CN112800977A

CN112800977A - Teacher blackboard writing action identification method based on multi-granularity convolutional neural network pruning

Info

Publication number: CN112800977A
Application number: CN202110130937.0A
Authority: CN
Inventors: 张文博; 包振山; 周晚晴; 杜嘉磊
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-01-30
Filing date: 2021-01-30
Publication date: 2021-05-14

Abstract

A teacher blackboard writing action recognition method based on multi-granularity convolutional neural network pruning belongs to the field of deep learning. The invention combines the practical application of intelligent recording of classroom top-quality courses in intelligent education, and applies the multi-granularity convolutional neural network pruning to the human body action recognition algorithm, thereby improving the processing speed of the human body action recognition algorithm. The textboard writing action recognition algorithm is divided into three steps: and OpenPose performs feature extraction, coordinate normalization and BP neural network classification. In addition, in the openpos algorithm, a multi-granularity convolutional neural network pruning framework based on a filter level and a connection level is used for compressing the openpos backbone network, a corresponding training strategy is designed and realized, and the combination of two types of pruning methods is realized. The final experiment result shows that the accuracy and the speed of the network pruning completely meet the requirements of practical application.

Description

Teacher blackboard writing action identification method based on multi-granularity convolutional neural network pruning

Technical Field

The invention belongs to the field of deep learning, relates to a teacher blackboard-writing action recognition method based on multi-granularity convolutional neural network pruning, and belongs to the technical field of deep neural network model compression.

Background

In recent years, the intelligent recording system for the fine class courses in the classroom can collect pictures of teachers and students through a high-definition shooting camera, and can realize switching of various teaching scenes by analyzing video images, automatically shoot teachers and student subjects, and identify the actions of the subjects.

In the intelligent recording system for the high-quality classroom courses, a teacher in an input video needs to be detected and tracked, and blackboard writing actions of the teacher are identified. Generally speaking, an openpos algorithm is selected to detect a teacher in an image and recognize blackboard writing actions, the openpos algorithm is used for extracting key points of a human body in an input image, the algorithm adopts a bottom-up strategy, the positions of the key points in the image are extracted first, then human body skeleton information is calculated through the learned key point relations, and the position of the teacher in the image can be calculated according to the key point information, so that the detection problem of the teacher is solved.

Disclosure of Invention

The invention provides a teacher blackboard writing action recognition method based on multi-granularity convolutional neural network pruning in combination with the practical application of intelligent recording of classroom top-quality courses in intelligent education, wherein a multi-granularity pruning compression algorithm achieves the aim of network compression by effectively training strategies and combining filter-level pruning and connection-level pruning.

The flow of the teacher blackboard-writing action recognition method is shown in fig. 1: firstly, a standard definition camera is used for collecting a high-definition video image, and an original image is input into a teacher blackboard writing action identification method after an interesting region is extracted.

In the teacher board action book identification method, an OpenPose key point extraction algorithm is adopted to extract key points of a human body from an image, then position normalization processing is carried out on the coordinates of the key points, and finally normalized coordinate information is input into a trained BP neural network for classification to obtain an output result.

The following is the detailed explanation of each stage of the textbook writing action identification method:

(1) OpenPose performs feature extraction

Firstly, an RGB image with the size of w × h is used as input, then a backbone network of OpenPose performs feedforward calculation on basic features, and meanwhile, a group of two-dimensional confidence maps S for predicting human key points and a group of two-dimensional vector fields V for representing the association degree between the human key points are extracted. Set S ═ S₁，S₂，S₃，…，S_j，…，S_J)，S_j∈R^w*h，R^w*hRefers to an input all RGB image with size w × h, comprising J confidence maps, each representing a type of key point of a human joint, where each response peak indicates the presence of one key point. Set V ═ V₁，V₂，V₃，…，V_c，…，V_c)，V_C∈R^w*h*2There are C two-dimensional vector fields, one for each limb, encoding the direction in which each part of the limb points towards the other. And finally, analyzing the confidence coefficient graph and the affinity field through the Hungarian algorithm, and outputting the key point information of all human bodies in the image.

Fig. 2 is a network framework of openpos, which consists of a basic VGG19 network and two branches of a loop. Branch one predicts the location of the keypoints and branch two predicts the affinity domain between limbs, commonly known as PAFs. The two branches of the first stage take the characteristic diagram F of VGG output as input to obtain a group of output S¹＝ρ¹(F)，

Where ρ () and () represent regression functions, in particular

Where D is the convolution kernel, convolution is performed using 3X3 and F is the input feature map. The following branches respectively have outputs S of more than one branch^t-1And V^t-1And the characteristic diagram F is used as input to obtain the output S of the new branch^tAnd V^tFinally, outputting a human body key point confidence map by repeating the process t timesS and an affinity field matrix V, t representing the relationship of the key points are iteration times, the value is more than or equal to 2, and the iteration is generally carried out until S is output^tUntil convergence, convergence means S^tNo longer changes in value. The calculation process is shown in the formulas (1) and (2).

In addition, the openpos algorithm provides a plurality of output forms of human BODY key points, including BODY _25, COCO, Face, Hand and other models. For the teacher to write on the blackboard, when the teacher lifts up one hand and faces the blackboard, the teacher can be considered to write on the blackboard. Through the requirement analysis of the teacher blackboard writing action recognition method, people need to pay key attention to key points of upper limbs and the head of people, and do not need to pay attention to the lower limbs of the people. Thus, the present invention employs an output model of class COCO. In the process of identifying the teacher writing movement, due to the problems of desk occlusion, camera angle and the like, the text only concerns the upper limb part of a person, so 12 key points numbered as 0,1,2,3,4,5,6,7,14,15,16 and 17 in the COCO model are activated, and the coordinates of the activated key points are used as the original input data of the next stage. The COCO model is shown in fig. 3.

(2) Coordinate normalization

Because the position of the teacher in the image is not fixed, and the blackboard writing action of the teacher identifies the relative position of the key point of the person, but does not concern the position of the person in the image, the position normalization processing is carried out on the obtained key point coordinate by adopting the method shown in formula (3). Origin of coordinates in formula (A)₀,B₀) Is the key point of the neck of the human body (A)_max,B_max) And (A)_min,B_min) Respectively, the maximum and minimum values in the sample data, (A)_b,B_b) And (A)_b,B_b) Respectively, the coordinates of the key points before and after the normalization process.

(3) BP neural network classification

The BP neural network is an effective multilayer feedforward neural network, has high nonlinear classification capability and strong robustness, and is widely used for pattern recognition and classification tasks. Generally, the normalized key point coordinates obtained in step (2) are usually linear indivisible data, but the full-connection layer in the BP neural network can map low-dimensional linear indivisible data to a high-dimensional data first, and at this time, the data becomes linearly separable, and the high-dimensional data can be classified through a hyperplane. Taking linear indivisible two-dimensional data as an example, fig. 4 shows a process of classifying two-dimensional linear indivisible data by a full connection layer. FIG. 4(a) is linear indifferent two-dimensional data [ x ]₁,x₂]Mapped into three-dimensional data [ y ] by formula (4)₁,y₂,y₃]At this time, the three-dimensional data may be classified into two categories by one hyperplane, as shown in fig. 4 (b). Classified three-dimensional data [ y₁,y₂,y₃]Mapped into two-dimensional data [ z ] by formula (5)₁,z₂]As shown in fig. 4(c) and 4 (d).

Therefore, the normalized key point coordinates are input into the BP neural network, and the BP neural network is trained. The BP neural network consists of: the input layer neuron number is 1 × 24, 24 is obtained by changing 12 2-dimensional coordinate point data into one-dimensional data, the key point hidden layer neuron number is 32, and the output layer contains 2 nervesThe elements respectively represent the blackboard writing state and the non-blackboard writing state and are distinguished through a Softmax classifier. The output of the Softmax function and the loss function L () are shown in equation (6) and equation (7), respectively. Y in formula (6)_qAnd (3) obtaining a vector with q normalized key points obtained in the step (2), wherein each q corresponds to one Softmax, n is the number of output categories, and the teacher action is classified into two categories, namely n is 2. y is_q' is the output value of the Softmax function, and for convenience of writing, is hereinafter denoted by y_q' instead, the output value of the Softmax function is represented. In addition, the significance of the loss function L () is to solve and evaluate the difference between the model and the actual result, and the value of the loss function is made smaller and smaller by continuously iterating the neural network, so that the result of the model is more accurate. Hereinafter, unless otherwise specified, the loss function L () is referred to in the same sense and is calculated by the calculation method of formula (7).

Further, the teacher blackboard writing action recognition method based on multi-granularity convolutional neural network pruning is characterized in that the OpenPose algorithm in the step (1) generates most of calculated amount, so that the OpenPose algorithm needs to be optimized, and the optimization process is as follows:

(1.1) intercept redefinement stage

By analyzing openpos, we find that in the openpos algorithm, the image feature extraction uses the first 4 convolution modules of the VGG19 network, and two convolution layers (Conv4-3, Conv4-4) are used to perform dimension reduction on the feature map after the feature extraction is completed. And respectively inputting the feature maps subjected to dimension reduction into the two branches to perform regression of the key points of the human body and prediction of a part affinity vector field representing the association degree between the two key points. Two branches have the same cascaded network structure, consisting of one initial stage and one refinement stage that cycles five times. Fig. 5 lists the calculated amount distribution at each stage of the openpos algorithm and the Average Precision (mAP) of different numbers of stages on the COCO data set, where the image size input to the openpos algorithm is uniformly adjusted to 368 × 368. As can be seen from fig. 3, after 2 refinishment stages are used, the accuracy of OpenPose has reached 46.2%, and 3 stages do not significantly improve the accuracy, but increase the calculation amount by more than 40%, so we only keep 2 refinishment stages, and the calculation amount 136.6GFLOPs of the algorithm is reduced to 80.8GFLOPs, so the calculation efficiency of the algorithm is significantly improved.

(1.2) model compression of VGG19 backbone network

The VGG19 network comprises 16 convolutional layers (with core sizes of 3 × 3) and 3 fully-connected layers, and the parameter quantity of the network model is 1.44 hundred million, about 574 MB. Fig. 6 shows the steps of optimizing the network, and compressing the backbone network of the multi-granularity convolutional neural network pruning frame OpenPose proposed by the present invention. The method comprises the specific steps of firstly fixing parameters of two cyclic Initial stages, a redefinition stage and a BP neural network part in OpenPose, carrying out multi-granularity convolutional neural network pruning on a VGG19 network, and retraining the pruned network by adopting a certain data set so as to recover the performance of the network. In the pruning process, parameters of other layers in the fixed network are unchanged, and only the first 10 layers (conv4-2) of the network are pruned; in the retraining process, parameters of the rest layers in the fixed network are unchanged, and only the parameters of the front 10 layer (conv4-2) of the network are updated, so that the purpose of keeping a single variable is to ensure that the capability of the front 10 layer of the VGG19 network model for extracting image features after the network pruning is completed is not reduced. Finally, the entire openpos algorithm is retrained with the COCO dataset to recover the loss of accuracy caused by network changes of VGG 19. And replacing the original network model with the pruned network model, thereby completing the optimization of the OpenPose algorithm.

Further, the multi-granularity convolutional neural network pruning algorithm in the step (1.2) specifically comprises the following steps:

(1.2.1) Filter level pruning

Firstly, randomly selecting a plurality of images as an evaluation set, calculating the mean value of the output feature mapping of the filter as input of each image, and using the mean value as the response value of the filter to the input image, thereby obtaining the response tensor of the batch of image sets. Then, measuring the variation degree of the tensor by using the information entropy, equally dividing the value range of tensor elements into m blocks, counting the number of the elements contained in each block, calculating the occurrence probability pj, and calculating the information entropy according to a formula (8).

Wherein H_j,kEntropy of information representing the tensor generated by the ith filter, j representing the current block, N and C_iThe number of network layers and the number of channels included in the i-th convolutional layer are respectively shown. And after the information entropy calculation is finished, sorting the filters in the ith convolutional layer according to the information entropy in an ascending order. The user can set the expected compression ratio C according to the evaluation of the convolutional neural network to be compressed_r(0≤C_r1) can be intuitively understood as the proportion of the remaining filters in the network that the user expects after compression. The number of filters to be cut out for the corresponding layer is calculated using equation (9).

n_i＝C_i(1-C_r) (9)

Correspondingly sorting the ith layer to the top n_iAnd deleting each filter, and removing the corresponding two-dimensional convolution kernel in the i +1 layer to finish pruning.

In the implementation, the process needs to set a binary mask matrix T which is completely consistent with the scale of the convolutional neural network model, wherein T is a 0-1 matrix and is used for representing the pruning state, each element in the T matrix corresponds to one parameter in the network model, the initial value of each element is set to 1, and when one filter is pruned, the matrix element values corresponding to the filter are all set to 0. Thus, for the filter bank W_i,kWhen the input characteristic diagram is F_iThe convolution operation varies as shown in equation (10).

Wherein f () represents an activation function, T_i,kIs and W_i,kThe corresponding matrix of the mask is then used,

representing a convolution operation, an example is a hadamard product.

The convolutional neural network is a feedforward calculation neural network, the basic composition unit is a neuron, a plurality of neurons form a two-dimensional vector for extracting the basic features of an image, the two-dimensional vector is called a feature matrix for short hereinafter, a plurality of two-dimensional vectors form convolutional layers, two adjacent convolutional layers are connected through the neurons to transmit information, and the neurons in the same convolutional layer are independent. Convolutional layers are used to extract features from the input vector, each convolutional layer consisting of a number of filter banks trained by a back-propagation algorithm. Let w_iAnd h_iRespectively representing input three-dimensional feature vectors

Width, height, X_iAfter convolution calculation, the output characteristic vector is changed into an output characteristic vector

The vector will continue as input for the next convolution layer. The convolutional layer operation is at C_iApplication C on one input channel_i+1Implemented by filters, one filter generating a feature vector, wherein each filter is represented by C_iA convolution kernel

And (4) forming. Therefore, the number of operations of the i +1 th convolutional layer is C_i+1C_ik²h_i+1w_i+1. One filter to prune the ith layer reduces C_ik²h_i+1w_i+1The sub-operation, while the corresponding input eigenvector of the (i + 1) th layer is also removed, can reduce C_i+2k²h_i+2w_i+2The next operation isPruning m filters in the ith layer to reduce m/C of the ith and (i + 1) th layers_i+1The amount of calculation of (a).

(1.2.2) connection-level pruning

By using dynamic pruning method, threshold TH is obtained according to formula (11)_A、TH_B(TH_B≥TH_ANot less than 0). Will be lower than threshold Th_AWill be cut off and will be above threshold TH_BThe connection recovery is realized, and the problem that the network cannot be recovered due to mistaken deletion of important connections in the pruning process is solved by the recoverable mechanism.

Formula (11) W_i,kRepresents a set of parameters in the ith filter, mean () in the formula represents averaging the set of parameters, and std () represents a standard deviation function of the set of parameters. s.s takes the value-1. The value of Δ t is-2 × s × std (W)_i,k). The pruning and restoration of the connection are realized by setting the setting and the clearing of the corresponding elements of the mask matrix. Let W_i,k(p) is the ith filter W_i,kP parameter of (2), T_i,kAnd (p) is the corresponding element in the mask matrix, the update strategy of each element in the mask matrix is shown in formula (12).

When the network parameters are updated again, an updating strategy of a random gradient descent method is adopted, as shown in formula (13).

Wherein the character I represents the set of all filters in the deep network, L () represents the loss function of the network during the pruning process, and the partial derivative is first taken for the loss function in equation (11)

Beta is the learning rate of parameter updating (beta is more than 0 and less than or equal to 1), and in order to avoid the problem that the parameters are not updated any more due to the undersize of the beta, the minimum value of the beta is taken to be 10^-4That is, β is not less than 0.0001 and not more than 1, and each convolutional layer needs more than 10000 times of iterative training.

(1.2.3) precision recovery training strategy-use of L1 and L2 regularization

The objective function is minimized, see equation (14).

In the formula (14), ω represents a parameter to be processed in the network model, and ω represents a parameter obtained after the regularization processing. λ is a regularization term parameter, and the definition of the λ value will be given in the introduction to L1 and L2 regularization, respectively, below. First term in formula (14)

Representing the predicted value f (x) of the network model to the e-th sample_e(ii) a ω) and training label y_iThe error between. The second term Ω (ω) in equation (9) is a regularization function for the parameter ω, and the regularization function Ω (ω) has many choices, which are mainly introduced for the regularization of L1 and the regularization of L2. The method comprises the following steps of adopting the L1 regularization and L2 regularization method, specifically introducing L1 regularization in a recovery training process after filter-level pruning is completed, and introducing L2 regularization in a recovery training process of a connection-level pruning method.

L1 regularization

After the filter-level pruning is completed, L1 regularization is introduced in the recovery training process. The L1 regular term of the parameter ω to be processed of the known network model is shown in equation (15).

Where ω includes a batch parameter size of n. The calculation of the L1 regularization term is to sum the absolute values of these parameters.

From the definition of regularization, the loss function with the regularization term of L1 is shown in equation (16).

The derivation is performed on the objective function with the L1 regularization term, and the result is shown in equation (17).

When L1 is normalized, when ω is updated by gradient descent method, the update process is shown in formula (18), where β is the learning rate of parameter update (0.0001. ltoreq. beta.ltoreq.1) as above.

In the gradient descent algorithm process of the formula (18), the parameter lambda of the regularization term is more than or equal to 0, and the value of lambda is set

The part omega can be changed into 0, so that a sparse model can be obtained, and the problem of parameter overfitting is solved.

L2 regularization

L2 regularization is introduced during the restoration training process for connection-level pruning, and the L2 regularization process will be described below. The process of regularization L2 for the parameter ω to be processed of the network model is shown in equation (19), where n represents the number of batch parameters contained in ω.

The calculation of the L2 regularization term is to take the sum of the squares of these parameters.

From the definition of regularization, a loss function with a regularization term of L2 is known (see equation (20)).

Wherein the content of the first and second substances,

the initial function before regularization is characterized,

the function obtained after regularization by L2 is characterized.

The derivation is performed on the objective function with L2 regularization, and the result is shown in equation (21).

When regularized by L2, ω is updated by gradient descent, which is updated as shown in equation (22), where n and β are both defined as above and λ is the regularization term parameter. Firstly, according to the expectation of the times to be trained, determining the learning rate beta, wherein 0.0001 is recommended; then, regarding the value of λ, a method of "coarse to fine" adjustment is adopted, the initial value of which is gradually increased/decreased from 1, the parameters are learned on the training set, and then the errors are verified on the test set, so as to seek the parameters which can make the verification errors of the test set smaller. The above process is repeated until the error on the test set is minimized. Firstly, setting the parameter of the regular item as 1, and then gradually increasing by 10 times according to the range of the verification set; if the error on the test set is unchanged or increased after 2-3 times of exploration, the error is adjusted to be gradually reduced, and the error is reduced by 10 times each time; and so on until finding the order of magnitude that minimizes the test set error; then, at this level of magnitude, a further "fine" adjustment is made by starting with 0 at the lowest bit, incrementing the value at the lowest bit by 1 each time until a value is found that minimizes the test set error.

Drawings

FIG. 1 is a flow chart of a teacher blackboard-writing action recognition method;

FIG. 2 is a network framework for OpenPose;

FIG. 3 is a COCO model;

FIG. 4 is a process for classifying two-dimensional linearly indivisible data by a fully connected layer;

FIG. 5 is a table comparing the computation and precision at different stages of OpenPose;

FIG. 6 is a backbone network optimization process for OpenPose;

FIG. 7 is a server-side configuration;

FIG. 8 is a comparison of initial and optimized OpenPose algorithm performance;

FIG. 9 is the calculated amount change of each network layer of VGG19 in the network pruning stage;

Detailed Description

In order to make the objects, technical solutions and features of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings.

In the present subject, training of a BP neural network and retraining of backbone network pruning by OpenPose are performed at a server, and fig. 7 shows a configuration of the server. The trained model is then transplanted to an embedded GPU platform specially applied to deep learning, NVIDIA Jetson TX2 for testing.

Fig. 8 shows the maps of openpos on the MS COCO dataset, the calculated amount of the teacher identification method, the single frame calculation time of the teacher identification method, and the accuracy change of the teacher identification method in the teacher posture verification set in the two stages of the initial algorithm and the algorithm optimization. As can be seen from fig. 8, after the last three redefinition stages of the OpenPose model are intercepted, the accuracy of the OpenPose model on the COCO key point verification set drops by 2.4%, but the algorithm speed is significantly increased, the single-frame processing time on TX2 drops from 251.7ms to 150.1ms, at this time, the accuracy of the teacher blackboard-writing recognition method is 98.1%, and the accuracy drops by only 0.5%, which is because the teacher is always in a standing posture, which is an ideal input for OpenPose to extract the key point, and the drop in the accuracy of the OpenPose model has a lower influence on the teacher blackboard-writing recognition method.

Details and results of network pruning of VGG19 are shown in fig. 9, the pruning in two stages is respectively reduced by 18.3GFLOPs and 6.0GFLOPs, after the pruning is completed, because the feature extracted by each convolution kernel of the backbone network is slightly changed, the adaptability of the parameter of the original OpenPose algorithm to the backhaul is reduced, and the accuracy of the OpenPose algorithm on the COCO verification set is reduced by 43.7%, therefore, the network needs to be retrained to recover the network accuracy, the super-parameter base _ size is 10, the base _ lr is 10-5, the network is subjected to 60K iterative training, and finally the network accuracy is recovered to 46.0%, and the network pruning only causes an accuracy loss of 0.1%, but reduces a calculation amount of 30%, thereby completely meeting the requirements of practical applications.

Claims

1. A teacher blackboard-writing action recognition method based on multi-granularity convolutional neural network pruning is characterized by comprising the following steps:

(1) OpenPose performs feature extraction

Firstly, an RGB image with the size of w multiplied by h is used as input, then a backbone network of OpenPose carries out feedforward calculation on basic features, and meanwhile, a group of two-dimensional confidence maps S for predicting human key points and a group of two-dimensional vector fields V for representing the association degree between the human key points are extracted; set S ═ S₁，S₂，S₃，…，S_j，…，S_J),S_j∈R^w*h，R^w*hThe method comprises the steps that all RGB images with the size of w x h are input, J confidence maps are included, each confidence map represents key points of a type of human body joints, and each response peak value in the map indicates that one key point exists; set V ═ V₁，V₂，V₃，…，V_c，…，V_C)，V_C∈R^w*h*2Having C two-dimensional vector fields, one for each limb, encoding each part of the limb pointing to anotherThe orientation of one portion; finally, analyzing the confidence coefficient graph and the affinity field through a Hungarian algorithm, and outputting key point information of all human bodies in the image;

openpos consists of a basic VGG19 network and two branches of loops; predicting the position of a key point by a first branch and predicting an affinity domain between limbs by a second branch, which are commonly called PAFs; the two branches of the first stage take the characteristic diagram F of VGG output as input to obtain a group of output S¹＝ρ¹(F),

Where ρ () and φ () represent regression functions, specifically

Where D is the convolution kernel, convolved with 3X3, and F is the input feature map; the following branches respectively have outputs S of more than one branch^t-1And V^t-1And the characteristic diagram F is used as input to obtain the output S of the new branch^tAnd V^tFinally, outputting a human body key point confidence map S and an affinity field matrix V representing the key point relation by repeating the process t times, wherein t is iteration times, the value is more than or equal to 2, and the iteration is carried out until S is output^tUntil convergence, convergence means S^tNo longer changes in value; the calculation process is shown in formulas (1) and (2);

adopting an output model with the category of COCO in the output form of the OpenPose algorithm; activating 12 key points numbered 0,1,2,3,4,5,6,7,14,15,16 and 17 in the COCO model, and taking the coordinates of the activated key points as the original input data of the next stage;

(2) coordinate normalization

Performing position normalization processing on the obtained key point coordinates by adopting a method shown in a formula (3); origin of coordinates in formula (A)₀,B₀) Is the key point of the neck of the human body (A)_max,B_max) And (A)_min,B_min) Respectively, the maximum and minimum values in the sample data, (A)_b,B_b) And (A)_b,B_b) Respectively carrying out normalization processing on the coordinates of the key points before and after the normalization processing;

(3) BP neural network classification

Inputting the normalized key points obtained in the step (2) into a BP neural network, and training the BP neural network; the BP neural network consists of: the number of neurons in an input layer is 1 multiplied by 24, 24 are obtained by changing 12 2-dimensional coordinate point data into one-dimensional data, the number of neurons in a key point hiding layer is 32, an output layer comprises 2 neurons and respectively represents a blackboard-writing state and a non-blackboard-writing state, and the neurons are distinguished through a Softmax classifier; the output of the Softmax function and the loss function L () are respectively shown as formula (4) and formula (5); y in formula (4)_qThe vector with q key points after normalization processing obtained in the step (2) corresponds to one Softmax for each q, n is the number of output categories, and the teacher action is classified into two categories, namely n is 2; y is_q' is the output value of the Softmax function, and for convenience of writing, is hereinafter denoted by y_q' instead of representing the output value of the Softmax function; hereinafter, unless otherwise specified, the loss function L () is referred to in the same sense and is calculated by the calculation method of formula (5);

2. the method for teacher blackboard writing action recognition based on multi-granularity convolutional neural network pruning as claimed in claim 1, wherein the openpos algorithm in the step (1) is specifically as follows:

(2.1) intercept redefinement stage

By analyzing OpenPose, it is found that in the OpenPose algorithm, the first 4 convolution modules of a VGG19 network are used for image feature extraction, and after the feature extraction is completed, two convolution layers, namely Conv4-3 and Conv4-4, are used for reducing the dimension of a feature map; inputting the characteristic graphs subjected to dimensionality reduction into two branches respectively to perform regression of key points of a human body and prediction of a part affinity vector field representing the association degree between the two key points; the two branches have the same cascade network structure and consist of an initial stage and a redefinition stage circulating for 2 times;

(2.2) model compression of VGG19 backbone networks

The VGG19 network comprises 16 convolutional layers and 3 full-connection layers, the sizes of the convolutional layers are all 3X3, and the provided multi-granularity convolutional neural network pruning framework is selected to compress the backbone network of OpenPose; firstly, parameters of two cyclic branches Initial stage and refinement stage in OpenPose and a BP neural network part are fixed, and multi-granularity convolutional neural network pruning is carried out on a VGG19 network; in the pruning process, parameters of other layers in the fixed network are unchanged, and only the first 10 layers of the network are pruned; in the retraining process, parameters of other layers in the fixed network are unchanged, only parameters of the first 10 layers of the network are updated, and finally, a COCO data set is adopted to retrain the whole OpenPose algorithm so as to recover precision loss caused by network change of VGG 19; replacing the original network model with the pruned network model, namely completing the optimization of the OpenPose algorithm; for the VGG19 network, except for the pruning method, other parts are not changed;

the pruning process is as follows:

(3.1) performing filter level pruning on the convolutional layers of the input network model layer by layer, specifically:

first of all, the first step is to,randomly selecting a plurality of images as an evaluation set, calculating the mean value of the output characteristic mapping of the filter when each image is used as input, and using the mean value as the response value of the filter to the input image, thereby obtaining the response tensor of the batch of image sets; then, measuring the variation degree of the tensor by using the information entropy, equally dividing the value range of tensor elements into m blocks, recommending the value of m to be 10, then counting the number of elements contained in each block and calculating the occurrence probability p_jCalculating the information entropy according to the formula (6);

wherein H_i,kEntropy of information representing the tensor generated by the ith filter, j representing the current block, N and C_iRespectively representing the number of network layers and the number of channels contained in the ith convolutional layer; after the information entropy calculation is finished, sorting the filters in the ith convolution layer according to the information entropy in an ascending order; the user can set the expected compression ratio C according to the evaluation of the convolutional neural network to be compressed_rThe residual filter ratio in the network after the compression is expected by a user can be intuitively understood; c_rThe value of (a) is between 0 and 1, wherein 0.5 is recommended; calculating the number of filters needing to be cut out of the corresponding layer by using a formula (7);

R_i＝C_i(1-C_r) (7)

correspondingly sorting the ith layer to obtain the front R_iDeleting each filter, and removing the corresponding two-dimensional convolution kernel in the i +1 layer to finish pruning;

in the implementation, a binary mask matrix T which is completely consistent with the scale of the convolutional neural network model is required to be set in the process, wherein the T is a 0-1 matrix and is used for representing the pruning state, each element in the T matrix corresponds to one parameter in the network model, the initial value of each element is set to be 1, and when one filter is pruned, the matrix element values corresponding to the filter are all set to be 0; thus, for the filter bank W_i,kWhen the input characteristic diagram is F_iVariation of time, convolution operationAs shown in equation (8);

representing a convolution operation, an h-hadamard product;

the convolutional neural network is a feedforward calculation neural network, a basic composition unit is a neuron, a plurality of neurons form a two-dimensional vector for extracting basic features of an image, the two-dimensional vector is called a feature matrix for short in the following text, a plurality of two-dimensional vectors form convolutional layers, two adjacent convolutional layers are connected through the neurons to transmit information, and the neurons in the same convolutional layer are independent; the convolutional layers are used for extracting characteristics of input vectors, and each convolutional layer is composed of a plurality of filter banks trained through a back propagation algorithm; let w_iAnd h_iRespectively representing input three-dimensional feature vectors

The vector will continue to be the input for the next convolution layer; the convolutional layer operation is at C_iApplication C on one input channel_i+1Implemented by filters, one filter generating a feature vector, wherein each filter is represented by C_iA convolution kernel

Composition is carried out; therefore, the number of operations of the i +1 th convolutional layer is C_i+1C_ik²h_i+1w_i+1(ii) a One filter to prune the ith layer reduces C_ik²h_i+1w_i+1The sub-operation, while the corresponding input eigenvector of the (i + 1) th layer is also removed, can reduce C_i+2k²h_i+2w_i+2And in the secondary operation, M filters in the ith layer are pruned, the value of M is the same as the value of M in the previous layer, and the M/C of the ith layer and the (i + 1) th layer are respectively reduced_i+1The calculated amount of (2);

(3.2) performing connection level pruning on the network model after the filter level network pruning, specifically:

setting a threshold TH by a formula (9) by adopting a dynamic pruning method_A、TH_BWherein TH is_B≥TH_ANot less than 0; will be lower than threshold Th_AWill be cut off and will be above threshold TH_BThe connection is recovered, and the problem that the network cannot be recovered due to mistaken deletion of important connections in the pruning process is solved by the recoverable mechanism;

w in formula (9)_i,kRepresenting a group of parameters in the ith filter of the ith layer, wherein mean () in the formula represents the average value of the group of parameters, and std () represents the standard deviation function of the group of parameters; s takes the value of-1; the value of Δ t is-2 × s × std (W)_i,k) (ii) a The pruning and the recovery of the connection are realized by setting and clearing corresponding elements of a mask matrix; let W_i,k(p) is the ith filter W_i,kP parameter of (2), T_i,k(p) is the corresponding element in the mask matrix, and the update strategy of each element in the mask matrix is shown in formula (10);

when the network parameters are updated, an updating strategy of a random gradient descent method is adopted, as shown in a formula (11);

Beta is the learning rate of parameter update, in order to avoid the problem that the parameter is not updated any more due to the over-small value of beta, the value of beta in the middle is: beta is more than or equal to 0.0001 and less than or equal to 1;

(3.3) precision recovery training strategy-use of L1 and L2 regularization

Minimizing the objective function, see equation (12);

in the formula (12), ω represents a parameter to be processed in the network model, and ω represents a parameter obtained after the regularization processing; λ is a regularization term parameter, the definition of the λ value being given in the following introduction to L1 and L2 regularization, respectively; first term in equation (12)

Representing the predicted value f (x) of the network model to the e-th sample_e(ii) a ω) and training label y_eThe error between; the second term Ω (ω) in formula (7) is a regularization function for the parameter ω, and the regularization function Ω (ω) has many choices, which are mainly introduced for the regularization of L1 and the regularization of L2; the method comprises the following steps that an L1 regularization and L2 regularization method is adopted, and specifically, the L1 regularization is introduced in a recovery training process after filter-level pruning is completed, and the L2 regularization is introduced in a recovery training process of a connection-level pruning method;

l1 regularization

After filter-level pruning is completed, introducing L1 regularization in the training recovery process; the L1 regular term of the parameter ω to be processed of the known network model is shown in formula (13);

wherein the size of the batch parameter dimension contained in ω is n_L1(ii) a The calculation process of the L1 regular term is to take the absolute values of the parameters to sum;

according to the definition of regularization, a loss function with an L1 regularization term is shown in an equation (14);

derivation is carried out on the target function with the L1 regularization term, and the result is shown in an equation (15);

when L1 is normalized, when ω is updated by gradient descent method, the updating process is shown in formula (16), wherein β is the learning rate of parameter updating and β is 0.0001 ≦ 1;

in the gradient descent algorithm process of the formula (16), the parameter lambda of the regularization term is more than or equal to 0;

l2 regularization

The L2 regularization is introduced in the training process of the recovery of the connection-level pruning, and the L2 regularization process is introduced below; the L2 regularization process for the network model to-be-processed parameter ω is shown in formula (17), where n_L2Represents the number of batch processing parameters contained in ω;

the calculation process of the L2 regular term is to take the square sum of the parameters;

from the definition of regularization, a loss function with a regularization term of L2 (see equation (18));

wherein the content of the first and second substances,

the initial function before regularization is characterized,

characterizing a function obtained after regularization by L2;

derivation is performed on the objective function with L2 regularization, and the result is shown in equation (19);

when L2 is normalized, when ω is updated by gradient descent method, the update is shown in formula (20), wherein n and β are both defined as above, and λ is the parameter of the regularization term; firstly, according to the expectation of the times to be trained, determining the learning rate beta, wherein 0.0001 is recommended; then, regarding the value of lambda, adopting a method of adjusting from coarse to fine, gradually increasing/reducing the initial value from 1, learning the parameters on the training set, and then verifying the error on the testing set to seek the parameters which can make the verification error of the testing set smaller; repeating the above processes until the error on the test set is minimum; firstly, setting the parameter of the regular item as 1, and then gradually increasing by 10 times according to the range of the verification set; if the error on the test set is unchanged or increased after 2-3 times of exploration, the error is adjusted to be gradually reduced, and the error is reduced by 10 times each time; and so on until finding the order of magnitude that minimizes the test set error; then, at this level of magnitude, a further "fine" adjustment is made by starting with 0 at the lowest bit, incrementing the value at the lowest bit by 1 each time until a value is found that minimizes the test set error.