CN109492581A

CN109492581A - A kind of human motion recognition method based on TP-STG frame

Info

Publication number: CN109492581A
Application number: CN201811328308.3A
Authority: CN
Inventors: 宫法明; 马玉辉; 宫文娟
Original assignee: China University of Petroleum East China
Current assignee: China University of Petroleum East China
Priority date: 2018-11-09
Filing date: 2018-11-09
Publication date: 2019-03-19
Anticipated expiration: 2038-11-09
Also published as: CN109492581B

Abstract

The invention discloses a kind of human motion recognition methods based on TP-STG frame, and this method includes: using video information as input, SVM classifier is added in priori knowledge, proposes posteriority criterion to remove non-personnel targets；Personnel targets are partitioned into detection algorithm by target positioning, and are exported in a manner of target frame and coordinate information, provide input data for human body critical point detection；Physical feeling positioning and correlation degree analysis are carried out to extract whole human body key point informations using improved gesture recognition algorithms, form crucial point sequence；Space-time diagram is constructed on crucial point sequence by action recognition algorithm, is applied in multilayer space-time diagram convolution operation, and the classification of motion is carried out by Softmax classifier, realizes the human action identification under complex scene.Method of the invention combines the actual scene of ocean platform for the first time, and the TP-STG frame of proposition attempts to identify for the first time the activity of the worker on offshore drilling platform using the method for target detection, gesture recognition and space-time diagram convolution.

Description

A kind of human motion recognition method based on TP-STG frame

Technical field

The invention belongs to computer visions and field of image processing, are related to a kind of human action knowledge based on TP-STG frame Other method.

Background technique

It with the universal of monitoring camera and is widely applied, the video data of magnanimity brings huge pressure to manual identified Power carries out analysis to video data using artificial mode and studies and judges, manpower, experience and the analysis ability 'bottleneck' restrictions of monitoring personnel Intelligent behavior differentiates the overall efficiency of application.In recent years, with the continuous propulsion of research, the research of human action identification is achieved Certain progress.Traditional method is most commonly seen in these three methods of template matching, three dimensional analysis and time series, but calculates It measures bigger, vulnerable to the interference of noise, lacks robustness, and the considerations of to action behavior mode globality and analysis of overall importance Deficiency, the feature extracted is few and simple, causes the accuracy rate of identification lower.

Previous human action recognizer effect under single special scenes is more prominent, but most of human body action recognitions Algorithm be only applicable to simple scenario, influenced by environmental factor very big.When traditional algorithm be applied to complex scene in when by It is influenced in by factors such as mixed and disorderly backgrounds, is difficult to be correctly detecting human action feature, recognition effect sharply declines, complicated field Human action under scape is identified as a problem urgently to be resolved.

Summary of the invention

The object of the present invention is to provide a kind of human motion recognition methods based on TP-STG frame, and this method solve existing There is technology to the problem that human action recognition effect in complex scene is poor and error is big, the human body that can be used under complex scene is dynamic It identifies, target positioning and gesture recognition and estimation is carried out to personnel in dynamic scene, realize personnel targets movement in image Precisely identification.

In order to achieve the above object, the present invention provides a kind of human motion recognition methods based on TP-STG frame, should Method includes:

(S100) using video information as input, SVM classifier is added in priori knowledge, proposes posteriority criterion to go Unless personnel targets；

(S200) personnel targets are partitioned into detection algorithm by target positioning, and in a manner of target frame and coordinate information Output, provides input data for human body critical point detection；

(S300) physical feeling positioning and correlation degree analysis are carried out to extract entirely using improved gesture recognition algorithms Portion's human body key point information forms crucial point sequence；

(S400) space-time diagram is constructed on crucial point sequence by action recognition algorithm, be applied in multilayer space-time picture scroll Product operation, and the classification of motion is carried out by Softmax classifier, realize the human action identification under complex scene.

Wherein, in the step S100, for the environmental quality under concrete scene, the elder generation being suitble under the scene is matched Knowledge is tested, such as this complex scene of offshore oil platform, due to the safety clothes color and certain cylindrical tube colors of personnel targets Quite similar with form, color, texture and shape feature are difficult to differentiate between, and often obscure two using conventional model under simple scenario Person leads to higher rate of false alarm；For such problem, the present invention proposes priori knowledge SVM classifier is added, to detection target SVM pre-training is carried out with target is obscured, the non-personnel targets that will identify that are considered as negative sample removal, reduce the calculating of negative sample Amount, improves the accuracy rate of next stage target detection.

Preferably, the target, which is positioned with detection algorithm, includes:

(S210) image data is converted video data to by data prediction, and carries out sample label operation, made For the input data of algorithm；

(S211) input picture is divided into N*N grid, r bounding box is predicted to each grid by feature extraction, As soon as if the center of an object is fallen in a bounding box, then this grid is responsible for detecting this object；

(S212) single convolutional network is run on the image obtains the confidence score calculating of bounding box, these confidence levels point Number reflects the accuracy of credibility and target in prediction block in bounding box comprising target；

(S213) it by increasing the costing bio disturbance of bounding box coordinates prediction, and reduces to the confidence for not including object boundary frame The prediction of degree is lost, and model early stage diverging and unstable is prevented；

(S214) bounding box width and height are normalized according to a certain percentage, so that they fall between zero and one, is obtained The target category probability and bounding box coordinates information finally predicted.

Preferably, in the step S212, in the calculating process of confidence score, need to define predicted boundary frame With the intersection degree of actual boundary frame, in this, as the calculation basis of confidence score, if do not deposited in predicted boundary frame unit In target, then confidence score should be zero；Otherwise confidence score is equal to the intersection PIA between prediction block and true object boundary frame With the product of real goal frame ground truth, thus the definition of confidence level is indicated are as follows:

In formula (1), Cr indicates confidence level, Gr_(Object)Indicate real goal frame,Indicate prediction block and real goal Intersection between bounding box.

Preferably, costing bio disturbance is completed by overall goal loss function in the step S213, wherein coordinate is pre- The loss function L (x, y, w, h) of survey are as follows:

In formula (2), x, y indicate that the coordinate of the central point of the bounding box relative to grid cell, w and h indicate entire input The width and height of image, andRespectively indicate i x_i,y_i,w_i,h_iMean value,Expression judges i-th of network In j-th of grid whether be responsible for this target object, λ indicates weight.

The loss function L (obj) of bounding box confidence level prediction containing real goal is indicated are as follows:

In formula (3), C_iIndicate the confidence value of i-th of bounding box, and for the bounding box confidence level without real goal The loss function L (noobj) of prediction is indicated are as follows:

The loss function L (class) of the prediction of target category is indicated are as follows:

In formula (5), p_i(c) it is expressed as the probability of i-th of target category,The probability for being expressed as i target category is equal Value, after acquiring each loss function, each loss function Weighted Fusion obtains final goal loss function L (f):

L (f)=L (x, y, w, h)+L (obj)+L (noobj)+L (class) (6)

In formula (6), final goal loss function L (f) is that loss function, the bounding box containing real goal that coordinate is predicted are set The prediction of the loss function of reliability prediction, the bounding box confidence level loss function predicted without real goal and target category Loss function weighted sum.

Preferably, the improved gesture recognition algorithms include:

(S310) color image of the w*h size obtained using target detection on last stage as input；

(S311) multiple dimensioned mode is taken, it is wild to expand perception according to 1.0 to 1.2 times of ratios；

(S312) a Feature Mapping F is obtained by the feature extraction of preceding 8 layer network of VGG；

(S313) network is divided into two loop branches, and a branch is used to predict the two-dimentional confidence level figure of body part position S carries out physical feeling and positions to obtain all visible key points of human body, another branch is for prediction pixel point in skeleton Two-dimensional vector field L is associated degree analyzing and obtains the sightless key point of human body；

(S314) loop branches obtain one group of S using characteristic pattern F as input₁,L₁；

(S315) the output S of the above branch of branch's difference after_t-1,L_t-1With characteristic pattern F as input, constantly into Row iteration；

(S316) by p stage final output S (p) and L (p)；

(S317) S, the predicted value and ground truth (S of L are calculated^*,L^*) between L2 norm, the ground of S and L Truth is calculated according to the 2D point of mark, if some key point mark missing, does not calculate the value of the point, final output is all The information of key point.

Preferably, in the step S313, two loop branches return S and L respectively, each stage calculates primary Loss S and L and is originally inputted later and continues to input next stage and be trained；With the increase of the number of iterations, On the basis of the key point position known, closed using the relative position that the shift length between vector is established between partes corporis humani position System, to realize the prediction and estimation of the invisible key point of human body.

Preferably, in the step S317, pass through the 2D point X marked in image_j,kCalculate the ground truth of S (S^*), wherein X_j,kIndicate the calculation method of the jth kind key point of k-th of people in image,Expression meets normal distribution；Work as picture The close annotation point X of vegetarian refreshments P_j,kWhen, reach the peak value of normal curve, then the S of jth kind key point is k in image in every image The normal distribution peak value of people

In formula (7), σ is the parameter of normal distribution, and for S, every one kind key point has a channel channel, is generated It is the peak value of response for retaining each point according to the method that multiple distributions are maximized when groundtruth, works as peak valueThe value of the jth kind key point of k-th of people is obtained when obtaining maximum value

Pass through two key point X of k-th of people_j1,k,X_j2,kBetween the unit vector of any pixel P calculate L's groundtruth(L_b ^*), wherein L indicates the affinity of human body, and k indicates k-th of people, j₁And j₂Expression two being capable of phase Joint even, such as head are directly connected by neck with trunk, and b indicates the position of b kind human body.

For the ground truth (L of L_b ^*) calculate it is relative complex, as the key point X of k-th of people in image_j1,kIt is directed toward X_j2,kWhen, unit vectorPoint situation discussion is needed to calculate, as shown in formula (9):

Wherein, v size and Orientation is fixed, x_j,kIndicate the position of j-th of key point of k-th of people, pixel P is It is no can fall in human body limb it is dry on need to meet two conditions, as threshold range:

In formula (10), l_b,kAnd σ_lThe dry length and width of human body limb is respectively indicated, by mutually similar to all personnel's target Other limb is dry to carry out equalization, so that the number of active lanes of the output of L is equal with dry kind of number of limb, acquires mean value

In formula (11), the two-dimensional vector field mean value at the position b in every imageIndicate k people in the position of pixel P The average value for setting the affine force vector in position at place, as key point d_j1And d_j2And known to the affine force vector in position before them Later, by calculating two key point d_j1And d_j2Between line vector sum line on each pixel the affine force vector in position it Between dot product integral as the position correlation R between two key points:

In formula (12), w indicates the weight between two key points, and for value between [0,1], P (w) indicates weighting function, L_b(P (w)) indicates the vector of the position affinity of each pixel on line.

Preferably, the action recognition algorithm includes:

(S410) continuous sequence of the human body key point obtained using gesture recognition on last stage as input；

(S411) using graph structure rule, space and timing structure information is made full use of to construct space-time diagram；

(S412) on space-time diagram using multilayer graph convolution operation to extract high-level characteristic, corresponding space division rule；

(S413) the neighborhood subset number for judging space-time node of graph, designs corresponding space division rule and determines and use Rule；

(S414) classification of motion is carried out using the Softmax classification of motion device of standard；

(S415) output action class label and corresponding movement scoring.

Preferably, in the step S411, when constructing space-time diagram, inside each video frame, according to human body from Right skeleton connection relationship constructs space diagram, while the identical key point of adjacent two frame is connected and composed timing side, all input frames In key point configuration node collection V, all directed edges constitute side collection E, obtain space-time diagram G=(V, E) according to above-mentioned rule, from The spatial information of skeleton key point so is remained, and the motion profile of key point is showed in the form of timing side. Specifically, node set V={ v in figure_ti| t=1,2...T, i=1,2...N } comprising all on crucial point sequence Artis, wherein T indicates video frame number, and N indicates the number of all key points of human body, is set as 18.When constructing space-time diagram, Feature vector F (the v of t frame in key point, i-th artis_ti) be made of the coordinate information and confidence level of key point； There are two subsets to form by the set E on side, is the link E of each video frame frame intrinsic articulation point respectively_s={ v_tiv_tj| (i, j) ∈ P } with And the link E of different video frame interframe_t={ v_tiv_(t+1)i, wherein P indicates the set of all key points of human body, and i, j are respectively Two any joints in set of keypoints.

Preferably, in the step S413, the adjacent pixel set of center pixel, i.e. neighborhood collection is suitable according to space Sequence is divided into a series of set T, and each set just includes one pixel of image, these set constitute one stroke of neighborhood collection Point, if 1 neighborhood of node is divided into a subset, divided labeled as unique；If 1 neighborhood of node is divided into two subsets, That is the subset of node itself and neighbors subset, labeled as the division based on distance；If 1 neighborhood of node is divided into three sons Collection, including on node itself, spatial position compare this node closer to the neighbors set of entire skeleton center of gravity and further from weigh The neighbors set of the heart is defined centripetal movement with centrifugal movement according to motion analysis, divides labeled as steric configuration.

Human motion recognition method based on TP-STG frame of the invention solves the prior art to complex scene servant The problem that body action recognition error is big and real-time is poor, has the advantage that

(1) method of the invention combines the actual scene of ocean platform for the first time, and the TP-STG frame of proposition is attempted to make for the first time Worker's activity on offshore drilling platform is identified with the method for target detection, gesture recognition and space-time diagram convolution；

(2) method of the invention is applied in this complex scene of ocean platform, is examined to reduce cylindrical tube to target The influence of survey proposes a kind of data prediction scheme based on SVM, improves the accuracy rate of target identification；

(3) method of the invention proposes improved gesture recognition algorithms in the case where there is veil, utilizes target detection Result realize the invisible key point of human body detection and estimation, remove the miscellaneous work of artificial label target from；

(4) method of the invention is sufficiently used spatial structural form by constructing space-time diagram on crucial point sequence And timing structure information, it is applied in multilayer space-time diagram convolution operation, and by the progress classification of motion of Softmax classifier and in advance It surveys, realizes the human action identification under complex scene.

Detailed description of the invention

Fig. 1 is the structure flow chart of the human motion recognition method based on TP-STG frame in the present invention.

Fig. 2 is the flow chart of target positioning and detection algorithm in the present invention.

Fig. 3 is the flow chart of improved gesture recognition algorithms in the present invention.

Fig. 4 is the flow chart of action recognition algorithm in the present invention.

Specific embodiment

Below in conjunction with drawings and examples, the following further describes the technical solution of the present invention.

A kind of human motion recognition method based on TP-STG frame, as shown in Figure 1, being of the invention based on TP-STG frame The structure flow chart of the human motion recognition method of frame, this method include:

(S100) using video information as input, feature extraction is carried out, then SVM classifier is added in priori knowledge, is proposed Posteriority criterion is to remove non-personnel targets；

(S200) personnel targets are partitioned into detection algorithm by target positioning, and in a manner of target frame and coordinate information Output, provides information using feature selecting for human body critical point detection；

(S300) physical feeling positioning and correlation degree analysis are carried out to detect entirely using improved gesture recognition algorithms Portion's human body key point information forms crucial point sequence；

(S400) space-time diagram is constructed on crucial point sequence by action recognition algorithm, be applied in multilayer space-time picture scroll Product operation, and the classification of motion is carried out by Softmax classifier, realize that the human action under complex scene is classified and estimated.

Wherein, in the step S100, using video information as input, pass through data prediction and sample label It converts video data to the image data that can input depth network, is examined using the image data set trainer's target marked Survey model.For the environmental quality under concrete scene, match the priori knowledge being suitble under the scene, as offshore oil platform this Complex scene, since the safety clothes color and certain cylindrical tube colors and form of personnel targets are quite similar, color, texture and Both shape feature is difficult to differentiate between, often obscured using conventional model under simple scenario, lead to higher rate of false alarm；For such Problem, the present invention propose priori knowledge SVM classifier is added, to detection target and obscure target progress SVM pre-training, will know Not Chu non-personnel targets be considered as negative sample removal, reduce the calculation amount of negative sample, improve next stage target detection Accuracy rate.

As shown in Fig. 2, for the flow chart of target positioning and detection algorithm of the invention, personnel targets positioning and detection algorithm Process includes:

(S211) input picture is divided into N*N grid, r bounding box is predicted to each grid by feature extraction, Using object judgement mechanism, if as soon as the center of an object is fallen in a bounding box, then this grid is responsible for detecting this A object, each grid cell predict that the r bounding box and confidence score of these frames carry out next if target is not present The detection of a grid；

(S212) single convolutional network is run on the image obtains the confidence score calculating of bounding box, these confidence levels point Number reflects the accuracy of credibility and target in prediction block in bounding box comprising target, is selected by Confidence Select the calculation of loss function；

Target positioning and detection algorithm solve target detection as single regression problem, directly input from image pixel It is exported to bounding box coordinates and class probability, uses the bounding box for disposably predicting target contained by all grids on the image, determines Position reliability and all categories probability vector disposably solve problem, and only running single convolutional network can then predict Target category and its position.

Design for network, initial convolutional layer is used to extract feature from image, and full articulamentum is used to predict to export Destination probability and position coordinates.Entire detection network has 24 3 × 3 convolutional layers and 2 full articulamentums, alternate 1 × 1 reduction zone For reducing the feature space from previous layer.In order to enable the network to receive the input picture of sizes, the present invention is reduced Full articulamentum in network structure, because full articulamentum necessarily requires to output and input regular length feature vector, if will As soon as whole network becomes a full convolutional network, detected then can be inputted to sizes.Meanwhile full convolutional network phase The spatial positional information of target can preferably be retained for full articulamentum.In addition, in order to promote wisp detection effect, this hair The bright pond number of layers eliminated in network structure, so that final characteristic pattern size is bigger, characteristic pattern size depends on original image ruler It is very little, but the size of characteristic pattern is necessary for odd number, guarantees that centre has a position that can save the target at original image center with this.

For Confidence, algorithm sets 0.6 for initial threshold, and the object type greater than the threshold value is considered as target class Not, otherwise it is considered as target to be not present.Target is judged with the presence or absence of the calculation for determining loss function, containing true according to confidence level The loss function L (obj) of the bounding box confidence level prediction of target is indicated are as follows:

In formula (4),It indicates to judge j-th of grid whether not responsible this target object, λ in i-th of network Indicate weight.

As shown in figure 3, for the flow chart of improved gesture recognition algorithms in the present invention, which includes:

(S313) by key-point analysis as a result, algorithm is divided into two loop branches, a branch is for predicting body The two-dimentional confidence level figure S of position position, carries out physical feeling positioning and critical point detection obtains all visible key points of human body, separately One branch is used for two-dimensional vector field L of the prediction pixel point in skeleton, is associated degree analyzing and position affinity calculates Obtain the information of the invisible key point of human body；

(S314) when current generation t≤p, loop branches obtain one group of S using characteristic pattern F as input₁,L₁；

(S316) by p stage final output S (p) and L (p)；

In step S313, image is analyzed by the convolutional network in preceding 8 layer network, two loop branches return S respectively And L, the confidence level figure of physical feeling is iteratively predicted in branch one and human body affinity is predicted in branch two, It is each drop applications different loss function at the end of each stage, S and L and being originally inputted is continued under input later One stage is trained.In the predicted value and true confidence map of estimation, the connection between key point is lost using L2, Spatially loss function is weighted, to solve some practical problems.With the increase of the number of iterations, in known key point On the basis of position, the relative positional relationship between partes corporis humani position is established using the shift length between vector, to realize The prediction and estimation of the invisible key point of human body.

Expression for position affinity, it is a 2D vector set, each 2D vector set can encode one The position and direction of limbs, while retaining the position and direction information between the region for indicating limbs.For belonging to specific limbs Each pixel at position, 2D vector coding are directed toward from a part of limbs the direction of another part, each type The limbs two-part affinity that has its corresponding polymerization connected.The confidence map one of these position affinity and physical feeling It rises and carries out combination learning and prediction by CNN, effectively can carry out Attitude estimation for more people, while in the feelings for guaranteeing precision Under condition, it can accomplish real-time effect.

As shown in figure 4, for the flow chart of action recognition algorithm in the present invention, which includes:

(S415) output action class label and corresponding movement scoring.

In step S413, convolution operation on the image pixel set that center pixel is adjacent, i.e. neighborhood collection is according to sky Between sequence be such as from left to right divided into a series of set T from top to bottom, each set just includes a pixel, these set Just constitute a division of neighborhood collection.The parameter of convolution kernel and the subset number and feature vector length in this division It is related, as long as defining certain division rule and defining referring to image convolution the parameter of convolution kernel.

For different figure convolutional networks, targeted convolution operation is defined, is just reduced to design corresponding division rule Then.To one there are the division rule of T subset, the parameter of convolution kernel includes T part, each partial parameters quantity and feature Vector is the same.In the convolution operation that a window size is 3 × 3, the neighborhood of a pixel is divided into 9 according to spatial order A subset, including upper left, on, upper right, left, middle and right, lower-left, lower and bottom right, each subset includes a pixel.Convolution kernel Parameter includes 9 parts, and each position is consistent with the feature vector length of characteristic pattern, and image convolution can be considered as picture scroll product and exist A kind of application in rule mesh trrellis diagram.

For skeleton division rule spatially, if 1 neighborhood of node is divided into a subset, drawn labeled as unique Point；If 1 neighborhood of node is divided into two subsets, the i.e. subset of node itself and neighbors subset, labeled as based on distance It divides；If by 1 neighborhood of node be divided into three subsets, including node itself, on spatial position than this node closer to entire The neighbors set of skeleton center of gravity and neighbors set further from center of gravity transport centripetal movement with centrifugation according to motion analysis It is dynamic to be defined, it is divided labeled as steric configuration.By using different skeleton division rules, realize classification of motion device to movement The prediction and scoring of classification.

Method of the invention is using advanced technologies such as deep learning, intelligent video behavioural analysis and big data analysis to video In data personnel targets carry out intellectual analysis, quickly study and judge and track and identify, realize abnormal operation video assistant analysis and The early warning and alarming of security incident.Specifically, in oil field, by analyzing the movement of drilling production operation personnel, Neng Gouji Shi Faxian personnel targets are potentially dangerous, enhance the quick-reaction capability to abnormal conditions.In addition, being analyzed by computer video Technology positions personnel targets in dynamic scene, identifies and critical point detection, carries out point of target action on this basis Analyse and study and judge, reduce the time of manual intervention, avoid caused by being produced because of personal accident and violation operation to oil field interference and Economic loss saves manpower and material resources to ensure safety production, improves production management level.In addition, being based on TP-STG The human motion recognition method of frame realizes personnel's action recognition and analysis under complex scene, is applicable not only to oilfield, Also there is important practical application value in the other fields such as medical treatment and security.

In conclusion the human motion recognition method of the invention based on TP-STG frame is dynamic to human body under complex scene It is fast and accurately identified, can be applied to multiple fields and carry out target positioning and detection, gesture recognition, critical point detection And the differentiation and analysis of behavior and movement.

It is discussed in detail although the contents of the present invention have passed through above preferred embodiment, but it should be appreciated that above-mentioned Description is not considered as limitation of the present invention.After those skilled in the art have read above content, for of the invention A variety of modifications and substitutions all will be apparent.Therefore, protection scope of the present invention should be limited to the appended claims.

Claims

1. a kind of human motion recognition method based on TP-STG frame, feature and specific step is as follows:

(S100) using video information as input, priori knowledge is added SVM classifier, propose posteriority criterion to go unless Personnel targets；

(S200) personnel targets are partitioned into detection algorithm by target positioning, and defeated in a manner of target frame and coordinate information Out, input data is provided for human body critical point detection；

(S300) physical feeling positioning and correlation degree analysis are carried out to extract whole people using improved gesture recognition algorithms Body key point information forms crucial point sequence；

(S400) space-time diagram is constructed on crucial point sequence by action recognition algorithm, is applied to grasp in multilayer space-time diagram convolution Make, and the classification of motion is carried out by Softmax classifier, realizes the human action identification under complex scene；

In the step S100, for the environmental quality under concrete scene, the priori knowledge being suitble under the scene is matched, such as This complex scene of offshore oil platform, due to personnel targets safety clothes color and certain cylindrical tube colors and form very Both similar, color, texture and shape feature are difficult to differentiate between, often obscured using conventional model under simple scenario, cause higher Rate of false alarm；For such problem, the present invention proposes priori knowledge SVM classifier is added, and to detection target and obscures target SVM pre-training is carried out, the non-personnel targets that will identify that are considered as negative sample removal, reduce the calculation amount of negative sample, improve The accuracy rate of next stage target detection.

2. the human motion recognition method according to claim 1 based on TP-STG frame, which is characterized in that the mesh Position, which is demarcated, with detection algorithm includes:

(S210) image data is converted video data to by data prediction, and carries out sample label operation, as calculation The input data of method；

(S211) input picture is divided into N*N grid, r bounding box is predicted to each grid by feature extraction, if As soon as the center of an object is fallen in a bounding box, then this grid is responsible for detecting this object；

(S212) single convolutional network is run on the image and obtains the confidence score calculating of bounding box, these confidences are anti- Accuracy of the credibility and target in bounding box comprising target in prediction block is reflected；

(S213) it by increasing the costing bio disturbance of bounding box coordinates prediction, and reduces to the confidence level for not including object boundary frame Prediction loss, prevent model early stage diverging and it is unstable；

(S214) bounding box width and height are normalized according to a certain percentage, so that they fall between zero and one, is obtained final The target category probability and bounding box coordinates information of prediction.

3. the human motion recognition method according to claim 2 based on TP-STG frame, which is characterized in that described In step S212, in the calculating of confidence score, need to define the intersection degree of predicted boundary frame and actual boundary frame, with this As the calculation basis of confidence score, if target is not present in predicted boundary frame unit, confidence score should be zero；

Otherwise confidence score is equal to intersection PIA and real goal frame ground between prediction block and true object boundary frame The product of truth, thus the definition of confidence level indicates are as follows:

In formula (1), Cr indicates confidence level, Gr_(Object)Indicate real goal frame,Indicate prediction block and true object boundary Intersection between frame.

4. the human motion recognition method according to claim 2 based on TP-STG frame, which is characterized in that described In step S213, costing bio disturbance is completed by overall goal loss function, wherein the loss function L (x, y, w, h) of coordinate prediction Are as follows:

In formula (2), x, y indicate that the coordinate of the central point of the bounding box relative to grid cell, w and h indicate entire input picture Width and height, andRespectively indicate i x_i,y_i,w_i,h_iMean value,Expression judges in i-th of network Whether j-th of grid is responsible for this target object, and λ indicates weight；

In formula (3), C_iIndicate the confidence value of i-th of bounding box, and the bounding box confidence level without real goal is predicted Loss function L (noobj) is indicated are as follows:

In formula (5), p_i(c) it is expressed as the probability of i-th of target category,It is expressed as the mathematical expectation of probability of i target category, is asked After obtaining each loss function, each loss function Weighted Fusion obtains final goal loss function L (f):

L (f)=L (x, y, w, h)+L (obj)+L (noobj)+L (class) (6)

In formula (6), final goal loss function L (f) is the loss function of coordinate prediction, the bounding box confidence level containing real goal The loss of the prediction of the loss function of prediction, the bounding box confidence level loss function predicted without real goal and target category Function weighted sum.

5. the human motion recognition method according to claim 1 based on TP-STG frame, which is characterized in that described changes Into gesture recognition algorithms include:

(S313) network is divided into two loop branches, and a branch is used to predict the two-dimentional confidence level figure S of body part position, into Row physical feeling positions to obtain all visible key points of human body, another branch is used for two dimension of the prediction pixel point in skeleton Vector field L is associated degree analyzing and obtains the sightless key point of human body；

(S315) the output S of the above branch of branch's difference after_t-1,L_t-1With characteristic pattern F as inputting, constantly change Generation；

(S316) by p stage final output S (p) and L (p)；

(S317) S, the predicted value and ground truth (S of L are calculated^*,L^*) between L2 norm, the ground truth of S and L It is calculated according to the 2D of mark point, if some key point mark missing, does not calculate the value of the point, all key points of final output Information；

In the step S313, two loop branches return S and L respectively, each stage calculates primary loss, later handle It S and L and is originally inputted and continues to input next stage and be trained；With the increase of the number of iterations, in known key point On the basis of position, the relative positional relationship between partes corporis humani position is established using the shift length between vector, to realize The prediction and estimation of the invisible key point of human body.

6. the human motion recognition method according to claim 1 based on TP-STG frame, which is characterized in that described is dynamic Include: as recognizer

(S413) the neighborhood subset number for judging space-time node of graph designs corresponding space division rule and determines the rule used Then；

(S415) output action class label and corresponding movement scoring.

7. the human motion recognition method according to claim 6 based on TP-STG frame, which is characterized in that described In step S411, when constructing space-time diagram, inside each video frame, space is constructed according to the natural skeleton connection relationship of human body Figure, while the identical key point of adjacent two frame is connected and composed into timing side, the key point configuration node collection V in all input frames, All directed edges constitute side collection E, obtain space-time diagram G=(V, E) according to above-mentioned rule, remain skeleton key point naturally Spatial information, and the motion profile of key point is showed in the form of timing side；

Specifically, node set V={ v in figure_ti| t=1,2...T, i=1,2...N } include the institute on crucial point sequence Some artis, wherein T indicates video frame number, and N indicates the number of all key points of human body, is set as 18；When building space-time diagram When, the feature vector F (v of t frame in key point, i-th artis_ti) be made of the coordinate information and confidence level of key point 's；There are two subsets to form by the set E on side, is the link E of each video frame frame intrinsic articulation point respectively_s={ v_tiv_tj|(i,j)∈ P } and different video frame interframe link E_t={ v_tiv_(t+1)i, wherein P indicates the set of all key points of human body, and i, j divide It is not two any joints in set of keypoints.