WO2019232894A1

WO2019232894A1 - Complex scene-based human body key point detection system and method

Info

Publication number: WO2019232894A1
Application number: PCT/CN2018/096157
Authority: WO
Inventors: 宫法明; 马玉辉; 徐燕; 袁向兵; 宫文娟; 李传涛; 岳寒冰; 丁洪金
Original assignee: 中国石油大学(华东)
Priority date: 2018-06-05
Filing date: 2018-07-18
Publication date: 2019-12-12
Also published as: CN108710868B; CN108710868A

Abstract

Discloses are a complex scene-based human body key point detection system and method. The method comprises: inputting monitor video information to obtain a single-frame static map and a multi-frame optical flow map; extracting features from the single-frame static map by means of a convolution operation to obtain a feature map, and in order to solve the impact of an interference target on personnel target detection under a complex scene, discriminating an actual confidence coefficient of the feature map from a preset confidence coefficient by using a personnel target detection algorithm to obtain a discretized personnel target bounding box; performing optical flow stacking on the multi-frame optical flow map to form a two-dimensional vector field; and extracting features in the discretized personnel target bounding box to obtain the feature map, obtaining key points and association degrees of parts, generating a part confidence map for each part of a human body by using a predictor, and precisely detecting human body key points by means of the part confidence map and the two-dimensional vector field. The system and method of the present invention are used for human body key point detection under a complex scene, thereby implementing precise detection of personnel target key points.

Description

Human key point detection system and method based on complex scene

Technical field

The invention relates to a technology for detecting key points of a human body, in particular to a system and method for detecting key points of a human body based on a complex scene.

Background technique

At present, China's "Skynet" project construction has begun to take shape. With the development of advanced technologies such as deep learning and intelligent video behavior analysis, how to effectively use surveillance video has become the focus of video data analysis.

Computer video surveillance uses computer vision and image processing methods to perform target detection, target classification, target tracking, and behavior recognition of human targets in surveillance scenarios. Among them, human behavior recognition is a research hotspot that has received extensive attention in recent years, and human keypoint detection is the basis and key technology of intelligent video behavior recognition. It analyzes and judges target behaviors through human keypoint sequences, realizes active detection of hidden dangers, and provides early warning of abnormal events in public places. It has important practical application value in oilfields, hospitals, and homes for the elderly.

Human keypoint detection is to identify and locate key parts of human targets in the image. With the popularization of deep convolutional neural networks, this problem is further solved. The detection methods of human key points are mainly divided into two categories: top-down methods and bottom-up methods. Among them, the top-down method refers to the detection of a person's target, then the target bounding box is used for positioning, and the single-person estimation method is used to locate all joints of the human body; the bottom-up method refers to the position of all joints first , Then distinguish the subordinate targets of the joint, and finally assemble the joint into a complete human posture. The former is suitable for the situation where the human target is sparse, and the latter is suitable for the situation where the human target is dense.

Traditional human keypoint detection methods include template-based methods, statistical classification-based methods, and sliding window-based methods. The method based on template matching is intuitive and simple, but it lacks robustness and is generally used in a single scene. The method of probability statistics is widely used, but requires a large amount of training data to learn model parameters, and the calculation is more complicated. The method based on sliding windows The labeling requirements for the training database are low, but they cannot overcome the effects of partial occlusion and construct the relative positional relationship between various parts of the human body.

In summary, due to the non-rigid characteristics of the human body, the variability of posture, and changes in lighting, the traditional method is more effective in a single specific scene, but it is more affected by background changes in complex scenes, and human parts are vulnerable to Occlusion and interference of other objects and targets are difficult to ensure the accuracy and completeness of human keypoint detection.

Summary of the Invention

An object of the present invention is to provide a human body key point detection system and method based on complex scenes. The system and method solve the problems of poor detection effect and large error of human body key points in complex scenes in the prior art, and can be used for complex Human keypoint detection in the scene, locate, identify and track human targets in dynamic scenes, to achieve accurate detection of keypoints of all human targets in the image.

In order to achieve the above object, the present invention provides a method for detecting a key point of a human body based on a complex scene. The method includes:

(S100) Input monitoring video information, and perform preprocessing to obtain single-frame static image and multi-frame optical flow image;

(S200) Extract features from a single frame static map by convolution operations to obtain feature maps. In order to solve the impact of interference targets on human target detection in complex scenes, a human target detection algorithm is used to actualize the confidence and preset of the feature map. Discriminate with confidence, remove non-human targets, and obtain discrete human target bounding boxes;

(S300) Use optical flow stacking for multi-frame optical flow diagrams to form a two-dimensional vector field;

(S400) Extract the features in the discretized human target bounding box to obtain a feature map, obtain the key points and the degree of association of the part, use a predictor to generate a partial position information map for each part of the human body, and use the partial position information map and two The dimensional vector field enables accurate detection of key points of the human body.

Wherein, in the step S400, in the first stage, the target bounding box is enlarged, the original image is used as input, and features are extracted using a convolution operation, and the confidence value of each part is predicted by the classifier from the original image to generate a corresponding And use the confidence map obtained in the previous stage and the extracted features as inputs to the next stage, and iterate continuously between several stages to obtain an accurate partial location confidence map.

Preferably, the human target detection algorithm includes:

(S210) Generate a set of fixed-size default bounding box sets for single-frame still images of different sizes, and perform feature extraction on the regions within the set of default bounding boxes;

(S211) For the physical characterization of the human target, extract the main features to form feature map units at different levels. As an image data set, the feature map units at each level are tiled with feature maps in a convolutional manner so that each default The position of the bounding box and the corresponding feature map unit is fixed;

(S212) On each feature map unit, a small convolution kernel convolution filter is used to predict the actual bounding box of the object in each default bounding box, the actual bounding box is used as the target bounding box, and the actual confidence degree is calculated , Judging the actual confidence level and the preset reliability level to remove invalid bounding boxes and to correct the target bounding box position;

(S213) Output discrete target bounding boxes at different levels, which have different aspect ratio scales.

Preferably, in the step S212, during the confidence determination process, the error and corresponding score of each default bounding box and the corresponding actual bounding box need to be calculated to predict all the values in the default bounding box area. The category and confidence of the target; set the threshold of the preset reliability; when the actual confidence is greater than the threshold, perform model loss calculation; when the actual confidence is less than the threshold, perform SVM A posteriori discrimination; when it is judged to be a human target, fine-tune the bounding box of the target; when it is judged to be a non-human target, remove the invalid bounding box.

Preferably, the model loss calculation is completed by a loss function, and the loss function is:

In equation (1), L (e) is the loss error, y is the expected output, and α is the actual output.

Perform moment estimation on the distribution of y, and use α to represent the cross entropy of y as:

In Equation (2), α _i is the actual output of the i-th default bounding box, and y _i is the expected output of the i-th default bounding box.

The average cross entropy of the n default bounding boxes is:

In formula (3), y _{i, n} represents the expected output of the i-th default bounding box when the number of matching default bounding boxes is n; α _{i, n} represents the number of matching default bounding boxes as n , The actual output of the i-th default bounding box.

Preferably, in step S212, when there are confusing targets, SVM post-judgment is performed on the human targets and the confusing targets, and a large number of manually labeled image data sets are sent to the SVM to pre-train the classification of the human targets and the confusing targets. The local SVM binary classification is performed after the confidence determination, and the identified confused target is removed as a negative sample, and the human target is used as a positive sample. Based on the confidence level of the positive sample person category, the score is determined to determine whether it is true. Personnel goals.

Preferably, the overall target loss function of the double discrimination is a weighted average sum of the confidence loss and the localization score loss, and the overall target loss function is:

In Equation (4), δ is the initial weight term; V is the number of default bounding boxes that match the actual bounding box; L (α, c) is the loss function of confidence; L (α, f) is the localization score Loss function.

The cross-validation sets the initial weighting term δ to 1; when the expected output is evaluated with confidence, the output is the confidence C of each class, and the loss function L (α, c) of the confidence is:

In formula (5), y _{i, N} indicates the expected output of the i-th default bounding box when the number of matching default bounding boxes is N; α _{i, N} indicates that when the number of matching default bounding boxes is N , The actual output of the i-th default bounding box.

When V = 0, the confidence loss is zero.

when

, It means that the i-th default bounding box matches the j-th actual bounding box of the category p.

when

, It means that the i-th default bounding box does not match the j-th actual bounding box of the category p, and the localized scoring loss function is:

In formula (6),

Represents the score that the default bounding box matches the actual bounding box; f _j represents the preset score of the default bounding box,

Represents the actual score of the α _i default bounding box; Δ represents the interval.

Preferably, the structure of the classifier C ₁ in the first stage is:

among them,

Represents the pixel space of the image, x _i represents the position of each pixel in the image, p represents the specific model part,

Represents the confidence value of location p in the first stage.

By using the confidence map and extracted features obtained in the previous stage as data input in the next stage to modify the position in the previous stage, the overall objective F (t) is:

In equation (7),

Denotes that the ideal confidence is obtained at t ∈ T.

Preferably, in step S300, an optical flow threshold is set by the optical flow method on the multi-frame optical flow map, an effective motion area in the video is extracted, and a video segment with a human target is filtered to be converted into For a single frame of image, a hash function calculation is performed for every interval frame, a random function random is selected, the frame number where each frame is located is used as its hash address, and a randomly generated frame number is obtained as the extracted frame.

The constraint equation of the multi-frame optical flow diagram is converted into:

I _x × V _x + I _y × V _y + I _z × V _z = -It (8)

In formula (8), I _x , I _y , I _z , and I _t are components of I (x, y, z, t) at x, y, z, and t, and V _x , V _y , and V _z are respectively Is the composition of x, y, z in the optical flow vector of I (x, y, z, t), and I (x, y, z, t) is the voxel at (x, y, z) position.

The method for forming a two-dimensional vector field includes: successively extracting multiple frames at time t to obtain an optical flow map, and assigning a velocity vector to each pixel in the image to form a motion vector field, which is obtained through a preprocessing operation; Optical flow between successive frames shifts the stacked fields to form a two-dimensional vector field.

Preferably, the human keypoint detection algorithm flow includes:

(S410) Using the coordinates of the discretized human target bounding box obtained by the target detection as the initial input of the algorithm, extracting features through a convolution operation to obtain a feature map;

(S411) Body part positioning and correlation degree analysis are performed simultaneously on two branches, all key points are obtained through body part positioning, and degree of correlation between all parts is obtained through correlation degree analysis to establish a relative position relationship;

The body part positioning algorithm described in (S412) consists of a predictor. It is divided into several stages. Each stage repeatedly generates a confidence map for each part of the human body. Each confidence map contains a certain kind of key points. The confidence map and the original image Features are also used as inputs in the next stage to predict the positions of various parts, and then determine the positions of key points of the human body;

(S413) encode the position and direction of the human body part, and determine the subordinate problem of the key points of multiple people by the direction of the vector in the two-dimensional vector field;

(S414) Use the displacement length between the vectors to establish the relative positional relationship between various parts of the human body, realize the prediction and estimation of invisible key points of the human body, and obtain detailed information of all key points of the human body.

In step S412, the confidence maps at all scales are accumulated for each part to obtain the total confidence map, and the point with the highest confidence is found, and this point is the position of the corresponding key point.

Among them, for the detection of multi-person key points, each person's body is combined together through a two-dimensional vector field to form a complete human body; when there are multiple people at a certain point, the vectors of n people are summed and divided by Number of people.

The invention also provides a human key point detection system based on a complex scene. The system includes: a data pre-processing module that processes monitoring video information to obtain a single frame static image and a multi-frame optical flow image; personnel targets A detection module that extracts features of a single frame static map sent by the data preprocessing module through a convolution operation, uses a small convolution kernel convolution filter to predict the actual bounding box of the object in each bounding box, and calculates the actual confidence , Using the actual bounding box as the target bounding box, using SVM posterior discrimination to discriminate the actual confidence from the preset reliability to remove invalid bounding boxes, modify the position of the target bounding box, and obtain a discrete person target bounding box; and The human key point detection module receives the discrete human target bounding box coordinates sent by the human target detection module, extracts features through a convolution operation to obtain a feature map, and obtains the key points and the degree of association of the part. The predictor is used as Generate partial position information maps for each part of the human body, and realize key points of the human body through partial position information maps and two-dimensional vector fields. Detection.

Wherein, the human keypoint detection module adopts several stages of iteration, and uses the confidence map and extracted features obtained in the previous stage as the input of the next stage to continuously iterate between several stages to obtain accurate part positions Letter illustration.

The system and method for detecting key points of a human body in a complex scene based on the present invention solve the problems of poor detection effect and large error of the key points of a human body in a complex scene in the prior art, and have the following advantages:

(1) The method and system of the present invention use a human target detection algorithm to remove non-human targets, simplifying complex scenes, and can be used to detect key points of human bodies in complex scenes for accurate detection;

(2) The method and system of the present invention use a two-dimensional vector field for encoding the position and direction of a human body part in the image domain, which can subordinate the key points of multiple people, and achieve accurate detection of key points of all human targets in the image ;

(3) The overall target loss function used in the method of the present invention in the SVM a posteriori determination is to let the localized scoring loss function find a global minimum in a gradual process, so that the score difference is minimized and the predicted value is more accurate, making the The target bounding box is adjusted to better match the shape of the target object;

(4) The method of the present invention can also process the targets that are easy to be confused in special scenes, such as the color of the safety clothing of the human target in the offshore platform and the color and shape of some cylindrical pipes, to remove the confused targets and improve recognition. Accuracy;

(5) The method of the present invention uses the confidence map of each part to express the space constraints between the parts when detecting key points of the human body, and simultaneously processes the input feature map and response map under multiple scales, which can ensure accuracy and Taking into account the distance relationship between various parts, by continuously expanding the network's acceptance domain to detect the positions of other parts, accurate detection of all key points of the human body is achieved.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flowchart of a human body keypoint detection method based on a complex scene of the present invention.

FIG. 2 is a principle diagram of a human body keypoint detection method based on a complex scene of the present invention.

FIG. 3 is a flowchart of a human target detection algorithm of the present invention.

FIG. 4 is a flowchart of a human keypoint detection algorithm of the present invention.

FIG. 5 is a structural diagram of a human body key point detection system based on a complex scene according to the present invention.

Detailed ways

The technical solution of the present invention is further described below with reference to the accompanying drawings and embodiments.

A human body key point detection method based on a complex scene, as shown in FIG. 1, is a flowchart of a human body key point detection method based on a complex scene, as shown in FIG. 2, which is a complex scene based on the present invention. The schematic diagram of the human key point detection method under the method, which includes:

(S400) Extract the features in the discretized human target bounding box, obtain the feature map, obtain the key points and the degree of association of the part, use the predictor to generate a partial position information map for each part of the human body, and pass the partial position information map and two-dimensional vector field Realize accurate detection of key points of the human body.

Among them, in step S400, in the first stage, the target bounding box is enlarged, the original image is used as an input, and features are extracted using a convolution operation, and then the confidence value of each part is predicted by the classifier from the original image to generate a corresponding confidence map. , And the confidence map and the extracted features obtained in the previous stage are used as the input of the next stage, and iterative between several stages to obtain an accurate partial location trust map.

As shown in FIG. 3, it is a flowchart of a human target detection algorithm according to the present invention. The process of the human target detection algorithm includes:

(S210) Generate a set of fixed-size default bounding box sets for single-frame static maps of different sizes, and extract features from the area within the set of default bounding boxes; use a number of default bounding boxes to extract features for larger single-frame static maps ;

(S211) According to the physical representation of human targets in complex scenes, features such as color, shape, and texture are extracted as main features to form feature map units at different levels. As image data sets, feature map units at each level are The feature map is tiled in a convolutional manner, so that the position of each default bounding box and the corresponding feature map unit is fixed;

(S212) Use a small convolution kernel convolution filter on each feature map unit to predict the actual bounding box of the object in each default bounding box. The actual bounding box is used as the target bounding box, and the actual confidence is calculated. The actual confidence is determined from the preset reliability; the confidence threshold can be set to 0.6, and the model loss calculation is performed for the case that is greater than the confidence threshold; for the case that is less than the confidence threshold, the SVM post-judgment judgment is directly performed. If the judgment is For human targets, the target bounding box needs to be fine-tuned, otherwise invalid bounding boxes are eliminated; specifically, the target bounding box is fine-tuned using a linear regression device to finely modify the position of the bounding box, otherwise it is regarded as an invalid bounding box (in (If the judgment is not the goal of the person), perform the culling operation;

(S213) A series of discretized target bounding boxes at different levels are output with different aspect ratio scales.

For the determination of the actual bounding box, use static images for data processing on the video stream, label the input image data set by deep learning technology, use the labeled image data set to train the human target detection model, and use this model to perform static image processing. Human target detection, get the specific position information of the target, and use the position information as input to get the target bounding box, which provides a data source for the key points extraction of the human body. In different scenarios, the corresponding data set is selected, for example, the image data set of an oil offshore platform, and the labeled image data set is used for training. The deep learning SSD framework is used.

It is further necessary to know that feature maps of different scales use default bounding boxes with different aspect ratios at each position. In step (S212), during the confidence determination process, the error and corresponding score of each default bounding box and the corresponding actual bounding box need to be calculated to predict the categories and confidence of all targets in the region, which are greater than The target category of the above confidence threshold is regarded as the target category. By calculating errors and scores, the actual bounding box needs to be matched with multiple default bounding boxes in the image, and the final result is the modified target bounding box.

Moreover, the confidence discrimination is a preliminary screening process of target detection. The default bounding box is matched with any actual bounding box with a value higher than the threshold, and the matching process is simplified by SVM posterior discrimination. In addition, the algorithm allows to predict the scores of multiple overlapping default bounding boxes, instead of just picking the bounding box with the largest degree of overlap for score estimation.

Therefore, the human object detection algorithm of the present invention combines the prediction of multiple feature maps with different resolutions, and can naturally process target objects of various sizes. Compared with other single-level methods, even the size of the input image (single frame still image) Smaller also has higher accuracy.

It is further necessary to know that in step (S212), the model loss calculation is completed by a loss function, and most commonly used loss functions are square difference functions:

When the difference between the actual output and the expected output is larger, the model loss is higher, and in actual operation, the distribution of y cannot be accurately obtained through calculation. Only the moment of the distribution of y can be estimated, that is, the value of α. Represents the cross entropy of y:

Therefore, the average cross entropy of the n default bounding boxes is as follows:

In formula (3), y _{i, n} represents the expected output of the ith default bounding box when the number of matching default bounding boxes is n; α _{i, n} represents the number of matching default bounding boxes as n , The actual output of the i-th default bounding box.

Further, according to an embodiment of the present invention, for a specific scene, such as a special scene of an offshore platform, since the color of the safety clothing of the human target is consistent with the color and shape of some cylindrical pipes, the conventional technology often uses conventional models in simple scenes. Confusion between the two leads to a higher false positive rate. In this embodiment of the present invention, the two types of targets are subjected to SVM post-judgment discrimination. A large number of manually labeled image data sets are sent to the SVM classifier in which the human target and the cylindrical pipe target are trained in advance, and the locality is performed after the confidence determination. The SVM two-class classification is then used to discriminate the identified cylindrical pipelines as negative samples. Only the confidence level of the positive sample person category is used to estimate whether the score is indeed a real person target, reducing the amount of calculation of negative samples. The overall objective loss function through double discrimination is the weighted average sum of the confidence loss and the localized scoring loss. The overall objective loss function is as follows:

In Equation (4), δ is the initial weight term.

Further, the initial weighting term δ is set to 1 through cross-validation. When the expected output is evaluated with confidence, the output is the confidence C of each class, and the loss function L (α, c) of the confidence is as follows:

In formula (5), y _{i, N} indicates the expected output of the i-th default bounding box when the number of matching default bounding boxes is N; α _{i, N} indicates that when the number of matching default bounding boxes is N , The actual output of the i-th default bounding box; N is the number of default bounding boxes that match the actual bounding box; if N = 0, the confidence loss is set to 0. Assume

Indicates that the i-th default bounding box matches the j-th actual bounding box of the category p, otherwise, if it does not match, then

The localization scoring loss function is:

In formula (6),

The overall target loss function is to let the localized scoring loss function find a global minimum value in a gradual process, so that the difference in score is minimum and the prediction value is more accurate, so that the target bounding box is adjusted to better match the target object shape.

As shown in FIG. 4, it is a flowchart of a human keypoint detection algorithm according to the present invention. The human keypoint detection algorithm flow includes:

(S410) Using the coordinates of the discretized human target bounding box obtained by target detection as the initial input of the algorithm, extracting features through a series of convolution operations to obtain a feature map;

(S411) Body part positioning and correlation analysis are performed simultaneously on two branches. The former is to find all key points, including: head, neck, right shoulder, right elbow, right wrist, left shoulder, left elbow , Left wrist, right hip, right knee, right ankle, left hip, left knee, and left ankle; 14 key points; the latter is to find the degree of correlation between all parts to establish a relative position relationship;

(S412) The body part localization algorithm consists of a series of predictors, which are divided into multiple stages. Each stage repeatedly generates a confidence map for each part of the human body. Each confidence map contains a certain type of key point. The confidence map is related to the original image. Features are also used as inputs in the next stage to predict the positions of various parts, and then determine the positions of key points of the human body;

(S413) encode the position and direction of the human body part, and determine the subordinate problem of the key points of multiple people through the direction of the vector in the two-dimensional vector field;

(S414) Use the displacement length between the vectors to establish the relative positional relationship between various parts of the human body, so as to realize the prediction and estimation of invisible key points of the human body, and finally obtain detailed information of all key points of the human body.

In step S412, the confidence maps at all scales are accumulated for each part to obtain a total confidence map, and the point with the highest degree of confidence is found, which is the position of the corresponding key point.

The human keypoint detection algorithm performs feature extraction on the input image at each scale to obtain the confidence map of each part of the human body. The larger the confidence value, the deeper the color on the confidence map, and the depth of the color is relative in the entire confidence map. . The algorithm of the present invention uses the confidence map of each part to express the space constraints between the parts, and simultaneously processes the input feature map and response map under multiple scales, which can ensure accuracy and take into account the distance relationship between each part By continuously expanding the network's acceptance domain to detect the location of other parts, eventually accurate detection of all key points of the human body is achieved.

Specifically, according to an embodiment of the present invention, in order to avoid the problem that the human target bounding box obtained by the target detection has a partial error in a certain range, which causes the problem that the part of the human target may not be completely displayed in the bounding box, the present invention The embodiment adopts a multi-scale method to expand the perceptual field and reduce errors caused by target detection. Specifically, the original bounding box is enlarged according to a ratio of 1.0: 1.2, and in this way, a complete human target is obtained, so that all key point coordinates are detected during the human key point detection stage. After using the convolutional network for feature extraction, the confidence value of each part is directly predicted from the original image, thereby generating a corresponding confidence map, which includes a background confidence map. When the human body is divided into P model parts, there are P + 1 layers of confidence maps. The preset value of P is 14. Suppose x is a pixel with salient features in the image. The original image is input into the network and the salient features in the image are extracted through a convolution operation. The salient features are mainly texture features. Using C _{1 to} represent the classifier in the first stage, the classifier can roughly predict the position of each part, thereby generating a confidence map of each part. The classifier structure is as follows:

among them,

Represents the pixel space of the image, x _i is the position of each pixel in the image, and p represents a specific model part,

Represents the confidence value of location p in the first stage.

The confidence map and image features obtained in the first stage are used as input data in the second stage, and the original image is used as input again. As the acceptance domain of the network continues to expand, the learned features will be different from the previous stage. The feature functions used include the image data features, the confidence map of each part at this stage, and the context information of the classifiers at all levels. The classifier C ₂ continues to predict the position of each part, which is a modification of the predicted position in the previous stage. The overall target F (t) is as follows:

In equation (7),

Denotes that the ideal confidence is obtained at t ∈ T. Through the continuous iteration of the two stages, the location of the predicted part is made more accurate, and finally a more accurate position of each part is obtained.

It is further necessary to know that, for multi-frame optical flow maps, the optical flow threshold can be set by the optical flow method, the effective motion areas in the video are extracted, and the video fragments with human targets are screened for single frame image conversion. In order to generate a random extraction frame, set a hash function calculation every 24 frames, select a random function random each time, take the frame number where each frame is located as its hash address, and get a randomly generated frame number, that is, For extracting frames.

For the constraint equation of the multi-frame optical flow diagram, set the target moving distance to be small enough, and at the same time, the time required for the movement can be ignored, then the constraint equation of the multi-frame optical flow diagram is transformed by Taylor formula, as shown below:

I _x × V _x + I _y × V _y + I _z × V _z = -It (8)

In formula (8), I _x , I _y , I _z , and I _t are components of I (x, y, z, t) at x, y, z, and t, and V _x , V _y , and V _z are respectively Is the composition of x, y, and z in the optical flow vector of I (x, y, z, and t). The three partial differentials are approximated by the corresponding direction difference of the image at the pixel of x, y, z, and t. .

A method for forming a two-dimensional vector field is specifically: obtaining an optical flow map by continuously extracting multiple frames at time t, and assigning a velocity vector to each pixel in the image to form a motion vector field, which is obtained through a preprocessing operation A two-dimensional vector field formed by optical flow displacement between successive frames.

Further, according to an embodiment of the present invention, for the problem of detecting key points of multiple people, detecting the body parts of different people, it is necessary to combine the bodies of each person separately to form a complete human body. The method used is two-dimensional Vector field. It is a 2D vector set. Each 2D vector set encodes the position and orientation of a human body part. The position and orientation information is stored in the vector. Each vector will have an affinity area between the two related human body parts. Each of these pixels has a 2D vector describing the direction. The affinity region exists in the form of a response graph, and the dimensions are two-dimensional. If multiple people overlap at a certain point, sum the vectors of n people and divide by the number of people.

The detection area is set in the video, the target detection is performed by the method of the present invention in a complex scene, the human target is located, identified and tracked, and the hover detection is performed on the event that the same human target moves within the area for a certain time, available Intelligent monitoring in banks, government agencies, embassies, cultural and religious gathering places, high-security perimeters, commercial and residential areas, etc., find suspicious targets and issue timely warnings to eliminate potential security risks.

Moreover, the method of the present invention can accurately identify and locate the key points of the human body, and on this basis, can judge the behavior and posture of personnel. It can be applied to many fields such as petroleum, industry, medical treatment and security, and these fields are facing many safety issues. Hidden danger factors, such as accidentally falling into the sea for oil drilling and production operations, industrial production personnel failing to wear safety equipment in compliance with regulations, and elderly and patient falls. The method of the invention can reduce the time of manual intervention, avoid economic losses caused by personal accidents and illegal operation of production, thereby ensuring the safety of industrial production, saving manpower and material resources, and improving the level of production management.

A human body key point detection system based on a complex scene, as shown in FIG. 5, is a structural diagram of a human body key point detection system based on a complex scene according to the present invention. The system includes a data pre-processing module for monitoring video. The information is processed to obtain a single frame static image and a multi-frame optical flow image; a human target detection module that extracts features of a single frame static image sent by the data preprocessing module through a convolution operation, and uses a small convolution kernel convolution filter Predict the actual bounding box of the object in each bounding box and calculate the actual confidence, and use the actual confidence as the target bounding box. Use SVM posterior discrimination to discriminate the actual confidence from the preset confidence to remove invalid bounding boxes. In order to modify the position of the target bounding box, a discrete human target bounding box is obtained; and the human keypoint detection module receives the discrete human target bounding box coordinates sent by the human target detection module, and extracts features through a convolution operation to obtain a feature map, and Obtain the key points and the degree of correlation of the parts, and use the predictor to generate a position position map for each part of the human body. FIG position signal and the two-dimensional vector fields to achieve precise detection of the key body.

Among them, the human keypoint detection module uses several stages of iteration, and uses the confidence map obtained in the previous stage and the extracted features as the input of the next stage to continuously iterate between several stages to obtain an accurate partial position information map. Specifically, the human keypoint detection module adopts the steps of the human keypoint detection algorithm described above based on the human keypoint detection method in a complex scene.

Among them, the human target detection module also adopts the steps of the human target detection algorithm described above based on the human key point detection method in a complex scene.

To sum up, the human body keypoint detection system and method based on complex scenes of the present invention can quickly and accurately detect all keypoints of human targets in complex scenes, and can be applied to multiple fields for positioning, recognition, tracking, and Behavior and posture discrimination.

Although the content of the present invention has been described in detail through the above preferred embodiments, it should be recognized that the above description should not be considered as limiting the present invention. After reading the above content by those skilled in the art, various modifications and alternatives to the present invention will be apparent. Therefore, the protection scope of the present invention should be defined by the appended claims.

Claims

A method for detecting key points of a human body based on a complex scene is characterized in that the method includes:

(S100) Input monitoring video information, and perform preprocessing to obtain single-frame static image and multi-frame optical flow image;

(S200) Extract features from a single frame static map by convolution operations to obtain feature maps. In order to solve the impact of interference targets on human target detection in complex scenes, a human target detection algorithm is used to actualize the confidence and preset of the feature map. Discriminate with confidence, remove non-human targets, and obtain discrete human target bounding boxes;

(S300) Use optical flow stacking for multi-frame optical flow diagrams to form a two-dimensional vector field;

(S400) Extract the features in the discretized human target bounding box to obtain a feature map, obtain the key points and the degree of association of the part, use a predictor to generate a partial position information map for each part of the human body, and use the partial position information map and two The dimensional vector field enables accurate detection of key points of the human body;

In the step S400, in the first stage, the discretized human target bounding box is enlarged, and the original image is used as an input. After the feature is extracted by using a convolution operation, the confidence value of each part is predicted by the classifier from the original image to generate Corresponding confidence map, and using the confidence map obtained in the previous stage and the extracted features as the input of the next stage, iterating continuously between several stages to obtain an accurate partial location confidence map.
The method for detecting key points of a human body based on a complex scene according to claim 1, wherein the human target detection algorithm comprises:

(S210) Generate a set of fixed-size default bounding box sets for single-frame still images of different sizes, and perform feature extraction on the regions within the set of default bounding boxes;

(S211) For the physical characterization of the human target, extract the main features to form feature map units at different levels. As an image data set, the feature map units at each level are tiled with feature maps in a convolutional manner so that each default The position of the bounding box and the corresponding feature map unit is fixed;

(S212) On each feature map unit, a small convolution kernel convolution filter is used to predict the actual bounding box of the object in each default bounding box, the actual bounding box is used as the target bounding box, and the actual confidence degree is calculated , Judging the actual confidence level and the preset reliability level to remove invalid bounding boxes and to correct the target bounding box position;

(S213) Output discrete target bounding boxes at different levels, which have different aspect ratio scales.
The method for detecting key points of a human body based on a complex scene according to claim 2, characterized in that in said step S212, in performing the confidence discrimination process, it is necessary to calculate each default bounding box and the corresponding The actual bounding box error and corresponding scores to predict the categories and confidence of all targets in the default bounding box area;

Set the threshold of the preset reliability; when the actual confidence is greater than the threshold, perform model loss calculation; when the actual confidence is less than the threshold, perform SVM posterior determination; when the determination is For human targets, fine-tune the target bounding box; when it is judged to be a non-human target, remove invalid bounding boxes.
The method for detecting key points of a human body based on a complex scene according to claim 3, wherein the model loss calculation is completed by a loss function, and the loss function is:

In formula (1), L (e) is the loss error, y is the expected output, and α is the actual output;

Perform moment estimation on the distribution of y, and use α to represent the cross entropy of y as:

In Equation (2), α i is the actual output of the ith default bounding box, and y i is the expected output of the ith default bounding box;

The average cross entropy of the n default bounding boxes is:

In formula (3), y i, n represents the expected output of the i-th default bounding box when the number of matching default bounding boxes is n; α i, n represents the number of matching default bounding boxes as n , The actual output of the i-th default bounding box.
The method for detecting key points of a human body based on a complex scene according to claim 4, characterized in that, in step S212, when there is a confusing target, the SVM post-judgment is performed on the human target and the confusing target, and a large number of The manually labeled image data set is sent to the SVM classifier with pre-trained human targets and confused targets. After the confidence level is determined, the local SVM is classified again, and the identified confused targets are removed as negative samples, and the human targets are regarded as positive. The sample is scored to determine whether it is a real person target based on the confidence of the positive sample person category.
The method for detecting human key points based on a complex scene according to claim 5, wherein the overall target loss function of the double discrimination is a weighted average sum of the confidence loss and the localization score loss, and the overall target loss function is:

In Equation (4), δ is the initial weight term; N is the number of default bounding boxes that match the actual bounding box; L (α, c) is the loss function of confidence; L (α, f) is the localization score Loss function

The cross-validation sets the initial weighting term δ to 1; when the expected output is evaluated with confidence, the output is the confidence C of each class, and the loss function L (α, c) of the confidence is:

In formula (5), y i, N indicates the expected output of the i-th default bounding box when the number of matching default bounding boxes is N; α i, N indicates that when the number of matching default bounding boxes is N , The actual output of the i-th default bounding box;

When N = 0, the confidence loss is 0;

when
, It means that the i-th default bounding box matches the j-th actual bounding box of the category p;

when
, It means that the i-th default bounding box does not match the j-th actual bounding box of the category p, and the localized scoring loss function is:

In formula (6),
Represents the score that the default bounding box matches the actual bounding box; f j represents the preset score of the default bounding box,
Represents the actual score of the α i default bounding box; Δ represents the interval.
The method for detecting key points of a human body based on a complex scene according to any one of claims 1-6, wherein the structure of the classifier C 1 in the first stage is:

among them,
Represents the pixel space of the image, x i represents the position of each pixel in the image, p represents the specific model part,
Represents the confidence value of the position p in the first stage;

By using the confidence map and extracted features obtained in the previous stage as data input in the next stage to modify the position in the previous stage, the overall objective F (t) is:

In equation (7),
Denotes that the ideal confidence is obtained at t ∈ T.
The method for detecting key points of a human body based on a complex scene according to claim 7, characterized in that in said step S300, an optical flow threshold is set and extracted by using an optical flow method for said multi-frame optical flow map. The effective motion area in the video is selected, and the video segment with the human target is filtered to be converted into a single frame image, and a hash function calculation is performed every arbitrary interval frame. A random function random is selected, and the frame where each frame is located is taken. The number is its hash address, and the randomly generated frame number is the extracted frame;

The constraint equation of the multi-frame optical flow diagram is converted into:

I x × V x + I y × V y + I z × V z = -It (8)

In formula (8), I x , I y , I z , and I t are components of I (x, y, z, t) at x, y, z, and t, and V x , V y , and V z are respectively Is the composition of x, y, z in the optical flow vector of I (x, y, z, t), and I (x, y, z, t) is the voxel at (x, y, z) position;

The method for forming a two-dimensional vector field includes: successively extracting multiple frames at time t to obtain an optical flow map, and assigning a velocity vector to each pixel in the image to form a motion vector field, which is obtained through a preprocessing operation; Optical flow between successive frames shifts the stacked fields to form a two-dimensional vector field.
The method for detecting a human keypoint based on a complex scene according to claim 8, wherein the human keypoint detection algorithm flow comprises:

(S410) Using the coordinates of the discretized human target bounding box obtained by the target detection as the initial input of the algorithm, extracting features through a convolution operation to obtain a feature map;

(S411) Body part positioning and degree of association analysis are performed simultaneously on two branches. All key points are obtained through body part positioning, and degree of association between all parts is obtained through association degree analysis to establish a relative position relationship;

The body part positioning algorithm described in (S412) consists of a predictor. It is divided into several stages. Each stage repeatedly generates a confidence map for each part of the human body. Each confidence map contains a certain kind of key points. The confidence map and the original image Features are also used as inputs in the next stage to predict the positions of various parts, and then determine the positions of key points of the human body;

(S413) encode the position and direction of the human body part, and determine the subordinate problem of the key points of multiple people through the direction of the vector in the two-dimensional vector field;

(S414) Use the displacement length between the vectors to establish the relative position relationship between various parts of the human body, realize the prediction and estimation of invisible key points of the human body, and obtain detailed information of all key points of the human body;

In step S412, the confidence maps at all scales are accumulated for each part to obtain a total confidence map, and the point with the highest degree of confidence is found, which is the position of the corresponding key point;

For multi-person keypoint detection, each person's body is combined together through a two-dimensional vector field to form a complete human body; when multiple people overlap at a certain point, the vectors of n people are summed and divided by the number of people.
A human body keypoint detection system based on a complex scene is characterized in that the system includes:

Data pre-processing module, which processes the surveillance video information to obtain single-frame still images and multi-frame optical flow images;

A human target detection module that extracts the features of a single frame static map sent by the data preprocessing module through a convolution operation, uses a small convolution kernel convolution filter to predict the actual bounding box of the object in each bounding box, and calculates the actual Confidence, using the actual bounding box as the target bounding box, using SVM posterior discrimination to discriminate the actual confidence from the pre-set reliability to remove the invalid bounding box, modify the position of the target bounding box, and obtain a discrete person target bounding box ;as well as

The human key point detection module receives the discrete human target bounding box coordinates sent by the human target detection module, extracts features through a convolution operation to obtain a feature map, and obtains the key points and the degree of association of the part. The predictor is used as Generate partial position information maps for each part of the human body, and realize accurate detection of key points of the human body through the partial position information maps and two-dimensional vector fields;

Wherein, the human keypoint detection module adopts several stages of iteration, and uses the confidence map and extracted features obtained in the previous stage as the input of the next stage to continuously iterate between several stages to obtain accurate part positions Letter illustration.