CN111767792A

CN111767792A - Multi-person key point detection network and method based on classroom scene

Info

Publication number: CN111767792A
Application number: CN202010439222.9A
Authority: CN
Inventors: 滕国伟; 丁敏
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2020-05-22
Filing date: 2020-05-22
Publication date: 2020-10-13

Abstract

The invention provides a multi-user key point detection network and a method based on a classroom scene. The network comprises a human body target detection module, a human body target region fusion module, a human body target region feature extraction module and a key point detection and integration module. The invention utilizes a plurality of stages to perform efficient feature fusion, and improves OpenPose and YoloV3 networks respectively based on a multi-scale feature fusion strategy. The invention provides an inclusion module based on cavity convolution to replace a feature extraction network of OpenPose, local information of a larger receptive field can be obtained, a dense connection module is fused into a shallow network of YOLOV3, feature fusion is carried out on shallow and high-level features, a GIOU loss function is used for replacing a bounding box regression loss function of YOLO v3 to improve detection precision, and then region fusion is carried out through a human body prediction frame fusion strategy to output a detection region. The two networks are cascaded into a framework for key point detection, so that the problems of difficult positioning and key point false detection of the rear-row small-scale students in the classroom are effectively solved.

Description

Multi-person key point detection network and method based on classroom scene

Technical Field

The invention relates to human body key point detection, in particular to a multi-person key point detection network and a multi-person key point detection method based on a classroom scene.

Background

Human body key point detection, also called human body posture estimation, is a very basic problem in computer vision, is a preposed task of human body action recognition, behavior analysis, human-computer interaction and the like, and can be understood as estimation of positions of key points of a human body, such as the head, the elbow, the wrist, the knee and the like. The human body posture estimation can be divided into 2D/3D key point detection and single person/multi-person key point detection, and key point tracking can be carried out after the key point detection is finished, and the method is also called as human body posture tracking. Human keypoint detection also faces many challenges, such as flexible, small and nearly invisible joints, occlusion, clothing, and light changes all add difficulty to human keypoint detection. The invention mainly relates to 2D multi-person key point detection, and aims to detect key points of students in a classroom for subsequent posture recognition. Giving an RGB image, accurately positioning a plurality of key points of the human body, and determining the human body to which the key points belong.

Currently, there are two main methods for multi-person key point detection:

(1) top-down: firstly, target (human body) detection is carried out, and then a single-person posture estimation is carried out on each detected human body (such as CPM, Stacked Hourglass, HRnet and other networks). The top-down approach is necessarily constrained by the target detection task, as single-person pose estimation based on a bounding box is vulnerable to occlusion problems and small-scale human targets.

(2) Bottom-up: the method comprises the steps of detecting key points of all people, and then matching the key points to related human bodies through an algorithm (such as OpenPose dynamic programming, tag matching of social Embedding, greedy algorithm of PersonLab and the like). The problem of occlusion is still a challenge, and meanwhile, due to the fact that the sizes of human body scales on the images are different, the difficulty of extracting the key point features is greater than that of a Top-down method.

Generally, the Top-down method has higher precision but poorer real-time performance, and the Bottom-up method has lower precision than the Top-down method, but has higher speed and better real-time performance.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the invention provides a multi-user key point detection network and a method based on a classroom scene, aiming at the problems of occlusion in the classroom scene, difficulty in positioning and detecting rear-row small-size targets and key point false detection in a non-human region, wherein the network is a multi-user key point detection network combining Top-down and Bottom-up.

In order to achieve the purpose, the invention adopts the following technical scheme:

a multi-person key point detection network based on classroom scene comprises a human body target region detection module, a human body target region fusion module, a human body target region feature extraction module and a key point detection and integration module;

the human body target area detection module is sequentially connected with the human body target area fusion module, the human body target area feature extraction module and the key point detection and integration module.

And the human body target area detection module is used for detecting the area of each student in the picture.

And the human body target area fusion module is used for fusing the areas of the students detected in the human body target area detection module.

And the human body target region feature extraction module is used for extracting features of the student regions fused in the human body target region fusion module.

And the key point detection and integration module is used for predicting the confidence coefficient and the position relation affinity of the key points in the area where the students exist, and then performing limb matching to obtain the final multi-user key point detection result.

The human body target area detection module is a YOLO V3 network with a dense connection module introduced into a shallow network, features of an input image are extracted by using dense connection convolution, and a GIOU loss function is used for replacing a bounding box regression loss function of YOLO V3, so that the shallow feature and the deep feature can be better and faster fused, the detection precision is improved, and the problem that low-resolution students in the back row of a classroom are difficult to detect is solved.

The human body target area fusion module is used for fusing human body frames detected by the YOLO-DesNet,firstly, the human body frame is amplified, and the boundary of the amplified human body prediction frame is ensured not to exceed the boundary of the original image. When any two human body prediction frames are fused, whether intersection exists between the two prediction frames is judged firstly, if intersection exists, the IOU is defined according to the idea of referring to the IOU_concatWhen IOU of two human body prediction boxes_concatAbove 0.5, the two regions are fused.

The human body target area feature extraction module is an inclusion network based on cavity convolution, local information with a larger receptive field is obtained by introducing cavity convolutions with different scales, and the sensing capability of the network to the local information is improved. The input pictures are firstly subjected to cross-channel organization information by using 1-by-1 standard convolution, the expression capacity of the network is improved, and more nonlinear transformation is provided. And performing secondary convolution on the output characteristics by using the standard convolution checks of 1 x 1 and 3 x 3, and increasing the adaptability of the network to different human body scales. And then, carrying out convolution by using the output characteristics of the previous step again by using the empty convolution with different expansion rates to obtain local information of a larger receptive field and improve the detection performance of the small-size human body target. And adding convolution characteristics output by different branches according to pixel point levels, and convolving the added characteristics again by using the standard convolution of 1 x 1 to eliminate aliasing effects caused by convolution using convolution kernels with different sizes. And finally, carrying out nonlinear activation on the output fusion convolution characteristics through a ReLu function.

The key point detection and integration module is a cascaded multi-stage network, simultaneously predicts a human body key point confidence map and a position relation map, sets a loss function behind each stage, finally outputs the key point confidence map and the position relation map and performs limb matching to obtain a final multi-person key point detection result.

A multi-person key point detection method based on classroom scene comprises the following specific operation steps:

step 1: and detecting the human body target area, namely detecting the area of each student in the picture, wherein the student target detection does not need to be too fine.

Step 2: and fusing human body target areas, namely fusing areas of the students roughly detected in the human body target area detection module.

And step 3: and (4) extracting the characteristics of the human body target area, namely extracting the characteristics of the student area fused by the human body target area fusion module.

And 4, step 4: and the key point detection and integration module is used for predicting the confidence coefficient and the part relation affinity of the key points in the area where the students exist, and then performing limb matching to obtain the final multi-person key point detection result.

The specific steps of the step 1 are as follows:

step 1.1: the input image is subjected to 1 time of dense connection convolution and 3 times of residual convolution to extract features, so that multiplexing and fusion of network multilayer features can be better realized;

step 1.2: the structure of a feature extraction network is deepened through 3 groups of residual modules, and the selection and extraction capability of the model on deep features of the image is improved;

step 1.3: carrying out tensor splicing with the feature map with the same size in the upper layer of the network by using a multi-scale pyramid structure through 2 times of upsampling, and carrying out regression prediction for 3 times to realize multi-scale detection on targets with different sizes;

step 1.4: replacing the bounding box regression loss function of YOLO V3 with the GIOU loss function;

step 1.5: and simultaneously participating in back propagation by the target confidence coefficient loss, the target category loss and the target boundary box regression loss, setting the iteration number to be 50000, the learning rate to be 0.0001 and the weight attenuation to be 0.0004, and helping the network to finish training.

The specific steps of the step 2 are as follows:

step 2.1: firstly, the human body frame detected in the step 1 is amplified, and the boundary of the amplified human body frame is ensured not to exceed the boundary of the image.

Step 2.2: firstly, judging whether any two human body frames have intersection or not according to the coordinate relation of the human body prediction frames, and if so, calculating the IOU of the two human body frames_concatThe value is obtained. IOU when two human body frames_concatAbove a certain threshold (set to 0.5), region fusion is performed.Here, the IOU_concatIs defined as the ratio of the intersection of any two human prediction frames to the smaller of the two prediction frames.

The specific steps of the step 3 are as follows:

step 3.1: for the input pictures, the 1 × 1 standard convolution is used for organizing information across channels, so that the expression capacity of the network is improved, and more nonlinear transformation is provided.

Step 3.2: the features output in step 3.1 are convolved twice using standard convolution kernels of 1 x 1 and 3 x 3, increasing the adaptability of the network to different human scales.

Step 3.3: and (3) performing convolution on the output characteristics in the step (3.2) by using the empty convolution with different expansion rates to obtain local information of a larger receptive field and improve the detection performance of the small-size human body target.

Step 3.4: adding convolution characteristics output by different branches according to pixel levels, and convolving the added characteristics again by using the standard convolution of 1 x 1 to eliminate aliasing effects caused by convolution using convolution kernels with different sizes.

Step 3.5: and (4) carrying out nonlinear activation on the fusion convolution characteristics output in the step (3.4) through a ReLu function to obtain the characteristics extracted finally.

The specific steps of the step 4 are as follows:

step 4.1: inputting the feature map output in the step 3 into a stage1, and respectively predicting a key point confidence map S1 and a part affinity vector field L1;

step 4.2: inputting the S1 and L1 predicted in the step 4.1 and the feature map output in the original step 3 into a stage2 for prediction to obtain S2 and L2;

step 4.3: the subsequent stages take the S and L output by the previous stage plus the characteristic diagram output in the step 3 as input until stage6 to obtain the final prediction result;

step 4.4: using non-maximum value to inhibit NMS to obtain discrete key point set for the confidence map of key points in human body obtained in step 4.3, and obtaining candidate limb segment combined by these discrete points;

step 4.5: and (4) scoring the candidate limb segments in the step 4.4 according to the position relation graph obtained in the step 4.3, and performing maximum bipartite graph matching through a Hungarian algorithm to obtain a final key point detection result.

Compared with the prior art, the invention has the following obvious and prominent substantial and remarkable technical progress:

1) the invention provides a multi-person key point detection network combining Top-Down and Bottom-up. Aiming at the problems of occlusion in a classroom scene, difficulty in positioning and detecting rear-row small-size students and detection of key points in a non-human region, the invention utilizes a plurality of stages to perform efficient feature fusion, improves OpenPose and YooloV 3 networks respectively based on a multi-scale feature fusion strategy, and fuses the two networks into a framework. The invention comprises 4 modules, which are respectively: the device comprises a human body region detection module, a human body region fusion module, a human body region feature extraction module and a key point detection and integration module.

2) The invention uses the dense connection convolution block in the shallow layer network by taking the thought of dense connection, extracts the characteristics of the input image by the dense connection convolution, and uses the GIOU loss function to replace the bounding box regression loss function of the YOLO V3, so that the shallow layer characteristics can be better and more quickly transmitted to the deep layer network, the detection precision is improved, and the problem of difficult positioning of the rear-row low-resolution students in the classroom is solved.

3) The invention provides an inclusion Net network (inclusion-DCNet) based on hole convolution, and replaces a backhaul part (VGG-19) of OpenPose with the inclusion-DCNet, aiming at obtaining local information with larger receptive field, improving the perception capability of the network on the local information and improving the positioning detection problem of students with small targets in the back row of a classroom.

Drawings

Fig. 1 is a schematic diagram of a multi-person key point detection network structure based on a classroom scene.

FIG. 2 is an effect diagram of multi-person key point detection in a classroom scene

FIG. 3 is a schematic structural diagram of the YOLO-DesNet network with the dense connection blocks fused in step 1.

Fig. 4 is a schematic structural diagram of step 3 void convolution-based inclusion net (inclusion-DCNet) network.

Fig. 5 is a schematic network structure diagram of the confidence map of the predicted keypoint and the location relationship map in step 4.

Detailed description of the preferred embodiments

The invention is described in detail below with reference to the drawings and preferred embodiments:

the first embodiment is as follows:

in this embodiment, as shown in fig. 1, a multi-user key point detection network based on a classroom scene includes a human target area detection module 1, a human target area fusion module 2, a human target area feature extraction module 3, and a key point detection and integration module 4; the human body target area detection module 1 is sequentially connected with a human body target area fusion module 2, a human body target area feature extraction module 3 and a key point detection and integration module 4; the human body target area detection module 1 is used for detecting the area of each student in the picture; the human body target area fusion module 2 is used for fusing the areas of the students roughly detected in the human body target area detection module 1; the human body target area feature extraction module 3 is used for extracting features of the student areas fused in the human body target area fusion module 2; and the key point detection and integration module 4 is used for predicting the confidence coefficient and the position relation affinity of the key points in the areas where the students exist, and then performing limb matching to obtain the final multi-user key point detection result.

Example two:

this embodiment is substantially the same as the first embodiment, and is characterized in that:

in this embodiment, more specifically, the human target area detection module is configured to detect an area of each student in the picture, where the detection of the student target does not need to be too fine, and a detection frame is allowed to contain a plurality of students, see fig. 3.

The human body target area feature extraction module is configured to perform feature extraction on the student areas fused in the human body target area fusion module, which is shown in fig. 4.

The key point detection and integration module is used for predicting the confidence coefficient and the position relation affinity of the key points in the area where the students exist, and then performing limb matching to obtain a final multi-person key point detection result, which is shown in an attached figure 5.

The human body target area detection module is a YOLO V3 network with a dense connection module introduced in a shallow network, features of an input image are extracted by using dense connection convolution, and a GIOU loss function is used for replacing a bounding box regression loss function of YOLO V3, so that the shallow feature and the deep feature can be better and faster fused, the detection precision is improved, and the problem that detection of low-resolution students in the back row of a classroom is difficult is solved, and the method is shown in figure 3.

The human body target area fusion module firstly amplifies the human body frame detected by the human body target area detection module, and ensures that the boundary of the amplified human body prediction frame does not exceed the boundary of the original image. When any two human body prediction frames are fused, whether intersection exists between the two prediction frames is judged firstly, if intersection exists, the IOU is defined according to the idea of referring to the IOU_maxWhen IOU of two human body prediction boxes_maxAbove 0.5, the two regions are fused.

The human body target area feature extraction module is an inclusion network based on cavity convolution, local information with a larger receptive field is obtained by introducing cavity convolutions with different scales, and the sensing capability of the network to the local information is improved. The input pictures are firstly subjected to cross-channel organization information by using 1-by-1 standard convolution, the expression capacity of the network is improved, and more nonlinear transformation is provided. And performing secondary convolution on the output characteristics by using the standard convolution checks of 1 x 1 and 3 x 3, and increasing the adaptability of the network to different human body scales. And then, carrying out convolution by using the output characteristics of the previous step again by using the empty convolution with different expansion rates to obtain local information of a larger receptive field and improve the detection performance of the small-size human body target. And adding convolution characteristics output by different branches according to pixel point levels, and convolving the added characteristics again by using the standard convolution of 1 x 1 to eliminate aliasing effects caused by convolution using convolution kernels with different sizes. And finally, carrying out nonlinear activation on the output fusion convolution characteristics through a ReLu function, and referring to the attached figure 4.

The key point detection and integration module is a cascaded multi-stage network, predicts a human body key point confidence map and a position relation map, sets a loss function after each stage, and finally outputs the key point confidence map and the position relation map and performs limb matching to obtain a final multi-person key point detection result, which is shown in the attached figure 5.

Example three:

as shown in fig. 1, a method for detecting a plurality of key points based on a classroom scene is operated by using the network, and the specific network flow steps are as follows:

Example four:

the present embodiment is basically the same as the third embodiment, and the features are as follows:

as shown in fig. 3, the specific steps of step 1 are:

step 1.1: and performing 1-time dense connection convolution and 3-time residual convolution on the input image to extract features, so that multiplexing and fusion of network multi-layer features can be better realized.

Step 1.2: the structure of the feature extraction network is deepened through 3 groups of residual modules, and the selection and extraction capability of the model on deep features of the image is improved.

Step 1.3: and (3) carrying out 3 times of regression prediction by using a multi-scale pyramid structure through 2 times of upsampling and carrying out tensor splicing with the characteristic image with the same size in the upper layer of the network, thereby realizing multi-scale detection of targets with different sizes.

Step 1.4: the bounding box regression loss function of YOLO V3 was replaced with the GIOU loss function.

The specific steps of the step 2 are as follows:

Step 2.2: firstly, judging whether any two human body frames have intersection or not according to the coordinate relation of the human body prediction frames, and if so, calculating the IOU of the two human body frames_concatThe value is obtained. IOU when two human body frames_concatAbove a certain threshold (set to 0.5), region fusion is performed. Here, the IOU_concatIs defined as the ratio of the intersection of any two human prediction frames to the smaller of the two prediction frames.

As shown in fig. 4, the specific steps of step 3 are:

Step 3.3: and (3) performing convolution on the output characteristics in the step (3.2) by using the cavity convolution with different expansion rates to obtain local information of a larger receptive field and improve the detection performance of small-size human body targets.

As shown in fig. 5, the specific steps of step 4 are:

step 4.1: and (4) inputting the feature map output in the step (3) into a stage1, and respectively predicting a key point confidence map S1 and a part affinity vector field L1.

Step 4.2: and (3) inputting the S1 and the L1 predicted in the step 4.1 and the feature map output in the original step 3 into a stage2 for prediction to obtain S2 and L2.

Step 4.3: and the subsequent stages take the S and L output by the previous stage and the characteristic diagram output in the step 3 as input until the stage6 to obtain the final prediction result.

Step 4.4: using NMS (non-maximum suppression) on the human keypoint confidence map finally obtained in step 4.3 to obtain a discrete set of keypoints, candidate limb segments combined by these discrete points can be obtained.

The invention provides a multi-person key point detection network combining Top-Down and Bottom-up, which comprises 4 modules, wherein the modules are as follows: the device comprises a human body region detection module, a human body region fusion module, a human body region feature extraction module and a key point detection and integration module. Aiming at the problems of occlusion in a classroom scene, difficulty in positioning and detecting rear-row small-size students and detection of key points in a non-human region, the invention respectively improves OpenPose and YoloV3 networks based on a multi-scale feature fusion strategy, and fuses the two networks into a framework. A plurality of stages are utilized to carry out efficient feature fusion, local larger receptive field information is obtained, and good effects are obtained aiming at the positioning detection problems of false detection key points and small target students.

The embodiments of the present invention have been described with reference to the accompanying drawings, but the present invention is not limited to the embodiments, and various changes and modifications can be made according to the purpose of the invention, and any changes, modifications, substitutions, combinations or simplifications made according to the spirit and principle of the technical solution of the present invention shall be equivalent substitutions, as long as the purpose of the present invention is met, and the present invention shall fall within the protection scope of the present invention as long as the technical principle and inventive concept of the multi-person key point detection network and method based on the classroom scene do not depart from the present invention.

Claims

1. A multi-person key point detection network based on a classroom scene comprises a human body target region detection module (1), a human body target region fusion module (2), a human body target region feature extraction module (3) and a key point detection and integration module (4); the method is characterized in that:

the human body target region detection module (1) is sequentially connected with the human body target region fusion module (2), the human body target region feature extraction module (3) and the key point detection and integration module (4);

the human body target area detection module (1) is used for detecting the area of each student in the picture;

the human body target region fusion module (2) is used for fusing the regions of the students roughly detected in the human body target region detection module (1);

the human body target region feature extraction module (3) is used for extracting features of the student regions fused in the human body target region fusion module (2);

and the key point detection and integration module (4) is used for predicting the confidence coefficient and the position relation affinity of the key points in the areas where the students exist, and then performing limb matching to obtain the final multi-person key point detection result.

2. The multi-person keypoint detection network based on a classroom scene of claim 1, wherein: the human body target area detection module (1) is a YOLO V3 network with a dense connection module introduced into a shallow network, and a GIOU loss function is used for replacing a bounding box regression loss function of YOLO V3, so that shallow features and deep features can be better and faster fused, the detection precision is improved, and the problem that low-resolution students in the back row of a classroom are difficult to detect is solved.

3. The multi-person keypoint detection network based on a classroom scene of claim 1, wherein: the human body target region fusion module (2) is used for fusing the human body frame regions detected in the human body target region detection module (1) and aims to reduce the situation that key points are detected at non-human positions subsequently.

4. The multi-person keypoint detection network based on a classroom scene of claim 1, wherein: the human body target region feature extraction module (3) is an Inception Net network based on cavity convolution, and aims to obtain local information of a larger receptive field and improve the detection performance of small-size students.

5. The multi-person keypoint detection network based on a classroom scene of claim 1, wherein: the key point detection and integration module (4) is a cascaded multi-stage network, a human body key point confidence map and a position relation map are predicted at the same time, a loss function is set after each stage, and finally the key point confidence map and the position relation map are output and limb matching is carried out to obtain a final multi-person key point detection result.

6. A multi-person key point detection method based on a classroom scene, which is operated by the multi-person key point detection network based on the classroom scene as claimed in claim 1, and is characterized by the following specific operation steps:

step 1: detecting a human body target region, and roughly detecting the region of each student in the picture;

step 2: human body detection region fusion, namely performing region fusion on the student regions detected in the step 1;

and step 3: extracting the characteristics of the human body target area, namely extracting the characteristics of the fused student target area obtained in the step 2;

and 4, step 4: and (3) key point detection, namely predicting the confidence coefficient and the part relation affinity of the key points in the area where the student exists, and then performing limb matching to obtain a final key point detection result.

7. The method for detecting multi-person key points based on classroom scene as claimed in claim 6, wherein the specific steps of said step 1 are:

8. The method for detecting multi-person key points based on classroom scene as claimed in claim 6, wherein the specific steps of said step 2 are:

step 2.1: firstly, amplifying the human body frame detected in the step 1, and ensuring that the boundary of the amplified human body frame does not exceed the boundary of the image;

step 2.2: firstly, judging whether any two human body frames are in existence or not according to the coordinate relation of the human body prediction framesThere is an intersection, if any, the IOU of these two body frames is calculated_concatA value; IOU when two human body frames_concatIf the threshold value is larger than a certain threshold value, carrying out region fusion; here, the IOU_concatThe value is defined as the ratio of the intersection of any two human prediction boxes to the smaller of the two prediction boxes.

9. The method for detecting multi-person key points based on classroom scene as claimed in claim 6, wherein said step 3 comprises the following steps:

step 3.1: for an input picture, 1 × 1 standard convolution is used for organizing information across channels, the expression capacity of a network is improved, and more nonlinear transformations are provided;

step 3.2: performing secondary convolution on the features output in the step 3.1 by using the standard convolution kernels 1 x 1 and 3 x 3, and increasing the adaptability of the network to different human body scales;

step 3.3: carrying out convolution on the output characteristics in the step 3.2 by using empty convolution with different expansion rates to obtain local information with a larger receptive field and improve the detection performance of small-size human body targets;

step 3.4: adding convolution characteristics output by different branches according to pixel point levels, and convolving the added characteristics again by using a 1 x 1 standard convolution to eliminate aliasing effects caused by convolution using convolution kernels with different sizes;

10. The method for detecting multi-person key points based on classroom scene as claimed in claim 6, wherein the specific steps of said step 4 are: