CN111767792A - Multi-person key point detection network and method based on classroom scene - Google Patents

Multi-person key point detection network and method based on classroom scene Download PDF

Info

Publication number
CN111767792A
CN111767792A CN202010439222.9A CN202010439222A CN111767792A CN 111767792 A CN111767792 A CN 111767792A CN 202010439222 A CN202010439222 A CN 202010439222A CN 111767792 A CN111767792 A CN 111767792A
Authority
CN
China
Prior art keywords
human body
key point
module
detection
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010439222.9A
Other languages
Chinese (zh)
Inventor
滕国伟
丁敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN202010439222.9A priority Critical patent/CN111767792A/en
Publication of CN111767792A publication Critical patent/CN111767792A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Human Computer Interaction (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a multi-user key point detection network and a method based on a classroom scene. The network comprises a human body target detection module, a human body target region fusion module, a human body target region feature extraction module and a key point detection and integration module. The invention utilizes a plurality of stages to perform efficient feature fusion, and improves OpenPose and YoloV3 networks respectively based on a multi-scale feature fusion strategy. The invention provides an inclusion module based on cavity convolution to replace a feature extraction network of OpenPose, local information of a larger receptive field can be obtained, a dense connection module is fused into a shallow network of YOLOV3, feature fusion is carried out on shallow and high-level features, a GIOU loss function is used for replacing a bounding box regression loss function of YOLO v3 to improve detection precision, and then region fusion is carried out through a human body prediction frame fusion strategy to output a detection region. The two networks are cascaded into a framework for key point detection, so that the problems of difficult positioning and key point false detection of the rear-row small-scale students in the classroom are effectively solved.

Description

Multi-person key point detection network and method based on classroom scene
Technical Field
The invention relates to human body key point detection, in particular to a multi-person key point detection network and a multi-person key point detection method based on a classroom scene.
Background
Human body key point detection, also called human body posture estimation, is a very basic problem in computer vision, is a preposed task of human body action recognition, behavior analysis, human-computer interaction and the like, and can be understood as estimation of positions of key points of a human body, such as the head, the elbow, the wrist, the knee and the like. The human body posture estimation can be divided into 2D/3D key point detection and single person/multi-person key point detection, and key point tracking can be carried out after the key point detection is finished, and the method is also called as human body posture tracking. Human keypoint detection also faces many challenges, such as flexible, small and nearly invisible joints, occlusion, clothing, and light changes all add difficulty to human keypoint detection. The invention mainly relates to 2D multi-person key point detection, and aims to detect key points of students in a classroom for subsequent posture recognition. Giving an RGB image, accurately positioning a plurality of key points of the human body, and determining the human body to which the key points belong.
Currently, there are two main methods for multi-person key point detection:
(1) top-down: firstly, target (human body) detection is carried out, and then a single-person posture estimation is carried out on each detected human body (such as CPM, Stacked Hourglass, HRnet and other networks). The top-down approach is necessarily constrained by the target detection task, as single-person pose estimation based on a bounding box is vulnerable to occlusion problems and small-scale human targets.
(2) Bottom-up: the method comprises the steps of detecting key points of all people, and then matching the key points to related human bodies through an algorithm (such as OpenPose dynamic programming, tag matching of social Embedding, greedy algorithm of PersonLab and the like). The problem of occlusion is still a challenge, and meanwhile, due to the fact that the sizes of human body scales on the images are different, the difficulty of extracting the key point features is greater than that of a Top-down method.
Generally, the Top-down method has higher precision but poorer real-time performance, and the Bottom-up method has lower precision than the Top-down method, but has higher speed and better real-time performance.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: the invention provides a multi-user key point detection network and a method based on a classroom scene, aiming at the problems of occlusion in the classroom scene, difficulty in positioning and detecting rear-row small-size targets and key point false detection in a non-human region, wherein the network is a multi-user key point detection network combining Top-down and Bottom-up.
In order to achieve the purpose, the invention adopts the following technical scheme:
a multi-person key point detection network based on classroom scene comprises a human body target region detection module, a human body target region fusion module, a human body target region feature extraction module and a key point detection and integration module;
the human body target area detection module is sequentially connected with the human body target area fusion module, the human body target area feature extraction module and the key point detection and integration module.
And the human body target area detection module is used for detecting the area of each student in the picture.
And the human body target area fusion module is used for fusing the areas of the students detected in the human body target area detection module.
And the human body target region feature extraction module is used for extracting features of the student regions fused in the human body target region fusion module.
And the key point detection and integration module is used for predicting the confidence coefficient and the position relation affinity of the key points in the area where the students exist, and then performing limb matching to obtain the final multi-user key point detection result.
The human body target area detection module is a YOLO V3 network with a dense connection module introduced into a shallow network, features of an input image are extracted by using dense connection convolution, and a GIOU loss function is used for replacing a bounding box regression loss function of YOLO V3, so that the shallow feature and the deep feature can be better and faster fused, the detection precision is improved, and the problem that low-resolution students in the back row of a classroom are difficult to detect is solved.
The human body target area fusion module is used for fusing human body frames detected by the YOLO-DesNet,firstly, the human body frame is amplified, and the boundary of the amplified human body prediction frame is ensured not to exceed the boundary of the original image. When any two human body prediction frames are fused, whether intersection exists between the two prediction frames is judged firstly, if intersection exists, the IOU is defined according to the idea of referring to the IOUconcatWhen IOU of two human body prediction boxesconcatAbove 0.5, the two regions are fused.
The human body target area feature extraction module is an inclusion network based on cavity convolution, local information with a larger receptive field is obtained by introducing cavity convolutions with different scales, and the sensing capability of the network to the local information is improved. The input pictures are firstly subjected to cross-channel organization information by using 1-by-1 standard convolution, the expression capacity of the network is improved, and more nonlinear transformation is provided. And performing secondary convolution on the output characteristics by using the standard convolution checks of 1 x 1 and 3 x 3, and increasing the adaptability of the network to different human body scales. And then, carrying out convolution by using the output characteristics of the previous step again by using the empty convolution with different expansion rates to obtain local information of a larger receptive field and improve the detection performance of the small-size human body target. And adding convolution characteristics output by different branches according to pixel point levels, and convolving the added characteristics again by using the standard convolution of 1 x 1 to eliminate aliasing effects caused by convolution using convolution kernels with different sizes. And finally, carrying out nonlinear activation on the output fusion convolution characteristics through a ReLu function.
The key point detection and integration module is a cascaded multi-stage network, simultaneously predicts a human body key point confidence map and a position relation map, sets a loss function behind each stage, finally outputs the key point confidence map and the position relation map and performs limb matching to obtain a final multi-person key point detection result.
A multi-person key point detection method based on classroom scene comprises the following specific operation steps:
step 1: and detecting the human body target area, namely detecting the area of each student in the picture, wherein the student target detection does not need to be too fine.
Step 2: and fusing human body target areas, namely fusing areas of the students roughly detected in the human body target area detection module.
And step 3: and (4) extracting the characteristics of the human body target area, namely extracting the characteristics of the student area fused by the human body target area fusion module.
And 4, step 4: and the key point detection and integration module is used for predicting the confidence coefficient and the part relation affinity of the key points in the area where the students exist, and then performing limb matching to obtain the final multi-person key point detection result.
The specific steps of the step 1 are as follows:
step 1.1: the input image is subjected to 1 time of dense connection convolution and 3 times of residual convolution to extract features, so that multiplexing and fusion of network multilayer features can be better realized;
step 1.2: the structure of a feature extraction network is deepened through 3 groups of residual modules, and the selection and extraction capability of the model on deep features of the image is improved;
step 1.3: carrying out tensor splicing with the feature map with the same size in the upper layer of the network by using a multi-scale pyramid structure through 2 times of upsampling, and carrying out regression prediction for 3 times to realize multi-scale detection on targets with different sizes;
step 1.4: replacing the bounding box regression loss function of YOLO V3 with the GIOU loss function;
step 1.5: and simultaneously participating in back propagation by the target confidence coefficient loss, the target category loss and the target boundary box regression loss, setting the iteration number to be 50000, the learning rate to be 0.0001 and the weight attenuation to be 0.0004, and helping the network to finish training.
The specific steps of the step 2 are as follows:
step 2.1: firstly, the human body frame detected in the step 1 is amplified, and the boundary of the amplified human body frame is ensured not to exceed the boundary of the image.
Step 2.2: firstly, judging whether any two human body frames have intersection or not according to the coordinate relation of the human body prediction frames, and if so, calculating the IOU of the two human body framesconcatThe value is obtained. IOU when two human body framesconcatAbove a certain threshold (set to 0.5), region fusion is performed.Here, the IOUconcatIs defined as the ratio of the intersection of any two human prediction frames to the smaller of the two prediction frames.
The specific steps of the step 3 are as follows:
step 3.1: for the input pictures, the 1 × 1 standard convolution is used for organizing information across channels, so that the expression capacity of the network is improved, and more nonlinear transformation is provided.
Step 3.2: the features output in step 3.1 are convolved twice using standard convolution kernels of 1 x 1 and 3 x 3, increasing the adaptability of the network to different human scales.
Step 3.3: and (3) performing convolution on the output characteristics in the step (3.2) by using the empty convolution with different expansion rates to obtain local information of a larger receptive field and improve the detection performance of the small-size human body target.
Step 3.4: adding convolution characteristics output by different branches according to pixel levels, and convolving the added characteristics again by using the standard convolution of 1 x 1 to eliminate aliasing effects caused by convolution using convolution kernels with different sizes.
Step 3.5: and (4) carrying out nonlinear activation on the fusion convolution characteristics output in the step (3.4) through a ReLu function to obtain the characteristics extracted finally.
The specific steps of the step 4 are as follows:
step 4.1: inputting the feature map output in the step 3 into a stage1, and respectively predicting a key point confidence map S1 and a part affinity vector field L1;
step 4.2: inputting the S1 and L1 predicted in the step 4.1 and the feature map output in the original step 3 into a stage2 for prediction to obtain S2 and L2;
step 4.3: the subsequent stages take the S and L output by the previous stage plus the characteristic diagram output in the step 3 as input until stage6 to obtain the final prediction result;
step 4.4: using non-maximum value to inhibit NMS to obtain discrete key point set for the confidence map of key points in human body obtained in step 4.3, and obtaining candidate limb segment combined by these discrete points;
step 4.5: and (4) scoring the candidate limb segments in the step 4.4 according to the position relation graph obtained in the step 4.3, and performing maximum bipartite graph matching through a Hungarian algorithm to obtain a final key point detection result.
Compared with the prior art, the invention has the following obvious and prominent substantial and remarkable technical progress:
1) the invention provides a multi-person key point detection network combining Top-Down and Bottom-up. Aiming at the problems of occlusion in a classroom scene, difficulty in positioning and detecting rear-row small-size students and detection of key points in a non-human region, the invention utilizes a plurality of stages to perform efficient feature fusion, improves OpenPose and YooloV 3 networks respectively based on a multi-scale feature fusion strategy, and fuses the two networks into a framework. The invention comprises 4 modules, which are respectively: the device comprises a human body region detection module, a human body region fusion module, a human body region feature extraction module and a key point detection and integration module.
2) The invention uses the dense connection convolution block in the shallow layer network by taking the thought of dense connection, extracts the characteristics of the input image by the dense connection convolution, and uses the GIOU loss function to replace the bounding box regression loss function of the YOLO V3, so that the shallow layer characteristics can be better and more quickly transmitted to the deep layer network, the detection precision is improved, and the problem of difficult positioning of the rear-row low-resolution students in the classroom is solved.
3) The invention provides an inclusion Net network (inclusion-DCNet) based on hole convolution, and replaces a backhaul part (VGG-19) of OpenPose with the inclusion-DCNet, aiming at obtaining local information with larger receptive field, improving the perception capability of the network on the local information and improving the positioning detection problem of students with small targets in the back row of a classroom.
Drawings
Fig. 1 is a schematic diagram of a multi-person key point detection network structure based on a classroom scene.
FIG. 2 is an effect diagram of multi-person key point detection in a classroom scene
FIG. 3 is a schematic structural diagram of the YOLO-DesNet network with the dense connection blocks fused in step 1.
Fig. 4 is a schematic structural diagram of step 3 void convolution-based inclusion net (inclusion-DCNet) network.
Fig. 5 is a schematic network structure diagram of the confidence map of the predicted keypoint and the location relationship map in step 4.
Detailed description of the preferred embodiments
The invention is described in detail below with reference to the drawings and preferred embodiments:
the first embodiment is as follows:
in this embodiment, as shown in fig. 1, a multi-user key point detection network based on a classroom scene includes a human target area detection module 1, a human target area fusion module 2, a human target area feature extraction module 3, and a key point detection and integration module 4; the human body target area detection module 1 is sequentially connected with a human body target area fusion module 2, a human body target area feature extraction module 3 and a key point detection and integration module 4; the human body target area detection module 1 is used for detecting the area of each student in the picture; the human body target area fusion module 2 is used for fusing the areas of the students roughly detected in the human body target area detection module 1; the human body target area feature extraction module 3 is used for extracting features of the student areas fused in the human body target area fusion module 2; and the key point detection and integration module 4 is used for predicting the confidence coefficient and the position relation affinity of the key points in the areas where the students exist, and then performing limb matching to obtain the final multi-user key point detection result.
Example two:
this embodiment is substantially the same as the first embodiment, and is characterized in that:
in this embodiment, more specifically, the human target area detection module is configured to detect an area of each student in the picture, where the detection of the student target does not need to be too fine, and a detection frame is allowed to contain a plurality of students, see fig. 3.
And the human body target area fusion module is used for fusing the areas of the students detected in the human body target area detection module.
The human body target area feature extraction module is configured to perform feature extraction on the student areas fused in the human body target area fusion module, which is shown in fig. 4.
The key point detection and integration module is used for predicting the confidence coefficient and the position relation affinity of the key points in the area where the students exist, and then performing limb matching to obtain a final multi-person key point detection result, which is shown in an attached figure 5.
The human body target area detection module is a YOLO V3 network with a dense connection module introduced in a shallow network, features of an input image are extracted by using dense connection convolution, and a GIOU loss function is used for replacing a bounding box regression loss function of YOLO V3, so that the shallow feature and the deep feature can be better and faster fused, the detection precision is improved, and the problem that detection of low-resolution students in the back row of a classroom is difficult is solved, and the method is shown in figure 3.
The human body target area fusion module firstly amplifies the human body frame detected by the human body target area detection module, and ensures that the boundary of the amplified human body prediction frame does not exceed the boundary of the original image. When any two human body prediction frames are fused, whether intersection exists between the two prediction frames is judged firstly, if intersection exists, the IOU is defined according to the idea of referring to the IOUmaxWhen IOU of two human body prediction boxesmaxAbove 0.5, the two regions are fused.
The human body target area feature extraction module is an inclusion network based on cavity convolution, local information with a larger receptive field is obtained by introducing cavity convolutions with different scales, and the sensing capability of the network to the local information is improved. The input pictures are firstly subjected to cross-channel organization information by using 1-by-1 standard convolution, the expression capacity of the network is improved, and more nonlinear transformation is provided. And performing secondary convolution on the output characteristics by using the standard convolution checks of 1 x 1 and 3 x 3, and increasing the adaptability of the network to different human body scales. And then, carrying out convolution by using the output characteristics of the previous step again by using the empty convolution with different expansion rates to obtain local information of a larger receptive field and improve the detection performance of the small-size human body target. And adding convolution characteristics output by different branches according to pixel point levels, and convolving the added characteristics again by using the standard convolution of 1 x 1 to eliminate aliasing effects caused by convolution using convolution kernels with different sizes. And finally, carrying out nonlinear activation on the output fusion convolution characteristics through a ReLu function, and referring to the attached figure 4.
The key point detection and integration module is a cascaded multi-stage network, predicts a human body key point confidence map and a position relation map, sets a loss function after each stage, and finally outputs the key point confidence map and the position relation map and performs limb matching to obtain a final multi-person key point detection result, which is shown in the attached figure 5.
Example three:
as shown in fig. 1, a method for detecting a plurality of key points based on a classroom scene is operated by using the network, and the specific network flow steps are as follows:
step 1: and detecting the human body target area, namely detecting the area of each student in the picture, wherein the student target detection does not need to be too fine.
Step 2: and fusing human body target areas, namely fusing areas of the students roughly detected in the human body target area detection module.
And step 3: and (4) extracting the characteristics of the human body target area, namely extracting the characteristics of the student area fused by the human body target area fusion module.
And 4, step 4: and the key point detection and integration module is used for predicting the confidence coefficient and the part relation affinity of the key points in the area where the students exist, and then performing limb matching to obtain the final multi-person key point detection result.
Example four:
the present embodiment is basically the same as the third embodiment, and the features are as follows:
as shown in fig. 3, the specific steps of step 1 are:
step 1.1: and performing 1-time dense connection convolution and 3-time residual convolution on the input image to extract features, so that multiplexing and fusion of network multi-layer features can be better realized.
Step 1.2: the structure of the feature extraction network is deepened through 3 groups of residual modules, and the selection and extraction capability of the model on deep features of the image is improved.
Step 1.3: and (3) carrying out 3 times of regression prediction by using a multi-scale pyramid structure through 2 times of upsampling and carrying out tensor splicing with the characteristic image with the same size in the upper layer of the network, thereby realizing multi-scale detection of targets with different sizes.
Step 1.4: the bounding box regression loss function of YOLO V3 was replaced with the GIOU loss function.
Step 1.5: and simultaneously participating in back propagation by the target confidence coefficient loss, the target category loss and the target boundary box regression loss, setting the iteration number to be 50000, the learning rate to be 0.0001 and the weight attenuation to be 0.0004, and helping the network to finish training.
The specific steps of the step 2 are as follows:
step 2.1: firstly, the human body frame detected in the step 1 is amplified, and the boundary of the amplified human body frame is ensured not to exceed the boundary of the image.
Step 2.2: firstly, judging whether any two human body frames have intersection or not according to the coordinate relation of the human body prediction frames, and if so, calculating the IOU of the two human body framesconcatThe value is obtained. IOU when two human body framesconcatAbove a certain threshold (set to 0.5), region fusion is performed. Here, the IOUconcatIs defined as the ratio of the intersection of any two human prediction frames to the smaller of the two prediction frames.
As shown in fig. 4, the specific steps of step 3 are:
step 3.1: for the input pictures, the 1 × 1 standard convolution is used for organizing information across channels, so that the expression capacity of the network is improved, and more nonlinear transformation is provided.
Step 3.2: the features output in step 3.1 are convolved twice using standard convolution kernels of 1 x 1 and 3 x 3, increasing the adaptability of the network to different human scales.
Step 3.3: and (3) performing convolution on the output characteristics in the step (3.2) by using the cavity convolution with different expansion rates to obtain local information of a larger receptive field and improve the detection performance of small-size human body targets.
Step 3.4: adding convolution characteristics output by different branches according to pixel levels, and convolving the added characteristics again by using the standard convolution of 1 x 1 to eliminate aliasing effects caused by convolution using convolution kernels with different sizes.
Step 3.5: and (4) carrying out nonlinear activation on the fusion convolution characteristics output in the step (3.4) through a ReLu function to obtain the characteristics extracted finally.
As shown in fig. 5, the specific steps of step 4 are:
step 4.1: and (4) inputting the feature map output in the step (3) into a stage1, and respectively predicting a key point confidence map S1 and a part affinity vector field L1.
Step 4.2: and (3) inputting the S1 and the L1 predicted in the step 4.1 and the feature map output in the original step 3 into a stage2 for prediction to obtain S2 and L2.
Step 4.3: and the subsequent stages take the S and L output by the previous stage and the characteristic diagram output in the step 3 as input until the stage6 to obtain the final prediction result.
Step 4.4: using NMS (non-maximum suppression) on the human keypoint confidence map finally obtained in step 4.3 to obtain a discrete set of keypoints, candidate limb segments combined by these discrete points can be obtained.
Step 4.5: and (4) scoring the candidate limb segments in the step 4.4 according to the position relation graph obtained in the step 4.3, and performing maximum bipartite graph matching through a Hungarian algorithm to obtain a final key point detection result.
The invention provides a multi-person key point detection network combining Top-Down and Bottom-up, which comprises 4 modules, wherein the modules are as follows: the device comprises a human body region detection module, a human body region fusion module, a human body region feature extraction module and a key point detection and integration module. Aiming at the problems of occlusion in a classroom scene, difficulty in positioning and detecting rear-row small-size students and detection of key points in a non-human region, the invention respectively improves OpenPose and YoloV3 networks based on a multi-scale feature fusion strategy, and fuses the two networks into a framework. A plurality of stages are utilized to carry out efficient feature fusion, local larger receptive field information is obtained, and good effects are obtained aiming at the positioning detection problems of false detection key points and small target students.
The embodiments of the present invention have been described with reference to the accompanying drawings, but the present invention is not limited to the embodiments, and various changes and modifications can be made according to the purpose of the invention, and any changes, modifications, substitutions, combinations or simplifications made according to the spirit and principle of the technical solution of the present invention shall be equivalent substitutions, as long as the purpose of the present invention is met, and the present invention shall fall within the protection scope of the present invention as long as the technical principle and inventive concept of the multi-person key point detection network and method based on the classroom scene do not depart from the present invention.

Claims (10)

1. A multi-person key point detection network based on a classroom scene comprises a human body target region detection module (1), a human body target region fusion module (2), a human body target region feature extraction module (3) and a key point detection and integration module (4); the method is characterized in that:
the human body target region detection module (1) is sequentially connected with the human body target region fusion module (2), the human body target region feature extraction module (3) and the key point detection and integration module (4);
the human body target area detection module (1) is used for detecting the area of each student in the picture;
the human body target region fusion module (2) is used for fusing the regions of the students roughly detected in the human body target region detection module (1);
the human body target region feature extraction module (3) is used for extracting features of the student regions fused in the human body target region fusion module (2);
and the key point detection and integration module (4) is used for predicting the confidence coefficient and the position relation affinity of the key points in the areas where the students exist, and then performing limb matching to obtain the final multi-person key point detection result.
2. The multi-person keypoint detection network based on a classroom scene of claim 1, wherein: the human body target area detection module (1) is a YOLO V3 network with a dense connection module introduced into a shallow network, and a GIOU loss function is used for replacing a bounding box regression loss function of YOLO V3, so that shallow features and deep features can be better and faster fused, the detection precision is improved, and the problem that low-resolution students in the back row of a classroom are difficult to detect is solved.
3. The multi-person keypoint detection network based on a classroom scene of claim 1, wherein: the human body target region fusion module (2) is used for fusing the human body frame regions detected in the human body target region detection module (1) and aims to reduce the situation that key points are detected at non-human positions subsequently.
4. The multi-person keypoint detection network based on a classroom scene of claim 1, wherein: the human body target region feature extraction module (3) is an Inception Net network based on cavity convolution, and aims to obtain local information of a larger receptive field and improve the detection performance of small-size students.
5. The multi-person keypoint detection network based on a classroom scene of claim 1, wherein: the key point detection and integration module (4) is a cascaded multi-stage network, a human body key point confidence map and a position relation map are predicted at the same time, a loss function is set after each stage, and finally the key point confidence map and the position relation map are output and limb matching is carried out to obtain a final multi-person key point detection result.
6. A multi-person key point detection method based on a classroom scene, which is operated by the multi-person key point detection network based on the classroom scene as claimed in claim 1, and is characterized by the following specific operation steps:
step 1: detecting a human body target region, and roughly detecting the region of each student in the picture;
step 2: human body detection region fusion, namely performing region fusion on the student regions detected in the step 1;
and step 3: extracting the characteristics of the human body target area, namely extracting the characteristics of the fused student target area obtained in the step 2;
and 4, step 4: and (3) key point detection, namely predicting the confidence coefficient and the part relation affinity of the key points in the area where the student exists, and then performing limb matching to obtain a final key point detection result.
7. The method for detecting multi-person key points based on classroom scene as claimed in claim 6, wherein the specific steps of said step 1 are:
step 1.1: the input image is subjected to 1 time of dense connection convolution and 3 times of residual convolution to extract features, so that multiplexing and fusion of network multilayer features can be better realized;
step 1.2: the structure of a feature extraction network is deepened through 3 groups of residual modules, and the selection and extraction capability of the model on deep features of the image is improved;
step 1.3: carrying out tensor splicing with the feature map with the same size in the upper layer of the network by using a multi-scale pyramid structure through 2 times of upsampling, and carrying out regression prediction for 3 times to realize multi-scale detection on targets with different sizes;
step 1.4: replacing the bounding box regression loss function of YOLO V3 with the GIOU loss function;
step 1.5: and simultaneously participating in back propagation by the target confidence coefficient loss, the target category loss and the target boundary box regression loss, setting the iteration number to be 50000, the learning rate to be 0.0001 and the weight attenuation to be 0.0004, and helping the network to finish training.
8. The method for detecting multi-person key points based on classroom scene as claimed in claim 6, wherein the specific steps of said step 2 are:
step 2.1: firstly, amplifying the human body frame detected in the step 1, and ensuring that the boundary of the amplified human body frame does not exceed the boundary of the image;
step 2.2: firstly, judging whether any two human body frames are in existence or not according to the coordinate relation of the human body prediction framesThere is an intersection, if any, the IOU of these two body frames is calculatedconcatA value; IOU when two human body framesconcatIf the threshold value is larger than a certain threshold value, carrying out region fusion; here, the IOUconcatThe value is defined as the ratio of the intersection of any two human prediction boxes to the smaller of the two prediction boxes.
9. The method for detecting multi-person key points based on classroom scene as claimed in claim 6, wherein said step 3 comprises the following steps:
step 3.1: for an input picture, 1 × 1 standard convolution is used for organizing information across channels, the expression capacity of a network is improved, and more nonlinear transformations are provided;
step 3.2: performing secondary convolution on the features output in the step 3.1 by using the standard convolution kernels 1 x 1 and 3 x 3, and increasing the adaptability of the network to different human body scales;
step 3.3: carrying out convolution on the output characteristics in the step 3.2 by using empty convolution with different expansion rates to obtain local information with a larger receptive field and improve the detection performance of small-size human body targets;
step 3.4: adding convolution characteristics output by different branches according to pixel point levels, and convolving the added characteristics again by using a 1 x 1 standard convolution to eliminate aliasing effects caused by convolution using convolution kernels with different sizes;
step 3.5: and (4) carrying out nonlinear activation on the fusion convolution characteristics output in the step (3.4) through a ReLu function to obtain the characteristics extracted finally.
10. The method for detecting multi-person key points based on classroom scene as claimed in claim 6, wherein the specific steps of said step 4 are:
step 4.1: inputting the feature map output in the step 3 into a stage1, and respectively predicting a key point confidence map S1 and a part affinity vector field L1;
step 4.2: inputting the S1 and L1 predicted in the step 4.1 and the feature map output in the original step 3 into a stage2 for prediction to obtain S2 and L2;
step 4.3: the subsequent stages take the S and L output by the previous stage plus the characteristic diagram output in the step 3 as input until stage6 to obtain the final prediction result;
step 4.4: using non-maximum value to inhibit NMS to obtain discrete key point set for the confidence map of key points in human body obtained in step 4.3, and obtaining candidate limb segment combined by these discrete points;
step 4.5: and (4) scoring the candidate limb segments in the step 4.4 according to the position relation graph obtained in the step 4.3, and performing maximum bipartite graph matching through a Hungarian algorithm to obtain a final key point detection result.
CN202010439222.9A 2020-05-22 2020-05-22 Multi-person key point detection network and method based on classroom scene Pending CN111767792A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010439222.9A CN111767792A (en) 2020-05-22 2020-05-22 Multi-person key point detection network and method based on classroom scene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010439222.9A CN111767792A (en) 2020-05-22 2020-05-22 Multi-person key point detection network and method based on classroom scene

Publications (1)

Publication Number Publication Date
CN111767792A true CN111767792A (en) 2020-10-13

Family

ID=72719526

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010439222.9A Pending CN111767792A (en) 2020-05-22 2020-05-22 Multi-person key point detection network and method based on classroom scene

Country Status (1)

Country Link
CN (1) CN111767792A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112507904A (en) * 2020-12-15 2021-03-16 重庆邮电大学 Real-time classroom human body posture detection method based on multi-scale features
CN112949379A (en) * 2020-12-30 2021-06-11 南京佑驾科技有限公司 Safety belt detection method and system based on vision
CN112966762A (en) * 2021-03-16 2021-06-15 南京恩博科技有限公司 Wild animal detection method and device, storage medium and electronic equipment
CN113158756A (en) * 2021-02-09 2021-07-23 上海领本智能科技有限公司 Posture and behavior analysis module and method based on HRNet deep learning
CN113297910A (en) * 2021-04-25 2021-08-24 云南电网有限责任公司信息中心 Distribution network field operation safety belt identification method
CN113537014A (en) * 2021-07-06 2021-10-22 北京观微科技有限公司 Improved darknet network-based ground-to-air missile position target detection and identification method
CN115272648A (en) * 2022-09-30 2022-11-01 华东交通大学 Multi-level receptive field expanding method and system for small target detection
CN115471773A (en) * 2022-09-16 2022-12-13 北京联合大学 Student tracking method and system for intelligent classroom

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109657631A (en) * 2018-12-25 2019-04-19 上海智臻智能网络科技股份有限公司 Human posture recognition method and device
CN110532984A (en) * 2019-09-02 2019-12-03 北京旷视科技有限公司 Critical point detection method, gesture identification method, apparatus and system
CN110781765A (en) * 2019-09-30 2020-02-11 腾讯科技(深圳)有限公司 Human body posture recognition method, device, equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109657631A (en) * 2018-12-25 2019-04-19 上海智臻智能网络科技股份有限公司 Human posture recognition method and device
CN110532984A (en) * 2019-09-02 2019-12-03 北京旷视科技有限公司 Critical point detection method, gesture identification method, apparatus and system
CN110781765A (en) * 2019-09-30 2020-02-11 腾讯科技(深圳)有限公司 Human body posture recognition method, device, equipment and storage medium

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
RONG FU ET AL: "Learning Behavior Analysis in Classroom Based on Deep Learning", 《10TH INTERNATIONAL CONFERENCE ON INTELLIGENT CONTROL AND INFORMATION PROCESSING》 *
YAYUN QI ET AL: "Vehicle Detection Under Unmanned Aerial Vehicle Based on Improved YOLOv3", 《2019 12TH INTERNATIONAL CONGRESS ON IMAGE AND SIGNAL PROCESSING, BIOMEDICAL ENGINEERING AND INFORMATICS (CISP-BMEI)》 *
ZHE CAO ET AL: "Real time Multi-Person 2D Pose Estimation using Part Affinity Fields", 《ARXIV:1611.08050V2》 *
汤林 等: "拥挤条件下的人体姿态检测算法研究", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *
董洪义: "《深度学习之PyTorch物体检测实战》", 31 January 2020 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112507904A (en) * 2020-12-15 2021-03-16 重庆邮电大学 Real-time classroom human body posture detection method based on multi-scale features
CN112507904B (en) * 2020-12-15 2022-06-03 重庆邮电大学 Real-time classroom human body posture detection method based on multi-scale features
CN112949379A (en) * 2020-12-30 2021-06-11 南京佑驾科技有限公司 Safety belt detection method and system based on vision
CN113158756A (en) * 2021-02-09 2021-07-23 上海领本智能科技有限公司 Posture and behavior analysis module and method based on HRNet deep learning
CN112966762A (en) * 2021-03-16 2021-06-15 南京恩博科技有限公司 Wild animal detection method and device, storage medium and electronic equipment
CN112966762B (en) * 2021-03-16 2023-12-26 南京恩博科技有限公司 Wild animal detection method and device, storage medium and electronic equipment
CN113297910A (en) * 2021-04-25 2021-08-24 云南电网有限责任公司信息中心 Distribution network field operation safety belt identification method
CN113537014A (en) * 2021-07-06 2021-10-22 北京观微科技有限公司 Improved darknet network-based ground-to-air missile position target detection and identification method
CN115471773A (en) * 2022-09-16 2022-12-13 北京联合大学 Student tracking method and system for intelligent classroom
CN115471773B (en) * 2022-09-16 2023-09-15 北京联合大学 Intelligent classroom-oriented student tracking method and system
CN115272648A (en) * 2022-09-30 2022-11-01 华东交通大学 Multi-level receptive field expanding method and system for small target detection
CN115272648B (en) * 2022-09-30 2022-12-20 华东交通大学 Multi-level receptive field expanding method and system for small target detection

Similar Documents

Publication Publication Date Title
CN111767792A (en) Multi-person key point detection network and method based on classroom scene
US20210326597A1 (en) Video processing method and apparatus, and electronic device and storage medium
CN111898709B (en) Image classification method and device
CN111126472B (en) SSD (solid State disk) -based improved target detection method
CN111340814B (en) RGB-D image semantic segmentation method based on multi-mode self-adaptive convolution
CN110188239B (en) Double-current video classification method and device based on cross-mode attention mechanism
CN108399362A (en) A kind of rapid pedestrian detection method and device
CN110059598B (en) Long-term fast-slow network fusion behavior identification method based on attitude joint points
US11494938B2 (en) Multi-person pose estimation using skeleton prediction
CN109492627B (en) Scene text erasing method based on depth model of full convolution network
CN108664885B (en) Human body key point detection method based on multi-scale cascade Hourglass network
CN111104930B (en) Video processing method, device, electronic equipment and storage medium
CN110322509B (en) Target positioning method, system and computer equipment based on hierarchical class activation graph
CN113095106A (en) Human body posture estimation method and device
WO2021249114A1 (en) Target tracking method and target tracking device
CN113673354B (en) Human body key point detection method based on context information and joint embedding
CN110705566A (en) Multi-mode fusion significance detection method based on spatial pyramid pool
CN112084952B (en) Video point location tracking method based on self-supervision training
CN112329861B (en) Layered feature fusion method for mobile robot multi-target detection
CN111401192A (en) Model training method based on artificial intelligence and related device
Zhou et al. Applying (3+ 2+ 1) D residual neural network with frame selection for Hong Kong sign language recognition
CN114764941A (en) Expression recognition method and device and electronic equipment
CN115223020A (en) Image processing method, image processing device, electronic equipment and readable storage medium
CN114764870A (en) Object positioning model processing method, object positioning device and computer equipment
CN116580211B (en) Key point detection method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20201013

RJ01 Rejection of invention patent application after publication