WO2022022368A1

WO2022022368A1 - Deep-learning-based apparatus and method for monitoring behavioral norms in jail

Info

Publication number: WO2022022368A1
Application number: PCT/CN2021/107746
Authority: WO
Inventors: 杨景翔; 许根; 黄业鹏; 吕立; 王菊; 徐刚; 肖江剑
Original assignee: 宁波环视信息科技有限公司; 中国科学院宁波材料技术与工程研究所
Priority date: 2020-07-28
Filing date: 2021-07-22
Publication date: 2022-02-03

Abstract

Disclosed are a deep-learning-based apparatus and method for monitoring behavioral norms in a jail. The deep-learning-based apparatus for monitoring behavioral norms in a jail comprises: a people counting and detection module and a behavioral norm monitoring module, wherein the people counting and detection module comprises a target detection and segmentation process, and is used for imperceptible roll call of people and crowd density recognition; and the behavioral norm monitoring module comprises a training process of obtaining a classifier by using a training sample set and a recognition process of recognizing a test sample by using the classifier, and is used for performing real-time calculation and discrimination on behaviors of people. In this way, according to the present application, behavioral norm recognition can be effectively performed, regarding the requirements of a jail, on detainees, and abnormal behaviors are detected and alarms for same are provided, thereby reinforcing the security protection of the jail and improving the working efficiency of correctional officers.

Description

Device and method for detecting behavioral norms of prisons based on deep learning

This application is based on and claims the priority of the Chinese patent application filed on July 28, 2020 with the application number of 202010736024.9 and the invention titled "Deep Learning-based Prison Behavioral Code Detection Device and Method".

technical field

The present application relates to the field of machine learning research, and in particular, to a deep learning-based device and method for detecting behavioral norms in prisons.

Background technique

With the rapid development of information technology, computer vision has ushered in the best period of development with the emergence of concepts such as VR, AR, and artificial intelligence. As the most important video behavior analysis in the field of computer vision, it is increasingly being studied by scholars at home and abroad. favor. In a series of fields such as video surveillance, human-computer interaction, medical care, and video retrieval, video behavior analysis occupies a large proportion. For example, in the popular driverless car project, video behavior analysis is very challenging. Due to the complexity and diversity of human actions, coupled with the influence of human self-occlusion, multi-scale, and perspective rotation and translation from multiple perspectives, it is very difficult to recognize video behaviors. How to accurately identify human behavior from multiple angles in real life and analyze human behavior has always been a very important research topic, and the society has higher and higher requirements for behavior analysis.

Traditional research methods include the following:

Based on video stream feature points: extract the spatiotemporal feature points from the extracted video frame images, then model and analyze the spatiotemporal feature points, and finally classify them.

Based on single-frame image features: extract the behavioral features of people in a single-frame image through algorithms or depth cameras, and then describe, model, and train the behavioral features to classify video behaviors.

The behavior analysis method based on video stream feature points and single-frame image features has achieved remarkable results in traditional single-view or single-person mode, but it is currently used in areas with relatively large pedestrian traffic such as streets, airports, and stations, or human occlusion. The appearance of a series of complex problems such as lighting changes, perspective changes, etc., simply using these two analysis methods in real life often fails to meet people's requirements, and sometimes the robustness of the algorithm is also very poor.

SUMMARY OF THE INVENTION

In order to solve the above-mentioned defects in the prior art, the present application proposes a deep learning-based detection device and method for behavioral norms of prisons and institutions, which adopts a deep learning network to analyze human behavior and improves the robustness of a classification model; especially a deep learning network It is suitable for training and learning based on big data, and can give full play to its advantages.

The technical solution of the present application is realized as follows:

The embodiment of the present application provides a deep learning-based detection device for behavioral norms of prisons, which completely designs a behavioral detection algorithm in accordance with the code of conduct for detainees in prisons. The detection process includes setting detection triggers for each standardized behavioral detection. Time period and detection area, the behavior detection is only triggered in the set time period and the set detection area, and the corresponding recognition algorithm is not performed in other time periods and other areas. Reduce the execution complexity of the system and improve the stability of the system. The detection time period and detection area are completely user-defined and set in accordance with the standard code of conduct, which can well meet the needs of code of conduct detection.

Specifically, this application proposes a deep learning-based behavioral code detection device for prisons, including: a head count detection module and a behavioral code detection module; wherein:

The head count detection module is used for non-sensing roll call and/or crowd density identification; the head count detection module includes a target detection and segmentation process;

The behavior norm detection module is used for real-time calculation and discrimination of personnel behavior; the behavior norm detection module includes a training process of obtaining a classifier by using a training sample set, and a recognition process of using the classifier to identify test samples.

Specifically, an embodiment of the present application also proposes a deep learning-based method for detecting behavioral norms in prisons, characterized in that the method includes the following steps:

Head count detection, used for non-sensing roll call and/or crowd density identification; the head count detection includes a target detection and segmentation process;

Behavioral norm detection is used for real-time calculation and discrimination of personnel behavior; the behavioral norm detection includes a training process of obtaining a classifier by using a training sample set, and a recognition process of using the classifier to identify test samples.

The advantages of this application are: the global high-level features are obtained by using the CNN method, and the feature enhancement of STN has good robustness to real-life videos, and then SPPE is used to obtain the human body posture information, and the SDTN returns to Human body detection frame, optimize its own network, use PP-NMS to solve the problem of redundant detection, and conduct corresponding classifier training based on the attitude estimation results. The features obtained from the global features are more comprehensive, making the behavior description more complete and more applicable.

Description of drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following briefly introduces the accompanying drawings required for the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present application, and for those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.

1 is a schematic flowchart of the target detection and segmentation process of the applicant's head technology detection module;

Fig. 2 is the schematic flow chart of the training process of the code of conduct detection module of the present application;

3 is a schematic flowchart of the discrimination process of the code of conduct detection module of the present application;

Fig. 4 is a simplified flowchart of extraction and modeling of underlying features;

Figure 5 is a process flow diagram of a general CNN.

detailed description

In view of the deficiencies in the prior art, the inventor of the present application has been able to propose the technical solution of the present application after long-term research and extensive practice. The technical solution, its implementation process and principle will be further explained as follows.

An apparatus for detecting behavior norms of prisons based on deep learning provided by an embodiment of the present application uses a CNN method to perform feature extraction on underlying features to obtain global features instead of key points obtained by traditional methods, and the embodiments of the present application Provided is a deep learning-based monitoring device for behavioral norms of prisons and institutions, which uses the STN method to perform feature enhancement on the obtained global features instead of directly modeling the obtained features; The learned prison behavior norm detection device uses the SDTN method to remap the obtained pose features to further enhance the accuracy of the detection frame. In addition, for the key points at multiple scales, a layer of deconvolution layer is used to perform key detection. The point regression operation can effectively improve the accuracy of multi-person key point detection. The deep learning-based prison code of conduct detection device provided by the embodiment of the present application also takes into account the connectivity of multiple key points, and establishes a method for connecting key points. to the field. The connected keypoint pairs are explicitly matched according to the connectivity of the human keypoints and the human body structure.

An embodiment of the present application provides a deep learning-based behavioral code detection device for prisons, including: a head count detection module and a behavioral code detection module; wherein:

Further, the target detection and segmentation process of the head count detection module includes the following steps:

S1) use the labeling tool to label the head of the image, generate a JSON file for each picture, and extract the feature information of the image labeling through a convolutional neural network;

S2) using the feature information obtained in step S1) to extract the ROI, that is, the region of interest, using the region generation network, and then use the region of interest pooling to turn these ROIs into a fixed size;

S3) perform Bounding box regression and classification prediction on the ROI obtained in step S2) through the fully connected layer, sample at different points of the feature map, and apply bilinear interpolation;

S4) Finally, perform a segmentation mask network, take the positive regions selected by the ROI classifier as input, and generate their masks; enlarge the predicted masks to the size of the ROI border to give the final mask results, each target There is a mask; the predicted segmentation mask is added to each ROI, and the output is the existing object on the image and a high-quality segmentation mask.

Further, the head count detection module specifically includes:

The target detection unit is used for non-sensing real-time detection and statistics of detainees;

Density analysis unit, used for real-time accurate density detection and abnormal alarm in dormitories and venting circles;

The target detection unit includes the following steps: firstly collect five groups to expose their heads in video images in different environments according to specification requirements, wherein four groups of videos are used as training data sets, and one group of videos is used as verification data sets; The video frame images of the four groups are operated according to the steps S1) to S5) to obtain a human head detection model; finally, this human head detection model is loaded to the remaining video frame images of that group, and the final real-time personnel detection and statistics are carried out;

Further, the training process of the behavior specification detection module includes the following steps:

S5) Input the video frame image of a certain behavior, let the image extract features through the convolutional neural network, and convolve the outputs of the six specific convolutional layers in the network with two 3*3 convolution kernels, and then Collect all the generated bounding boxes and throw them all to NMS, that is, non-maximum suppression, to obtain a series of target detection boxes.

S6) input the target detection frame obtained in step S5) into the STN, that is, the spatial transformation network, and carry out a reinforcement operation to extract a high-quality single-person area from the inaccurate candidate frame;

S7) use SPPE to the single person area frame after step S6) strengthening, namely single person posture estimator, estimate the posture skeleton of this person;

S8) Remap the single-person posture obtained in step S7) to the image coordinate system through SDTN, that is, the space inverse transformation network, so as to obtain a more accurate human body target detection frame, and perform the human body posture estimation operation again; then through PP- NMS, that is, parameterized non-maximum suppression, solves the problem of redundant detection, and obtains the human skeleton information under this behavior;

S9) For the multi-scale key points obtained in step S8), the key point regression operation is performed through the deconvolution layer, which is equivalent to performing an up-sampling process, which can improve the accuracy of the target key points; consider the connectivity of multiple key points. , establish a directed field connecting key points, and match the connected key point pairs according to the connectivity and structure of human body parts, reduce misconnections, and obtain the final human skeleton information;

S10) perform feature extraction on the final human skeleton information obtained in step S9), and input it into the classifier for training as a training sample of this type of behavior;

S11) Repeat the above steps to obtain classifiers of various behaviors.

Further, the identification process of the behavior specification detection module includes the following steps:

S12) According to the code of conduct of the detainees, according to the specific behavior detection requirements, set the detection trigger time period and detection area, and store them locally in the form of JSON;

S13) when detecting, first read the JSON file, in the set detection trigger time period, enter the video frame image of a certain behavior, only take the image in the detection area, and adopt the steps S5) to S10) to It performs human pose estimation and obtains the human skeleton feature information in the detection area; in other time periods, only video frame images are played, and corresponding behavior recognition operations are not performed;

S14) Input the human skeleton feature information obtained in step S13) into the classifier for identification to obtain the video behavior category.

Further, the identification process includes setting the detection trigger time period and detection area of each standard behavior detection and using the classifier to identify, including artificially setting the detection time and detection area, and strictly following the code of conduct for the detainees in the detention center. When it is within the detection trigger time period, the corresponding behavior recognition operation is carried out in the set detection area. When a violation is identified, an alarm message needs to be issued; if it is not within the detection trigger time period, the corresponding behavior recognition operation will not be performed; The detection time period and detection area are completely user-defined and set in accordance with the standard code of conduct, which can well meet the needs of code of conduct detection.

Further, in the described step S8), the PP-NMS operation specifically includes: selecting the attitude of the maximum confidence as a reference, and eliminating the area frame close to the reference according to the elimination standard, repeating the process many times until the redundant identification frame. is eliminated and each recognition box appears uniquely;

The human skeleton information obtained in the step S8) further includes: using the enhanced data set, by learning the description information of different postures in the output result, to imitate the formation process of the human body area frame, and further generate a larger training set.

Yet another embodiment of the present application provides a deep learning-based method for detecting behavioral norms in prisons, the method comprising the following steps:

Behavior norm detection is used for real-time calculation and discrimination of personnel behavior; the behavior norm detection includes a training process of obtaining a classifier by using a training sample set, and a recognition process of using the classifier to identify test samples.

Further, the target detection and segmentation process specifically includes the following steps:

Further, the head count detection specifically includes the following steps:

Target detection, which is used for non-sensing real-time detection and statistics of detainees;

Density analysis, used for real-time accurate density detection and abnormal alarm in dormitories and ventilation circles;

The target detection includes the following steps: firstly collect five groups to expose their heads in video images in different environments according to the specification requirements, wherein four groups of videos are used as training data sets, and one group of videos is used as verification data sets; The video frame images of the group are operated according to the steps S1) to S5) to obtain a human head detection model; finally, the human head detection model is loaded for the remaining video frame images of the group, and the final real-time personnel detection and statistics are performed.

The technical solution, its implementation process and principle will be further explained below with reference to the accompanying drawings.

As shown in FIGS. 1-3 , the deep learning-based method for detecting behavioral norms in prisons of the present application includes a head count detection module and a behavioral norms detection module.

The head count detection module is used for the nonsensical roll call of detainees and the identification of crowd density in the prison; the behavioral code detection module is used for the order of washing, housekeeping, dining and sleeping, getting up, and television education in prisons. , safety rotation norms, conduct assessment norms, three-position supervision norms, and out-of-jail holding head norms conduct real-time calculation and judgment.

Wherein, the head count detection module specifically includes: a target detection unit, which is used for insensitive real-time detection and statistics of detainees. The density analysis unit is used for real-time accurate density detection and abnormal alarm in prisons and venting circles.

Wherein, the behavior specification detection module specifically includes:

The washing order comparison unit is used to set the toilet and the waiting area in the prison, and calculate in real time whether there are only 2 people in the toilet and whether other people are waiting in the specified area.

The housekeeping standard unit is used to set the bed and the waiting area against the wall in the dormitory, and calculate in real time whether the bed is always kept for 4 people to clean the house, and whether other personnel are waiting in the area against the wall.

The meal order comparison unit is used for the meal time in the dormitory, and real-time calculation to determine whether there are abnormal people who do not sit and eat.

The sleeping order comparison unit is used for the rest time in the dormitory, and real-time calculation to determine whether there is a head-covered sleep or a violation of getting up.

The wake-up order specification unit is used for the deadline for getting up in the dormitory, and real-time calculation to determine whether someone is in the bed.

The TV education order comparison unit is used for the TV education time in the prison. It will calculate and judge in real time whether there are any abnormal people who are not sitting and watching TV education, and if there are too many people walking around, an alarm will be issued.

The safety rotation specification unit is used for setting the safety rotation area in the prison, and it is calculated in real time to determine whether two people are present in the safety rotation area, and it is judged to be a violation if they stay in the same position for a long time.

The conduct norm assessment unit is used for the operation time of the prison house, and the uniformity of the queue is calculated and scored in real time.

The three-positioning supervision unit is used to calculate and judge in real time whether the personnel perform the "three-positioning" operation in accordance with the regulations when a fight occurs in the prison.

The standard unit for holding the head when leaving the prison is used to set the cordon area in the prison, and calculate in real time whether the person leaves the prison to carry the head with both hands in the cordon area according to regulations.

Further, the human head technology detection module includes a target detection and segmentation process, and the behavior specification detection module includes a training process of obtaining a classifier by using a training sample set and a recognition process of using the classifier to identify test samples. The corresponding behavior detection algorithm is designed in full accordance with the code of conduct for detainees in the detention center. The identification process includes setting the detection triggering time period and detection area of each standard behavior detection and using the classifier to identify, including artificially setting the detection time and detection area, strictly in accordance with the code of conduct for the detainees in the detention center. During the trigger time period, the corresponding behavior identification operation is performed in the set detection area, and an alarm message is issued when a violation is identified. If it is not within the detection trigger time period, the corresponding behavior recognition operation will not be performed. The detection time period and detection area are completely user-defined and set in accordance with the standard code of conduct, which can well meet the needs of code of conduct detection. The behavior detection is only triggered in the set time period and the set detection area, and the corresponding recognition algorithm is not carried out in other time periods and other areas. Reduce the execution complexity of the system and improve the stability of the system. The detection time period and detection area are completely user-defined and set in accordance with the standard code of conduct, which can well meet the needs of code of conduct detection.

Further, the target detection and segmentation process of the human head technology detection module is shown in Figure 1, including the following steps:

S1) Use the labeling tool to label the head of the image, and generate a JSON file for each image. The feature information of image annotation is extracted through CNN (Convolutional Neural Network, convolutional neural network).

S2) Use RPN (Region Proposal Network, region generation network) to extract the ROI (Region Of Interest, region of interest) from the feature information obtained in step S1), and then use ROI Pooling (region of interest pooling) to change all these ROIs into a fixed size.

S3) Perform Bounding box regression and classification prediction on the ROI obtained in step S2) through a fully connected layer, sample at different points of the feature map, and apply bilinear interpolation.

S4) Finally, perform a segmentation mask network, take the positive regions selected by the ROI classifier as input, and generate their masks. The predicted mask is scaled up to the size of the ROI bounding box to give the final masking result, one mask per object. The predicted segmentation mask is added to each ROI, and the output is the existing object on the image and a high-quality segmentation mask.

The dataset contains four different environments, 10 people are divided into five groups, and each group is repeated three times according to the specification. Four of these groups were used as training datasets, and the remaining group was used as test datasets.

Specifically, for example, to complete target detection, first collect five groups of videos that expose their heads in different environments according to the specification requirements. Four groups of videos are used as training data sets, and one group of videos is used as validation data sets. First, the four groups of video frame images are operated according to the above-mentioned steps S1) to S5), and finally the model of human head detection is obtained; then the human head detection model is loaded on the remaining group of video frame images, and the final real-time personnel detection is carried out. and statistics. To complete the density detection, the final step of density calculation is required.

The training process of the behavior specification detection module is shown in Figure 2, including the following steps:

S5) Input the video frame image of a certain behavior, let the image go through CNN to extract features, convolve the outputs of 6 specific convolutional layers in the network with two 3*3 convolution kernels, and then convolve the generated All bounding boxes are collected and thrown into NMS (Non-Maximum-Suppression, non-maximum suppression) to obtain a series of target detection boxes.

S6) Input the target detection frame obtained in step S5) into STN (Spatial Transform Networks) for reinforcement operation, and extract high-quality single-person regions from inaccurate candidate frames.

S7) Use SPPE (Single Person Pose Estimator, single-person pose estimator) to estimate the pose skeleton of the person on the single-person area frame enhanced in step S6).

S8) Remap the single-person pose obtained in step S7) to the image coordinate system through SDTN (Spatial De-Transformer Network), so as to obtain a more accurate human target detection frame, and perform the human pose estimation operation again. . Then, the redundant detection problem is solved by PP-NMS (Parametric Pose Non-Maximum-Suppression, parametric non-maximum suppression), and the human skeleton information under this behavior is obtained.

S9) For the multi-scale key points obtained in step S8), the key point regression operation is performed through the deconvolution layer, which is equivalent to performing an up-sampling process, which can improve the accuracy of the target key points. Considering the connectivity of multiple key points, a directed field connecting the key points is established, and the connected key point pairs are clearly matched according to the connectivity and structure of human body parts to reduce misconnections and obtain the final human skeleton information.

S10) step S9) obtains the final human body skeleton information and carries out feature extraction, and it is input into the classifier as the training sample of this type of behavior for training;

S11) Repeat the above steps to obtain classifiers of various behaviors.

The identification process of the behavior specification detection module is shown in Figure 2, including the following steps:

S12) According to the code of conduct of the detainees, according to the specific behavior detection requirements, set the detection trigger time period and detection area, and store them locally in the form of JSON.

S13) When detecting, first read the JSON file, record the video frame image of a certain behavior within the set detection trigger time period, only take the image in the detection area, and use the above-mentioned steps S5) to S10) to perform it. The human body pose estimation is performed to obtain the human skeleton feature information in the detection area. In the rest of the time period, only video frame images are played, and corresponding behavior recognition operations are not performed.

In the above technical solution, in step S5) preferably two layers of convolution are used to extract detection results for different feature maps.

In the above-mentioned technical scheme, in step S8), PP-NMS operates as follows:

Firstly, the pose with the highest confidence is selected as the reference, and the area frame close to the reference is eliminated according to the elimination standard. This process is repeated many times until the redundant identification frame is eliminated and each identification frame is unique.

In the above technical solution, the human skeleton information obtained in step S8) also includes the following operations:

Using the reinforcement data set, by learning the description information of different poses in the output results, to imitate the formation process of the human body area frame, and further generate a larger training set.

This application preferably adopts the detention center data set. The data set contains four different environments, 10 people are divided into five groups, and each group is repeated three times according to the specification requirements. Four of these groups were used as training datasets, and the remaining group was used as test datasets.

Specifically, for example, to identify the behavior of "washing", first collect four groups to enter the washing area according to the specifications, and one group enters the washing area illegally. The washing videos of the four groups are used as the training data set, and the washing videos of one group are used as the training data set. Validation dataset. First, operate a certain group of washing video frame images according to the above steps S1) to S5), and finally obtain the human posture and skeleton information features of the "washing" video behavior; take it as the "washing" behavior of this group Standardized training samples are input into classifier training; after multiple training samples of different groups, the "washing" behavior classifier is obtained. Similarly, classifiers for various video behaviors can be constructed.

When judging, execute the above-mentioned steps S12) to S14), first set the detection trigger time period and the detection area, if the current time is within the detection trigger time period, then the video frame images of a group in the test sample are set according to the set The determined detection area is divided, and only the images in the detection area are operated according to the above steps S5) to S10) to obtain the information features of human body posture and skeleton information in the detection area, and then through the data enhancement set, it is input into the classifier. Behavioral categories are identified. The discrimination process for other environments is the same.

Figure 3 shows a simplified low-level feature extraction and modeling flowchart.

In the technical solution of the present application, the posture estimation framework adopted is RMPE (Regional Multi-Person Pose Estimation, regional multi-person posture detection). The outputs of each specific convolutional layer are convolved with two 3*3 convolution kernels respectively, and all the generated bounding boxes are collected together to obtain the filtered target detection frame through NMS, and then the detection frame is input to STN and The human body posture is automatically detected in SPPE, and then regression is performed through SDTN and PP-NMS to establish a directed field connecting key points, reducing misconnection to obtain the final human posture skeleton feature.

The technical solution of the present application adopts a two-layer convolution operation to extract the underlying features, and then uses a non-maximum suppression method to eliminate redundancy in the detection results. The detection frame after redundancy elimination is input into the STN layer to enhance the features. The function of the STN network is to make the obtained features robust to translation, rotation and scale changes. Then, the feature image output by STN is used for SPPE single-person pose estimation, and then the pose estimation result is returned to the image coordinate system through SDTN, which can extract high-quality human regions in the inaccurate region frame. Then the problem of redundant detection is solved by PP-NMS. Finally, the key point regression is carried out through the deconvolution layer, the accuracy of the key points is improved, the directed field connecting the key points is established, and the misconnection is reduced, so as to obtain the final human skeleton information.

Among the above technical solutions, CNN is an efficient identification method that has been developed in recent years and has attracted attention. In the 1960s, Hubel and Wiesel discovered that their unique network structure can effectively reduce the complexity of the feedback neural network when they studied the neurons used for local sensitivity and direction selection in the cat cerebral cortex, and then proposed CNN. Now, CNN has become one of the research hotspots in many scientific fields, especially in the field of pattern classification, because the network avoids the complex pre-processing of the image and can directly input the original image, so it has been more widely used.

Generally, the basic structure of CNN includes two layers, one of which is a feature extraction layer, the input of each neuron is connected to the local receptive field of the previous layer, and the local features are extracted. Once the local feature is extracted, the positional relationship between it and other features is also determined; the second is the feature mapping layer, each computing layer of the network consists of multiple feature maps, each feature map is a plane, All neurons in the plane have equal weights.

In the technical solution of the present application, the feature mapping layer is used to extract the global underlying features in the video frame images, and then perform deeper processing on the underlying features.

The generalized processing flow of CNN is shown in Figure 4.

The layer to be used in the technical solution of this application is the Feature Map obtained after convolution. We extract the six layers of feature maps whose sizes are (38, 38), (19, 19), (10, 10), ( 5,5), (3,3), (1,1), and then set multiple a priori boxes with different scales or aspect ratios in each unit of the feature map. This forms a feature map. The detection result is obtained by convolving the feature map, and the detection value includes the class confidence and the position of the bounding box. Each is done with a 3×3 convolution.

It should be understood that the above-mentioned embodiments are only intended to illustrate the technical concept and characteristics of the present application, and the purpose thereof is to enable those who are familiar with the technology to understand the content of the present application and implement accordingly, and cannot limit the protection scope of the present application. All equivalent changes or modifications made according to the spirit and spirit of this application should be covered within the protection scope of this application.

Claims

A deep learning-based behavioral code detection device for prisons is characterized by comprising: a head count detection module and a behavioral code detection module; wherein:

The head count detection module is used for non-sensing roll call and/or crowd density identification; the head count detection module includes a target detection and segmentation process;

The behavior norm detection module is used for real-time calculation and discrimination of personnel behavior; the behavior norm detection module includes a training process of obtaining a classifier by using a training sample set, and a recognition process of using the classifier to identify test samples.
The deep learning-based detection device for prison behavior norms according to claim 1, wherein the target detection and segmentation process of the head count detection module comprises the following steps:

S1) use the labeling tool to label the head of the image, generate a JSON file for each picture, and extract the feature information of the image labeling through a convolutional neural network;

S2) using the feature information obtained in step S1) to extract the ROI, that is, the region of interest, using the region generation network, and then use the region of interest pooling to turn these ROIs into a fixed size;

S3) perform Bounding box regression and classification prediction on the ROI obtained in step S2) through the fully connected layer, sample at different points of the feature map, and apply bilinear interpolation;

S4) Finally, perform a segmentation mask network, take the positive regions selected by the ROI classifier as input, and generate their masks; enlarge the predicted masks to the size of the ROI border to give the final mask results, each target There is a mask; the predicted segmentation mask is added to each ROI, and the output is the existing object on the image and a high-quality segmentation mask.
The deep learning-based detection device for prison behavior norms according to claim 2, wherein the head count detection module specifically includes:

The target detection unit is used for non-sensing real-time detection and statistics of detainees;

Density analysis unit, used for real-time accurate density detection and abnormal alarm in dormitories and venting circles;

The target detection unit includes the following steps: firstly collect five groups to expose their heads in video images in different environments according to specification requirements, wherein four groups of videos are used as training data sets, and one group of videos is used as verification data sets; The four groups of video frame images are operated according to the steps S1) to S5) to obtain a human head detection model; finally, the human head detection model is loaded on the remaining group of video frame images, and the final real-time personnel detection and statistics are performed.
The deep learning-based prison behavior norm detection device according to claim 1, wherein the training process of the behavior norm detection module comprises the following steps:

S5) Input the video frame image of a certain behavior, let the image extract features through the convolutional neural network, and convolve the outputs of the six specific convolutional layers in the network with two 3*3 convolution kernels, and then Collect all the generated bounding boxes and throw them all to NMS, that is, non-maximum suppression, to obtain a series of target detection boxes.

S6) input the target detection frame obtained in step S5) into the STN, that is, the spatial transformation network, and carry out a reinforcement operation to extract a high-quality single-person area from the inaccurate candidate frame;

S7) use SPPE to the single person area frame after step S6) strengthening, namely single person posture estimator, estimate the posture skeleton of this person;

S8) Remap the single-person posture obtained in step S7) to the image coordinate system through SDTN, that is, the space inverse transformation network, so as to obtain a more accurate human body target detection frame, and perform the human body posture estimation operation again; then through PP- NMS, that is, parameterized non-maximum suppression, solves the problem of redundant detection, and obtains the human skeleton information under this behavior;

S9) For the multi-scale key points obtained in step S8), the key point regression operation is performed through the deconvolution layer, which is equivalent to performing an up-sampling process, which can improve the accuracy of the target key points; consider the connectivity of multiple key points. , establish a directed field connecting key points, and match the connected key point pairs according to the connectivity and structure of human body parts, reduce misconnections, and obtain the final human skeleton information;

S10) perform feature extraction on the final human skeleton information obtained in step S9), and input it into the classifier for training as a training sample of this type of behavior;

S11) Repeat the above steps to obtain classifiers of various behaviors.
The deep-learning-based behavioral code detection device for prisons according to claim 4, wherein the identification process of the behavioral code detection module comprises the following steps:

S12) According to the code of conduct of the detainees, according to the specific behavior detection requirements, set the detection trigger time period and detection area, and store them locally in the form of JSON;

S13) when detecting, first read the JSON file, in the set detection trigger time period, enter the video frame image of a certain behavior, only take the image in the detection area, and adopt the steps S5) to S10) to It performs human pose estimation and obtains the human skeleton feature information in the detection area; in other time periods, only video frame images are played, and corresponding behavior recognition operations are not performed;

S14) Input the human skeleton feature information obtained in step S13) into the classifier for identification to obtain the video behavior category.
The deep learning-based behavior norm detection device for prisons according to claim 1, wherein the identification process includes setting a detection trigger time period and a detection area for the detection of various normative behaviors, and using a classifier to identify, including: Manually set the detection time and detection area, and strictly follow the code of conduct for the detainees in the detainee. When the detection trigger time period is in place, the corresponding behavior recognition operation is performed in the set detection area. When a violation is identified, an alarm message is issued. ; If it is not within the detection trigger time period, the corresponding behavior recognition operation will not be performed; the detection time period and detection area are completely customized by the user and set according to the standard code of conduct, which can well meet the needs of behavior code detection.
The deep learning-based monitoring device for behavioral norms in prisons according to claim 4, wherein the PP-NMS operation in the step S8) specifically includes: selecting a posture with a maximum confidence as a reference, and according to the elimination standard, close the The referenced area frame is eliminated, and the process is repeated many times until the redundant identification frame is eliminated and each identification frame is unique;

The human skeleton information obtained in the step S8) further includes: using the enhanced data set, by learning the description information of different postures in the output result, to imitate the formation process of the human body area frame, and further generate a larger training set.
A deep learning-based method for detecting behavioral norms in prisons, characterized in that the method comprises the following steps:

Head count detection, used for non-sensing roll call and/or crowd density identification; the head count detection includes a target detection and segmentation process;

Behavior norm detection is used for real-time calculation and discrimination of personnel behavior; the behavior norm detection includes a training process of obtaining a classifier by using a training sample set, and a recognition process of using the classifier to identify test samples.
The deep learning-based method for detecting behavioral norms in prisons according to claim 8, wherein the target detection and segmentation process specifically comprises the following steps:

S1) use the labeling tool to label the head of the image, generate a JSON file for each picture, and extract the feature information of the image labeling through a convolutional neural network;

S2) using the feature information obtained in step S1) to extract the ROI, that is, the region of interest, using the region generation network, and then use the region of interest pooling to turn these ROIs into a fixed size;

S3) perform Bounding box regression and classification prediction on the ROI obtained in step S2) through the fully connected layer, sample at different points of the feature map, and apply bilinear interpolation;

S4) Finally, perform a segmentation mask network, take the positive regions selected by the ROI classifier as input, and generate their masks; enlarge the predicted masks to the size of the ROI border to give the final mask results, each target There is a mask; the predicted segmentation mask is added to each ROI, and the output is the existing object on the image and a high-quality segmentation mask.
The deep learning-based method for detecting behavioral norms in prisons according to claim 9, wherein the detection of the head count specifically comprises the following steps:

Target detection, which is used for non-sensing real-time detection and statistics of detainees;

Density analysis, used for real-time accurate density detection and abnormal alarm in dormitories and ventilation circles;

The target detection includes the following steps: firstly collect five groups to expose their heads in video images in different environments according to the specification requirements, wherein four groups of videos are used as training data sets, and one group of videos is used as verification data sets; The video frame images of the group are operated according to the steps S1) to S5) to obtain a human head detection model; finally, the human head detection model is loaded for the remaining video frame images of the group, and the final real-time personnel detection and statistics are performed.