CN115035599A - Armed personnel identification method and armed personnel identification system integrating equipment and behavior characteristics - Google Patents

Armed personnel identification method and armed personnel identification system integrating equipment and behavior characteristics Download PDF

Info

Publication number
CN115035599A
CN115035599A CN202210641120.4A CN202210641120A CN115035599A CN 115035599 A CN115035599 A CN 115035599A CN 202210641120 A CN202210641120 A CN 202210641120A CN 115035599 A CN115035599 A CN 115035599A
Authority
CN
China
Prior art keywords
behavior
frame
image
personnel
armed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210641120.4A
Other languages
Chinese (zh)
Inventor
赵小川
董忆雪
徐凯
王子彻
樊迪
邵佳星
何云峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China North Computer Application Technology Research Institute
Original Assignee
China North Computer Application Technology Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China North Computer Application Technology Research Institute filed Critical China North Computer Application Technology Research Institute
Priority to CN202210641120.4A priority Critical patent/CN115035599A/en
Publication of CN115035599A publication Critical patent/CN115035599A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/34Smoothing or thinning of the pattern; Morphological operations; Skeletonisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/762Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

An armed personnel identification method and system fusing equipment and behavior characteristics, wherein the method comprises the following steps: acquiring an equipment detection data set, and training an equipment detection model based on the equipment detection data set; acquiring armed personnel behavior video stream data, and constructing an armed personnel behavior recognition training sample set based on the video stream data; training a armed personnel behavior recognition model based on the armed personnel behavior recognition training sample set; extracting each frame of image in the video stream to be identified, inputting the frame of image into the trained equipment detection model, and obtaining an equipment detection result of each frame of image; extracting the bone joint point data of each frame of image in the video stream to be identified; inputting the bone joint point data into a trained armed personnel behavior recognition model to obtain a personnel behavior recognition result of each frame of image in a video stream; and calculating the danger coefficient of the personnel in each frame of image based on the equipment detection result and the behavior recognition result, and if the danger coefficient is higher than a preset threshold value, judging the personnel to be armed personnel.

Description

Armed personnel identification method and armed personnel identification system integrating equipment and behavior characteristics
Technical Field
The invention relates to the technical field of armed personnel identification, in particular to an armed personnel identification method and system integrating equipment and behavior characteristics.
Background
In actual anti-terrorism reconnaissance work, armed personnel have no clear definition, and the armed personnel are difficult to accurately distinguish only by target detection. As armed personnel are equipped with not only distinctive features in their exterior shape, but also a number of apparently characterized behaviors. Such as standing shooting, half-squat shooting, and the like. The existing identification method only performs identification in a single direction, so that the identification accuracy is low, and armed staffs are difficult to identify accurately.
Disclosure of Invention
In view of the foregoing analysis, embodiments of the present invention provide a method and a system for identifying armed personnel by fusing equipment and behavior characteristics, so as to solve the problem that the existing identification method has a low accuracy and is difficult to accurately identify armed personnel.
On one hand, the embodiment of the invention provides an armed personnel identification method integrating equipment and behavior characteristics, which comprises the following steps:
acquiring an equipment detection data set, and training an equipment detection model based on the equipment detection data set;
acquiring armed personnel behavior video stream data, and constructing an armed personnel behavior recognition training sample set based on the video stream data; training a armed personnel behavior recognition model based on the armed personnel behavior recognition training sample set;
extracting each frame of image in a video stream to be identified, inputting the image into a trained equipment detection model, and obtaining an equipment detection result of each frame of image; extracting the bone joint point data of each frame of image in the video stream to be identified; inputting the bone joint point data into a trained armed personnel behavior recognition model to obtain a personnel behavior recognition result of each frame of image in a video stream; and calculating the danger coefficient of the personnel in each frame of image based on the equipment detection result and the behavior recognition result, and if the danger coefficient is higher than a preset threshold value, judging the personnel to be armed personnel.
Based on the further improvement of the method, the danger coefficient of the personnel in each frame of image is calculated in the following mode:
Danger=P i ·IoU i +P act wherein P is i Indicates the confidence of the detected equipment of the i-th type, IoU i Denotes the cross-over ratio, P, of the detected ith equipment to the human ROI act Representing the degree of risk of the behavior recognition result.
Further, the risk of the behavior recognition result is calculated according to the following formula:
Figure BDA0003684128270000021
wherein, P j Representing the confidence, β, of the jth behavior class j The risk factor representing the jth behavior class,
Figure BDA0003684128270000022
representing the number of behavior categories.
Further, the equipment detection model is a dynamic neural network model; the dynamic neural network model comprises a first sub-network and a second sub-network, the first sub-network being for detecting a human in an image; when the first sub-network detection image contains human, extracting human ROI and transmitting the human ROI to a second sub-network; the second sub-network is used for carrying out equipment detection by adopting classifiers at different network depths according to different resolutions of the image; and training the dynamic neural network model based on the equipment detection data set to obtain a trained armed personnel equipment detection model.
Further, constructing an armed personnel behavior recognition training sample set based on the video stream data, comprising:
extracting the bone joint point data of each frame of image in the video stream data; adding a behavior label for each frame of image; corresponding the behavior labels to the skeletal joint point data to obtain an initial training sample set;
and performing label smoothing on the behavior labels in the initial training sample set to obtain a behavior recognition training sample set.
Further, performing label smoothing on the behavior labels in the initial training sample set to obtain a behavior recognition training sample set, including:
performing integral smoothing on all behavior labels in the initial training sample set;
determining a behavior conversion frame in an initial training sample set, and performing intra-group behavior label smoothing on a group of images before the behavior conversion frame;
the skeletal joint point data comprises a confidence level for a skeletal joint point; and performing confidence smoothing on the behavior label of each image based on the confidence of the skeletal joint point to obtain a behavior recognition training sample set.
Further, the following formula is used to perform overall smoothing on all behavior labels in the initial training sample set:
Figure BDA0003684128270000033
wherein, Lab represents a sample behavior label,
Figure BDA0003684128270000032
denotes the number of classes, and ε denotes the smoothing parameter.
Further, performing intra-group behavior tag smoothing on a group of images before the behavior transition frame, including:
for each behavior conversion frame, determining an active index and a target index of a group of images before conversion according to the tag value of the image of the frame before the behavior conversion frame and the tag value of the behavior conversion frame;
calculating a label value corresponding to the active index in Labels of k images before a behavior conversion frame according to a formula Labels [ j ] [ active index ] ═ maximum label value (i-j)/k;
according to the formula
Figure BDA0003684128270000031
Calculating a label value corresponding to a target index in labels of k images before a behavior conversion frame;
the image processing method comprises the steps that j is i-k, i- (k-1), i-1, the ith frame is a behavior conversion frame, and Labels [ j ] [ active index ] represents a label value corresponding to an active index in a behavior label of the jth frame image; and the Labels [ j ] [ target index ] represents a label value corresponding to a target index in the behavior Labels of the jth frame image, the active index is an index where a maximum label value in the behavior Labels of the ith-1 frame image is located, and the target index is an index where the maximum label value in the behavior Labels of the ith frame image is located.
Further, confidence smoothing the behavior label of each image based on the confidence of the skeletal joint points, comprising:
setting the confidence coefficient of the main skeletal joint point as 1 and the confidence coefficients of other skeletal joint points as unchanged for each frame image frame, and calculating the mean value of the confidence coefficients of all skeletal joint points;
and multiplying the average value of the confidence degrees by the label value of the frame image to obtain a smooth label of the frame image based on the confidence degrees.
On the other hand, the embodiment of the invention provides an armed personnel identification system integrating equipment and behavior characteristics, which comprises the following modules:
the equipment detection model training module is used for acquiring an equipment detection data set and training an equipment detection model based on the equipment detection data set;
the behavior recognition model training module is used for acquiring armed personnel behavior video stream data and constructing an armed personnel behavior recognition training sample set based on the video stream data; training a armed personnel behavior recognition model based on the armed personnel behavior recognition training sample set;
the armed personnel identification module is used for extracting each frame of image in the video stream to be identified and inputting the frame of image into the trained equipment detection model to obtain an equipment detection result of each frame of image; extracting the bone joint point data of each frame of image in the video stream to be identified; inputting the bone joint point data into a trained armed personnel behavior recognition model to obtain a personnel behavior recognition result of each frame of image in a video stream; and calculating the danger coefficient of the personnel in each frame of image based on the equipment detection result and the behavior recognition result, and if the danger coefficient is higher than a preset threshold value, judging the personnel to be armed personnel.
Compared with the prior art, the equipment detection model and the armed personnel behavior recognition model are constructed and trained, and the equipment characteristics and the behavior characteristics are combined, so that armed personnel can be recognized efficiently and accurately, and technical support is provided for accurate and efficient anti-terrorism.
In the invention, the technical schemes can be combined with each other to realize more preferable combination schemes. Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.
Drawings
The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout.
FIG. 1 is a flow chart of an armed personnel identification method incorporating equipment and behavioral characteristics in accordance with an embodiment of the invention;
FIG. 2 is a block diagram of an armed personnel identification system implementing a fusion equipment and behavior feature of the present invention;
FIG. 3 is a diagram illustrating armed forces behavior classification according to an embodiment of the present invention;
FIG. 4 illustrates skeletal joint points identified by a gesture recognition algorithm in accordance with an embodiment of the present invention;
FIG. 5 is a partial tag data before intra-group tag smoothing, in accordance with an embodiment of the present invention;
FIG. 6 is a partial tag data after intra-group tag smoothing, according to an embodiment of the present invention;
FIG. 7 is a partial tag data after confidence smoothing according to an embodiment of the present invention;
FIG. 8 is a time-space diagram of a skeletal joint, in accordance with an embodiment of the present invention.
Detailed Description
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate preferred embodiments of the invention and together with the description, serve to explain the principles of the invention and not to limit the scope of the invention.
One embodiment of the invention discloses an armed personnel identification method integrating equipment and behavior characteristics, as shown in fig. 1, comprising the following steps:
s1, acquiring an equipment detection data set, and training an equipment detection model based on the equipment detection data set;
s2, acquiring armed personnel behavior video stream data, and constructing an armed personnel behavior recognition training sample set based on the video stream data; training a armed personnel behavior recognition model based on the armed personnel behavior recognition training sample set;
s3, extracting each frame of image in the video stream to be identified, inputting the image into the trained equipment detection model, and obtaining the equipment detection result of each frame of image; extracting the bone joint point data of each frame of image in the video stream to be identified; inputting the bone joint point data into a trained armed personnel behavior recognition model to obtain a personnel behavior recognition result of each frame of image in a video stream; and calculating the danger coefficient of the personnel in each frame of image based on the equipment detection result and the behavior recognition result, and if the danger coefficient is higher than a preset threshold value, judging the personnel to be armed personnel.
According to the method, the equipment detection model and the armed personnel behavior recognition model are constructed and trained, and the equipment characteristics and the behavior characteristics are combined, so that armed personnel can be recognized efficiently and accurately, and technical support is provided for accurate and efficient anti-terrorism.
In the actual anti-terrorism reconnaissance work, the equipment carried by the armed personnel is small and difficult to distinguish in a long distance range, and in order to accurately identify the equipment at different distances, in step S1, images of the armed personnel at different distances are collected by a variable-focus high-resolution camera to construct an equipment detection data set.
Marking the acquired images with personnel and equipment to form an equipment detection data set, and specifically comprising the following steps:
and S11, labeling the personnel and equipment in the armed personnel image by using the labeling frame.
Specifically, marking the personnel and equipment in the image is to mark the positions of the personnel and equipment in the image by using a marking frame and mark the type of the corresponding equipment.
And S12, performing data set enhancement on the marked image by adopting marking frame scale distortion, marking frame mirror image turning, image random zooming, image random cutting and/or image random arrangement to obtain an enhanced data set, and taking the data set before enhancement and the enhanced data set as equipment detection data sets.
In order to increase the size of the data set and improve the detection capability of the model, the data of the marked image is enhanced. Specifically, the data enhancement comprises the step of carrying out data set enhancement on the marked image by adopting marking frame scale distortion, marking frame mirror image turning, image random zooming, image random cutting and/or image random arrangement to obtain an enhanced data set.
In implementation, the scale warping method comprises the following steps: and the coordinate origin of the labeling frame is unchanged, the image in the labeling frame is covered at the original target position after corresponding scale transformation, and the coordinate of the labeling frame is transformed along with the scale transformation to obtain a new image and corresponding labeling information.
When in implementation, the mirror image turning method comprises the following steps: and (3) keeping the original coordinate of the marking frame unchanged, covering the image in the marking frame at the original target position after carrying out corresponding overturning transformation, keeping the coordinate of the marking frame unchanged, and obtaining a new image and corresponding marking information.
In practice, armed personnel equipment includes: headgear, firearms (including rifles, pistols and rocket tubes), explosives, knives, and the human body itself, are a total of 5 types of targets. The final detection data set comprises 5 kinds of targets including a person, a head sleeve, a gun, a cutter and explosive; the data set contained 17000 more live-action images.
For a self-constructed dataset, the size of the initial anchor box of the detection model needs to be determined. Therefore, after the equipment detection data set is constructed based on the armed personnel image, before the dynamic neural network model is constructed, the method further comprises the following steps: and determining an initial anchor frame of the dynamic neural network model according to the labeling frames of all the images in the data set.
Specifically, determining an initial anchor frame of the dynamic neural network model according to the labeling frames of all the images in the dataset includes:
s13, scaling the images in the data set to a specified size in an equal proportion to obtain a scaled labeling frame;
illustratively, the size of all images of the self-constructed dataset is 1920 × 1080; scaling the maximum value of the width and the height in each picture to a specified size, for example, the specified size is 1080 × 1080, and correspondingly scaling the smaller side; and changing the relative coordinate of the labeling frame corresponding to the zoomed image into the absolute coordinate, and calculating the size of the zoomed labeling frame, namely the length and the width of the labeling frame.
In order to screen out invalid data, the changed marking frames are screened, all marking frames with the width and the height not less than 2 pixels are reserved, and the rest marking frames are deleted.
And S14, clustering the zoomed labeling boxes, performing variation on each type of labeling boxes by adopting a genetic algorithm, and selecting the optimal labeling box as an initial anchor box based on the prediction accuracy.
And clustering all the scaled labeling frames, wherein in the implementation process, a k-means clustering algorithm can be adopted to cluster the labeling frames to obtain the labeling frames with different km classes.
In the task of object recognition, it is often desirable to detect small objects in a large feature map, since the large feature map contains more information about small objects. Thus, the anchor box on the large feature map is typically set to a small value, while the anchor box of the small feature map is set to a larger value. In implementation, if the dynamic neural network includes a four-level classifier, 4 sets of initialization anchor frames are required to be set, and each set of anchor frame includes 3 pairs of values. Therefore, in the k-means clustering algorithm, the number km of classes is 12.
And for each type of marking box, mutation is carried out by adopting a genetic algorithm, and the optimal type of marking box is selected as the initial anchor box corresponding to the type based on a fitness function. Specifically, the random variation is performed on the labeled frame, that is, the length and width of the labeled frame are randomly varied to generate a cluster of anchor frames corresponding to the class. And for the generated cluster of anchor frames, calculating the prediction accuracy of each anchor frame, and selecting the anchor frame with the highest prediction accuracy as the initial anchor frame corresponding to the class.
The prediction accuracy of each anchor frame is calculated by the following formula:
Figure BDA0003684128270000081
wherein p is i,j Indicates the prediction accuracy of the jth anchor frame of the ith class, n i,j M represents the number of the intersection ratio of the jth anchor frame of the ith class and the ith class label frame larger than the threshold value i And the number of the i-th class label boxes is shown.
Specifically, the equipment detection model in step S1 is a dynamic neural network model; the dynamic neural network model comprises a first sub-network and a second sub-network, the first sub-network being for detecting a human in an image; when the first sub-network detects that the image contains the human, extracting a human ROI and transmitting the human ROI to a second sub-network; the second sub-network is used for carrying out equipment detection by adopting classifiers at different network depths according to different resolutions of the image; and training the dynamic neural network model based on the equipment detection data set to obtain a trained armed personnel equipment detection model.
In implementation, the first sub-network may adopt a YOLOv3-tiny structure, and the network has the characteristics of light weight, easy deployment and high precision. Is suitable for simple personnel detection tasks. If the first sub-network detects that the image contains human, human ROI (region of interest) is cut out, the human ROI is transmitted to the second sub-network, and if the human ROI does not contain human, forward propagation is stopped, so that the detection efficiency of the model is improved.
The second sub-network is used for rig detection with classifiers at different network depths according to different resolutions of the image, i.e. at different network depths according to the resolution of the human ROI received from the first sub-network.
In practice, the second subnetwork may employ the modified yolov5l model. The second sub-network comprises a backbone network unit, a hack network unit and a prediction unit; the backbone network unit is used for extracting feature maps of different scales of human ROI; the Neck network is used for performing up-sampling and feature fusion on feature maps of different scales extracted by the main network unit to obtain tensor data of different scales; the prediction unit comprises a plurality of shallow classifiers used for carrying out target detection according to tensor data of different scales;
the Neck network unit comprises a plurality of stages of CSP and CBL combined blocks, a shallow classifier is connected behind the CSP and CBL combined blocks of different stages, and the shallow classifier is used for carrying out target detection according to tensor data of the current scale.
In order to facilitate the identification and detection of samples with different resolutions by adopting shallow classifiers at different network depths (corresponding to CSP and CBL combination blocks with different levels), the model efficiency is improved, and the computational redundancy is reduced. Specifically, the Neck network unit comprises a plurality of stages of CSP and CBL combined blocks, the higher the stage number is, the deeper the network depth is represented, and a plurality of classifiers of the detection unit are connected behind the CSP and CBL combined blocks at different stages, so that the rapid identification and detection can be carried out on samples with different resolutions, the operation amount is greatly reduced, and the operation resources are saved.
The CBL block comprises a full connection layer, a batch normalization layer and a Leaky Relu layer. The CSP block is a CSP2_ x structure block and is used for dividing the input into two branches, wherein one branch firstly passes through the CBL and then passes through x residual error structures, and then convolution is carried out; the other branch is directly convolved; and then concat splicing the two branches, and outputting the two branches after passing through a BN layer and an activation layer.
Illustratively, the last 4 CSP and CBL combination block post-connection classifiers of the hack network element are respectively denoted as a first shallow classifier, a second shallow classifier, a third shallow classifier and a fourth shallow classifier for convenience of description. The output of each stage of CSP and CBL combination block is divided into two paths, one path is connected with the shallow classifier corresponding to the stage, and the other path continuously propagates forwards sequentially through the CBL layer and the concat layer and enters the next stage of CSP and CBL combination block. And (3) firstly judging the resolution of the output of the CSP and CBL combined block, if the resolution is in a preset range, inputting the feature map into a shallow classifier corresponding to the current level for detection and identification, and stopping forward propagation, otherwise, continuing the forward propagation of the feature map, and further extracting deep features. For example, after the 4 th-from-last CSP and CBL combined block, firstly judging the resolution of an input image, and if the resolution is greater than or equal to 400, inputting the feature map into a first shallow classifier for detection and identification; if the image resolution is less than 400, the forward propagation is continued to extract deep features. After the last 3 rd CSP and CBL combined block, firstly judging the resolution of an input image, and if the resolution is less than 400 and more than or equal to 200, inputting the feature map into a second shallow classifier for detection and identification; otherwise, the forward propagation is continued to extract deep features. After the last 2-level CSP and CBL combined block, firstly, judging the resolution of an input image, and if the resolution is less than 200 and more than or equal to 50, inputting the feature map into a third shallow classifier for detection and identification; otherwise, the forward propagation feature is continued. And (4) after the last 1 st level CSP and CBL combined block, judging is not needed, and features are input into a corresponding classifier for detection and identification. By connecting classifiers at different depths, samples with different resolutions will fall back early at different depths, and only samples with resolutions less than 50 will enter the deep layer of the network for calculation. This has promoted the operating efficiency of model greatly, has reduced the redundancy of calculation power.
Specifically, the shallow classifier comprises a convolution layer, a concat layer and a sigmoid layer which are sequentially connected; the convolutional layer is used for extracting image features, the concat layer is used for splicing the features, and the sigmoid layer is used for classifying; the sizes and the numbers of convolution kernels of the shallow classifiers in different levels are different.
In implementation, the convolution layer of the first shallow classifier is composed of 128 convolution kernels, each convolution kernel has a size of 1, and stride is equal to 1. The convolution layer of the second shallow classifier consists of 128 convolution kernels, each convolution kernel has a size of 3, stride is 2. The convolution layer of the third shallow classifier consists of 256 convolution kernels, each convolution kernel has a size of 3, stride is 2. The convolution layer of the fourth shallow classifier consists of 384 convolution kernels, each convolution kernel has a size of 3, stride is 2.
During implementation, the concat layer of the shallow classifier is used for splicing the features output by the convolutional layer with the features output by a certain layer in the trunk network, so that the features extracted by the trunk network and the features extracted by the hack network are fused, and the classification is more accurate. During implementation, a layer with the output characteristic dimension being the same as that of the convolution layer output characteristic dimension of the shallow classifier in the backbone network is selected, and the output characteristics of the layer are spliced with the output characteristics of the convolution layer of the current classifier.
The convolution kernels with different scales are arranged in the classifiers with different depths, so that the features of different depths are extracted, and for the image with higher resolution, the deep features are not required to be extracted to accurately detect and identify, thereby reducing the operation amount and improving the detection efficiency.
After the dynamic neural network model is built, the dynamic neural network model is trained based on the equipment detection data set, and the trained armed personnel equipment detection model is obtained. Specifically, when model training is performed, the loss of the detection frame of the model is calculated by the following formula:
Figure BDA0003684128270000121
the distance between the central points of the prediction frame and the marking frame is represented by dis _2, the distance between the diagonal lines of the minimum circumscribed rectangle of the prediction frame and the marking frame is represented by dis _ C, and the intersection ratio of the prediction frame and the marking frame is represented by IOU.
The classification loss of the model adopts a cross entropy loss function:
Figure BDA0003684128270000122
where M represents the number of categories, for example, four categories of equipment are identified, so M is 4. p is a radical of ic Confidence, y, that a sample i belongs to the class c ic Is a variable from 0 to 1, taking 1 when the true class of sample i is c, otherwise taking 0. N represents the number of samples in a batch.
The overall loss function of the finally obtained model is:
Figure BDA0003684128270000123
and according to the overall loss of the model, performing back propagation and optimizing the model parameters, thereby obtaining a trained armed personnel equipment detection model.
A large number of invalid and inefficient redundant structures and parameters exist in the trained equipment identification model, and the problem of further improving the reasoning efficiency is solved. Pruning is just one of the methods for improving the reasoning efficiency, and a model with smaller scale, higher memory utilization rate, lower energy consumption, higher inference speed and minimum inference accuracy loss can be efficiently generated by cutting inefficient branches and parameters.
The armed personnel equipment detection model is constructed by adopting the dynamic neural network model, and images with different resolutions can be identified by adopting different network depths, so that redundant calculation is reduced, and the detection efficiency is improved.
Specifically, training the dynamic neural network model based on the equipment detection data set to obtain a trained armed personnel equipment detection model, and pruning the armed personnel equipment detection model by adopting the following steps:
s15, carrying out sparsification on each channel of the model, and calculating a scale factor of each channel after the sparsification;
because different channels in the network have different influences on the effect of model identification, the purpose of the sparsification processing is to approximate the coefficient (also called scale factor) of the BN layer of the channel with smaller influence to 0 to obtain the sparsified scale factor. During training, a regular term is added to the scale factor of the BN layer in each channel, and for the scale factor with a smaller value, the scale factor is closer to 0 after training. Thereby realizing the purpose of thinning the scale factor.
S16, if the scale factor is smaller than a preset threshold value, cutting the channel; otherwise, the channel is reserved.
Illustratively, if the cropping percentage is set to 55%, 55% of the number of channels is to be cropped. And determining a threshold value according to the percentage and all scale factors in the model, and pruning all channels with the scale factors smaller than the threshold value (namely setting the corresponding scale factors to be 0), so as to prune the model.
And S17, retraining the pruned model to obtain a trained armed personnel equipment detection model.
The number of channels of the pruned model is reduced, so that the model parameters are reduced, and the identification precision of the pruned model is definitely reduced. Therefore, training is needed to fine-tune the training to compensate for the loss of precision due to pruning. And when the precision reaches a preset value, finishing training to obtain the pruned armed personnel equipment detection model.
Since no armed personnel behavior data set exists at present, the data set needs to be constructed by self. In practice, the armed personnel behavior video can be recorded at a frame rate of 30fps and a resolution of 640 multiplied by 480, and armed personnel behavior video stream data can be obtained. And based on the acquired video stream data, performing frame-by-frame behavior annotation on the video in a manual annotation mode.
Specifically, in step S2, constructing an armed personnel behavior recognition training sample set based on the video stream data includes:
s21, extracting the bone joint point data of each frame of image in the video stream data; adding a behavior label for each frame of image; corresponding the behavior label to the skeletal joint point data to obtain an initial training sample set;
specifically, armed personnel behavior tags are manually added to each frame of image. Armed forces act in 6 categories (see fig. 3): standing, walking, squatting, standing and shooting, and squatting. For the purpose of normalizing the annotation result, the firing angle is constrained to be + -30 deg. for horizontal firing, and the behavior satisfying the firing angle is defined as the firing behavior.
And extracting the bone joint point data of the person in the image by adopting a gesture recognition algorithm. In implementation, the AlphaPose gesture recognition algorithm can be used for extracting the bone joint point data (see the attached figure 4) of armed personnel in the video frame by frame, wherein the bone joint point data comprises the coordinates and the confidence coefficient of the bone joint point. A total of 14 skeletal joint points were extracted: face center, neck, right shoulder, left shoulder, right elbow, left elbow, right wrist, left wrist, right hip, left hip, right knee, left knee, right ankle, left ankle.
And corresponding the bone joint point data of each frame of image with the behavior label to generate an initial training sample set. Because the attitude recognition algorithm may have empty data, the empty data is removed, namely the image frame without the bone hanging point data is removed, so that the validity of the data is ensured.
And S22, performing label smoothing on the behavior labels in the initial training sample set to obtain a behavior recognition training sample set.
For multi-class models, tags are typically encoded in the form of one-hot, e.g., the first class tags are encoded "standing" as (1, 0, 0, 0, 0, 0), and so on. The one-hot label cannot ensure the generalization capability of the model, so that the network is easy to overfit. In order to solve the problem, after an initial training sample set is obtained, label smoothing is carried out on behavior labels in the initial sample set to obtain a behavior recognition training sample set. Specifically, label smoothing includes:
s221, performing integral smoothing on all behavior labels in the initial training sample set
Specifically, the following formula is adopted to perform overall smoothing on all behavior labels in the initial training sample set:
Figure BDA0003684128270000141
wherein, Label represents the sample behavior Label,
Figure BDA0003684128270000151
denotes the classification number and epsilon denotes the smoothing parameter.
In practice, ε may be 0.1. After the overall behavior label is smoothed, a point probability is distributed to the category with lower probability, and a certain generalization space is left for learning. For example, the label (1, 0, 0, 0, 0, 0) becomes (0.90, 0.02, 0.02, 0.02, 0.02, 0.02) after being smoothed.
S222, determining a behavior conversion frame in the initial training sample set, and performing intra-group behavior label smoothing on a group of images before the behavior conversion frame;
specifically, if the behavior label of the i-1 th frame is different from the behavior label of the i-th frame, the i-th frame is the behavior conversion frame. For each behavior transition frame, an intra-group smoothing is performed on a group of images before it. The method specifically comprises the following steps:
for each behavior conversion frame, determining an active index and a target index of a group of images before conversion according to the tag value of the image of the frame before the behavior conversion frame and the tag value of the behavior conversion frame;
calculating a label value corresponding to the active index in Labels of k images before a behavior conversion frame according to a formula Labels [ j ] [ active index ] ═ maximum label value (i-j)/k;
according to the formula
Figure BDA0003684128270000152
Calculating a label value corresponding to a target index in labels of k images before a behavior conversion frame;
the image processing method comprises the steps that j is i-k, i- (k-1), i-1, the ith frame is a behavior conversion frame, and Labels [ j ] [ active index ] represents a label value corresponding to an active index in a behavior label of the jth frame image; and the Labels [ j ] [ target index ] represents a label value corresponding to a target index in the behavior Labels of the jth frame image, the active index is an index where a maximum label value in the behavior Labels of the ith-1 frame image is located, and the target index is an index where the maximum label value in the behavior Labels of the ith frame image is located.
Specifically, the process of smoothing the intra-group behavior tag will be described by taking the behavior tag in fig. 5 as an example. The last line in fig. 5 is labeled as the line transition frame, which is assumed to be the ith frame. A group before the ith frame, i.e. k behavior tags, is extracted for intra-group tag smoothing. In implementation, k may be determined according to the action duration and the smoothing precision requirement, and a part of the continuous behavior tags before the behavior transition frame is taken for performing intra-group smoothing, for example, k is taken as 7, that is, 7 behavior tags before the ith frame are subjected to intra-group smoothing. In the behavior tag of the i-1 th frame image, the index where the maximum value of the tag is located is 0, and therefore the active index is 0. In the behavior tag of the ith frame image, the index where the maximum value of the tag is located is 1, and thus the target index is 1.
Therefore, for the j-th frame image, j is i-k, i- (k-1), i-1, according to
Figure BDA0003684128270000162
Calculate the tag value of its index 0 position, based on
Figure BDA0003684128270000161
The tag value at the index 1 position is calculated, and the tag values at the other index positions are unchanged.
For example, for tags (0.90, 0.02, 0.02, 0.02), the tag maximum value is 0.9 and the tag minimum value is 0.2. The label value after the intra-group smoothing is as shown in fig. 6, because the behavior of the person is continuously changed, the behavior label before the label conversion frame is smoothed, so that the behavior label can be smoothly transited to the behavior conversion frame, finally the label can reflect the actual behavior of the person, the generalization space of subsequent learning is enhanced, and a data basis is provided for accurately identifying the action of armed staff.
S223, the bone joint point data comprises confidence degrees of the bone joint points; and performing confidence smoothing on the behavior label of each image based on the confidence of the skeletal joint point to obtain a behavior recognition training sample set.
Specifically, for each frame of image frame, the confidence coefficient of the main skeletal joint point is set to be 1, the confidence coefficients of other skeletal joint points are unchanged, and the mean value of the confidence coefficients of all skeletal joint points is calculated; wherein, the main skeletal joint points comprise a neck, a left shoulder, a right shoulder, a left hip and a right hip.
And multiplying the mean value of the confidence degrees by the label value of the frame image to obtain a smooth label of the frame image based on the confidence degrees. Thereby further enhancing the generalization space of learning. The confidence smoothed behavior tag data is shown in fig. 7.
After the behavior label is smoothed, the method also comprises the step of carrying out normalization processing on the bone joint point coordinate data in the initial training sample set during implementation. Specifically, the bone joint point coordinates can be normalized by using the maximum/minimum values of the bone joint point coordinates of each group, and all the bone joint point coordinates are normalized to be within a range of (-1, 1).
In implementation, the training sample set constructed by the application contains 29757 frames of labeled data for training.
Specifically, in step S2, the armed personnel behavior recognition model is a model constructed based on a space-time graph convolutional network, and specifically includes:
s23, constructing a bone joint point space-time diagram by taking the bone joint points as nodes, taking natural connection relations among the bone joint points as space edges and taking the connection relations of the same bone joint points in two continuous frames as time edges;
the sequence of skeletal points of the human body can generally be represented by the coordinates of the joints of the human body in each frame of image. In order to better utilize graph convolution to extract dynamic information of human skeleton points, edges between nodes of a graph not only comprise space edges representing natural connection between human joints, but also comprise time edges connecting the same joint points on continuous time steps, and traditional graph convolution is expanded to a time neighborhood. The constructed bone joint point space-time diagram is shown in fig. 8.
The human bone space-time diagram has a structure G ═ V, E, and the bone joint points are connected with the time edges through the space edges as the nodes of the space-time diagram. The information of the space-time diagram comprises the number N of the skeletal joint points, the number T of frames contained in the input video stream and a characteristic matrix v corresponding to each joint point ti . The feature matrix of all the joint points in the space-time diagram can be represented as follows:
V={v ti |t=1,2,...T,i=1,2,...N}
wherein v is ti The feature matrix representing the ith joint point of the tth frame contains the coordinates and confidence of the joint point. The nodes in the space-time diagram are connected through a space edge and a time edge, and the space edge and the time edge are respectively expressed as follows:
E s ={v ti ,v tj |(i,j)∈H}
E t ={v ti v (t+1)i }
wherein H is a joint point set naturally connected with the human body. And describing track information of human behavior changing along with time by constructing a human skeleton joint point space-time diagram.
S24, constructing a space-time graph convolution neural network, wherein the space-time graph convolution network comprises a plurality of space-time graph convolution blocks which are connected in sequence;
each space-time map convolution block comprises a space map convolution layer and a time map convolution layer which are sequentially connected; the space map convolution layer is used for performing map convolution on the input features to extract space domain features of the bone joint point space-time map; the time map convolutional layer is used for performing standard two-dimensional convolution on the input features to extract time domain features of the bone joint point space-time map;
specifically, the spatial map convolution layer is used for performing graph convolution on input features to extract spatial features of a bone joint point space-time map, and includes:
performing subset division on the neighborhood of each node in the skeleton joint point space-time diagram by adopting a distance-based division method; constructing an adjacency matrix of each node based on the divided subsets;
in a conventional convolutional neural network, the sampling function can be understood as the size of the convolution kernel, i.e., the range covered each time a convolution operation (feature extraction) is performed. For example, when a convolution operation is performed on a certain pixel, a convolution kernel of 3 × 3 actually calculates and aggregates information of the pixel and 8 adjacent pixels.
In the space-time graph convolution network, nodes are equal to image pixel points of traditional convolution, and a sampling function is responsible for specifying the range of the related adjacent nodes when graph convolution operation is carried out on each node. The method adopts a distance-based partition method to partition the subsets of the neighborhood of each node in the bone joint point space-time diagram. In the present application, a neighborhood set is divided into two subsets according to first-order neighboring nodes (directly connected nodes): 1) d ═ 0 represents the root node; 2) d-1 represents a neighborhood subset that is 1 from the root node. Therefore, in the present invention, the number K of subsets divided is 2, and there are two types of weighting functions, then the process of mapping the points in the neighborhood to the subsets after division so that they have the same label can be expressed as: l ti :B(v ti ) K-1, when the weighting function w can be expressed as w (v) tj ,v ti )=w(l ti (v tj ))。B(v ti ) Set of contiguous nodes, l, representing the ith joint of the t-th frame ti A subset label representing a contiguous node. The connection between human skeletal joints in a single frame can be expressed as a adjacency matrix a, and an identity matrix I represents a self-connection. For the division strategy based on joint distance, the adjacency matrix is decomposed into a plurality of matrixes A j Is provided with
Figure BDA0003684128270000191
j is 0, 1. In the distance-based partitioning strategy: a. the 0 =I,A 1 =A。
In practice, the constructed spatio-temporal graph convolutional network comprises a plurality of sequentially connected spatio-temporal graph volume blocks, for example, 9 sequentially connected spatio-temporal graph volume blocks. The first three spatio-temporal map volume blocks have 64 channels for output, the next three spatio-temporal map volume blocks have 128 channels for output, and the last three spatio-temporal map volume blocks have 256 channels for output.
Each space-time map convolution block comprises a space map convolution layer and a time map convolution layer which are sequentially connected.
Space diagram convolution layer according to formula
Figure BDA0003684128270000192
Performing graph convolution operation to extract spatial domain characteristics; wherein f is in Representing input features of the convolution layer of the space diagram, f out Representing an output characteristic, A, of a convolutional layer of a spatial map j Is a contiguous matrix representation of the jth subset, Λ j Degree matrix of adjacency matrix of jth subset, W j The weight of the jth subset is represented,
Figure BDA0003684128270000193
a significance mask matrix representing the nodes is generated,
Figure BDA0003684128270000194
representing a bit-wise multiplication.
When the human body is moving, some joints are often mass motion (such as wrists and elbows) and may appear in various parts of the body, so the modeling of these joints should include different importance. Therefore, the method adds a learnable mask M in each spatial graph convolution layer, and the learnable mask M is used for measuring the contribution degree of the node feature to the adjacent nodes of the skeletal joint point space-time graph based on the importance weight learnt by the information of the edge in the skeletal joint point space-time graph. Namely, the space map convolutional layer comprises an importance mask unit which is used for adaptively adjusting the importance of each node to other adjacent nodes.
The importance mask unit comprises a batch normalization layer, a Relu layer, a dropout layer, a convolution layer and a Sigmoid layer which are sequentially connected;
the normalization layer is used for enabling the importance mask matrix to have asymmetry; the Relu layer is used for nonlinear transformation; the dropout layer is used to prevent overfitting; the convolution kernel of the convolution layer is 1 multiplied by 1 and is used for enabling the mask matrix to be consistent with the corresponding dimension of the graph convolution layer; the Sigmoid layer is used to map the output result into the range of [0,1 ].
And the time map convolution layer is used for performing standard two-dimensional convolution on the input features to extract the time domain features of the bone joint point space-time map. Node v ti The neighborhood of (2) is expanded to include time-connected nodes, which can be expressed as follows
Figure BDA0003684128270000201
Where the parameter Γ controls the time span in the neighborhood graph, called the temporal kernel size. Since the time axis is ordered, the constructed label mapping function can be modified as follows:
Figure BDA0003684128270000202
v tj feature matrix, v, representing the jth joint of the tth frame qj A feature matrix representing the jth joint point of the qth frame.
And after the space-time diagram convolutional network is constructed, training the space-time diagram convolutional network based on the behavior recognition training sample set obtained in the step S22 to obtain a trained armed personnel behavior recognition model.
In practice, the batch _ size may be set to 32, 30 epochs are trained, the penalty function uses BCE penalty, and the optimizer uses Adadelta. The initial learning rate is set to 0.01, multiplying the initial learning rate by 0.1 every 10 epochs. The accuracy rate of the armed personnel behavior recognition model trained by the invention on the constructed data set can reach 99.2%.
According to the method, a training data set is constructed according to armed personnel behavior video stream data, and labels in the data set are smoothed, so that training data with a certain generalization space are provided for a training model, and overfitting of the training model is prevented; the armed personnel behavior recognition model is constructed by adopting the space-time graph convolutional network, so that the characteristics are extracted from the time domain and the space domain, the characteristics of the richer deep layer are extracted, and the accuracy of behavior recognition is improved.
Specifically, after the equipment detection model and the armed personnel behavior recognition model are trained, step S3 is executed, and for the video stream to be recognized, each frame of image is extracted and input into the trained equipment detection model, so as to obtain the armed personnel equipment detection result of each frame of image. And extracting the armed personnel bone key point data of each frame of image in the video stream to be identified, and inputting the bone key point data into the trained armed personnel behavior identification model to obtain a personnel behavior identification result of each frame of image in the video stream.
In step S3, based on the equipment detection result and the behavior recognition result, the risk coefficient of the person in each frame of image is calculated as follows:
Danger=P i ·IoU i +P act wherein, P i Indicates the confidence of the detected equipment of the i-th type, IoU i Denotes the cross-over ratio, P, of the detected ith equipment to the human ROI act Representing the degree of risk of the behavior recognition result.
Specifically, the risk of the behavior recognition result is calculated according to the following formula:
Figure BDA0003684128270000211
wherein, P j Confidence, β, representing the jth behavior class j The risk coefficient of the jth behavior class is shown, and K is the number of behavior classes.
In implementation, the risk degree of the behavior recognition result may also adopt a risk coefficient corresponding to the behavior type with the highest probability obtained by the behavior recognition classification.
The specific risk factors for each behavior category may be set according to the risk of each behavior.
The armed personnel are judged from two aspects by fusing the equipment characteristics and the behavior characteristics, and finally, the armed personnel are judged efficiently and accurately.
One embodiment of the invention discloses an armed personnel identification system integrating equipment and behavior characteristics, as shown in fig. 2, comprising the following modules:
the equipment detection model training module is used for acquiring an equipment detection data set and training an equipment detection model based on the equipment detection data set;
the behavior recognition model training module is used for acquiring armed personnel behavior video stream data and constructing an armed personnel behavior recognition training sample set based on the video stream data; training a armed personnel behavior recognition model based on the armed personnel behavior recognition training sample set;
the armed personnel identification module is used for extracting each frame of image in the video stream to be identified and inputting the frame of image into the trained equipment detection model to obtain an equipment detection result of each frame of image; extracting the bone joint point data of each frame of image in the video stream to be identified; inputting the bone joint point data into a trained armed personnel behavior recognition model to obtain a personnel behavior recognition result of each frame of image in a video stream; and calculating the danger coefficient of the personnel in each frame of image based on the equipment detection result and the behavior recognition result, and if the danger coefficient is higher than a preset threshold value, judging the personnel to be armed personnel.
The method embodiment and the system embodiment are based on the same principle, and related parts can be referenced mutually, and the same technical effect can be achieved. For the specific implementation process, reference is made to the foregoing embodiments, which are not described herein again.
Those skilled in the art will appreciate that all or part of the flow of the method implementing the above embodiments may be implemented by a computer program, which is stored in a computer readable storage medium, to instruct related hardware. The computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims (10)

1. An armed personnel identification method integrating equipment and behavior characteristics is characterized by comprising the following steps of:
acquiring an equipment detection data set, and training an equipment detection model based on the equipment detection data set;
acquiring armed personnel behavior video stream data, and constructing an armed personnel behavior recognition training sample set based on the video stream data; training a armed personnel behavior recognition model based on the armed personnel behavior recognition training sample set;
extracting each frame of image in the video stream to be identified, inputting the frame of image into the trained equipment detection model, and obtaining an equipment detection result of each frame of image; extracting the bone joint point data of each frame of image in the video stream to be identified; inputting the bone joint point data into a trained armed personnel behavior recognition model to obtain a personnel behavior recognition result of each frame of image in a video stream; and calculating the danger coefficient of the personnel in each frame of image based on the equipment detection result and the behavior recognition result, and if the danger coefficient is higher than a preset threshold value, judging the personnel to be armed personnel.
2. The armed personnel identification method with fusion of equipment and behavioral characteristics according to claim 1, wherein the risk coefficient of personnel per frame of image is calculated as follows:
Danger=P i ·IoU i +P act wherein P is i Confidence level of the ith equipment of the detection, IoU i Denotes the intersection ratio, P, of the detected ith equipment to the human ROI act Representing the degree of risk of the behavior recognition result.
3. The armed personnel identification method incorporating equipment and behavioral characteristics according to claim 2, wherein the risk of a behavioral identification result is calculated according to the following formula:
Figure FDA0003684128260000011
wherein, P j Representing the confidence, β, of the jth behavior class j The risk factor representing the jth behavior class,
Figure FDA0003684128260000012
representing the number of behavior categories.
4. The armed personnel identification method fusing equipment and behavioral characteristics according to claim 1, wherein the equipment detection model is a dynamic neural network model; the dynamic neural network model comprises a first sub-network and a second sub-network, the first sub-network being for detecting a human in an image; when the first sub-network detects that the image contains the human, extracting a human ROI and transmitting the human ROI to a second sub-network; the second sub-network is used for carrying out equipment detection by adopting classifiers at different network depths according to different resolutions of the image; and training the dynamic neural network model based on the equipment detection data set to obtain a trained armed personnel equipment detection model.
5. The armed personnel identification method with fusion of equipment and behavior features according to claim 1, wherein constructing an armed personnel behavior identification training sample set based on the video stream data comprises:
extracting the bone joint point data of each frame of image in the video stream data; adding a behavior label for each frame of image; corresponding the behavior label to the skeletal joint point data to obtain an initial training sample set;
and performing label smoothing on the behavior labels in the initial training sample set to obtain a behavior recognition training sample set.
6. The armed personnel identification method integrating equipment and behavioral characteristics according to claim 5, wherein performing label smoothing on the behavioral labels in the initial training sample set to obtain a behavioral identification training sample set comprises:
performing integral smoothing on all behavior labels in the initial training sample set;
determining a behavior conversion frame in an initial training sample set, and performing intra-group behavior label smoothing on a group of images before the behavior conversion frame;
the bone joint point data comprises a confidence of a bone joint point; and performing confidence smoothing on the behavior label of each image based on the confidence of the skeletal joint point to obtain a behavior recognition training sample set.
7. The armed personnel identification method with integration of equipment and behavioral characteristics according to claim 6, wherein the following formula is used to globally smooth all behavioral labels in the initial training sample set:
Figure FDA0003684128260000031
wherein, Label represents the sample behavior Label,
Figure FDA0003684128260000032
denotes the classification number and epsilon denotes the smoothing parameter.
8. The armed personnel identification method of fusion equipment and behavioral features according to claim 6, wherein the intra-group behavioral tag smoothing of a group of images prior to a behavioral transition frame comprises:
for each behavior conversion frame, determining an active index and a target index of a group of images before conversion according to the tag value of the image of the frame before the behavior conversion frame and the tag value of the behavior conversion frame;
calculating a label value corresponding to the active index in Labels of k images before a behavior conversion frame according to a formula Labels [ j ] [ active index ] ═ maximum label value (i-j)/k;
according to the formula
Figure FDA0003684128260000033
Calculating a label value corresponding to a target index in labels of k images before a behavior conversion frame;
the image processing method comprises the following steps that j is i-k, i- (k-1), i-1, the ith frame is a behavior conversion frame, and Labels [ j ] [ active index ] represents a label value corresponding to an active index in a behavior label of the jth frame image; labels [ j ] [ target index ] represents a label value corresponding to a target index in the behavior Labels of the jth frame image, the active index is an index where a label maximum value is located in the behavior Labels of the (i-1) th frame image, and the target index is an index where the label maximum value is located in the behavior Labels of the ith frame image.
9. The armed personnel behavior recognition method of claim 6, wherein confidence smoothing of the behavior tags of each image based on confidence of skeletal joint points comprises:
setting the confidence coefficient of the main skeletal joint point as 1 for each frame of image frame, keeping the confidence coefficients of other skeletal joint points unchanged, and calculating the mean value of the confidence coefficients of all skeletal joint points;
and multiplying the average value of the confidence degrees by the label value of the frame image to obtain a smooth label of the frame image based on the confidence degrees.
10. An armed personnel identification system fusing equipment and behavior characteristics, comprising the following modules:
the equipment detection model training module is used for acquiring an equipment detection data set and training an equipment detection model based on the equipment detection data set;
the behavior recognition model training module is used for acquiring armed personnel behavior video stream data and constructing an armed personnel behavior recognition training sample set based on the video stream data; training a armed personnel behavior recognition model based on the armed personnel behavior recognition training sample set;
the armed personnel identification module is used for extracting each frame of image in the video stream to be identified and inputting the frame of image into the trained equipment detection model to obtain an equipment detection result of each frame of image; extracting the bone joint point data of each frame of image in the video stream to be identified; inputting the bone joint point data into a trained armed personnel behavior recognition model to obtain a personnel behavior recognition result of each frame of image in a video stream; and calculating the danger coefficient of the personnel in each frame of image based on the equipment detection result and the behavior recognition result, and if the danger coefficient is higher than a preset threshold value, judging the personnel to be armed personnel.
CN202210641120.4A 2022-06-08 2022-06-08 Armed personnel identification method and armed personnel identification system integrating equipment and behavior characteristics Pending CN115035599A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210641120.4A CN115035599A (en) 2022-06-08 2022-06-08 Armed personnel identification method and armed personnel identification system integrating equipment and behavior characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210641120.4A CN115035599A (en) 2022-06-08 2022-06-08 Armed personnel identification method and armed personnel identification system integrating equipment and behavior characteristics

Publications (1)

Publication Number Publication Date
CN115035599A true CN115035599A (en) 2022-09-09

Family

ID=83123698

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210641120.4A Pending CN115035599A (en) 2022-06-08 2022-06-08 Armed personnel identification method and armed personnel identification system integrating equipment and behavior characteristics

Country Status (1)

Country Link
CN (1) CN115035599A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116503737A (en) * 2023-05-10 2023-07-28 中国人民解放军61646部队 Ship detection method and device based on space optical image
WO2024066044A1 (en) * 2022-09-27 2024-04-04 深圳先进技术研究院 Dangerous behavior recognition method and system based on super-resolution reconstruction and related device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024066044A1 (en) * 2022-09-27 2024-04-04 深圳先进技术研究院 Dangerous behavior recognition method and system based on super-resolution reconstruction and related device
CN116503737A (en) * 2023-05-10 2023-07-28 中国人民解放军61646部队 Ship detection method and device based on space optical image
CN116503737B (en) * 2023-05-10 2024-01-09 中国人民解放军61646部队 Ship detection method and device based on space optical image

Similar Documents

Publication Publication Date Title
CN110322446B (en) Domain self-adaptive semantic segmentation method based on similarity space alignment
CN109711463B (en) Attention-based important object detection method
CN110232350B (en) Real-time water surface multi-moving-object detection and tracking method based on online learning
CN111259786B (en) Pedestrian re-identification method based on synchronous enhancement of appearance and motion information of video
CN111797716A (en) Single target tracking method based on Siamese network
CN111291809B (en) Processing device, method and storage medium
CN113158862B (en) Multitasking-based lightweight real-time face detection method
CN115035599A (en) Armed personnel identification method and armed personnel identification system integrating equipment and behavior characteristics
CN113298815A (en) Semi-supervised remote sensing image semantic segmentation method and device and computer equipment
CN112150493A (en) Semantic guidance-based screen area detection method in natural scene
CN113255443A (en) Pyramid structure-based method for positioning time sequence actions of graph attention network
CN110942471A (en) Long-term target tracking method based on space-time constraint
CN113963032A (en) Twin network structure target tracking method fusing target re-identification
CN112651998A (en) Human body tracking algorithm based on attention mechanism and double-current multi-domain convolutional neural network
CN112884742A (en) Multi-algorithm fusion-based multi-target real-time detection, identification and tracking method
CN114022372B (en) Mask image patching method for introducing semantic loss context encoder
CN113870160B (en) Point cloud data processing method based on transformer neural network
CN111723660A (en) Detection method for long ground target detection network
Manssor et al. Real-time human detection in thermal infrared imaging at night using enhanced Tiny-yolov3 network
CN113554679A (en) Anchor-frame-free target tracking algorithm for computer vision application
Chen et al. SWIPENET: Object detection in noisy underwater images
Kadim et al. Deep-learning based single object tracker for night surveillance.
CN113657414B (en) Object identification method
CN113095251B (en) Human body posture estimation method and system
CN114067128A (en) SLAM loop detection method based on semantic features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination