WO2022141962A1

WO2022141962A1 - Invasion detection method and apparatus, device, storage medium, and program product

Info

Publication number: WO2022141962A1
Application number: PCT/CN2021/087835
Authority: WO
Inventors: 朱铖恺; 赵永磊; 武伟; 路少卿; 闫俊杰
Original assignee: 深圳市商汤科技有限公司
Priority date: 2020-12-31
Filing date: 2021-04-16
Publication date: 2022-07-07
Also published as: CN112668496A

Abstract

An invasion detection method and apparatus, a device, a storage medium, and a program product. The method comprises: obtaining an image to be processed from a video stream to be processed (S101); detecting objects in said, and obtaining at least one object detection box (S102); determining whether a preset invading object is present in the object detection box (S103); when it is determined that the preset invading object is present in the object detection box, performing recognition on said image, and obtaining an invasion detection area (S104); and determining, on the basis of the location of the preset invading object and the invasion detection area, whether an invasion event has occurred (S105).

Description

Intrusion detection method, apparatus, device, storage medium and program product

CROSS-REFERENCE TO RELATED APPLICATIONS

This disclosure is based on the Chinese patent application with the application number of 202011620177.3, the application date of which is on December 31, 2020, and the application name is "Intrusion Detection Method, Device, Equipment and Storage Medium", and claims the priority of the Chinese patent application, which The entire contents of the Chinese patent application are hereby incorporated by reference into the present disclosure.

technical field

The present disclosure relates to the field of intelligent detection, and in particular, to an intrusion detection method and apparatus, device, storage medium and program product.

Background technique

Pedestrians/non-motor vehicles often enter by mistake or intentionally on expressways, which affects the normal driving of vehicles on the road and has a great impact on traffic safety. The characteristics of such long-tail events are that the probability of occurrence within a certain period of time is low, and the image data collected by the camera during the collection process is massive. The target detection method of deep learning in related technologies achieves target accuracy by increasing training data and model capacity. However, it requires a lot of manpower to mark the detection frame, and will greatly increase the hardware cost of algorithm operation, which is an urgent problem to be solved in the implementation of long-tail event algorithms such as pedestrian intrusion.

SUMMARY OF THE INVENTION

In view of this, embodiments of the present application provide an intrusion detection method, apparatus, device, and storage medium.

The technical solutions of the embodiments of the present application are implemented as follows:

In a first aspect, an embodiment of the present application provides an intrusion detection method, including: obtaining the to-be-processed image from a to-be-processed video stream; detecting objects in the to-be-processed image to obtain at least one object detection frame; determining whether the preset intrusion object exists in the object detection frame; in the case of determining that there is a preset intrusion object in the object detection frame, identifying the to-be-processed image to obtain an intrusion detection area; based on The preset position of the intrusion object and the intrusion detection area determine whether an intrusion event occurs.

In some embodiments, the detecting an object in the to-be-processed image to obtain at least one object detection frame includes: detecting an object in the to-be-processed image to obtain at least one object detection frame, each the position of the object detection frame and the type of the object in each of the object detection frames; when it is determined based on the type of the object that the preset intrusion object exists in any of the object detection frames, determine the The preset intrusion object exists in the to-be-processed image; based on the position of the object detection frame where the preset intrusion object exists, the position of the preset intrusion object is determined.

In this way, the method of first obtaining the object detection frame, and then determining that there is a preset intrusion object in the object detection frame can effectively detect the preset intrusion object in the image to be processed, and the detection frame can also be used to subsequently determine whether the intrusion object is not exists in the intrusion detection area. The detector can be used to accurately determine whether the intrusion object is included in the to-be-processed image and the position of the intrusion object.

In some embodiments, the object detection in the image to be processed obtains at least one object detection frame, the position of each object detection frame and the category of the object in each object detection frame, including : based on a deep convolutional network, extract features from the image to be processed to obtain a first feature map; based on a region generating network, generate candidate target regions in the first feature map to obtain a second feature map; the first feature map The second feature map includes at least one detection frame and the position and confidence of each detection frame; based on the pooling layer, the first feature map and the second feature map are subjected to the process of pooling position-sensitive candidate regions wherein, based on the at least one detection frame and the confidence level of each detection frame, a detection frame that satisfies a preset condition is determined as an object detection frame, and a category of an object in the object detection frame is determined.

In this way, based on the deep convolutional network, the region generation network and the pooling layer, the object detection in the image to be processed can be realized, and at least one object detection frame, the position of each object detection frame and the category of the object in each object detection frame can be obtained. .

In some embodiments, the determining a detection frame that meets a preset condition as an object detection frame based on the at least one detection frame and the confidence of each of the detection frames includes: using a non-maximum value suppression algorithm, Based on the confidence of each of the detection frames and the intersection ratio between the detection frames in the at least one detection frame, a detection frame satisfying a preset condition is determined as an object detection frame.

In this way, each object in the image to be processed can be finally determined a most suitable object detection frame by using the non-maximum value suppression algorithm.

In some embodiments, the non-maximum suppression algorithm is used, based on the confidence of each of the detection frames and the intersection ratio between the detection frames in the at least one detection frame, will satisfy the predetermined Determining a conditional detection frame as an object detection frame includes: based on the confidence of each detection frame, determining at least one detection frame with the highest confidence in the detection frame as a target detection frame; detecting the target The frame is determined as one of the object detection frames; the intersection ratio of the target detection frame and each other detection frame is determined; wherein, the other detection frames refer to the at least one of the detection frames except the target detection frame A detection frame other than the frame; delete other detection frames whose intersection ratio is greater than the threshold from the at least one detection frame to obtain a candidate detection frame set; remove the target detection frame from the candidate detection frame set The detection frame with the highest confidence is determined as a new target detection frame; the new target detection frame is determined as one of the object detection frames; the difference between the new target detection frame and each new other detection frame is determined Intersection and union ratio; wherein, each new other detection frame refers to a detection frame other than the new target detection frame in the candidate detection frame set; the intersection and union ratio is greater than the new other detection frame of the threshold value Delete from the candidate detection frame set to obtain a new candidate detection frame set; and so on, to obtain the object detection frame.

In this way, the non-maximum value suppression algorithm is used to merge the detection frames whose intersection ratio is greater than the threshold, and the detection frame that meets the preset conditions can be determined as the object detection frame, so as to realize the final determination of each object in the image to be processed as a maximum detection frame. Appropriate object detection boxes.

In some embodiments, the determining whether the preset intrusion object exists in the object detection frame includes: using a first-level classifier to perform a first-level classifier on the object detection frame corresponding to the target category based on the object category. A classification is used to obtain a first classification result; using a second-level classifier cascaded with the first-level classifier, based on the first classification result, a second classification is performed on the object detection frame that meets the preset condition, and the result is obtained second classification result; when it is determined that the preset intrusion object exists in any of the object detection frames based on the second classification result, it is determined that the preset intrusion object exists in the object detection frame.

In this way, the first classification can be regarded as a preliminary judgment, and the second classification can be regarded as a re-judgment. Adopting the two-classification method of making a preliminary judgment and then a second judgment can effectively improve the efficiency of the classification and reduce the misjudgment rate.

In some embodiments, the first-level classifier and the second-level classifier have the following relationship: the classification accuracy of the first-level classifier is lower than the classification accuracy of the second-level classifier; The number of convolutional layers in the first-level classifier is less than the number of convolutional layers in the second-level classifier; the confidence level of the first-level classifier is higher than that of the second-level classifier. Confidence is low.

In this way, using the classifier composed of the first-level classifier and the second-level classifier can effectively improve the efficiency and confidence of the classification and reduce the misjudgment rate on the basis of ensuring the classification accuracy.

In some embodiments, the identifying the to-be-processed image to obtain the intrusion detection area includes: using a convolutional neural network model to perform semantic segmentation on the to-be-processed image to obtain the intrusion detection area.

In this way, the convolutional neural network is used for semantic segmentation of the image to be processed, which realizes the automatic identification of the intrusion detection area, and does not require manual labeling of the area, which is convenient for large-scale online applications.

In some embodiments, the determining whether an intrusion event occurs based on the preset position of the intrusion object and the intrusion detection area includes: judging the preset intrusion based on the preset position of the intrusion object Whether the object is located in the intrusion detection area; in response to the preset intrusion object located in the intrusion detection area, it is determined that the intrusion event occurs; or in response to the preset intrusion object located in the intrusion detection area Outside the area, it was determined that the intrusion event did not occur.

In this way, determining whether the preset intrusion object is located in the intrusion detection area based on the preset position of the intrusion object can effectively improve the accuracy of determining that the preset intrusion object is located in the intrusion detection area.

In some embodiments, the determining whether an intrusion event occurs based on the preset position of the intrusion object and the intrusion detection area includes: determining an object detection frame in which the preset intrusion object exists as a target detection frame; determine the center point of the bottom edge of the target detection frame as the position of the preset intrusion object; based on the relative relationship between the center point of the bottom edge of the target detection frame and the intrusion detection area Location relationship to determine whether an intrusion event has occurred.

In this way, based on the relative positional relationship between the center point of the bottom edge of the target detection frame and the intrusion detection area, it is determined whether an intrusion event occurs. In this way, the position coordinates corresponding to the center point of the bottom edge of the target detection frame are compared with the position coordinates of the intrusion detection area. When it is determined that the location coordinates belong to the intrusion area, it is determined that an intrusion event occurs, which can effectively improve the accuracy of determining the intrusion event.

In some embodiments, the method further comprises: in response to the occurrence of the intrusion event, outputting an alarm identification.

In this way, the intrusion object can be quickly guided to leave the intrusion detection area according to the alarm identification, thereby effectively preventing the intrusion object from entering the intrusion detection area.

In some embodiments, the method further includes: in response to the occurrence of the intrusion event, recording the intrusion event based on the category of the intrusion object and the intrusion detection area to obtain an intrusion record; recording the intrusion record stored or sent to the associated terminal.

In this way, intrusion events can be recorded, and high-frequency points of intrusion by intrusion objects can be found according to the intrusion records, and preventive measures can be strengthened.

In a second aspect, an embodiment of the present application provides an intrusion detection device, including: an obtaining module configured to obtain the to-be-processed image from a to-be-processed video stream; a detection module configured to detect objects in the to-be-processed image performing detection to obtain at least one object detection frame; a first determination module, configured to determine whether the preset intrusion object exists in the object detection frame; an identification module, configured to determine whether a preset intrusion object exists in the object detection frame In the case of the intrusion object, the image to be processed is identified to obtain an intrusion detection area; the second determination module is configured to determine whether an intrusion event occurs based on the preset intrusion object position and the intrusion detection area. .

In a third aspect, an embodiment of the present application provides a computer device, including a memory and a processor, where the memory stores a computer program that can be run on the processor, and when the processor executes the program, the intrusion detection of the above method is implemented method.

In a fourth aspect, an embodiment of the present application provides a storage medium storing executable instructions for implementing the intrusion detection method of the above method when a processor is caused to execute.

In a fifth aspect, an embodiment of the present application provides a computer program product, including one or more instructions, where the one or more instructions are suitable for an intrusion detection method in which a processor loads and executes the above method.

The embodiments of the present application have the following advantages:

1) Obtain the image to be processed from the video stream to be processed, so that the image in the video stream collected by the image capture device can be used as input to analyze the video stream, which can effectively improve the utilization rate of the image capture device.

2) First, the object in the image to be processed is detected to obtain at least one object detection frame, and then it is determined whether a preset intrusion object exists in the object detection frame. In this way, the provided detection model and classification model are decoupled, and the classification model can be customized for special scenarios during the implementation of the algorithm to quickly achieve the expected performance. Great improvement. Further, since the detection model and the classification model are decoupled, for the optimization of false positives in new scenarios, it is only necessary to add false positive data to train a new classifier, which can be cascaded with the existing detection model, which is suitable for rapid upgrade iterations of algorithm implementation. Filtering false positives in a cascaded manner can greatly improve the detection accuracy of long-tail events.

3) In the case of determining that a preset intrusion object exists in the object detection frame, identify the to-be-processed image to obtain an intrusion detection area; determine whether an intrusion detection area occurs based on the preset intrusion object position and intrusion detection area intrusion event. In this way, the identification of the intrusion detection area is performed only in the to-be-processed images that are confirmed to have intrusion objects, without the need to identify all the to-be-processed images, which can significantly reduce the computing power requirements on hardware devices, thereby realizing efficient and fully automatic detection. Whether the intrusion object intrudes into the intrusion detection area, and then there is no need to manually mark the area, which is convenient for large-scale online applications.

Description of drawings

FIG. 1 is a schematic flowchart of the implementation of an intrusion detection method provided by an embodiment of the present application;

2 is a schematic diagram of a cascaded classification model provided by an embodiment of the present application;

3 is a schematic diagram of a semantic segmentation model provided by an embodiment of the present application;

4 is a schematic diagram of a detection model provided by an embodiment of the present application;

FIG. 5 is a display diagram of a smart transportation platform provided by an embodiment of the present application;

FIG. 6 is a pedestrian/vehicle intrusion diagram provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of an implementation flowchart of an intrusion detection method provided by an embodiment of the present application;

FIG. 8 is a schematic structural diagram of an intrusion detection device according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a hardware entity of a computer device according to an embodiment of the present application.

Implementation

In order to make the purposes, technical solutions and advantages of the embodiments of the present application more clear, the specific technical solutions of the invention will be described in further detail below with reference to the accompanying drawings in the embodiments of the present application. The following examples are used to illustrate the present application, but are not intended to limit the scope of the present application.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application.

It should be understood that some embodiments described herein are only used to explain the technical solutions of the embodiments of the present application, and are not used to limit the technical scope of the embodiments of the present application.

The embodiment of the present application proposes an intrusion detection method to be applied to a computer device. The computer device may include a removable device or a non-removable device. The functions implemented by the method may be implemented by calling a program code by a processor in the computer device. Of course, The program code can be stored in a computer storage medium, and it can be seen that the computer device includes at least a processor and a storage medium.

A schematic diagram of an implementation flow of an intrusion detection method provided by an embodiment of the present application, as shown in FIG. 1 , the method includes:

Step S101, obtaining the to-be-processed image from the to-be-processed video stream;

In some embodiments, the video stream acquired by the image acquisition device can be used as input to acquire the image to be processed from the video stream. Due to the influence of the acquisition time period, the data of such acquired video stream is massive in most cases. Here, the image capturing device may be a camera. In the implementation process, the current road image acquisition system can be reused, which can effectively avoid the limitation of dedicated hardware; the image to be processed can also be obtained by means of timed snapshots, and the pedestrians/non-motor vehicles entering the expressway can be identified and warned. , to assist the traffic police in maintaining the order of the expressway and improving the safety of the road network.

Step S102, detecting objects in the to-be-processed image to obtain at least one object detection frame;

In some embodiments, an object detection model may be used to detect objects in the image to be processed to obtain at least one object detection frame. In the implementation process, the target detection model can be trained based on Faster-Regions with Convolutional Neural Network (Faster-RCNN), You Only Look Once (YOLO), single-shot One of the multi-box (SingleShot MultiBox Detector, SSD) networks, etc. Among them, the two-step target detection method represented by Faster R-CNN has the advantage of high detection accuracy, but the disadvantage is that the detection speed is slow; the single-step target detection method represented by YOLO and SSD network has the advantage that the detection speed is higher than that of double-step detection. Method classes are fast.

In the implementation process, the input of any of the above three types of object detection models may be images to be processed, and at least one object detection frame is output after processing.

Step S103, determining whether the preset intrusion object exists in the object detection frame;

In some embodiments, a cascaded classifier model can be used to determine whether a preset intrusion object exists in the object detection frame. The cascaded classifier model may include multi-level classifiers, and each level of classifier completes the corresponding classification task. In this way, the classification result determined by the cascaded classifier model is more accurate than the classification result determined by the single-level classifier model, and the effect of effectively improving the classification efficiency can be achieved.

In the implementation process, for example, in a high-speed road scene, the preset intrusion object may be a pedestrian or a non-motor vehicle.

Step S104, in the case of determining that a preset intrusion object exists in the object detection frame, identify the to-be-processed image to obtain an intrusion detection area;

The intrusion detection method provided by the embodiments of the present application can be applied to the identification of pedestrians/non-motor vehicles inadvertently breaking into high-speed roads, or intentional entry events, and can also be applied to the entrance of kindergartens to detect lost children, people by lakes or waters Falling into the water, or a long-tail event such as a prison break. Since such long-tail events are characterized by a low probability of occurring within a certain period of time, the image data collected by the camera during the acquisition process is massive. If the target area is identified for each image, the computing power of the system is required to be high. . Here, the system only recognizes the to-be-processed images that are judged to have intrusion objects, and the recognition method can use a semantic segmentation model. In this embodiment, the intrusion detection method provided by this embodiment of the present application is applied to the pedestrian/non-motor vehicle accidental intrusion occurring on a high-speed road as an example for description.

Step S105: Determine whether an intrusion event occurs based on the preset position of the intrusion object and the intrusion detection area.

In the same image to be processed, the position of the identified preset intrusion object and the intrusion detection area obtained in step S104 are processed to determine whether there is a preset intrusion object in the intrusion detection area.

The embodiments of the present application have the following advantages:

1) Obtain the to-be-processed image from the to-be-processed video stream, so that the image in the video stream collected by the image capture device can be used as input to analyze the video stream, which can effectively improve the utilization rate of the image capture device.

2) First, the object in the image to be processed is detected to obtain at least one object detection frame, and then it is determined whether a preset intrusion object exists in the object detection frame. In this way, the provided detection model and classification model are decoupled, and the classification model can be customized for special scenarios during the implementation of the algorithm to quickly achieve the expected performance. Great improvement; further, due to the decoupling of the detection model and the classification model, for the optimization of false positives in new scenarios, it is only necessary to add false positive data to train a new classifier, which can be cascaded with the existing detection model, which is suitable for the implementation of the algorithm The rapid upgrade iteration of the system can greatly improve the detection accuracy of long-tail events by filtering false positives in a cascading manner.

This embodiment of the present application provides a cascaded classification model. As shown in FIG. 2 , the cascaded classification model includes: a first-level classifier 220 and a second-level classifier 230 . in:

The first-level classifier 220 includes a first residual network 221 and a first fully connected layer 222, wherein the first residual network 221 performs feature extraction on the image content in the input object detection frame 210 to obtain a feature map P1; The first fully connected layer 222 performs the first classification based on the feature map P1 to obtain a first classification result; based on the first classification result, the object detection frames that do not meet the requirements are filtered out.

The second-level classifier 230 includes a second residual network 231 and a second fully-connected layer 232, wherein the second residual network 231 is used to detect the picture in the object detection frame when the first classification result satisfies the condition Feature extraction is performed on the content to obtain a feature map P2; the second fully connected layer 232 performs a second classification based on the feature map P2 to obtain a second classification result; based on the second classification result, object detection frames that do not meet the requirements are filtered out.

The second classification result 240 represents the classification result of the object detection frame after being classified by the cascade model.

For the cascaded classification model shown in Figure 2, the first residual network 221 in the first-level classifier 220 can use a classification model with fast classification speed, such as the residual network Resnet18, which can filter most of the negative samples; the second-level The second residual network 231 in the classifier 230 can use a classification model with slow speed but high accuracy, such as the residual network Resnet50, to improve the accuracy, so that the overall speed will not be much slower, and the accuracy will be improved a lot.

An intrusion detection method provided by an embodiment of the present application includes:

Step S201, obtaining the to-be-processed image from the to-be-processed video stream;

Step S202, detecting the object in the to-be-processed image to obtain at least one object detection frame, the position of each of the object detection frames, and the category of the object in each of the object detection frames;

In some embodiments, in the case of a high-speed road scene, the detection model input is a high-speed road image, and the output is an object detection frame. Here, the objects can be workers, pedestrians, animals, cars, motorcycles, electric bicycles, and the like. In the feature extraction stage, a deep convolutional network is used to extract features from high-speed road images, and a region generation network is used to extract candidate object detection frames; in the detection stage, based on the object detection frames obtained in the feature extraction stage, candidate objects are detected. The frame feature performs position-sensitive candidate region pooling, that is, category classification and coordinate regression, to obtain the position of each object detection frame and the category of the object in each object detection frame.

Step S203 , when it is determined that the preset intrusion object exists in any of the object detection frames based on the category of the object, determine that the preset intrusion object exists in the to-be-processed image;

In some embodiments, in the case of a high-speed road scene, the detection result includes, but is not limited to, at least one of the following: workers, pedestrians, animals, cars, motorcycles, electric bicycles, and the like on the high-speed road. Here, it may be determined that objects classified as pedestrians and non-motor vehicles are preset intrusion objects, that is, when there are pedestrians and non-motor vehicles in any object detection frame, it is determined that there are preset intrusion objects in the image to be processed.

Step S204, determining the position of the preset intrusion object based on the position of the object detection frame where the preset intrusion object exists;

In some embodiments, the position of the object detection frame where there is a preset intrusion object may be represented by position coordinates. Based on the position coordinates of the object detection frame, the position of the intrusion object in the image to be processed can be determined.

Step S205, using a first-level classifier to perform a first classification on the object detection frame corresponding to the target category based on the category of the object, to obtain a first classification result;

In some embodiments, a cascaded classification model as shown in FIG. 2 may be used, and the first-level classifier 220 may include a first residual network 221 and a first fully connected layer 222 . Among them, the first residual network 221 can use the ResNet18 network, 18 represents the depth of the network, that is, 18 specifies 18 layers with weights, including convolutional layers and fully connected layers, excluding pooling layers and batches Normalization (Batch Normalization, BN) layer. The ResNet18 network performs feature extraction on the object detection frame to obtain a feature map; the first fully connected layer 222 performs the first classification based on the feature map to obtain the first classification result, that is, the object detection frame that does not meet the requirements is filtered out by the first filter. Here, the first-level classifier 220 completes the initial judgment of the image, which may also be called rough judgment. The characteristics of rough judgment are high efficiency but high false judgment rate.

Step S206, using the second-level classifier cascaded with the first-level classifier, and based on the first classification result, perform a second classification on the object detection frame that meets the preset condition, and obtain a second classification result;

In some embodiments, the first-level classifier and the second-level classifier have the following relationship: the classification accuracy of the first-level classifier is lower than the classification accuracy of the second-level classifier; the The number of convolutional layers in the first-level classifier is less than the number of convolutional layers in the second-level classifier; the confidence level of the first-level classifier is higher than that of the second-level classifier. Low.

In some embodiments, as shown in FIG. 2 , when the first classification result satisfies the condition, the object detection frame is classified using the second-level classifier 230, wherein the second-level classifier 230 includes a second residual network 231 and the second fully connected layer 232. Wherein, the second residual network 231 can use the ResNet50 network. The ResNet50 network performs feature extraction on the object detection frame to obtain a feature map; the second fully connected layer 232 performs a second classification based on the feature map to obtain a second classification result, that is, the object detection frame that does not meet the requirements is filtered out by a second filter. What the second-level classifier 230 completes is the re-judgment of the object detection frame, which may also be called fine judgment. The fine judgment is characterized by high classification accuracy and low misjudgment rate.

Step S207, when it is determined that the preset intrusion object exists in any of the object detection frames based on the second classification result, determine that the preset intrusion object exists in the object detection frame;

Step S208, in the case of determining that there is a preset intrusion object in the object detection frame, identify the to-be-processed image to obtain an intrusion detection area;

Step S209, determining the object detection frame in which the preset intrusion object exists as the target detection frame;

Step S210, determining the center point of the bottom edge of the target detection frame as the preset position of the intrusion object;

In some embodiments, the center point of the bottom edge of the target detection frame corresponds to a position coordinate, and the position coordinate is determined as the position of the intrusion object.

In other embodiments, the position coordinate corresponding to the center point of any frame of the target detection frame may also be determined as the position of the intrusion object.

Step S211 , based on the relative positional relationship between the center point of the bottom edge of the target detection frame and the intrusion detection area, determine whether an intrusion event occurs.

In the implementation process, the position coordinates corresponding to the center point of the bottom edge of the target detection frame may be compared with the position coordinates of the intrusion detection area. If it is determined that the position coordinates belong to the intrusion area, it is determined that an intrusion event has occurred; if it is determined that the position coordinates do not belong to the intrusion area, it is determined that no intrusion event has occurred.

In the embodiment of the present application, the classified object detection frame is obtained, and only the image cropped from the matched classified object frame is input into the cascade classification. In this way, there is no need to identify all the images to be processed, which can significantly reduce the computing power requirement on the hardware device.

In the embodiment of the present application, when it is confirmed that the object detection frame including the intrusion object is met through the cascade classification, the to-be-processed image is identified to obtain the intrusion detection area. In this way, the cascade classifier is used for classification in two steps. The first-level classification can be regarded as a preliminary judgment, and the second-level classification can be regarded as a secondary judgment. The two-stage classification method of first preliminary judgment and then secondary judgment can be effective. Improve the efficiency of classification and reduce the misjudgment rate. For similar long-tail event detection, such as fireworks detection, filtering false positives in a cascaded manner can greatly improve the detection accuracy of long-tail events.

In the embodiment of the present application, it is determined whether an intrusion event occurs based on the relative positional relationship between the center point of the bottom edge of the target detection frame and the intrusion detection area. In this way, the position coordinates corresponding to the center point of the bottom edge of the target detection frame are compared with the position coordinates of the intrusion detection area. When it is determined that the location coordinates belong to the intrusion area, it is determined that an intrusion event occurs, which can effectively improve the accuracy of determining the intrusion event.

A semantic segmentation model provided by an embodiment of the present application, as shown in FIG. 3 , the semantic segmentation model includes: a multi-layer convolution network 302 , a multi-layer deconvolution network 303 , and an image 304 that has completed semantic segmentation. in:

The multi-layer convolutional network 302, when the number of layers is 5, is a 5-layer convolutional network, which is used to downsample the image to be processed by 32 times, and at the same time encode the image to be processed.

The multi-layer deconvolution network 303, when the number of layers is 4, is a 4-layer deconvolution network, which is used to upsample the encoding result by 32 times, and perform decoding and semantic understanding of the encoding result.

Taking the image to be processed as a high-speed expressway scene as an example, the image to be processed 301 is input into the convolutional neural network model to obtain an image 304 that can complete semantic segmentation, that is, the high-speed expressway area (gray, the label is 1) and the non-highway expressway area are obtained. Area (black, label 0). In this way, marking the high-speed expressway area as gray and marking the non-high-speed expressway area as black can achieve the visualization effect; marking the high-speed expressway area as 1, and marking the non-high-speed expressway area as 0 can realize the use of different Mark the area to quickly identify the location of the intruding object.

A detection model provided by the embodiment of the present application, as shown in FIG. 4 , the detection model includes: a deep convolutional network 402, a region generation network (Region Proposal Network, RPN) 403, a position-sensitive candidate region pooling layer (Position Sensitive Regions of Interest Pooling, PSROIPooling) 404, bounding box regression result 405, and classification result 406. in:

The deep convolutional network 402 is used to perform feature extraction on the to-be-processed image 301 (which is the same image as the to-be-processed image 301 shown in FIG. 3 ) to obtain a first feature map.

Region Proposal Network (RPN) 403 is used to generate a candidate target region (object detection frame) on the first feature map to obtain a second feature map, the second feature map includes at least one detection frame and each The position and confidence of the detection frame.

The position-sensitive candidate region pooling (Position Sensitive Regions of Interest Pooling, PSROIPooling) layer 404 is used to perform position-sensitive candidate region pooling on the simultaneously input first feature image and at least one object detection frame to obtain a frame regression result 405 and the classification result 406, wherein, in this embodiment, the classification result 406 realizes the prediction of the detection result, and the detection result includes but is not limited to at least one of the following: staff, pedestrians, animals, vehicles on high-speed expressways , motorcycle, electric bicycle, etc., the frame regression result 405 predicts the precise coordinates of the detection frame corresponding to the detection result.

In this way, after the image to be processed is detected by the detection model, at least one object detection frame cropped from the to-be-processed image is obtained, and the position, confidence level and category of the object in the object detection frame are determined for each of the object detection frames.

Step S401, obtaining the to-be-processed image from the to-be-processed video stream;

Step S402, performing feature extraction on the to-be-processed image based on a deep convolutional network to obtain a first feature map;

In some embodiments, as shown in FIG. 4 , a detector based on Faster-Regions with Convolutional Neural Network (Faster-RCNN) can be used to detect images to be processed, and the Faster-RCNN network inputs is the image to be processed, and at least one object detection frame is output after processing;

As shown in FIG. 4 , the first stage (feature extraction stage) uses the deep convolution network 402 to perform feature extraction, wherein the deep convolution network 402 includes: vector convolution operation 1 (conv1), vector convolution operation 2 (conv2), Dense vector convolution operation 3 (dense conv3) and dense vector convolution operation 4 (dense conv4) use the above four deep convolutional networks to perform feature extraction on images.

Step S403, based on the region generation network, generate candidate target regions in the first feature map to obtain a second feature map; the second feature map includes at least one detection frame and the position and confidence of each detection frame ;

As shown in FIG. 4 , based on the region generation network 403 , candidate target regions are generated in the first feature map to obtain the second feature map. The second feature map includes at least one detection frame, the position of each detection frame, and the confidence level of the detected target.

Step S404, in the process of pooling the first feature map and the second feature map into position-sensitive candidate regions based on the pooling layer, based on the confidence of the at least one detection frame and each detection frame degree, determine the detection frame that satisfies the preset condition as the object detection frame, and determine the category of the object in the object detection frame;

As shown in FIG. 4 , the first feature map and the second feature map are subjected to position-sensitive candidate region pooling, that is, the first feature map and the second feature map are simultaneously input to perform the position-sensitive candidate region pooling layer 404 to obtain a frame. From the regression result 405 and the classification result 406, the confidence of the detected target and the position of the detection frame are obtained, the detection frame satisfying the preset condition is determined as the object detection frame, and the category of the object in the object detection frame is determined.

Step S405 , when it is determined that the preset intrusion object exists in any of the object detection frames based on the category of the object, determine that the preset intrusion object exists in the to-be-processed image;

Step S406, determining the position of the preset intrusion object based on the position of the object detection frame where the preset intrusion object exists;

Step S407, determining whether the preset intrusion object exists in the object detection frame;

Step S408: In the case of determining that a preset intrusion object exists in the object detection frame, use a convolutional neural network model to perform semantic segmentation on the to-be-processed image to obtain the intrusion detection area;

During implementation, the convolutional neural network model shown in FIG. 3 includes a multi-layer convolution network 302 and a multi-layer deconvolution network 303 . Among them, the multi-layer convolution network 302, when the number of layers is 5, is a 5-layer convolution network, which is used to downsample the image to be processed by 32 times and encode the image to be processed at the same time; multi-layer deconvolution The network 303, when the number of layers is 4, is a 4-layer deconvolution network, which is used to upsample the encoding result by 32 times, and perform decoding and semantic understanding of the encoding result.

Taking the image to be processed as a high-speed road scene as an example, inputting the input to-be-processed image 301 into the convolutional neural network model can obtain the high-speed road area (gray, the label is 1) and the non-high-speed road area (black, the label is 0) . In this way, marking the high-speed expressway area as gray and marking the non-high-speed expressway area as black can achieve the visualization effect; marking the high-speed expressway area as 1, and marking the non-high-speed expressway area as 0 can realize the use of different Mark the area to quickly identify the location of the intruding object.

Step S409, judging whether the preset intrusion object is located in the intrusion detection area based on the position of the preset intrusion object;

In the implementation process, the position of the intrusion object may correspond to a set of position coordinates. For example, a group of position coordinates of the object detection frame is used as the position coordinates of the intrusion object, and the coordinates that best represent the position of the intrusion object are selected from a set of position coordinates as the intrusion object. The position coordinates of the object are compared with the position coordinates of the intrusion detection area to determine whether the intrusion object is located in the intrusion detection area.

Step S410, determining that the intrusion event occurs in response to the preset intrusion object being located within the intrusion detection area; or determining that the intrusion event has not occurred in response to the preset intrusion object being located outside the intrusion detection area the intrusion event.

In the embodiment of the present application, the first feature map is obtained based on the deep convolutional network; the second feature map is obtained based on the region generating network; In the pooling process, based on the confidence of at least one detection frame and each detection frame, a detection frame that satisfies a preset condition is determined as an object detection frame, and a category of an object in the object detection frame is determined. In this way, the obtained object detection frame is a detection frame including a preset intrusion object. If no object detection frame including an intrusion object is detected, subsequent processing is not required, which can effectively improve the detection efficiency of long-tail events.

In the embodiment of the present application, in the presence of a preset intrusion object, a convolutional neural network model is used to semantically segment the image to be processed, and an intrusion detection area and a non-intrusion detection area distinguished by colors and labels are obtained. In this way, using the intrusion detection areas marked with different colors can achieve a visualization effect; using the intrusion detection areas marked without labels, you can use different labels to mark different areas and quickly identify the location of the intrusion object. By introducing the convolutional neural network model to identify the intrusion detection area on the image to be processed, the automatic identification of the intrusion detection area is realized without manual labeling of the area, which is convenient for large-scale online applications. There is no need to pre-mark the intrusion detection area, and it is easy to deploy and go online.

In the embodiment of the present application, determining whether the preset intrusion object is located in the intrusion detection area based on the preset intrusion object position can effectively improve the accuracy of determining that the preset intrusion object is located in the intrusion detection area.

Step S421, obtaining the to-be-processed image from the to-be-processed video stream;

Step S422, performing feature extraction on the to-be-processed image based on a deep convolutional network to obtain a first feature map;

Step S423, based on the region generation network, generate candidate target regions in the first feature map to obtain a second feature map; the second feature map includes at least one detection frame and the position and confidence of each detection frame ;

Step S424, in the process of pooling the first feature map and the second feature map into position-sensitive candidate regions based on the pooling layer, a non-maximum value suppression algorithm is used, based on the detection value of each detection frame. The confidence level and the intersection ratio between the detection frames in the at least one detection frame, determine the detection frame that satisfies the preset condition as the object detection frame, and determine the category of the object in the object detection frame;

In the implementation process, a non-maximum value suppression algorithm is used, based on the confidence of each detection frame and the intersection ratio between the detection frames in the at least one detection frame, the Determining the detection frame as an object detection frame includes: based on the confidence of each detection frame, determining at least one detection frame with the highest confidence in the detection frame as a target detection frame; determining the target detection frame as 1. the object detection frame; determine the intersection ratio of the target detection frame and each other detection frame; wherein, the other detection frame refers to the at least one of the detection frame except the target detection frame delete the other detection frames whose intersection ratio is greater than the threshold from the at least one of the detection frames to obtain a candidate detection frame set; The largest detection frame is determined as a new target detection frame; the new target detection frame is determined as one of the object detection frames; the intersection ratio between the new target detection frame and each new other detection frame is determined. ; wherein, the each new other detection frame refers to a detection frame other than the new target detection frame in the candidate detection frame set; the new other detection frame whose intersection ratio is greater than the threshold is removed from the Delete the candidate detection frame set to obtain a new candidate detection frame set; and so on, to obtain the object detection frame.

Step S425, in the case that the preset intrusion object exists in any of the object detection frames based on the category of the object, determine that the preset intrusion object exists in the to-be-processed image;

Step S426, based on the position of the object detection frame of the preset intrusion object, determine the position of the preset intrusion object;

Step S427, determining whether the preset intrusion object exists in the object detection frame;

Step S428, in the case of determining that a preset intrusion object exists in the object detection frame, identify the to-be-processed image to obtain an intrusion detection area;

Step S429: Determine whether an intrusion event occurs based on the preset position of the intrusion object and the intrusion detection area.

Step S430, outputting an alarm identifier in response to the occurrence of the intrusion event;

In the process of implementation, take pedestrian intrusion incidents on expressways as an example, and output warning signs, which can quickly guide the intruding objects to leave the dangerous area and prevent traffic accidents. The identification results can also be used to find intrusion objects intruding high-frequency points and strengthen preventive measures.

Step S431, in response to the occurrence of the intrusion event, record the intrusion event based on the category of the intrusion object and the intrusion detection area to obtain an intrusion record;

Step S432: Store the intrusion record or send it to an associated terminal.

In the embodiment of the present application, in the process of detecting a preset intrusion object, the confidence level of the detected object and the position of the detection frame are obtained first. Then, a non-maximum value suppression algorithm is used to merge the detection frames whose intersection ratio is greater than the threshold, and determine the detection frame that meets the preset condition as the object detection frame. In this way, each object in the image to be processed can be finally determined a most suitable object detection frame by using the non-maximum value suppression algorithm.

In the embodiment of the present application, when it is determined that the object detection frame is within the intrusion detection area, an alarm flag is output; the intrusion event is recorded based on the category of the intrusion object and the intrusion detection area, and the intrusion record is obtained; the intrusion record is stored or sent to the associated terminal. In this way, the intrusion object can be quickly guided to leave the intrusion detection area according to the alarm identification, thereby effectively preventing the intrusion object from entering the intrusion detection area. Intrusion events can also be recorded, and high-frequency points of intrusion by intrusion objects can be found according to the intrusion records, and preventive measures can be strengthened.

Pedestrians/non-motor vehicles often enter by mistake or intentionally on expressways, which affects the normal driving of vehicles on the road and has a great impact on traffic safety. Video patrols need to conduct real-time and active detection of pedestrians/motor vehicles on the road. When pedestrians/motor vehicles are found within the driving range of high-speed highways, relevant early warnings should be issued in time, and the traffic police department should be notified to respond in a timely manner to guide and urge pedestrians/non-motor vehicles. The motor vehicle leaves the driving area of the expressway, eliminates the hidden danger of road driving, and improves the road driving safety index.

The early video patrol system mainly relied on the judge polling the video of the image acquisition equipment to detect the pedestrians who broke into the high speed by mistake, and take corresponding measures. Although this scheme can effectively detect pedestrians entering by mistake, the research and judgment efficiency is low, omissions are more likely to occur, and the real-time polling is not high. With the development of computer vision technology, the target detection algorithm has been greatly improved. It is used to pre-screen pedestrians appearing in images and videos, which greatly improves the work efficiency of judges. In recent years, target detection algorithms have proposed data-driven solutions based on deep learning, which further improves the accuracy and recall rate of pedestrian intrusion detection. How the algorithm accuracy can reach or even surpass manual judgment has become a research hotspot.

Usually, pedestrian intrusion is not a common event on fast-high-speed roads, which places high requirements on the accuracy of detection algorithms. For example, 99% accuracy means that only one false positive is allowed within a hundred incidents. Although the target detection method based on deep learning can theoretically achieve the target accuracy by increasing the training data and model capacity, it requires a lot of manpower to label the detection frame, and will greatly increase the hardware cost of the algorithm operation, which is the landing of the pedestrian intrusion event algorithm. Problems to be solved. On the other hand, pedestrian intrusion needs to set a prohibited intrusion area, and there is a lot of redundant operation and maintenance work for large-scale applications based on the method of manually demarcating the area.

Fig. 5 is a display diagram of an intelligent transportation platform provided by an embodiment of the present application. As shown in Fig. 5, an intelligent transportation platform 501 is used to display the intrusion images of pedestrians/non-motor vehicles identified on the Kuaigao Expressway using the intrusion detection method.

FIG. 6 is a pedestrian/vehicle intrusion diagram provided by an embodiment of the present application. As shown in FIG. 6 , the pedestrian/vehicle intrusion 601 is the displayed pedestrian/non-motor vehicle intrusion image after clicking the pedestrian/non-motor vehicle intrusion image shown in FIG. 5 . Motor vehicle intrusion magnified image and image details, such as time, location, etc. of pedestrian/non-motor vehicle intrusion.

In order to illustrate that the solution provided by the embodiment of the present application can significantly save the computing power of the hardware device, an example of judging pedestrians entering the forbidden area on the Kuaigao Expressway is described in chronological order. FIG. 7 is an intrusion detection method provided by the embodiment of the present application. The schematic diagram of the realization flow of , as shown in Figure 7, in which, the time axis involves 5 moments, which are from T1 to T5 in sequence, and the workflow is described as follows:

Step S700, input the image to be processed into the detector at time T1 to obtain at least one candidate pedestrian detection frame;

The image to be processed may be an original image or an image after preprocessing the original image. The processing process of step S700 is divided into a feature extraction stage and a detection stage:

Feature extraction stage: using the deep convolution network 402 shown in Figure 4, including vector convolution operation 1 (conv1), vector convolution operation 2 (conv2), dense vector convolution operation 3 (dense conv3) and dense vector convolution Product operation 4 (dense conv4). The above four depth convolutional layers can be used to perform feature extraction on the image to be processed to obtain a first feature map, and based on the region generation network 403, a candidate target region is generated in the first feature map to obtain a second feature map. The second feature map includes at least one detection frame, the position of each detection frame, and the confidence level of the detected target.

Detection stage: As shown in Figure 4, the first feature map and the second feature map are pooled for position-sensitive candidate regions, that is, the first feature map and the second feature map are simultaneously input for position-sensitive candidate region pooling layer 404 , obtain the frame regression result 405 and the classification result 406, after this processing, the confidence of the detected target and the position of the detection frame can be obtained, the detection frame that meets the preset conditions is determined as the object detection frame, and the object detection frame is determined. The category of the object in the box.

Finally, through the maximum value suppression algorithm, the detection frames whose intersection ratio is greater than the threshold are merged, and the object detection frame (candidate pedestrian detection frame) that meets the requirements is output.

Step S701, cutting out each candidate pedestrian detection frame at time T2, inputting it to the cascade classifier, and obtaining a classification result;

Here, the cascade classifier can be obtained through training, for example, in the training phase: collect 300,000 small pictures of detection alarms, including 60,000 positive samples and 240,000 negative samples. First, the pedestrian/non-pedestrian binary classification is performed on these data, and then the stochastic gradient descent algorithm is used to train the ResNet18 network and the ResNet50 network respectively. As shown in Fig. 2, the first residual network 211 can use the ResNet18 network to realize the rough judgment of the image, which has the characteristics of high judgment efficiency and high error rate; the second residual network 231 can use the ResNet50 network to realize the image judgment. Fine judgment, low misjudgment rate. In this way, by cascading multiple classifiers, on the premise of maintaining high recall, the classification improves the accuracy step by step, and finally obtains high-precision pedestrian detection results.

Step S702, at time T3, determine whether the cascaded classification model has acquired a valid image;

Through the cascade classification model, it can be judged whether there is a required valid image in the image to be processed. For example, to judge pedestrians who are prohibited from intruding on a high-speed highway, at the time T3, it can be determined whether the cascade classification model has obtained images of intruding pedestrians. When no valid image is obtained, there is no need to identify the target area (high-speed road area) of the image to be processed, and the process ends. In this way, there is no need to identify the target area for each input original image, which obviously saves the cost of hardware equipment. computing power.

Step S703, when it is determined that the cascaded classifier has acquired a valid image, input the to-be-processed image into the semantic segmentation model at time T4 to obtain a pedestrian prohibited entry area on a high expressway;

Here, because the appearance of pedestrians on the fast-high-speed road is a long-tail event, that is, the probability of occurrence of this event in a fixed scene is extremely low, so it is necessary to input the to-be-processed image again when it is determined that the cascaded classifier has obtained a valid image. Semantic segmentation model to obtain high expressway pedestrian prohibited areas. In this way, it is only necessary to perform speech segmentation on valid images, which can effectively reduce the computing power requirements of the algorithm in similar long-tail event analysis.

Referring to Fig. 3, the input of the semantic segmentation model is the image to be processed 301, which is first subjected to 32 times downsampling through a 5-layer convolutional network (conv1, conv2, conv3, conv4 and conv5) 302 to the image to be processed 301, and the image to be processed is encoded; After 4 layers of deconvolution (dconv1, dconv2, dconv3 and dconv4) 303, the encoding result is upsampled by 32 times, and the encoding result is decoded and semantically understood to obtain the high expressway area (gray, label is 1) and non-high Expressway area (black, labelled 0). In this way, marking the high-speed expressway area as gray and marking the non-high-speed expressway area as black can achieve the visualization effect; marking the high-speed expressway area as 1, and marking the non-high-speed expressway area as 0 can realize the use of different Mark the area to quickly identify the location of the intruding object.

Step S704, at time T5, according to the pedestrian classification result and the pedestrian prohibited entry area of the expressway, determine whether the pedestrian has entered the prohibited area;

After completing steps S701 to S704, the pedestrian detection frame and the semantic segmentation map can be obtained.

Let the upper left point of the pedestrian detection frame be (x ₁ , y ₁ ), the lower right point be (x ₂ , y ₂ ), and the semantic segmentation result is a two-dimensional matrix G. Here, for the perspective relationship, to determine whether a pedestrian enters the area, it is necessary to select a pedestrian positioning point that is on the same plane as the ground as much as possible.

In theory, the center point of the bottom edge of the pedestrian detection frame can be selected as the pedestrian positioning point to determine whether the pedestrian is intruding, that is, the following formula (1):

In actual use, in order to obtain a more robust result, the average value of the segmentation results of an area near the center point of the pedestrian's bottom edge is calculated to determine whether the pedestrian enters the forbidden area. The formula for judging pedestrian intrusion is formula (2):

Step S705, at time T6, output the result of pedestrian intrusion.

As shown in Figure 5, the images of pedestrian/non-motor vehicle intrusions identified on Kuaigao Expressway using the intrusion detection method are shown. After clicking the pedestrian/non-motor vehicle intrusion image shown in Figure 5, as shown in Figure 6, the enlarged image and image details of the pedestrian/non-motor vehicle intrusion are displayed, such as the time and location of the pedestrian/non-motor vehicle intrusion.

The embodiment of the present application proposes an intrusion detection method for cascading event detection. First, pedestrian detection is performed based on the detection model at time T1, and secondly, candidate targets are filtered through the cascade classifier at time T2, and then judged at time T3. Whether there is an intrusion object in the image to be processed, if there is no intrusion object, the original image will not be semantically segmented. At T4, the to-be-processed image with an intrusion object will be semantically segmented to determine whether the target appears in the prohibited intrusion area. At T5 Input the object detection frame and the intrusion detection area at the same time for judgment, and complete the judgment at time T6. By cascading multiple algorithm modules, the scheme realizes fully automatic pedestrian intrusion detection on high-speed roads without significantly increasing the computing power requirement. In this way, no-entry areas are identified using semantic segmentation without human annotation. Implemented in a multi-model cascade, the algorithm modules for detection, classification, and segmentation can be upgraded independently. For long-tail event detection, the algorithm's computing power requirements are reduced.

Based on the foregoing embodiments, the embodiments of the present application provide an intrusion detection apparatus, which includes each module included and each submodule included in each module, which can be implemented by a processor in a computer device; of course, it can also be It is implemented by a specific logic circuit; in the process of implementation, the processor may be a central processing unit (CPU), a microprocessor (MPU), a digital signal processor (DSP) or a field programmable gate array (FPGA) or the like.

An embodiment of the present application provides an intrusion detection device. FIG. 8 is a schematic diagram of the structure and composition of the intrusion detection device according to an embodiment of the present application. As shown in FIG. 8 , the device 800 includes:

Obtaining module 810, configured to obtain the to-be-processed image from the to-be-processed video stream;

A detection module 820, configured to detect objects in the to-be-processed image to obtain at least one object detection frame;

a first determining module 830, configured to determine whether the preset intrusion object exists in the object detection frame;

The identification module 840 is configured to identify the to-be-processed image when it is determined that a preset intrusion object exists in the object detection frame to obtain an intrusion detection area;

The second determination module 850 is configured to determine whether an intrusion event occurs based on the preset position of the intrusion object and the intrusion detection area.

In some embodiments, the detection module 820 includes a detection sub-module, a first determination sub-module and a second determination sub-module, wherein the detection sub-module is configured to detect an object in the image to be processed, Obtain at least one object detection frame, the position of each of the object detection frames and the category of the object in each of the object detection frames; the first determination sub-module is configured to determine any one of the objects based on the category of the object. In the case where the preset intrusion object exists in the object detection frame, it is determined that the preset intrusion object exists in the to-be-processed image; the second determination sub-module is configured to be based on the existence of the preset intrusion object. The position of the object detection frame of the object determines the position of the preset intrusion object.

In some embodiments, the detection sub-module includes a deep convolutional network, a region generation network and a pooling layer, wherein the deep convolutional network is configured to perform feature extraction on the to-be-processed image to obtain the first feature The region generation network is configured to generate candidate target regions in the first feature map to obtain a second feature map; the second feature map includes at least one detection frame and the position of each detection frame, confidence; the pooling layer is configured to perform position-sensitive candidate region pooling on the first feature map and the second feature map based on the at least one detection frame and each of the For the confidence level of the detection frame, the detection frame that satisfies the preset condition is determined as the object detection frame, and the category of the object in the object detection frame is determined.

In some embodiments, the detection sub-module includes a non-maximum suppression algorithm unit, using a non-maximum suppression algorithm, based on the confidence of each of the detection frames and the at least one of the detection frames. The intersection ratio between the detection frames determines the detection frame that meets the preset condition as the object detection frame.

In some embodiments, the non-maximum value suppression algorithm unit includes a first determination subunit, a second determination subunit, a deletion subunit, a third determination subunit, and a fourth determination subunit, wherein the first determination subunit A determination subunit, configured to, based on the confidence of each of the detection frames, determine at least one detection frame with the highest confidence in the detection frame as a target detection frame; determine the target detection frame as an object detection frame; the second determination subunit is configured to determine the intersection ratio of the target detection frame and each other detection frame; wherein, the other detection frame refers to the at least one detection frame except all the detection frames. a detection frame other than the target detection frame; the deletion subunit is configured to delete other detection frames whose intersection ratio is greater than a threshold from the at least one detection frame to obtain a candidate detection frame set; the third determination a subunit, configured to determine the detection frame with the highest confidence in the candidate detection frame set except the target detection frame as a new target detection frame; determine the new target detection frame as a an object detection frame; the fourth determination subunit is configured to determine the intersection ratio of the new target detection frame and each new other detection frame; wherein, the each new other detection frame refers to the a detection frame other than the new target detection frame in the candidate detection frame set; the deletion subunit is further configured to delete other new detection frames whose intersection ratio is greater than a threshold from the candidate detection frame set, A new set of candidate detection frames is obtained; and by analogy, the object detection frame is obtained.

In some embodiments, the first determination module 830 includes a first classification sub-module, a second classification sub-module and a third determination sub-module, wherein the second classification sub-module uses a first-level classifier based on The category of the object performs a first classification on the object detection frame corresponding to the target category to obtain a first classification result; the second classification sub-module adopts a second-level classifier cascaded with the first-level classifier, Based on the first classification result, a second classification is performed on the object detection frame that meets the preset condition, and a second classification result is obtained; the third determination sub-module determines any one of the objects based on the second classification result. When the preset intrusion object exists in the detection frame, it is determined that the preset intrusion object exists in the object detection frame. The first-level classifier and the second-level classifier have the following relationship: the classification accuracy of the first-level classifier is lower than the classification accuracy of the second-level classifier; the first-level classification The number of convolutional layers in the classifier is less than the number of convolutional layers in the second-level classifier; the confidence level of the first-level classifier is lower than that of the second-level classifier.

In some embodiments, the identification module 840 is further configured to use a convolutional neural network model to perform semantic segmentation on the to-be-processed image to obtain the intrusion detection area.

In some embodiments, the second determination module 850 includes a determination sub-module and a fourth determination sub-module, wherein the determination sub-module is configured to determine the preset intrusion object based on the position of the preset intrusion object whether the intrusion object is located in the intrusion detection area; the fourth determination submodule is configured to determine that the intrusion event occurs in response to the preset intrusion object being located in the intrusion detection area; or in response to the intrusion detection area; The preset intrusion object is located outside the intrusion detection area, and it is determined that the intrusion event does not occur.

In some embodiments, the second determination module 850 further includes a fifth determination sub-module, a sixth determination sub-module and a seventh determination sub-module, wherein the fifth determination sub-module is configured to The set object detection frame of the intrusion object is determined as the target detection frame; the sixth determination submodule is configured to determine the center point of the bottom edge of the target detection frame as the preset position of the intrusion object; The seventh determination sub-module is configured to determine whether an intrusion event occurs based on the relative positional relationship between the center point of the bottom edge of the target detection frame and the intrusion detection area.

In some embodiments, the intrusion detection apparatus further includes an output module configured to output an alarm identification in response to the occurrence of the intrusion event.

Based on the foregoing embodiment, the intrusion detection apparatus further includes a recording module and a sending module, wherein the recording module is configured to, in response to the occurrence of the intrusion event, pair the intrusion detection area based on the category of the intrusion object and the intrusion detection area. The intrusion event is recorded to obtain an intrusion record; the sending module is configured to store or send the intrusion record to an associated terminal.

The descriptions of the above apparatus embodiments are similar to the descriptions of the above method embodiments, and have similar beneficial effects to the method embodiments. For technical details not disclosed in the device embodiments of the present application, please refer to the descriptions of the method embodiments of the present application for understanding.

It should be noted that, in the embodiments of the present application, if the above-mentioned intrusion detection method is implemented in the form of a software function module and sold or used as an independent product, it may also be stored in a computer-readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in the form of software products in essence or the parts that contribute to related technologies. The computer software products are stored in a storage medium and include several instructions to make A computer device (which may be a mobile phone, a tablet computer, a notebook computer, a desktop computer, a robot, a server, etc.) executes all or part of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: a U disk, a mobile hard disk, a read only memory (Read Only Memory, ROM), a magnetic disk or an optical disk and other media that can store program codes. As such, the embodiments of the present application are not limited to any specific combination of hardware and software.

Correspondingly, the embodiments of the present application provide a computer-readable storage medium, which may be a volatile storage medium or a non-volatile storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the foregoing embodiments are implemented The steps in the intrusion detection method provided in .

Correspondingly, an embodiment of the present application provides a computer device, and FIG. 9 is a schematic diagram of a hardware entity of the computer device according to the embodiment of the present application. As shown in FIG. 9 , the hardware entity of the device 900 includes: a memory 901 and a processor 902 , the memory 901 stores a computer program that can be executed on the processor 902, and when the processor 902 executes the program, the steps in the intrusion detection method provided in the above embodiments are implemented.

The memory 901 is configured to store instructions and applications executable by the processor 902, and can also cache data to be processed or processed by the processor 902 and various modules in the computer device 900 (eg, image data, audio data, voice communication data and Video communication data), which can be realized by flash memory (FLASH) or random access memory (Random Access Memory, RAM).

It should be pointed out here that the descriptions of the above storage medium and device embodiments are similar to the descriptions of the above method embodiments, and have similar beneficial effects to the method embodiments. For technical details not disclosed in the embodiments of the storage medium and device of the present application, please refer to the description of the method embodiments of the present application to understand.

It is to be understood that reference throughout the specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic associated with the embodiment is included in at least one embodiment of the present application. Thus, appearances of "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily necessarily referring to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in various embodiments of the present application, the size of the sequence numbers of the above-mentioned processes does not mean the sequence of execution, and the execution sequence of each process should be determined by its functions and internal logic, and should not be dealt with in the embodiments of the present application. implementation constitutes any limitation. The above-mentioned serial numbers of the embodiments of the present application are only for description, and do not represent the advantages or disadvantages of the embodiments.

It should be noted that, herein, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article or device comprising a series of elements includes not only those elements, It also includes other elements not expressly listed or inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined, or Can be integrated into another system, or some features can be ignored, or not implemented. In addition, the coupling, or direct coupling, or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be electrical, mechanical or other forms. of.

The unit described above as a separate component may or may not be physically separated, and the component displayed as a unit may or may not be a physical unit; it may be located in one place or distributed to multiple network units; Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

In addition, each functional unit in each embodiment of the present application may all be integrated into one processing unit, or each unit may be separately used as a unit, or two or more units may be integrated into one unit; the above integration The unit can be implemented either in the form of hardware or in the form of hardware plus software functional units.

Those of ordinary skill in the art can understand that all or part of the steps of implementing the above method embodiments can be completed by program instructions related to hardware, the aforementioned program can be stored in a computer-readable storage medium, and when the program is executed, the execution includes: The steps of the above method embodiments; and the aforementioned storage medium includes: a removable storage device, a read only memory (Read Only Memory, ROM), a magnetic disk or an optical disk and other media that can store program codes.

Alternatively, if the above-mentioned integrated units of the present application are implemented in the form of software function modules and sold or used as independent products, they may also be stored in a computer-readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in the form of software products in essence or the parts that contribute to related technologies. The computer software products are stored in a storage medium and include several instructions to make A computer device (which may be a mobile phone, a tablet computer, a notebook computer, a desktop computer, a robot, a server, etc.) executes all or part of the methods described in the various embodiments of the present application. The aforementioned storage medium includes various media that can store program codes, such as a removable storage device, a ROM, a magnetic disk, or an optical disk.

The methods disclosed in the several method embodiments provided in this application can be arbitrarily combined under the condition of no conflict to obtain new method embodiments.

The above is only the embodiment of the present application, but the protection scope of the present application is not limited to this. Covered within the scope of protection of this application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Industrial Applicability

In the embodiment of the present disclosure, the to-be-processed image is obtained from the to-be-processed video stream, so that the image in the video stream collected by the image capture device can be used as input to analyze the video stream, which can effectively improve the performance of the image capture device. Utilization rate: firstly, the object in the image to be processed is detected to obtain at least one object detection frame, and then it is determined whether there is a preset intrusion object in the object detection frame. In this way, the provided detection model and classification model are decoupled, and the classification model can be customized for special scenarios during the implementation of the algorithm to quickly achieve the expected performance. Great improvement. Further, since the detection model and the classification model are decoupled, for the optimization of false positives in new scenarios, it is only necessary to add false positive data to train a new classifier, which can be cascaded with the existing detection model, which is suitable for rapid upgrade iterations of algorithm implementation. Filtering false positives in a cascading manner can greatly improve the detection accuracy of long-tail events; when it is determined that there is a preset intrusion object in the object detection frame, the to-be-processed image is identified to obtain an intrusion detection area ; Determine whether an intrusion event occurs based on the preset intrusion object position and intrusion detection area. In this way, the identification of the intrusion detection area is performed only in the to-be-processed images that are confirmed to have intrusion objects, without the need to identify all the to-be-processed images, which can significantly reduce the computing power requirements on hardware devices, thereby realizing efficient and fully automatic detection. Whether the intrusion object intrudes into the intrusion detection area, and then there is no need to manually mark the area, which is convenient for large-scale online applications.

Claims

An intrusion detection method, the method is performed by a computer device, and the method includes:

obtain the to-be-processed image from the to-be-processed video stream;

Detecting objects in the to-be-processed image to obtain at least one object detection frame;

determining whether the preset intrusion object exists in the object detection frame;

In the case of determining that a preset intrusion object exists in the object detection frame, identify the to-be-processed image to obtain an intrusion detection area;

Based on the preset position of the intrusion object and the intrusion detection area, it is determined whether an intrusion event occurs.
The method according to claim 1, wherein the detecting an object in the to-be-processed image to obtain at least one object detection frame comprises:

Detecting objects in the to-be-processed image to obtain at least one object detection frame, the position of each of the object detection frames, and the category of the object in each of the object detection frames;

In the case of determining that the preset intrusion object exists in any of the object detection frames based on the category of the object, determining that the preset intrusion object exists in the image to be processed;

The position of the preset intrusion object is determined based on the position of the object detection frame in which the preset intrusion object exists.
The method according to claim 2, wherein, by detecting an object in the to-be-processed image, at least one object detection frame, a position of each of the object detection frames, and each of the object detection frames are obtained. Classes of objects, including:

Based on a deep convolutional network, feature extraction is performed on the to-be-processed image to obtain a first feature map;

Based on the region generation network, a candidate target region is generated in the first feature map to obtain a second feature map; the second feature map includes at least one detection frame and the position and confidence of each detection frame;

In the process of pooling the first feature map and the second feature map into position-sensitive candidate regions based on the pooling layer, based on the confidence of the at least one detection frame and each detection frame, the The detection frame that satisfies the preset condition is determined as the object detection frame, and the category of the object in the object detection frame is determined.
The method according to claim 3, wherein, based on the confidence of the at least one detection frame and each of the detection frames, determining a detection frame that satisfies a preset condition as an object detection frame comprises:

Using a non-maximum suppression algorithm, based on the confidence of each of the detection frames and the intersection ratio between the detection frames in the at least one detection frame, a detection frame that satisfies a preset condition is determined as an object Check box.
5. The method of claim 4, wherein said employing a non-maximum suppression algorithm is based on a confidence level of each said detection frame and an intersection between said detection frames in said at least one said detection frame The detection frame that meets the preset conditions is determined as the object detection frame, including:

Based on the confidence of each of the detection frames, at least one detection frame with the highest confidence in the detection frame is determined as a target detection frame; the target detection frame is determined as an object detection frame;

determining the intersection ratio of the target detection frame and each other detection frame; wherein, the other detection frame refers to a detection frame other than the target detection frame in the at least one of the detection frames;

Delete other detection frames whose intersection ratio is greater than the threshold from the at least one detection frame to obtain a candidate detection frame set;

Determining the detection frame with the highest confidence except the target detection frame in the candidate detection frame set as a new target detection frame; determining the new target detection frame as a described object detection frame;

Determine the intersection ratio of the new target detection frame and each new other detection frame; wherein, the each new other detection frame refers to the candidate detection frame set except the new target detection frame. outside the detection frame;

Delete other new detection frames whose intersection ratio is greater than the threshold from the candidate detection frame set to obtain a new candidate detection frame set;

By analogy, the object detection frame is obtained.
The method according to claim 1, wherein the determining whether the preset intrusion object exists in the object detection frame comprises:

Using the first-level classifier, the object detection frame corresponding to the target category is firstly classified based on the category of the object to obtain a first classification result;

Using the second-level classifier cascaded with the first-level classifier, based on the first classification result, perform a second classification on the object detection frame that meets the preset condition, and obtain a second classification result;

When it is determined that the preset intrusion object exists in any of the object detection frames based on the second classification result, it is determined that the preset intrusion object exists in the object detection frame.
The method of claim 6, wherein the first-level classifier and the second-level classifier have the following relationship:

The classification accuracy of the first-level classifier is lower than the classification accuracy of the second-level classifier;

The number of convolutional layers in the first-level classifier is less than the number of convolutional layers in the second-level classifier;

The confidence level of the first-level classifier is lower than the confidence level of the second-level classifier.
The method according to any one of claims 1 to 7, wherein the identifying the to-be-processed image to obtain an intrusion detection area comprises:

The image to be processed is semantically segmented by using a convolutional neural network model to obtain the intrusion detection area.
The method according to any one of claims 1 to 8, wherein the determining whether an intrusion event occurs based on the preset position of the intrusion object and the intrusion detection area comprises:

Determine whether the preset intrusion object is located in the intrusion detection area based on the position of the preset intrusion object;

In response to the preset intrusion object being located within the intrusion detection area, it is determined that the intrusion event has occurred; or in response to the preset intrusion object being located outside the intrusion detection area, it is determined that the intrusion has not occurred event.
The method according to any one of claims 1 to 7, wherein the determining whether an intrusion event occurs based on the preset position of the intrusion object and the intrusion detection area comprises:

Determining the object detection frame in which the preset intrusion object exists as the target detection frame;

Determining the center point of the bottom edge of the target detection frame as the position of the preset intrusion object;

Based on the relative positional relationship between the center point of the bottom edge of the target detection frame and the intrusion detection area, it is determined whether an intrusion event occurs.
The method according to any one of claims 1 to 10, wherein the method further comprises:

In response to the occurrence of the intrusion event, an alert indicator is output.
The method according to any one of claims 1 to 11, wherein the method further comprises:

In response to the occurrence of the intrusion event, recording the intrusion event based on the category of the intrusion object and the intrusion detection area to obtain an intrusion record;

The intrusion record is stored or sent to an associated terminal.
An intrusion detection device comprising:

an obtaining module, configured to obtain the to-be-processed image from the to-be-processed video stream;

a detection module, configured to detect objects in the to-be-processed image to obtain at least one object detection frame;

a first determining module, configured to determine whether the preset intrusion object exists in the object detection frame;

an identification module, configured to identify the to-be-processed image to obtain an intrusion detection area when it is determined that a preset intrusion object exists in the object detection frame;

The second determination module is configured to determine whether an intrusion event occurs based on the preset position of the intrusion object and the intrusion detection area.
The apparatus of claim 13, wherein the detection module comprises:

a detection submodule, configured to detect objects in the to-be-processed image to obtain at least one object detection frame, the position of each of the object detection frames, and the category of the object in each of the object detection frames;

a first determination submodule, configured to determine that the preset intrusion object exists in the image to be processed when it is determined that the preset intrusion object exists in any of the object detection frames based on the category of the object ;

The second determination sub-module is configured to determine the position of the preset intrusion object based on the position of the object detection frame in which the preset intrusion object exists.
The apparatus of claim 14, wherein the detection submodule comprises:

a deep convolutional network, configured to perform feature extraction on the to-be-processed image to obtain a first feature map;

a region generation network, configured to generate candidate target regions in the first feature map to obtain a second feature map; the second feature map includes at least one detection frame and the position and confidence of each of the detection frames;

a pooling layer, configured to perform position-sensitive candidate region pooling on the first feature map and the second feature map, based on the confidence level of the at least one detection frame and each of the detection frames , determining a detection frame that satisfies a preset condition as an object detection frame, and determining the category of the object in the object detection frame.
The apparatus according to claim 15, wherein the detection sub-module includes a non-maximum suppression algorithm unit, and a non-maximum suppression algorithm is used to base on the confidence of each of the detection frames and the at least one of the The intersection ratio between the detection frames in the detection frame determines the detection frame satisfying the preset condition as the object detection frame.
The apparatus of claim 16, wherein the non-maximum suppression algorithm unit comprises:

The first determination subunit is configured to, based on the confidence of each detection frame, determine at least one detection frame with the highest confidence in the detection frame as a target detection frame; determine the target detection frame as a the object detection frame;

The second determination subunit is configured to determine the intersection ratio of the target detection frame and each other detection frame; wherein, the other detection frame refers to the at least one detection frame except the target detection frame. outside the detection frame;

a deletion subunit, configured to delete other detection frames whose intersection ratio is greater than a threshold from the at least one detection frame to obtain a candidate detection frame set;

The third determination subunit is configured to determine the detection frame with the highest confidence in the candidate detection frame set except the target detection frame as a new target detection frame; determine the new target detection frame as the object detection frame;

The fourth determination subunit is configured to determine the intersection ratio of the new target detection frame and each new other detection frame; wherein, the each new other detection frame refers to the addition of the candidate detection frame set detection frames other than the new target detection frame;

The deletion subunit is also configured to delete other new detection frames whose intersection ratio is greater than the threshold from the candidate detection frame set to obtain a new candidate detection frame set;

By analogy, the object detection frame is obtained.
The apparatus of claim 13, wherein the first determining module comprises:

The second classification sub-module adopts a first-level classifier to perform a first classification on the object detection frame corresponding to the target class based on the class of the object to obtain a first classification result;

The second classification sub-module adopts the second-level classifier cascaded with the first-level classifier, and based on the first classification result, performs a second classification on the object detection frame that meets the preset condition, and obtains a second classification result;

The third determination sub-module, when it is determined based on the second classification result that the preset intrusion object exists in any of the object detection frames, determines that the preset intrusion object exists in the object detection frame.
The apparatus according to claim 18, wherein the first-level classifier and the second-level classifier have the following relationship: the classification accuracy of the first-level classifier is higher than that of the second-level classifier The classification accuracy is low; the number of convolutional layers in the first-level classifier is less than that in the second-level classifier; The confidence of the secondary classifier is low.
The apparatus according to any one of claims 13 to 19, wherein the identification module is further configured to use a convolutional neural network model to perform semantic segmentation on the to-be-processed image to obtain the intrusion detection area.
The apparatus according to any one of claims 13 to 20, wherein the second determining module comprises:

a judgment submodule, configured to judge whether the preset intrusion object is located in the intrusion detection area based on the position of the preset intrusion object;

The fourth determination sub-module is configured to determine that the intrusion event occurs in response to the preset intrusion object being located within the intrusion detection area; or in response to the preset intrusion object being located within the intrusion detection area. Additionally, it is determined that the intrusion event did not occur.
The apparatus according to any one of claims 13 to 19, wherein the second determining module further comprises:

a fifth determination submodule, configured to determine the object detection frame in which the preset intrusion object exists as the target detection frame;

The sixth determination sub-module is configured to determine the center point of the bottom edge of the target detection frame as the position of the preset intrusion object;

A seventh determination submodule is configured to determine whether an intrusion event occurs based on the relative positional relationship between the center point of the bottom edge of the target detection frame and the intrusion detection area.
The apparatus according to any one of claims 13 to 22, wherein the intrusion detection apparatus further comprises an output module configured to output an alarm identification in response to the occurrence of the intrusion event.
The device according to any one of claims 13 to 23, wherein the intrusion detection device further comprises:

a recording module, configured to, in response to the occurrence of the intrusion event, record the intrusion event based on the category of the intrusion object and the intrusion detection area to obtain an intrusion record;

A sending module, configured to store or send the intrusion record to an associated terminal.
A computer device, comprising a memory and a processor, wherein the memory stores a computer program that can be run on the processor, wherein the processor implements the program described in any one of claims 1 to 12 when the processor executes the program steps in the method.
A storage medium storing executable instructions for causing a processor to implement the method of any one of claims 1 to 12 when executed.
A computer program product comprising one or more instructions adapted to be loaded by a processor and to perform the method of any of claims 1 to 12.