CN111860140A

CN111860140A - Target event detection method and device, computer equipment and storage medium

Info

Publication number: CN111860140A
Application number: CN202010521931.1A
Authority: CN
Inventors: 王远江; 郭子豪; 郑凯; 袁野
Original assignee: Beijing Megvii Technology Co Ltd
Current assignee: Beijing Megvii Technology Co Ltd
Priority date: 2020-06-10
Filing date: 2020-06-10
Publication date: 2020-10-30
Anticipated expiration: 2040-06-10
Also published as: CN111860140B

Abstract

A target event detection method, apparatus, computer device and storage medium, the method comprising: acquiring a current video frame, and identifying a preset warning area from the current video frame; carrying out target detection in the current video frame, and obtaining a target detection frame when a target is detected in the current video frame; if the target detection frame and the preset alert area meet the preset position relation condition, sequentially identifying the continuous interactive action of the target and the preset alert area in each subsequent continuous video frame; and determining whether a target event occurs or not based on an interactive action sequence formed by the initial action and each continuous interactive action, wherein the initial action is an action corresponding to the target in the target detection frame. By the method, the position relation between the target and the warning area is detected, whether the target event occurs is determined by combining the action state change of the target within a certain time, the accuracy of detecting the target event can be improved, and the misjudgment is reduced.

Description

Target event detection method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a target event detection method, an apparatus, a computer device, and a storage medium.

Background

With the development of computer vision technology, more and more security services utilize computer vision to identify and judge targets or events in a picture scene, and generate an alarm to prompt a user when certain conditions are met.

However, in the related art, usually, a detection frame of a human body is obtained first, and then an intersection relationship is calculated between the detection frame and a preset warning region labeled as a warning region, and by judging the intersection relationship between the detection frame of the human body and the preset warning region, whether a target event of interaction between the human body and the preset warning region occurs is determined, and if yes, an alarm is given directly. This, while enabling the alarming of conventional target events, may also result in a significant amount of false positives for people present near the preset alert zone.

Disclosure of Invention

In view of the above, it is necessary to provide a target event detection method, apparatus, computer device and storage medium capable of reducing erroneous judgment.

A method of target event detection, the method comprising:

acquiring a current video frame, and identifying a preset warning area from the current video frame;

carrying out target detection in the current video frame, and obtaining a target detection frame when a target is detected in the current video frame;

If the target detection frame and the preset warning area meet a preset position relation condition, sequentially identifying the continuous interactive action of the target and the preset warning area in each subsequent continuous video frame;

and determining whether a target event occurs or not based on an interactive action sequence formed by an initial action and each continuous interactive action, wherein the initial action is an action corresponding to the target in the target detection frame.

A target event detection apparatus, the apparatus comprising:

the area identification module is used for acquiring a current video frame and identifying a preset warning area from the current video frame;

the target detection module is used for carrying out target detection in the current video frame and obtaining a target detection frame when a target is detected in the current video frame;

the action identification module is used for sequentially identifying the continuous interactive action of the target and the preset alert region in each subsequent continuous video frame if the target detection frame and the preset alert region meet the preset position relation condition;

and the target event detection module is used for determining whether a target event occurs or not based on an interactive action sequence formed by an initial action and each continuous interactive action, wherein the initial action is an action corresponding to the target in the target detection frame.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

According to the target event detection method, the device, the computer equipment and the storage medium, after the current video frame is obtained, the preset alert area is identified in the current video frame, the target detection is carried out on the current video frame, the target detection frame is obtained when the target is detected in the current video frame, and if the target detection frame and the preset alert area meet the condition that the target detection frame and the preset alert area meet the preset position relation, the continuous interactive action of the target and the preset alert area is sequentially identified in each subsequent continuous video frame; and determining whether a target event occurs or not according to an interactive action sequence formed by the initial action corresponding to the target in the target detection frame and each continuous interactive action. By the method, the position relation between the target and the warning area is detected, whether the target event occurs is determined by combining the action state change of the target within a certain time, the accuracy of detecting the target event can be improved, and the misjudgment is reduced.

Drawings

FIG. 1 is a diagram of an application environment of a target event detection method in one embodiment;

FIG. 2 is a schematic flow chart diagram illustrating a method for target event detection in one embodiment;

FIG. 3 is a diagram illustrating three representative wall areas determined from target points in an exemplary embodiment;

FIG. 4 is a flow diagram illustrating the classification of a current video frame by ResNet in one embodiment;

FIG. 5 is a schematic flow chart diagram illustrating a method for detecting a target event in another embodiment;

FIG. 6 is a schematic flow chart diagram illustrating a method for detecting a target event in another embodiment;

FIG. 7 is a flow diagram of a predicted target detection block to determine a target in a next video frame in one embodiment;

FIG. 8 is a block diagram of a target event detection device in one embodiment;

FIG. 9 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The target event detection method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the terminal 104 via a network. After the terminal 104 acquires the current video frame from the terminal 102, the terminal 104 identifies a preset alert area in the current video frame, performs target detection on the current video frame, obtains a target detection frame when a target is detected in the current video frame, and if the target detection frame and the preset alert area satisfy a preset position relation condition, sequentially identifies a continuous interactive action of the target and the preset alert area in each subsequent continuous video frame; and determining whether a target event occurs or not according to an interactive action sequence formed by the initial action corresponding to the target in the target detection frame and each continuous interactive action. The terminal 102 may be, but is not limited to, various video capture devices, video imaging devices, and the like, and in other embodiments, the terminal 102 may also be various devices connected to the video capture devices and the like for storing captured video data; the terminal 104 may be, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices.

In other embodiments, the target event detection method may also be used in an application scenario of a terminal and a server, where the terminal may be, but is not limited to, various video capture devices, video imaging devices, or devices connected to a video capture device and the like for storing captured video data, and the server may be implemented by an independent server or a server cluster formed by multiple servers.

In one embodiment, as shown in fig. 2, a target event detection method is provided, which is described by taking the method as an example applied to the terminal 104 in fig. 1, and includes steps S210 to S240.

Step S210, obtaining a current video frame, and identifying a preset warning area from the current video frame.

The target event detection method in this embodiment may be applied to detecting a monitoring picture acquired by a video acquisition device such as a monitoring camera, and determining whether a target event occurs in the monitoring picture. In security service, for some special locations, it may be necessary to monitor whether some specific target events occur, and when a specific target event occurs, warning information may be generated to remind relevant staff. In one embodiment, the terminal 104 may directly obtain the video frame from a monitoring camera or other device for analysis; in another embodiment, the terminal 104 may also first obtain video frame data from a monitoring camera or other devices for storage, and then a video frame obtaining module in the target event detection apparatus of the terminal 104 obtains a video from a module for storing video frames for analysis; it is understood that in another embodiment, the picture taken by the monitoring camera may also be stored by a device connected to the monitoring camera, and in this embodiment, the terminal 104 may also obtain the video frame from the device storing the video picture.

In one embodiment, the current video frame is a video frame of the video at the current time; it can be understood that, in the present embodiment, the picture in the current video frame is a picture at a monitored specific position, and whether a target event occurs needs to be detected in a preset alert area in the video picture. In a specific embodiment, the target event detection method is used for detecting whether a human body wall turning event occurs, and the preset alert area is a wall area.

Further, in one embodiment, determining the preset alert zone from the current video frame comprises: and acquiring the position information of the target point, and determining a preset warning area based on the position information of the target point. The target point position information may be input by a user, for example, the position information of the target point may be input by specifically clicking the selected target point in the display screen with a mouse, or the position information of the target point may be directly input by the user; the target points include at least four. In this embodiment, the user may input the boundary point of the preset alert area as the target point, and the server determines the preset alert area based on the position information of the target point after acquiring the position information of the target point. Further, after the server acquires the target point position, the server is sequentially connected with the target points to acquire a preset warning area.

In a specific embodiment, the preset warning area is a wall area, taking four target point positions as an example, the server connects and marks the former two target points as a wall head line, connects and marks the latter two target points as a wall foot line, and simultaneously visualizes the wall head line and the later two target points on the interface; the first two target points and the second two target points are distinguished according to the sequence input by the user. In the terminal, three areas are recorded, namely a double-line area with the front two target points forming a wall head line and the rear two target points forming a wall foot line; the rear two target points form a wall head line, and the front two target points form a double-line area of a wall foot line; and a preset alert area formed by the four target points, as shown in fig. 3, which is a schematic diagram illustrating three wall areas determined according to the target points in this embodiment.

In another embodiment, determining the preset alert zone from the current video frame comprises: and identifying a preset boundary position point in the current video frame, and determining a preset warning area according to the identified preset boundary position point. In this embodiment, the boundary point is set for the preset alert region, and a recognizable target is set, so that the preset boundary position point can be obtained by recognizing the preset target in the current video frame, and the position of the preset alert region is determined according to the preset boundary position point.

Step S220, performing target detection in the current video frame, and obtaining a target detection frame when a target is detected in the current video frame.

The target detection is also called target extraction, and is an image segmentation based on target geometry and statistical characteristics, the segmentation and the identification of the target are combined into a whole, and the accuracy and the real-time performance of the image segmentation are important capabilities of the whole system. Further, the target of the current video frame can be identified through a classification network in the neural network. When an object is detected in the current video frame, a frame enclosing the object is obtained, namely the object detection frame. In one embodiment, obtaining the target detection box includes: obtaining the coordinate position of the target detection frame; such as the coordinate positions corresponding to the four vertices of the target detection box.

In this embodiment, the target is a subject of execution in a target event that needs to be detected, and in a specific embodiment, the target event is a human body when the human body crosses a wall. In one embodiment, one or more objects may be detected in the current video frame.

Further, in one embodiment, performing target detection in the current video frame and generating a detection frame when a target is detected may be implemented by a neural network for target detection; in one embodiment, any kind of target detection neural network may be used to implement the process of target detection on the current video frame.

In step S230, if the target detection frame and the preset alert area satisfy the preset position relationship condition, the continuous interaction between the target and the preset alert area in each subsequent continuous video frame is sequentially identified.

The preset position relation condition required to be met by the target detection frame and the preset warning area can be set according to the actual situation. In one embodiment, when the effective area of the target detection frame intersects with the preset alert area, the target detection frame and the preset alert area are judged to meet the preset position relation condition; the effective area of the target detection frame is a preset position part in the target detection frame.

The effective area of the target detection frame is intersected with the preset warning area, namely, the effective area of the target detection frame is intersected with the area included in the preset warning area. In one embodiment, the effective area of the target detection frame refers to a 20% portion of the target detection frame near the lower edge, and when the target is a human body, the 20% portion of the target detection frame near the lower edge corresponds to the foot position of the standing human body.

The subsequent continuous video frames refer to the continuous video frames after the current video frame; the continuous interactive action of the target and the preset alert area comprises the interactive action between the target and the preset alert area, and the action of the target in each subsequent continuous video frame forms a continuous action which is recorded as the continuous interactive action of the target and the preset alert area. In one embodiment, for each subsequent continuous video frame, it is also detected whether the target detection frame and the preset alert zone satisfy the preset position relation condition, and if so, the step of identifying the continuous interactive action between the target and the preset alert zone for each video frame is executed. In one embodiment, the step of sequentially identifying the continuing interaction of the target and the preset alert zone in the subsequent continuous video frames comprises: if the continuous target detection frame corresponding to the target and the preset warning area meet the preset position relation condition in each subsequent continuous video frame, sequentially identifying the action corresponding to the target in each subsequent continuous video frame, and determining the action as the continuous interactive action of the target and the preset warning area.

In one embodiment, identifying the continuing interaction of the target and the preset alert zone in each video frame can be realized by a classification neural network; in a specific embodiment, ResNet (Residual Network) may be used to extract and identify image features of the detection box, and determine whether a target appears in the detection box. Wherein the residual network is determined in advance through training. The residual error network has the characteristics of easy optimization and capability of improving the accuracy by increasing the equivalent depth, so that a classification result with higher accuracy can be obtained.

In one embodiment, sequentially identifying the continuing interaction of the target with the preset alert zone in subsequent consecutive video frames further comprises: sequentially extracting image features from targets in subsequent continuous target detection frames in each video frame, performing action classification based on the extracted image features, determining action classification results corresponding to the targets in the target detection frames, and obtaining action classification results corresponding to each video; and obtaining the continuous interactive action of the target and the preset warning area according to the action classification result corresponding to each subsequent continuous video frame.

Further, in one embodiment, the target in the target detection frame is subjected to action classification through a preset classifier. Further, in one embodiment, the preset classifier provides a plurality of preset action classes. For example, when the target is a human body, the motion classification result corresponding to the target is a motion made by the human body in the current video frame, and for example, the preset motion categories specifically include standing, bending, squatting, lifting hands, lifting legs, and other motions. And inputting the target detection frame into a preset classifier to obtain the action category corresponding to the target output by the preset classifier. In one embodiment, the pre-set classifier can be implemented using ResNet (residual error network).

In this embodiment, an effective region is set for the target detection frame, when the effective region of the target detection frame intersects with the preset alert region, it is considered that an effective action is detected, at this time, the target starts to be tracked, and the continuous interactive action between the target and the preset alert region in each subsequent continuous video frame is identified.

In one particular embodiment, target tracking for a target may be implemented using an IoU Tracker (IoU Tracker). IoU (Intersection over Union) calculates the Intersection ratio (ratio of Intersection to Union) of the "predicted bounding box" and the "true bounding box". In this embodiment, specifically, an intersection ratio is calculated for target detection frames in which a target is detected in front and rear adjacent video frames, so as to determine whether the same target appears in two adjacent video frames; wherein, the "predicted frame" represents the target detection frame in the next video frame of the two adjacent video frames, and the "real frame" represents the target detection frame in the previous video frame of the two adjacent video frames.

In this embodiment, the IoU tracker determines whether the same object in the current video frame exists in each subsequent consecutive video frame; and after the same target in the current video frame exists in each subsequent continuous video frame, carrying out continuous interactive action for identifying the target and the preset alert area on each subsequent video frame.

The interactive action of the same target and the preset alert area in the subsequent continuous video frames is the continuation interactive action in the embodiment; in one embodiment, the position relation between the target and the preset alert region in each subsequent continuous video frame is detected, and when the target and the preset alert region meet the interaction condition, the action corresponding to the target is determined, namely the continuous interaction action of the target and the preset alert region; that is, when the position of the target and the preset alert area is in the interactive state, the action of identifying the target is the continuous interactive action in the embodiment. In one embodiment, the target event is the human body crossing the wall, i.e. the interaction between the human body and the wall in subsequent consecutive video frames.

In one embodiment, the step of sequentially identifying the continuing interaction of the target and the preset alert zone in the subsequent continuous video frames comprises: acquiring a next video frame, and identifying the continuing interactive action between the target in the next video frame and a preset warning area when the next video frame is detected to contain the same target as the previous video frame; and returning to the step of acquiring the next video frame. In this embodiment, the next video frame is obtained in a circulating manner, and whether the same target exists in the next video frame is detected, so that the target is tracked, if so, the continuation interactive action between the target and the preset alert area in the next video frame is identified, and the continuation interactive action between the target and the preset alert area in each subsequent continuous video frame of the current video frame can be obtained.

In one embodiment, similar to detecting whether a target exists in a current video frame, after a next video frame is acquired, a pending target detection frame is determined in the next video frame; and when the image characteristics extracted from the undetermined target detection frame determine that the undetermined target detection frame contains the undetermined target and the target are the same target according to the image characteristics, judging that the target is detected in the next video frame. The pending target detection frame of the next video frame may include one or more frames.

In this embodiment, a target detection frame determined in a next video frame is marked as an undetermined target detection frame, and when an image feature is extracted from the undetermined target detection frame to determine that the undetermined target detection frame contains an undetermined target, whether the undetermined target is the same target as a target in a current video frame is determined, specifically, the undetermined target is matched with a target in the current video frame. In a specific embodiment, IoU calculation may be performed between the target in the next video frame and the target in the current video frame, and when the intersection ratio between the target to be determined and the target is greater than the intersection ratio threshold, it is determined that the target to be determined and the target are the same target.

In a specific embodiment, when a plurality of targets (target detection frames) are detected in a current video frame and a plurality of undetermined target detection frames exist in a next video frame, a hungarian algorithm is adopted to determine the target detection frames corresponding to the undetermined target detection frames, that is, the targets corresponding to the current video frame are determined for each target in the next video frame. The Hungarian algorithm is a combined optimization algorithm for solving a task allocation problem in polynomial time.

In another embodiment, the target event is another event, and the corresponding preset action sequence is set as another action sequence according to the actual situation.

Step S240, based on the interactive action sequence formed by the initial action and each continuous interactive action, determining whether the target event occurs.

The initial action is the action corresponding to the target in the target detection frame in the current video frame, and the continuation interaction is used as the action of the target in each subsequent continuous video frame of the current video frame; in one embodiment, the initial actions and the continuation interactive actions are arranged according to the time sequence to obtain the interactive action sequence corresponding to the target.

In one embodiment, determining whether a target event occurs based on an interactive action sequence consisting of an initial action and continuation interactive actions includes: determining an interactive action sequence according to the time sequence and the initial action and each continuous interactive action; and when the interactive action sequence accords with the preset action sequence of the target event, judging that the target event occurs.

The target event is an event composed of a series of actions, and therefore in this embodiment, it is determined whether the interaction sequence is consistent with the preset action sequence of the target event, and if so, it is determined that the target event interacted between the target and the preset alert area occurs. The preset action sequence can be set according to the actual situation.

In one embodiment, the target event is a human body crossing a wall. In one embodiment, whether the interaction action sequences respectively accord with a preset action sequence corresponding to a target event that the human body crosses the wall is judged, and if so, the target event that the human body crosses the wall is judged to occur. In another embodiment, the human body crossing the wall body can be divided into the action sequence of the human body going up the wall, in the wall and going down the wall, and the action sequence of the human body going up the wall, in the wall and going down the wall can be divided into more detailed actions for the human body going up the wall, in the wall and going down the wall respectively, specifically, for example, the action of the human body changing from the standing action to the climbing action, the action of the human body changing from the sitting action to the climbing action, or the climbing action can be determined as the action of the human body going; the action of bending the human body and the action of sitting the human body can be determined as the action of the human body in the wall; the actions of lowering the wall of the human body may correspond to actions including bending down to climb, bending down to sit, bending down to stand, or changing from sitting to standing. As shown in table 1, the corresponding action sequence of the actions of the human body on the wall, in the wall and off the wall; when detecting that the combination of any action sequence meets the action sequence of the human body on the wall, in the wall and on the lower wall in the interactive action sequence, the target event that the human body crosses the wall can be judged. It will be appreciated that in other embodiments, the preset order of actions for the target event may be set to other orders of actions.

	Upper wall	In the wall	Lower wall
				Sequence of actions	Standing → climbing	Stoop down	Stoop → climb
Sequence of actions	Sit → climb	Sitting position	Stoop → seat
				Sequence of actions	Climbing device		Stoop → standing
Sequence of actions			Sit → stand

TABLE 1

In the embodiment, a target event is divided into a plurality of actions, a preset action sequence is determined for each action according to the sequence, and an interactive action sequence is compared with the preset action sequence to determine whether the preset action in the target event occurs or not; further, the interactive action sequence is compared with the preset action sequence of the target event, and if the interactive action sequence is consistent with the preset action sequence of the target event, the target event is judged to occur. That is, in this embodiment, not only the target interacts with the preset alert region at a certain time, but also a change in the interaction between the target and the preset alert region within a certain period of time is required to determine whether the target event occurs, so that the accuracy of detecting the target event can be improved, and the misjudgment can be reduced.

After the current video frame is acquired, a preset alert area is identified in the current video frame and a target detection frame is determined, and when the detection area in the middle of the target detection frame meets the event starting condition, continuous interactive actions of the target and the preset alert area are sequentially identified in each subsequent continuous video frame; and determining whether a target event occurs or not according to an interactive action sequence formed by the initial action corresponding to the target in the target detection frame and each continuous interactive action. The event starting condition is that the target detection frame and a preset warning area meet a preset position relation. By the method, the position relation between the target and the warning area is detected, whether the target event occurs is determined by combining the action state change of the target within a certain time, the accuracy of detecting the target event can be improved, and the misjudgment is reduced.

In one embodiment, when an object is detected in the current video frame, after obtaining the object detection frame, the method further includes: inputting the target detection box into a preset neural network for classification to obtain a classification result; when the classification result is a specific target, entering a step of sequentially identifying the continuous interactive action of the target and the preset alert region in each subsequent continuous video frame if the target detection frame and the preset alert region meet the preset position relation condition; in this embodiment, the initial action is an initial action corresponding to the specific target in the target detection frame determined based on the classification result, and the continuation interactive action is a continuation interactive action between the specific target and the preset alert area in each subsequent consecutive video frame determined based on the classification result.

The preset neural network can be a classification network determined in advance through training; whether the target is a specific target according to the classification result may be used to determine whether the target detection frame is a human body or a background, or may be used to identify whether the target detection frame is a specific target (e.g., a specific person). Further, in an embodiment, identifying whether the target is a specific target, and classifying the motion of the target may be accomplished through the same residual network, as shown in fig. 4, which is a schematic flow diagram illustrating target detection and classification of a current video frame through ResNet. Inputting the current video frame into ResNet, and outputting two classification results, namely the probability of the target in the detection frame in the current video frame and the probability of the action classification result corresponding to the target, namely determining whether the target detection frame comprises the specific target and the action of the target through a preset classification network.

In this embodiment, the target detection frame is classified through a preset neural network, so that whether a target corresponding to the target detection frame is a specific target or not and an action of the target corresponding to the target in the target detection frame can be obtained, and a more accurate classification result can be obtained by classifying through the neural network. In a specific embodiment in which the target event is that a human body crosses a wall, the target detection frame is determined to be a human or a background through a preset neural network, and a corresponding action is taken when the target detection frame is a human.

Further, as shown in fig. 5, in an embodiment, if the target detection frame and the preset alert area satisfy the preset position relationship condition, the method further includes step S510 of constructing a target sequence, and storing an initial action corresponding to the target in the target detection frame in the target sequence; in this embodiment, the method of sequentially identifying the continuous interactive action between the target and the preset alert area in the subsequent consecutive video frames includes step S520, obtaining the subsequent consecutive video frames, determining the continuous interactive action between the target and the preset alert area in the subsequent consecutive video frames, and storing the continuous interactive action in the target sequence corresponding to the target.

Mathematically, a sequence is an object (or event) that is lined up in a column; such that each element is either before or after the other elements. In this embodiment, the actions identified in each video frame are stored as elements in a target sequence, and the target sequence sequentially includes an initial action and each continuation interactive action. In this embodiment, after the target sequence is constructed, the motion of the target in the target detection frame in the current video frame is identified, and is recorded as the initial motion, and is stored in the target sequence.

In one embodiment, when a plurality of objects are detected in a current video frame, corresponding object sequences are respectively constructed for the plurality of objects; and respectively carrying out continuous interactive action on each target and the preset warning area in subsequent continuous video frames, and respectively storing the continuous interactive action detected in the subsequent continuous video frames into a target sequence corresponding to the same target.

In this embodiment, by constructing a target sequence corresponding to a target, storing an initial action of the target in a current video frame and a continuation interactive action in each subsequent continuous video frame, when determining an interactive action sequence, each continuation interactive action can be directly read from the target sequence according to a time sequence to obtain the interactive action sequence.

In another embodiment, as shown in fig. 6, after the target sequence is constructed, the method further includes step S610: storing the target detection frame and the corresponding initial action in a target sequence; in this embodiment, after sequentially identifying the continuing interaction between the target and the preset alert area in the subsequent consecutive video frames, the method further includes step S620: and when the target is detected in the next video frame, determining a predicted target detection frame of the target in the next video frame, and storing the predicted target detection frame into a target sequence corresponding to the target.

In one embodiment, storing the target detection box in the target sequence includes storing position information corresponding to the target detection box. In this embodiment, for a target sequence, a target detection frame and a corresponding initial action are stored in the target sequence; when a plurality of targets exist in the current video frame, the targets can be distinguished through the position information of the target detection frame, when the continuous interactive action corresponding to each target is detected in the subsequent continuous video frames, the continuous interactive action of each target can be identified respectively based on the position information of each target, and the identified continuous interactive action of each target and the preset warning area is stored into the corresponding target sequence.

Further, in one embodiment, as shown in fig. 7, determining a predicted target detection box of a target in a next video frame includes the step S710: acquiring a first target detection frame of a target in a target sequence in a previous video frame, and determining the first target detection frame as a reference target detection frame; step S720, calculating the intersection ratio of the reference target detection frame and each to-be-determined target detection frame in the next video frame, and determining the to-be-determined target detection frame with the intersection ratio larger than the intersection ratio threshold as the target prediction target detection frame in the next video frame. The undetermined target detection frame refers to a target detection frame in the next video frame.

The last video frame of the object in the object sequence refers to: the target correspondingly stores an initial action in a target sequence or continues a last frame of video frame corresponding to the interactive action; further, in order to better distinguish, in this embodiment, a target detection frame corresponding to the target in the previous video frame is marked as a reference target detection frame, and each target detection frame in the next video frame is marked as an undetermined target detection frame; it is understood that there may be a plurality of pending target detection boxes. And respectively calculating the intersection ratio between each undetermined target detection frame and the reference target detection frame, and determining the undetermined target detection frame with the intersection ratio larger than the intersection ratio threshold value as a predicted target detection frame of the target corresponding to the reference target detection frame in the next video frame. In a specific embodiment, when a target in a previous video frame is denoted as a target a, a corresponding reference target detection frame X is used, and a continuation interactive action in each subsequent and consecutive video frame is identified for the target a, assuming that a plurality of undetermined target detection frames 1, 2, 3, … n detected in a next video frame are respectively used to calculate an intersection ratio between each undetermined target detection frame 1, 2, 3, … n and the reference target detection frame X, and assuming that the intersection ratio between the undetermined target detection frame 3 and the reference template detection frame is greater than an intersection ratio threshold, determining that the undetermined target detection frame 3 is a predicted target detection frame of the target a in the next video frame.

In this embodiment, an object sequence corresponding to an object is constructed, and an object detection frame (position information) and actions (including an initial action and a continuation interactive action) corresponding to the object identified in each video frame are stored, so that the method can be applied to individually tracking a plurality of objects in the case that a plurality of objects are detected in one video frame, and respectively determining whether each object is executing an action corresponding to an object event.

In one embodiment, when the continuous interactive action of the target and the preset alert area is not identified in the subsequent continuous video frames, the last updating time corresponding to the continuous interactive action in the target sequence is obtained; and if the last updating time is more than a preset time threshold from the current time, ending the updating of the target sequence.

When the target is not detected in each subsequent continuous video frame, or the target is detected in each subsequent continuous video frame but no interactive action exists between the target and the preset alert area, it is determined that the continuous interactive action between the target and the preset alert area is not identified in each subsequent continuous video frame.

The last update time corresponding to the continuation interaction refers to the join time corresponding to the last continuation interaction joined in the target sequence. In this embodiment, when the continuous interaction between the target and the preset alert area is not identified in the subsequent consecutive video frames, the target sequence is determined to determine whether to continue the tracking identification of the target, specifically, the determination is made according to a time interval between the last update time corresponding to the continuous interaction in the target sequence and the current time, when the time interval exceeds a preset time threshold, the target may leave the preset alert area, or the target event is not continuously executed, at this time, the update of the target may be finished, and the continuous interaction of the target in the subsequent consecutive video frames is no longer identified. In another embodiment, if the last update time does not exceed the preset time threshold from the current time interval, the continuous interaction with the preset alert zone may be continuously identified for each subsequent consecutive video frame.

In one embodiment, the number of actions stored in the target sequence is detected every preset time period, and when the number of actions stored in the target sequence is detected to be greater than a preset threshold, the step of determining whether a target event occurs based on the interactive action sequence is entered. In another embodiment, it is also possible to directly detect an existing interactive action sequence in the target sequence after every preset time period, and determine whether a target event occurs. The preset time period can be set according to actual conditions. In another embodiment, when the target sequence is detected to exist, each newly added continuous interactive action in the target sequence and the interactive action sequence determined by the initial action in the target sequence are detected, that is, as long as the target sequence exists, the interactive action sequence is continuously detected to determine whether a target event occurs, so that the detection timeliness can be improved, and other problems caused by untimely detection are avoided. Further, in one embodiment, when the target event is determined to occur, alarm information is generated and sent to a corresponding preset supervisor. Namely, when the target event is determined to occur, alarm information can be generated to prompt relevant personnel to take corresponding measures.

In an embodiment, the target event detection method is described in detail by taking an example of detecting a target event that a human body crosses a wall.

And acquiring a current video frame, and identifying a preset warning area from the current video frame. When a human body is detected in the current video frame, a human body detection frame (the target detection frame) is obtained, a preset classifier is adopted to extract image features of the target in the human body detection frame, and 2 attribute classification results are output, wherein the classification results include whether the target in the human body detection frame is a human body and actions corresponding to the target in the human body detection frame. On the one hand, non-human targets can be further screened (whether the target is a human or not is judged by the aid of the first output result); on the other hand, when the target in the human body detection frame is a human body, the human body features in the detection frame are extracted, the human body features are classified, and the action category corresponding to the human body in the human body detection frame is output. In one embodiment, the preset classifier provides 6 action state classes: standing, stooping, squatting, lifting the hands, lifting the legs, and others. In a specific embodiment, the preset classifier uses ResNet (residual error network) as a backbone network, and the head network uses two full-connection layers to respectively predict probabilities of two attributes (a probability that a human body detection frame contains a human body, and a corresponding action of the human body is a probability of each action category).

Meanwhile, the position relation between the human body detection frame of the current video frame and the preset warning region is judged, when the effective region (the effective region is 20% of the lower edge of the human body detection frame and is called valid box) of the human body detection frame is detected to be intersected with the preset warning region, the fact that the event starting condition is met is judged, at the moment, a target sequence corresponding to the human body detection frame is built, and the position information and the corresponding initial action of the human body detection frame are stored in the target sequence. And if a plurality of human body detection frames exist in the current video frame and are intersected with the preset warning area, respectively constructing target sequences corresponding to the human body detection frames.

And (5) adopting an IoU Tracker to track the targets of the human body detection frames with the constructed target sequences. And acquiring a subsequent continuous next video frame, recording a human body detection frame in the current video frame as a candidate frame, determining a human body detection frame in the next video frame, and recording the human body detection frame as a prediction frame. And respectively matching each prediction frame in the next video frame with each candidate frame, performing IoU calculation, and then selecting a target sequence where the candidate frame matched with each prediction frame is located according to the Hungarian algorithm. And if IoU of a prediction box generated by the candidate box and the currently matched target sequence exceeds a set threshold, the matching is considered to be successful, and the prediction box is added into the target sequence of the target. Otherwise, the matching fails. If a certain current target sequence is not matched with any prediction frame, determining whether the target sequence is ended or not according to the last updating time in the target sequence; specifically, when the time interval between the last update time and the current time exceeds a preset time threshold, the tracking of the target sequence is finished. Further, in an embodiment, if a current prediction box is not matched with any target sequence, a target sequence corresponding to the prediction box is newly created to record and track the target.

And recording the action corresponding to each human body detection frame according to each target sequence to obtain an interactive action sequence generated between the corresponding human body and the warning area, and then determining whether a target event that the human body crosses the wall body occurs or not according to the action state change of the target human body in the interactive action sequence.

The target event detection method not only detects the position relation between the target and the warning area, but also determines whether a target event that the human body crosses the wall body occurs or not by combining the action state change of the target within a certain time, so that the accuracy of detecting the target event can be improved, and the misjudgment can be reduced. Meanwhile, whether a target event occurs is determined according to the action state change within a certain time, so that the method has the characteristics of high efficiency, accuracy and high robustness.

It should be understood that although the various steps in the flow charts of fig. 2-7 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-7 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.

In one embodiment, as shown in fig. 8, there is provided a target event detecting apparatus including: an area identification module 810, an object detection module 820, an action identification module 830, and an object event detection module 840, wherein:

the region identification module 810 is configured to obtain a current video frame and identify a preset alert region from the current video frame;

a target detection module 820, configured to perform target detection in the current video frame, and obtain a target detection frame when a target is detected in the current video frame;

the action identification module 830 is configured to, if the target detection frame and the preset alert area satisfy the preset position relationship condition, sequentially identify a continuation interactive action between the target and the preset alert area in subsequent consecutive video frames;

the target event detection module 840 is configured to determine whether a target event occurs based on an interaction sequence formed by an initial action and each continuation interaction, where the initial action is an action corresponding to a target in the target detection box.

The target event detection device identifies a preset alert area in a current video frame after the current video frame is acquired, performs target detection on the current video frame, obtains a target detection frame when a target is detected in the current video frame, and identifies continuous interactive actions of the target and the preset alert area in each subsequent continuous video frame if the target detection frame and the preset alert area meet the preset position relation condition; and determining whether a target event occurs or not according to an interactive action sequence formed by the initial action corresponding to the target in the target detection frame and each continuous interactive action. By the method, the position relation between the target and the warning area is detected, whether the target event occurs is determined by combining the action state change of the target within a certain time, the accuracy of detecting the target event can be improved, and the misjudgment is reduced.

In one embodiment, the above apparatus further comprises: the target sequence construction module is used for constructing a target sequence and storing the initial action corresponding to the target detection frame in the target sequence; in this embodiment, the action recognition module 830 is configured to acquire each subsequent continuous video frame, determine a continuation interactive action between the target and the preset alert area in each subsequent continuous video frame, and store the continuation interactive action in the target sequence corresponding to the target.

In one embodiment, the target sequence construction module of the apparatus is further configured to store the target detection box and the corresponding initial action in the target sequence; in this embodiment, the object detection module 820 is further configured to, when an object is detected in a next video frame, determine a predicted object detection frame of the object in the next video frame, and store the predicted object detection frame in an object sequence corresponding to the object.

In one embodiment, the object detection module 820 of the above apparatus comprises: the acquisition unit is used for acquiring a target detection frame of a target in a last video frame in a target sequence and determining the target detection frame as a reference target detection frame; and the intersection ratio calculation unit is used for calculating the intersection ratio of the reference target detection frame and each to-be-determined target detection frame in the next video frame, and determining the to-be-determined target detection frame with the intersection ratio larger than the intersection ratio threshold value as a predicted target detection frame of the target in the next video frame. And the undetermined target detection frame is a predicted target detection frame of the target in the next video frame.

In one embodiment, the above apparatus further comprises: the updating time obtaining module is used for obtaining the last updating time corresponding to the continuous interactive action in the target sequence when the continuous interactive action between the target and the preset warning area is not identified in the subsequent continuous video frames; and the judging module is used for finishing the updating of the target sequence if the distance between the last updating time and the current time exceeds a preset time threshold.

In one embodiment, the target event detection module 840 includes: the interactive action sequence determining unit is used for determining an interactive action sequence according to the time sequence and the initial action and each continuous interactive action; and the judging unit is used for judging that the target event occurs when the interactive action sequence accords with the preset action sequence of the target event.

In an embodiment, the motion recognition module 830 of the apparatus is specifically configured to, if the continuous target detection frame corresponding to the target and the preset alert area satisfy the condition of the preset positional relationship in the subsequent continuous video frames, sequentially recognize the motion corresponding to the target in the subsequent continuous video frames, and determine the motion as the continuous interactive motion between the target and the preset alert area.

In one embodiment, the apparatus further includes a classification module, configured to input the target detection box into a preset neural network for classification, so as to obtain a classification result; in this embodiment, when the classification result is a specific target, the method jumps to the identification module 830 to execute the continuous interactive action of sequentially identifying the target and the preset alert area in each subsequent continuous video frame if the target detection frame and the preset alert area satisfy the preset position relationship condition. In this embodiment, the initial action is an initial action corresponding to the specific target in the target detection frame determined based on the classification result, and the continuation interactive action is a continuation interactive action between the specific target and the preset alert area in each subsequent consecutive video frame determined based on the classification result.

For specific limitations of the target event detection device, reference may be made to the above limitations of the target event detection method, which are not described herein again. The modules in the target event detection device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 9. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a target event detection method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 9 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:

if the target detection frame and the preset alert area meet the preset position relation condition, sequentially identifying the continuous interactive action of the target and the preset alert area in each subsequent continuous video frame;

and determining whether a target event occurs or not based on an interactive action sequence formed by the initial action and each continuous interactive action, wherein the initial action is an action corresponding to the target in the target detection frame.

In one embodiment, the processor, when executing the computer program, further performs the steps of: constructing a target sequence, and storing initial actions corresponding to the targets in the target detection frame in the target sequence; and acquiring subsequent continuous video frames, determining the continuous interactive action between the target and the preset warning area in the subsequent continuous video frames, and storing the continuous interactive action into a target sequence corresponding to the target.

In one embodiment, the processor, when executing the computer program, further performs the steps of: storing the target detection frame and the corresponding initial action in a target sequence; when an object is detected in the next video frame, determining a predicted object detection frame of the object in the next video frame, and storing the predicted object detection frame in an object sequence corresponding to the object.

In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring a target detection frame of a target in a previous video frame in a target sequence, and determining the target detection frame as a reference target detection frame; and calculating the intersection ratio of the reference target detection frame and each to-be-determined target detection frame in the next video frame, and determining the to-be-determined target detection frame with the intersection ratio larger than the intersection ratio threshold as a predicted target detection frame of the target in the next video frame.

In one embodiment, the processor, when executing the computer program, further performs the steps of: when the continuous interactive action of the target and the preset alert area is not identified in the subsequent continuous video frames, the last updating time corresponding to the continuous interactive action in the target sequence is obtained; and if the last updating time is more than a preset time threshold from the current time, ending the updating of the target sequence.

In one embodiment, the processor, when executing the computer program, further performs the steps of: determining an interactive action sequence according to the time sequence and the initial action and each continuous interactive action; and when the interactive action sequence accords with the preset action sequence of the target event, judging that the target event occurs.

In one embodiment, the processor, when executing the computer program, further performs the steps of: and when the partial area of the preset position in the target detection frame is intersected with the preset warning area, judging that the target detection frame and the preset warning area meet the preset position relation condition.

In one embodiment, the processor, when executing the computer program, further performs the steps of: if the continuous target detection frame corresponding to the target and the preset warning area meet the preset position relation condition in each subsequent continuous video frame, sequentially identifying the action corresponding to the target in each subsequent continuous video frame, and determining the action as the continuous interactive action of the target and the preset warning area.

In one embodiment, inputting the target detection box into a preset neural network for classification to obtain a classification result; and when the classification result is a specific target, sequentially identifying the continuous interactive action of the target and the preset warning region in each subsequent continuous video frame if the target detection frame and the preset warning region meet the preset position relation condition.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

In one embodiment, the computer program when executed by the processor further performs the steps of: constructing a target sequence, and storing initial actions corresponding to the targets in the target detection frame in the target sequence; and acquiring subsequent continuous video frames, determining the continuous interactive action between the target and the preset warning area in the subsequent continuous video frames, and storing the continuous interactive action into a target sequence corresponding to the target.

In one embodiment, the computer program when executed by the processor further performs the steps of: storing the target detection frame and the corresponding initial action in a target sequence; when an object is detected in the next video frame, determining a predicted object detection frame of the object in the next video frame, and storing the predicted object detection frame in an object sequence corresponding to the object.

In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring a target detection frame of a target in a previous video frame in a target sequence, and determining the target detection frame as a reference target detection frame; and calculating the intersection ratio of the reference target detection frame and each to-be-determined target detection frame in the next video frame, and determining the to-be-determined target detection frame with the intersection ratio larger than the intersection ratio threshold as a predicted target detection frame of the target in the next video frame.

In one embodiment, the computer program when executed by the processor further performs the steps of: when the continuous interactive action of the target and the preset alert area is not identified in the subsequent continuous video frames, the last updating time corresponding to the continuous interactive action in the target sequence is obtained; and if the last updating time is more than a preset time threshold from the current time, ending the updating of the target sequence.

In one embodiment, the computer program when executed by the processor further performs the steps of: determining an interactive action sequence according to the time sequence and the initial action and each continuous interactive action; and when the interactive action sequence accords with the preset action sequence of the target event, judging that the target event occurs.

In one embodiment, the computer program when executed by the processor further performs the steps of: and when the partial area of the preset position in the target detection frame is intersected with the preset warning area, judging that the target detection frame and the preset warning area meet the preset position relation condition.

In one embodiment, the computer program when executed by the processor further performs the steps of: if the continuous target detection frame corresponding to the target and the preset warning area meet the preset position relation condition in each subsequent continuous video frame, sequentially identifying the action corresponding to the target in each subsequent continuous video frame, and determining the action as the continuous interactive action of the target and the preset warning area.

In one embodiment, the computer program when executed by the processor further performs the steps of: inputting the target detection box into a preset neural network for classification to obtain a classification result; and when the classification result is a specific target, sequentially identifying the continuous interactive action of the target and the preset warning region in each subsequent continuous video frame if the target detection frame and the preset warning region meet the preset position relation condition.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for target event detection, the method comprising:

2. The method according to claim 1, wherein if the target detection frame and the preset warning region satisfy a preset position relationship condition, further comprising the steps of: constructing a target sequence, and storing the initial action corresponding to the target in the target detection frame in the target sequence;

sequentially identifying the continuing interactive action of the target and the preset alert area in each subsequent continuous video frame, wherein the method comprises the following steps:

and acquiring subsequent continuous video frames, determining the continuous interactive action between the target and the preset warning area in the subsequent continuous video frames, and storing the continuous interactive action to a target sequence corresponding to the target.

3. The method of claim 2, wherein after the constructing the target sequence, further comprising: storing the target detection frame and the corresponding initial action in a target sequence;

after sequentially identifying the continuing interaction between the target and the preset alert zone in each subsequent continuous video frame, the method further comprises the following steps:

When the target is detected in the next video frame, a predicted target detection frame of the target in the next video frame is determined, and the predicted target detection frame is stored in a target sequence corresponding to the target.

4. The method of claim 3, wherein determining the predicted target detection block for the target in the next video frame comprises:

acquiring a target detection frame of the target in the last video frame in the target sequence, and determining the target detection frame as a reference target detection frame;

calculating the intersection ratio of the reference target detection frame and each to-be-determined target detection frame in the next video frame, and determining the to-be-determined target detection frame with the intersection ratio larger than an intersection ratio threshold value as a predicted target detection frame of the target in the next video frame; and the undetermined target detection frame is a target detection frame in the next video frame.

5. The method of claim 2, wherein:

when the continuous interactive action of the target and the preset warning area is not identified in the subsequent continuous video frames, obtaining the last updating time corresponding to the continuous interactive action in the target sequence;

And if the last updating time and the current time exceed the preset time threshold, ending the updating of the target sequence.

6. The method of any one of claims 1 to 5, wherein determining whether a target event occurs based on the initial action and the interaction sequence of each of the continued interactions comprises:

determining an interactive action sequence according to the initial action and each continuation interactive action according to the time sequence;

and when the interactive action sequence conforms to the preset action sequence of the target event, judging that the target event occurs.

7. The method according to any one of claims 1 to 5, characterized in that:

and when a partial area of a preset position in the target detection frame is intersected with the preset warning area, judging that the target detection frame and the preset warning area meet a preset position relation condition.

8. The method according to claim 1, wherein said sequentially identifying the continuation interaction of the target with the preset alert zone in subsequent consecutive video frames comprises:

if the continuous target detection frame corresponding to the target and the preset alert area meet the preset position relation condition in the subsequent continuous video frames, sequentially identifying the action corresponding to the target in the subsequent continuous video frames, and determining the action as the continuous interactive action of the target and the preset alert area.

9. The method of claim 1, wherein after obtaining an object detection frame when an object is detected in the current video frame, further comprising:

inputting the target detection box into a preset neural network for classification to obtain a classification result;

when the classification result is a specific target, entering a step of sequentially identifying the continuing interactive action of the target and the preset alert region in each subsequent continuous video frame if the target detection frame and the preset alert region meet a preset position relation condition;

the initial action is an initial action corresponding to the specific target in the target detection frame determined based on the classification result, and the continuation interaction is a continuation interaction action between the specific target and the preset alert area in each subsequent continuous video frame determined based on the classification result.

10. An apparatus for target event detection, the apparatus comprising:

11. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any one of claims 1 to 9 when executing the computer program.

12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 9.