CN112926481A

CN112926481A - Abnormal behavior detection method and device

Info

Publication number: CN112926481A
Application number: CN202110254248.0A
Authority: CN
Inventors: 巩海军; 潘华东; 殷俊; 张兴明; 彭志蓉; 彭闯
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2021-03-05
Filing date: 2021-03-05
Publication date: 2021-06-08
Anticipated expiration: 2041-03-05
Also published as: CN112926481B

Abstract

The application discloses a method and a device for detecting abnormal behaviors, which aim at a first video frame in a video stream to detect human heads and determine the human heads in the first video frame; determining whether the enlarged detection area of the human head contains a set number of human heads or not for any human head; if the human heads with the set number exist, determining that the enlarged detection area is an effective detection area; for any effective detection area, respectively determining N detection areas to be checked corresponding to the effective detection area from N second video frames positioned after a first video frame in a video stream; and determining whether the effective detection area has abnormal behaviors or not according to the variation trends of the number of the heads of the effective detection area and the N detection areas to be checked in the time dimension. The method can be used for detecting abnormal behaviors in real time, solving the technical problems that the target is inaccurate to detect and cannot be detected when the target is shielded, and solving the perspective problem of the size of the target.

Description

Abnormal behavior detection method and device

Technical Field

The embodiment of the application relates to the technical field of image processing, in particular to a method and a device for detecting abnormal behaviors.

Background

With the rapid development of economy, the improvement of the living standard of people and the continuous enlargement and flourishing of the scale of the commercial district in the city, more and more people shop, eat, entertain and recreation in the commercial district, which makes some large commercial districts become the representatives of the city image and the most active areas of the economy. However, a huge potential safety hazard may be hidden behind crowded people in public places such as this, so if people in these scenes can be monitored in real time, abnormal behaviors of people can be found in time, corresponding solutions can be taken in time, and occurrence of major accidents is avoided.

At present, a common crowd abnormity analysis monitoring system is based on a human body detection mode. In the detection mode, human body identification needs to be carried out on one individual in a public place, certain difficulty exists, and further, greater identification difficulty exists for human body identification in a crowded scene; meanwhile, the human body detection mode also has set requirements on the size of the human body, and people far away from the monitoring equipment can not be identified, so that the accuracy of crowd abnormity analysis is influenced.

Disclosure of Invention

The application provides a method and a device for detecting abnormal behaviors, which are used for solving the technical problems that target detection is inaccurate and a target cannot be detected when the target is shielded, and solving the perspective problem of the size of the target.

In a first aspect, an embodiment of the present application provides a method for detecting an abnormal behavior, where the method includes: carrying out human head detection on a first video frame in a video stream, and determining each human head in the first video frame; determining whether an enlarged detection area of the human head contains a set number of human heads for any human head; if the human heads with the set number exist, determining the expanded detection area as an effective detection area; for any effective detection area, respectively determining N detection areas to be checked corresponding to the effective detection area from N second video frames positioned after the first video frame in the video stream; and determining whether the effective detection area has abnormal behaviors or not according to the variation trends of the number of the heads of the effective detection area and the N detection areas to be checked in the time dimension.

Based on the scheme, when the person in the crowd is detected, the method for detecting the head of the person has fewer characteristics compared with the method for detecting the head of the person, so that the method for detecting the head of the person is adopted for the crowd in the public place collected by the lens, the accuracy of target detection can be improved, the crowd-crowded scene can be detected, and the technical problem that the target cannot be detected due to being shielded can be avoided; in addition, because the camera lens is when monitoring the crowd in the public place, there is the perspective problem of target near big or far, for this reason, this application embodiment is through the enlarged detection zone that generates human head to will accord with the enlarged detection zone that predetermines the requirement as effective detection zone, thereby finally can confirm whether this effective detection zone takes place unusual action based on the trend of change of the number of people that this effective detection zone has respectively in current video frame and a plurality of frame video frames of follow-up. In the method, no matter the crowd is close to the lens or the crowd is far away from the lens, the change states of the near target and the far target in the crowd at present can be represented to the same true degree by adopting the method of generating the enlarged detection area of the head of the human body, so that the method can overcome the perspective problem of the near target and the far target; in addition, the method avoids analyzing all elements in the video frame by extracting the area (namely, the effective detection area) which is interested by the user as a research object, so that the method can be suitable for analyzing abnormal behaviors of video streams with different resolutions in real time, and the extraction effect and speed of the method when the area which is interested by the user is extracted are obviously superior to those of an optical flow method.

In one possible implementation, before the determining whether the valid detection region has abnormal behavior, the method further includes: if the number of the heads of the effective detection area and the N detection areas to be checked is determined to be increased or decreased in the time dimension, forming effective image sequences of the effective detection area and the N detection areas to be checked; the determining whether the valid detection zone has abnormal behavior comprises: and inputting the effective image sequence into a 3D convolutional neural network model, and determining whether abnormal behaviors exist in the effective detection area.

Based on the scheme, in a public place, crowd abnormal events can be expressed as sudden crowd aggregation or crowd dispersion in a short time, when an effective detection area is taken as a research object, whether the number of heads of the effective detection area in a current video frame and a plurality of subsequent video frames respectively has an increasing or decreasing phenomenon is determined, so that the events with the two phenomena can be identified as the crowd abnormal events to a great extent; furthermore, the effective image sequences which are shown in the state of increasing or decreasing the number of people are classified by additionally using a 3D convolutional neural network model, so that the identification effect of the crowd abnormal events can be greatly improved, and the misjudgment events possibly existing in the preamble crowd abnormal identification step are avoided.

In one possible implementation, the determining that the enlarged detection area is a valid detection area if there are a set number of human heads includes: if the human heads with the set number exist, determining whether a human head detection area meeting a first combination requirement with the expanded detection area exists; and merging the human head detection area meeting the first merging requirement and the enlarged detection area into the effective detection area.

Based on the scheme, aiming at the generated enlarged detection area of the human head, if the number of the human heads included in the enlarged detection area meets the set requirement, the human head detection area meeting the first combination requirement can be further fused into the enlarged detection area to form a new effective detection area in an expression form, namely the influence of the crowd with relatively more people on the surrounding environment is enlarged, so that the dynamic change process between people in the public place can be more accurately described, and the method has very important significance for determining the abnormal events of the crowd.

In one possible implementation, after the merging the human head detection area satisfying the first merging requirement and the enlarged detection area into the effective detection area, the method further includes: any two valid detection regions that meet the second merge requirement are merged into one valid detection region.

Based on the scheme, the human head detection area meeting the first combination requirement is fused into the enlarged detection area with the human head number meeting the set requirement, so as to form an effective detection area with a new expression form; furthermore, if any two effective detection areas respectively conforming to the expression form meet the second merging requirement, the two effective detection areas can be continuously merged to obtain another effective detection area of a new expression form, that is, for a relatively dense crowd, by searching another/some relatively dense crowd(s) having a certain association relation with the effective detection areas in the environment, a certain deep-level relation which may exist between different crowds in a public place can be established, which has a very important significance for determining crowd abnormal events.

In one possible implementation, the first merging requirement is the presence of an intersection with the enlarged detection zone; the second merging requirement is that the intersection ratio of any two effective detection areas meets a set condition.

Based on the scheme, when one or more head detection areas and enlarged detection areas with the number of heads meeting the set requirement exist intersection, namely, individuals having a certain incidence relation with a relatively dense crowd exist around the relatively dense crowd in a zero-dispersion manner, the determined relatively scattered individuals can be taken into the relatively dense crowd and considered as a whole; in addition, aiming at any two enlarged detection areas respectively fused with human head detection areas (the number of human heads respectively included in the two enlarged detection areas meets the set requirement), namely for any two effective detection areas, if the ratio of the overlapping area between the two effective detection areas and the total area of the two effective detection areas is in accordance with the set condition, the cross influence between two relatively dense crowds to a certain degree can be represented, and the cross influence to the certain degree can be used for representing that the two relatively dense crowds are coordinated and unified under the current scene, environment and atmosphere, namely, the two relatively dense crowds can be regarded as the same dense crowd, so that the accuracy can be better when the effective detection areas based on the situation are used for detecting the abnormal behaviors of the dense crowds in the public places.

In one possible implementation, the enlarged detection area of the human head is determined by: and setting a proportion for expanding the human head detection area by taking the center of the human head detection area where the human head is positioned as a center to obtain the expanded detection area of the human head.

Based on the scheme, when the lens is used for monitoring the crowd in a public place, the perspective problem that the target is large and small exists, namely people close to the lens look larger, and people far away from the lens look smaller, so that an examination is undoubtedly provided for identifying crowd abnormal events at the position close to the lens and crowd abnormal events at the position far away from the lens, namely a method for identifying the crowd abnormal events at the position close to the lens and the crowd abnormal events at the position far away from the lens needs to be followed, and the crowd abnormal events at the position close to the lens and the crowd at the position far away from the lens can be identified to the same accuracy degree. Therefore, the embodiment of the application expands the human head detection areas where the human heads at the near position and the far position are located in the same set proportion, wherein the expansion is performed by taking the center of the human head detection area where the human head is located as the center in the expanding process, so that the crowd at the near position of the lens and the crowd at the far position of the lens can be identified in the same lens at present with the same influence, namely, the perspective problem of the target at the near position and the far position can be overcome by the method.

In one possible implementation, the 3D convolutional neural network model includes parallel spatial and temporal convolutional layers, a first attention layer after the spatial convolutional layer and a second attention layer after the temporal convolutional layer.

Based on the scheme, the effective image sequence is input into the 3D convolutional neural network model comprising the parallel spatial convolutional layer and the time convolutional layer, the first attention mechanism layer is loaded after the spatial convolutional layer, and the second attention mechanism layer is loaded after the time convolutional layer, and the introduced attention mechanism can focus on the relation between the target spatial position and the time sequence among the image multi-frames, so that the 3D convolutional neural network model has better classification effect when being applied to the classification of the effective image sequence.

In a second aspect, an embodiment of the present application provides an apparatus for detecting abnormal behavior, where the apparatus includes: the device comprises an effective detection area determining unit, a processing unit and a processing unit, wherein the effective detection area determining unit is used for carrying out human head detection on a first video frame in a video stream and determining each human head in the first video frame; determining whether an enlarged detection area of the human head contains a set number of human heads for any human head; if the human heads with the set number exist, determining the expanded detection area as an effective detection area; a to-be-verified detection area determining unit, configured to determine, for any effective detection area, N to-be-verified detection areas corresponding to the effective detection area from N second video frames located after the first video frame in the video stream, respectively; the abnormity determining unit is used for determining whether the effective detection area has abnormal behaviors or not according to the change trends of the number of the human heads respectively included in the effective detection area and the N detection areas to be checked in the time dimension; in addition, the method avoids analyzing all elements in the video frame by extracting the area (namely, the effective detection area) which is interested by the user as a research object, so that the method can be suitable for analyzing abnormal behaviors of video streams with different resolutions in real time, and the extraction effect and speed of the method when the area which is interested by the user is extracted are obviously superior to those of an optical flow method.

Based on the scheme, when the person in the crowd is detected, the method for detecting the head of the person has fewer characteristics compared with the method for detecting the head of the person, so that the method for detecting the head of the person is adopted for the crowd in the public place collected by the lens, the accuracy of target detection can be improved, the crowd-crowded scene can be detected, and the technical problem that the target cannot be detected due to being shielded can be avoided; in addition, because the camera lens is when monitoring the crowd in the public place, there is the perspective problem of target near big or far, for this reason, this application embodiment is through the enlarged detection zone that generates human head to will accord with the enlarged detection zone that predetermines the requirement as effective detection zone, thereby finally can confirm whether this effective detection zone takes place unusual action based on the trend of change of the number of people that this effective detection zone has respectively in current video frame and a plurality of frame video frames of follow-up. In the mode, no matter the crowd is close to the lens or the crowd is far away from the lens, the change states of the near target and the far target in the crowd where the near target and the far target are located can be represented in the same true degree by adopting the mode of generating the detection area for expanding the head of the human body, and therefore the perspective problem of the near target and the far target can be solved by the mode.

In a possible implementation of the method, the anomaly determination unit is further configured to: if the number of the heads of the effective detection area and the N detection areas to be checked is determined to be increased or decreased in the time dimension, forming effective image sequences of the effective detection area and the N detection areas to be checked; the abnormality determining unit is specifically configured to: and inputting the effective image sequence into a 3D convolutional neural network model, and determining whether abnormal behaviors exist in the effective detection area.

In a possible implementation method, the effective detection area determining unit is specifically configured to: if the human heads with the set number exist, determining whether a human head detection area meeting a first combination requirement with the expanded detection area exists; and merging the human head detection area meeting the first merging requirement and the enlarged detection area into the effective detection area.

In one possible implementation, the effective detection area determination unit is further configured to: any two valid detection regions that meet the second merge requirement are merged into one valid detection region.

In one possible implementation, the apparatus further comprises an enlarged detection zone determination unit; the enlarged detection area determination unit is configured to: and setting a proportion for expanding the human head detection area by taking the center of the human head detection area where the human head is positioned as a center to obtain the expanded detection area of the human head.

In a third aspect, an embodiment of the present application provides a computing device, including:

a memory for storing a computer program;

a processor for calling a computer program stored in said memory and executing the method according to any of the first aspect according to the obtained program.

In a fourth aspect, the present application provides a computer-readable storage medium, which stores a computer program for causing a computer to execute the method according to any one of the first aspect.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a method for detecting an abnormal behavior according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram illustrating determination of an effective detection area according to an embodiment of the present application;

fig. 3 is a BasicBlock in a P3D network structure provided by the embodiment of the present application;

FIG. 4 is a schematic diagram of a timing attention mechanism module according to an embodiment of the present disclosure;

fig. 5 is a device for detecting an abnormal behavior according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In some public places like large commercial districts in cities, people in the public places are closely monitored, the change trend of the people is mastered, the method has very important significance for timely identifying potential safety hazards hidden behind the people, and corresponding measures can be taken at the first time when the potential safety hazards are identified, so that the occurrence of some major accidents is avoided.

Although many defects existing in the traditional manual monitoring mode can be avoided to a great extent by aiming at some monitoring systems for crowd abnormity analysis in the market at present, the monitoring systems are usually carried out based on a human body detection mode. In the detection mode, a person in a public place is accurately identified, so that certain difficulty exists, and further, greater identification difficulty exists for the problem of personal identification in a crowded scene; meanwhile, the detection method cannot solve the perspective problem of the target in the near-far direction.

In view of the above technical problem, an embodiment of the present application provides a method for detecting an abnormal behavior, as shown in fig. 1, the method includes the following steps:

step 101, performing human head detection on a first video frame in a video stream, and determining each human head in the first video frame; determining whether an enlarged detection area of the human head contains a set number of human heads for any human head; and if the human heads with the set number exist, determining the expanded detection area as an effective detection area.

It is well known that in some large commercial areas, such as in cities, cameras are ubiquitous; through the lenses, people in public places can be monitored in real time, and video stream data is formed for data analysis of various purposes. One content can be video stream data acquired in real time based on a lens to analyze people in public places, discover abnormal behaviors in time and give an early warning, and avoid severe conditions.

In this step, for a video stream data L collected in real time by a shot (it is understood that the video stream data L is generated in real time on the one hand, and it is also used for analyzing abnormal behaviors of people in real time), for example, a frame of video frame collected by the shot at time T1 may be used as a first video frame, and a human head included in the video frame is detected. Detecting human heads in the first video frame based on Yolov5 or centrnet head detection model, for example, and identifying each human head included in the video frame in a certain representation form, for example, identifying the individual human heads identified by a color setting block diagram; and for any human head, an enlarged detection area of the human head can be determined; then, surrounding the enlarged detection area, the number of heads included in the enlarged detection area may be further determined, and by comparing the determined number of heads with a preset threshold value G for the number of heads, if the number of heads included in the enlarged detection area is greater than the threshold value G, the enlarged detection area in the first video frame is used as an effective detection area for research in subsequent steps; otherwise, the expanded detection region may be discarded directly in this step, i.e., the subsequent step is not continued with the expanded detection region as a research object.

It should be noted that the threshold W can be set by those skilled in the art according to actual requirements.

Step 102, for any effective detection area, determining N detection areas to be checked corresponding to the effective detection area from N second video frames located after the first video frame in the video stream.

In this step, based on the premise that the number of the heads of the enlarged detection areas in step 101 is greater than the threshold G, that is, it is determined that the enlarged detection area in the first video frame is an effective detection area, the effective detection area may be used as a research object to acquire information of N second video frames that are in the video stream data L and located after the first video frame, that is, for any one of the N second video frames that are in the video stream data L and located after the first video frame, image information at the same position as the effective detection area may be acquired from the effective detection area in the second video frame, where the image information is one detection area to be verified, that is, N detection areas to be verified corresponding to the effective detection area may be finally acquired.

Note that in the examples of the present application, N.gtoreq.7.

When N second video frames located after the first video frame are extracted, frames may be taken from adjacent video frames, or frames may be taken from video frames at intervals of a fixed value X (X ═ 1, 2, … …), which is not specifically limited in the present application.

For example, when the video stream data L is named as the 1 st frame, the 2 nd frame, and the 3 rd frame … … in sequence, if the first video frame is the 1 st frame, the value of N is 7; then the 7 second video frames in one mode are respectively the 2 nd frame, the 3 rd frame, the 4 th frame, the 5 th frame, the 6 th frame, the 7 th frame and the 8 th frame; in yet another mode, the 7 second video frames are respectively a 3 rd frame, a 5 th frame, a 7 th frame, a 9 th frame, an 11 th frame, a 13 th frame, and a 15 th frame (i.e., the fixed value X takes a value of 1 in the scene).

103, determining whether the effective detection area has abnormal behaviors or not according to the variation trends of the number of the human heads respectively included in the effective detection area and the N detection areas to be checked in the time dimension.

In this step, the first video frame and the N second video frames are sequentially arranged according to the time sequence, and the number Y of the heads of the effective detection area in the first video frame and the number of the heads of the effective detection area in the N second video frames are respectively obtained, and the number of the heads is respectively recorded as Z according to the time sequence₁，Z₂，Z₃……Z_N(N total values); finally, pass pair Y, Z₁，Z₂，Z₃……Z_NAnd (4) analyzing the N +1 numerical values, namely analyzing the number change condition of the people in the effective detection area in the period of time, and finally giving a conclusion whether the abnormal behaviors of the people exist in the effective detection area.

Some of the above steps will be described in detail with reference to examples.

In one implementation of step 103, before determining whether there is an abnormal behavior in the valid detection region, the method further includes: if the number of the heads of the effective detection area and the N detection areas to be checked is determined to be increased or decreased in the time dimension, forming effective image sequences of the effective detection area and the N detection areas to be checked; the determining whether the valid detection zone has abnormal behavior comprises: and inputting the effective image sequence into a 3D convolutional neural network model, and determining whether abnormal behaviors exist in the effective detection area.

For example, in public places, crowd anomalies can often manifest as sudden crowd crowding or crowd straying over a short period of time. Therefore, in step 103, Y, Z is obtained₁，Z₂，Z₃……Z_NWhen N +1 numerical values are total, on one hand, whether abnormal behaviors exist in the effective detection area can be directly judged according to the variation trend of the group of numerical values, and the method comprises the following steps: if the value is incremented, then to a large extent a crowd gathering event is occurring in the active detection zone; if the numerical value is decreased progressively, then in a large extent the effective detection area is in the process of happening the events of people's feelings; on the other hand, in order to further more accurately identify the event which is determined as abnormal behavior only according to the variation trend of the numerical values, the embodiment of the present application may respectively correspond to the effective detection area and the N detection areas to be verified, which correspond to a group of numerical values conforming to any one of the two variation trendsAnd after a group of effective image sequences are obtained, the group of effective image sequences can be input into the 3D convolutional neural network model to be classified by the 3D convolutional neural network model, so that the recognition effect of the crowd abnormal events can be greatly improved, and the misjudgment events possibly existing in the preorder crowd abnormal recognition step can be avoided.

In one implementation of step 101, the determining that the enlarged detection area is an effective detection area if there are a set number of human heads includes: if the human heads with the set number exist, determining whether a human head detection area meeting a first combination requirement with the expanded detection area exists; and merging the human head detection area meeting the first merging requirement and the enlarged detection area into the effective detection area.

In certain implementations of the present application, the first merge requirement is the presence of an intersection with the enlarged detection zone.

Next, referring to fig. 2, a schematic diagram of determining an effective detection area according to an embodiment of the present application is provided. For fig. 2, a rectangular frame 1 represented by a thicker line and a rectangular frame 2 represented by a thinner line are included, and both rectangular frames are respectively an enlarged detection area; and for the rectangular frame 2, if the number of the human heads in the rectangular frame is greater than the threshold G, the human head detection area (i.e. meeting the first merging requirement) intersecting with the rectangular frame can be fused with the rectangular frame, so as to form an effective detection area of another expression form, referring to fig. 2, the formed effective detection area is an irregular polygon. As shown in fig. 2, there are 3 small rectangular frames located at the lower side (there is one small rectangular frame that intersects) and the right side (there are two small rectangular frames that intersect) of the rectangular frame 2, which are respectively the human head detection regions identified by the human head detection method.

Optionally, after the human head detection area meeting the first combination requirement and the enlarged detection area are combined into the effective detection area, the method further includes: any two valid detection regions that meet the second merge requirement are merged into one valid detection region.

In some implementations of the present application, the second merge requirement is that a merge ratio of any two valid detection regions satisfies a set condition.

In conjunction with the foregoing example, and with continued reference to fig. 2, the rectangular frame 2 and the 3 small rectangular frames intersecting therewith together form an irregular polygon, which is made to be an irregular polygon 1; as an example, if the number of human heads in the rectangular frame 1 is also greater than the threshold value G, the human head detection area intersecting with the rectangular frame may be merged with the rectangular frame to form an effective detection area of another expression, and referring to fig. 2, the formed effective detection area is an irregular polygon and is made to be an irregular polygon 2. As shown in fig. 2, there are 2 small rectangular frames located on the left side (there is an intersecting small rectangular frame) and the upper side (there is an intersecting small rectangular frame) of the rectangular frame 1, which are respectively the human head detection regions identified by the human head detection method.

In fig. 2, there is an intersection between irregular polygon 1 and irregular polygon 2, and the intersection between the two is shown in fig. 2 by the representation of hatching. Calculating an intersection ratio IOU (intersection divided by union) of the two irregular polygons, comparing the calculated IOU with a preset threshold M, merging the two irregular polygons if the calculated IOU is greater than the threshold M, selecting a minimum coordinate value (such as P point in FIG. 2) and a maximum coordinate value (such as Q point in FIG. 2) of one irregular polygon generated after merging, and obtaining a maximum circumscribed rectangle of the irregular polygon according to the two coordinate values of the P point and the Q point, wherein a rectangular frame represented by a dotted line in FIG. 2 is a maximum circumscribed rectangle of the irregular polygon, so that the rectangular frame represented by the dotted line is an effective detection area of another expression form.

It is noted that the threshold M can be set by those skilled in the art according to actual requirements.

In one implementation of step 101, the enlarged detection area of the human head is determined by: and setting a proportion for expanding the human head detection area by taking the center of the human head detection area where the human head is positioned as a center to obtain the expanded detection area of the human head.

For example, when monitoring a group of people in a public place using a lens, there is a problem of perspective in which the object is small and large, that is, a person near the lens looks larger and a person far from the lens looks smaller, and when the degree of crowdedness of a group far from the lens is large, there is a high possibility that a single person cannot be identified. Obviously, an examination is undoubtedly provided for identifying crowd abnormal events at the near position of the lens and for identifying crowd abnormal events at the far position of the lens, that is, what kind of method needs to be followed so that the crowd abnormal events at the near position and the far position under the same lens can be identified with equal accuracy.

Therefore, the embodiment of the application expands the human head detection areas where the human heads at the near position and the far position are located in the same set proportion, wherein the expansion is performed by taking the center of the human head detection area where the human head is located as the center in the expanding process, so that the crowd at the near position of the lens and the crowd at the far position of the lens can be identified in the same lens at present with the same influence, namely, the perspective problem of the target at the near position and the far position can be overcome by the method.

For example, when the human head detection area is represented by a regular rectangular frame, the width and height of the rectangular frame may be multiplied by the same coefficient S (S is greater than 1), and the center of the rectangular frame is taken as the center, so that an enlarged rectangular frame, that is, an enlarged detection area of the human head may be obtained. Thus, for a human head detection region closer to the lens, by performing the above-described operation of expanding the detection region thereon, the number of people included in the expanded detection region generated by it may be 3; and for a human head detection area farther from the lens, by also performing the above-described operation of expanding the detection area, it may generate an expanded detection area including a number of people of 7.

It is noted that the coefficient S can be set by those skilled in the art according to actual requirements.

It should be noted that, in the above example, the number of people included in the enlarged detection area near the shot is 3, and the number of people included in the enlarged detection area far from the shot is 7, which are illustrated based on an assumption that people are relatively dense in both the case of near the shot and the case of far the shot, and it is mainly to illustrate that, for far the shot, through the scheme of the present application, a single person in people in the current scene can be accurately identified.

In some implementations of the present application, the 3D convolutional neural network model includes parallel spatial and temporal convolutional layers, a first attention mechanism layer located after the spatial convolutional layer and a second attention mechanism layer located after the temporal convolutional layer.

For example, in the embodiment of the present application, P3D can be used as a backbone network structure, and the input of the P3D network is a group of image sequences (i.e., the effective image sequences described above). FIG. 3 is a BasicBlock of the P3D network architecture, where

The characteristic graphs are added point by point, and the attention block represents a time sequence attention mechanism module, wherein the time sequence attention mechanism module is introduced after the convolution operation of the time dimension and the convolution operation of the space dimension. The time sequence attention mechanism focuses on the relation between the time sequence among multiple frames of images and the target space position, and achieves good results in the classification of image sequences. The timing attention mechanism module is shown in FIG. 4, wherein

And representing point-by-point multiplication of the characteristic diagrams, wherein T represents the number of time dimension characteristic diagrams or the number of space dimension characteristic diagrams, H represents the height of the characteristic diagrams, W represents the width of the characteristic diagrams, and C represents the number of channels. Firstly, reducing dimension through convolution calculation, reducing calculated amount, obtaining a characteristic diagram TxHxWxC/2, and stretching the characteristic diagram into a one-dimensional vector to obtain A, B and D; multiplying the A vector matrix and the B vector matrix by a softmax function to be normalized, multiplying the A vector matrix and the B vector matrix by a D vector to obtain a one-dimensional vector E, stretching the one-dimensional vector E into a characteristic diagram of TxHxWxC/2, and performing convolution operation to obtain a characteristic diagram of TxHxWxC/2And adding the figure graph and the input figure graph to obtain an output figure graph TxHxWxC.

And finally, the obtained feature graph passes through a pooling layer, is stretched into a column vector according to the sequence of the effective image sequence, and is classified through a full-connection layer to obtain a classification result, namely an abnormal behavior event which is divided into a crowd gathering event and a crowd scattering event. And if the abnormal behavior occurs, giving an alarm and informing.

Based on the same concept, the embodiment of the present application further provides a device for detecting abnormal behavior, as shown in fig. 5, the device includes:

an effective detection region determining unit 501, configured to perform human head detection on a first video frame in a video stream, and determine each human head in the first video frame; determining whether an enlarged detection area of the human head contains a set number of human heads for any human head; and if the human heads with the set number exist, determining the expanded detection area as an effective detection area.

A unit 502 for determining regions to be checked, configured to determine, for any effective detection region, N regions to be checked corresponding to the effective detection region from N second video frames located after the first video frame in the video stream.

An anomaly determination unit 503, configured to determine whether there is an abnormal behavior in the effective detection area according to a trend of change in the number of human heads included in the effective detection area and the N detection areas to be checked in the time dimension.

Further to the apparatus, the anomaly determination unit 503 is further configured to: if the number of the heads of the effective detection area and the N detection areas to be checked is determined to be increased or decreased in the time dimension, forming effective image sequences of the effective detection area and the N detection areas to be checked; the abnormality determining unit is specifically configured to: and inputting the effective image sequence into a 3D convolutional neural network model, and determining whether abnormal behaviors exist in the effective detection area.

Further, for the apparatus, the effective detection area determination unit 501 is specifically configured to: if the human heads with the set number exist, determining whether a human head detection area meeting a first combination requirement with the expanded detection area exists; and merging the human head detection area meeting the first merging requirement and the enlarged detection area into the effective detection area.

Further, for the apparatus, the effective detection area determination unit 501 is further configured to: any two valid detection regions that meet the second merge requirement are merged into one valid detection region.

Further to the apparatus, the first merging requirement is an intersection with the enlarged detection zone; the second merging requirement is that the intersection ratio of any two effective detection areas meets a set condition.

Further, for the apparatus, an enlarged detection area determination unit 504 is further included; an enlarged detection area determination unit 504 for: and setting a proportion for expanding the human head detection area by taking the center of the human head detection area where the human head is positioned as a center to obtain the expanded detection area of the human head.

Further to the apparatus, the 3D convolutional neural network model includes a spatial convolutional layer and a temporal convolutional layer in parallel, a first attention mechanism layer located after the spatial convolutional layer, and a second attention mechanism layer located after the temporal convolutional layer.

The embodiment of the present application provides a computing device, which may specifically be a desktop computer, a portable computer, a smart phone, a tablet computer, a Personal Digital Assistant (PDA), and the like. The computing device may include a Central Processing Unit (CPU), memory, input/output devices, etc., the input devices may include a keyboard, mouse, touch screen, etc., and the output devices may include a Display device, such as a Liquid Crystal Display (LCD), a Cathode Ray Tube (CRT), etc.

Memory, which may include Read Only Memory (ROM) and Random Access Memory (RAM), provides the processor with program instructions and data stored in the memory. In an embodiment of the application, the memory may be configured to store program instructions of a method of detecting abnormal behavior;

and the processor is used for calling the program instructions stored in the memory and executing the detection method of the abnormal behavior according to the obtained program.

The embodiment of the application provides a computer-readable storage medium, which stores computer-executable instructions, wherein the computer-executable instructions are used for enabling a computer to execute a method for detecting abnormal behaviors.

It will be apparent to those skilled in the art that embodiments of the present application may be provided as a method, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method for detecting abnormal behavior, comprising:

carrying out human head detection on a first video frame in a video stream, and determining each human head in the first video frame; determining whether an enlarged detection area of the human head contains a set number of human heads for any human head; if the human heads with the set number exist, determining the expanded detection area as an effective detection area;

for any effective detection area, respectively determining N detection areas to be checked corresponding to the effective detection area from N second video frames positioned after the first video frame in the video stream;

and determining whether the effective detection area has abnormal behaviors or not according to the variation trends of the number of the heads of the effective detection area and the N detection areas to be checked in the time dimension.

2. The method of claim 1,

before the determining whether the abnormal behavior exists in the effective detection area, the method further includes:

if the number of the heads of the effective detection area and the N detection areas to be checked is determined to be increased or decreased in the time dimension, forming effective image sequences of the effective detection area and the N detection areas to be checked;

the determining whether the valid detection zone has abnormal behavior comprises:

and inputting the effective image sequence into a 3D convolutional neural network model, and determining whether abnormal behaviors exist in the effective detection area.

3. The method of claim 1,

if the human heads with the set number exist, the enlarged detection area is determined to be an effective detection area, and the method comprises the following steps:

if the human heads with the set number exist, determining whether a human head detection area meeting a first combination requirement with the expanded detection area exists;

and merging the human head detection area meeting the first merging requirement and the enlarged detection area into the effective detection area.

4. The method of claim 3,

after the human head detection area meeting the first combination requirement and the enlarged detection area are combined into the effective detection area, the method further comprises the following steps:

any two valid detection regions that meet the second merge requirement are merged into one valid detection region.

5. The method of claim 4,

the first merge requirement is the presence of an intersection with the enlarged detection zone;

the second merging requirement is that the intersection ratio of any two effective detection areas meets a set condition.

6. The method of claim 1,

the enlarged detection area of the human head is determined by the following method, including:

and setting a proportion for expanding the human head detection area by taking the center of the human head detection area where the human head is positioned as a center to obtain the expanded detection area of the human head.

7. The method of any one of claims 2 to 6,

the 3D convolutional neural network model comprises a space convolutional layer and a time convolutional layer which are parallel, a first attention mechanism layer located behind the space convolutional layer and a second attention mechanism layer located behind the time convolutional layer.

8. An abnormal behavior detection device, comprising:

the device comprises an effective detection area determining unit, a processing unit and a processing unit, wherein the effective detection area determining unit is used for carrying out human head detection on a first video frame in a video stream and determining each human head in the first video frame; determining whether an enlarged detection area of the human head contains a set number of human heads for any human head; if the human heads with the set number exist, determining the expanded detection area as an effective detection area;

a to-be-verified detection area determining unit, configured to determine, for any effective detection area, N to-be-verified detection areas corresponding to the effective detection area from N second video frames located after the first video frame in the video stream, respectively;

and the abnormity determining unit is used for determining whether the abnormal behaviors exist in the effective detection area according to the change trends of the number of the human heads respectively included in the effective detection area and the N detection areas to be checked in the time dimension.

9. A computer device, comprising:

a memory for storing a computer program;

a processor for calling a computer program stored in said memory, for executing the method according to any one of claims 1-7 in accordance with the obtained program.

10. A computer-readable storage medium, characterized in that the storage medium stores a program which, when run on a computer, causes the computer to carry out the method according to any one of claims 1 to 7.