CN113920585A

CN113920585A - Behavior recognition method and device, equipment and storage medium

Info

Publication number: CN113920585A
Application number: CN202111234621.2A
Authority: CN
Inventors: 苏海昇
Original assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Current assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Priority date: 2021-10-22
Filing date: 2021-10-22
Publication date: 2022-01-11

Abstract

The embodiment of the application discloses a behavior identification method, which comprises the following steps: acquiring a detection frame of each object in each frame of image in the video sequence to be identified; determining a first identification result of the video sequence to be identified based on the distribution condition of the detection frames in each frame of image; under the condition that the first recognition result meets a first condition, performing behavior recognition on the track sequence of the group object to obtain a fourth recognition result; wherein the track sequence of the group object is generated based on the video sequence to be identified; the population objects include at least two target objects having a spatial distance less than a distance threshold; and determining a target recognition result of the video sequence to be recognized at least based on the first recognition result or the fourth recognition result. The embodiment of the application also provides a behavior recognition device, equipment and a storage medium.

Description

Behavior recognition method and device, equipment and storage medium

Technical Field

The present application relates to the field of computer vision, and relates to, but is not limited to, behavior recognition methods and apparatus, devices, and storage media.

Background

Behavior recognition in video sources is an important application in the field of computer vision, and it is extremely difficult to recognize abnormal events in video sources. Possible challenges include scarcity of annotation data due to small probability events, large inter/intra-class variance, differences in subjective definition of anomalous events, low resolution of video capture devices, etc.

The problems that exist in practice mainly include: the number of action execution subjects of an event is variable (the number of action execution subjects according to an event can be generally divided into a single object behavior and a group object behavior); different spatial receptive fields are required for different types of behavior event identification (behavior subjects with different positions and visual fields may exist in an identification scene, and the behavior event occurrence position really required to be detected usually only occupies a small area of a picture); the behavior body which does not generate the preset behavior and the behavior body which generates the preset behavior may have overlapping in spatial position, and the like, and the above problems all cause interference to the behavior recognition result.

Disclosure of Invention

The embodiment of the application provides a behavior identification method, a behavior identification device, equipment and a storage medium.

The technical scheme of the embodiment of the application is realized as follows:

in a first aspect, an embodiment of the present application provides a behavior identification method, where the method includes:

acquiring a detection frame of each object in each frame of image in the video sequence to be identified;

determining a first identification result of the video sequence to be identified based on the distribution condition of the detection frames in each frame of image;

under the condition that the first recognition result meets a first condition, performing behavior recognition on the track sequence of the group object to obtain a fourth recognition result; wherein the track sequence of the group object is generated based on the video sequence to be identified; the population objects include at least two target objects having a spatial distance less than a distance threshold;

and determining a target recognition result of the video sequence to be recognized at least based on the first recognition result or the fourth recognition result.

In some possible embodiments, the determining the target recognition result of the video sequence to be recognized based on at least the first recognition result or the fourth recognition result includes: under the condition that the first identification result represents that abnormal behaviors do not occur in the video sequence to be identified, determining a target identification result of the video sequence to be identified based on the first identification result; or, when the first recognition result represents that abnormal behaviors occur in the video sequence to be recognized, determining a target recognition result of the video sequence to be recognized based on the fourth recognition result.

Therefore, when the first identification result of the video sequence to be identified represents that the video sequence to be identified has no abnormal behavior through logic pre-judgment, the first identification result is directly used as a final target identification result, and the video sequence to be identified which obviously has no abnormal behavior can be directly determined; meanwhile, under the condition that abnormal behaviors possibly exist in the video sequence to be identified through logic judgment, the neural network is further combined for identification, and the fourth identification result is used as a target identification result, so that the robustness of behavior identification is improved.

In some possible embodiments, the determining, based on the distribution of the detection boxes in each frame of image, a first identification result of the video sequence to be identified includes: determining a dense area in each frame of image based on the distribution condition of the detection frames in each frame of image; the dense area comprises a central detection frame which is overlapped with other detection frames in each frame of image for more than or equal to a first time threshold; and determining a first identification result of the video sequence to be identified based on the number of objects included in the dense area in each frame of image.

Therefore, the inference of the local dense region is guided by a heuristic detection frame density distribution and dense group position estimation method, the effective perception region of the model can be increased, and the retrieval range of irrelevant backgrounds is reduced. Simultaneously combines logic prejudgment to improve the problem of insufficient robustness and accuracy of pure neural network identification

In some possible embodiments, in a case that the first recognition result represents that an abnormal behavior occurs in the video sequence to be recognized, performing behavior recognition on a track sequence of a group object to obtain the fourth recognition result includes: under the condition that the first identification result represents that abnormal behaviors occur in the video sequence to be identified, determining the area change value of the dense region between adjacent frames in the video sequence to be identified; determining a second identification result of the video sequence to be identified based on the area change value and the change threshold of the dense region; and under the condition that the second recognition result represents that the video sequence to be recognized does not have abnormal behaviors, performing behavior recognition on the track sequence of the group object to obtain a fourth recognition result.

Therefore, aiming at the condition that the first identification result represents abnormal behaviors in the video sequence to be identified, the second identification result is further determined based on the area change value of the dense region between the adjacent frames, and the robustness and the accuracy of the neural network identification scheme are improved by combining a two-stage heuristic logic method.

In some possible embodiments, when the second recognition result represents that the video sequence to be recognized has not undergone an abnormal behavior, performing behavior recognition on a track sequence of a group object to obtain the fourth recognition result, including: under the condition that the second identification result represents that the video sequence to be identified has no abnormal behavior, determining a surrounding frame area in each frame of image in the video sequence to be identified based on the coverage area of the dense area in each frame of image in the video sequence to be identified; the positions of the surrounding frame areas in each frame of image are consistent, the sizes of the surrounding frame areas are the same, and the surrounding frame areas surround the dense area; determining a third identification result of the video sequence to be identified based on the pixel change condition of the surrounding frame region between adjacent frames in the video sequence to be identified; and under the condition that the third recognition result represents that the video sequence to be recognized does not have abnormal behaviors, performing behavior recognition on the track sequence of the group object to obtain a fourth recognition result.

Therefore, after heuristic logic judgment is carried out on the video sequence to be recognized through area change of the dense area and pixel change of the bounding box area after the union set is taken, behavior recognition is carried out by means of the behavior recognition model, and robustness and accuracy of the neural network recognition scheme can be improved.

In some possible embodiments, the method further comprises: determining that the video sequence to be recognized has abnormal behaviors under the condition that the second recognition result represents that the video sequence to be recognized has abnormal behaviors; or determining that the video sequence to be recognized has abnormal behaviors when the third recognition result represents that the video sequence to be recognized has abnormal behaviors.

Therefore, under the condition that the second recognition result or the third recognition result judged by the logic rule represents that the video sequence to be recognized has abnormal behaviors, the second recognition result or the third recognition result is directly used as the target recognition result of the video sequence to be recognized, so that the recognition flow is shortened and the behavior recognition efficiency is improved for the video sequence to be recognized which obviously has the abnormal behaviors.

In some possible embodiments, the determining a third recognition result of the video sequence to be recognized based on the pixel variation of the bounding box region between adjacent frames in the video sequence to be recognized includes: determining a third identification result of the video sequence to be identified as abnormal behavior under the condition that the pixel difference value of the surrounding frame region between adjacent frames in the video sequence to be identified is greater than or equal to a pixel threshold value; and under the condition that the pixel difference value of the surrounding frame region between the adjacent frames in the video sequence to be identified is smaller than the pixel threshold value, determining that the third identification result of the video sequence to be identified is that no abnormal behavior occurs.

Therefore, whether the video sequence to be recognized has abnormal behaviors or not is judged by comparing the pixel change degree of the surrounding frame area between the adjacent frames in the video sequence to be recognized, and a third recognition result is obtained. Therefore, the video sequence to be recognized with obviously abnormal behaviors is quickly determined through logic rules, the recognition process is shortened, and the accuracy of the recognition method is enhanced.

In some possible embodiments, the determining, based on the area variation value and the variation threshold of the dense region, a second recognition result of the video sequence to be recognized includes: determining a second identification result of the video sequence to be identified as abnormal behavior under the condition that the area change values of the dense regions between the adjacent frames are greater than or equal to the change threshold; and under the condition that the area change value of the dense region between the adjacent frames is smaller than the change threshold value, determining that the second identification result of the video sequence to be identified is not abnormal.

Therefore, whether the video sequence to be recognized has abnormal behaviors or not is judged by comparing the area change degree of the dense region between the adjacent frames in the video sequence to be recognized, and a second recognition result is obtained. Therefore, the video sequence to be recognized, in which abnormal behaviors obviously occur, can be quickly determined according to the severe motion condition, the recognition process is shortened, and the accuracy of the recognition method is enhanced.

In some possible embodiments, the determining, based on the number of objects included in the dense region in each frame of image, a first recognition result of the video sequence to be recognized includes: under the condition that the number of objects included in the dense area in each frame of image is more than or equal to 2, determining that a first identification result of the video sequence to be identified is abnormal behavior; and determining that the first identification result of the video sequence to be identified has no abnormal behavior under the condition that the number of objects included in the dense area in each frame of image is less than 2.

In this way, whether the number of objects included in the dense area in each frame of image in the video sequence to be recognized is greater than or equal to 2 is determined, whether the video sequence to be recognized has abnormal behavior is determined, and a first recognition result is obtained. Therefore, the video sequence to be recognized, which obviously does not have abnormal behaviors of group objects, is quickly determined aiming at the condition that the number of the included objects does not meet the requirement, the recognition process is shortened, and the accuracy of the recognition method is enhanced.

In some possible embodiments, the determining the dense region in each frame of image based on the distribution of the detection boxes in each frame of image includes: generating an adjacency matrix corresponding to each frame of image according to the intersection and parallel ratio between every two detection frames in each frame of image; taking the detection frame with the matching times larger than or equal to the second time threshold value in the adjacent matrix as a central detection frame in each frame of image; wherein the second nonce threshold is greater than the first nonce threshold; and expanding the central detection frame in each frame of image outwards according to a specific proportion to obtain a dense area in each frame of image.

In this way, the distribution of the detection frames in each frame of image is used to determine the central detection frame, and then the central detection frame predicted to be prone to group object behaviors is expanded outwards, so that the expanded dense region includes all objects which are centered on the central detection frame and surround the group object behaviors as much as possible. Therefore, the behavior receptive field of the action execution main body is reduced, and the problem of small effective perception range when the video sequence to be recognized in the full visual field is recognized can be solved.

In some possible embodiments, the method further comprises: and generating a track sequence of the group objects based on the dense areas in each frame of image.

In this way, the dense region in each frame of image is processed to obtain a track sequence of the group object, so as to perform behavior recognition of the group object through the behavior recognition model. Meanwhile, the track sequence of the group object reduces the behavior receptive field of the action execution main body, and the problem of small effective perception range when the video sequence to be recognized in the whole visual field is recognized can be solved.

In some possible embodiments, the generating the trajectory sequence of the group object based on the dense region in each frame of image includes: determining a surrounding frame area in each frame of image based on the coverage area of the dense area in each frame of image in the video sequence to be identified; the positions of the surrounding frame areas in each frame of image are consistent, the sizes of the surrounding frame areas are the same, and the surrounding frame areas surround the dense area; generating a trajectory sequence of the group object based on the timestamp of each frame of image and the bounding box region in each frame of image.

In this way, by determining the bounding box area in each frame of image and further generating the track sequence of the group object, the receptive field of the group object behavior event can be reduced, and the accuracy and efficiency of behavior identification are further improved. Meanwhile, the relative position loss of the group object behaviors is avoided, and the performance improvement is good for the detection of the behaviors which are similar in space but different in movement rhythm.

In a second aspect, an embodiment of the present application provides an apparatus for identifying a behavior, including:

the acquisition module is used for acquiring a detection frame of each object in each frame of image in the video sequence to be identified;

a first determining module, configured to determine a first identification result of the video sequence to be identified based on a distribution of the detection boxes in each frame of image;

the recognition module is used for performing behavior recognition on the track sequence of the group object under the condition that the first recognition result meets a first condition to obtain a fourth recognition result; wherein the track sequence of the group object is generated based on the video sequence to be identified; the population objects include at least two target objects having a spatial distance less than a distance threshold;

a second determining module, configured to determine a target recognition result of the video sequence to be recognized based on at least the first recognition result or the fourth recognition result.

In a third aspect, an embodiment of the present application provides an apparatus, including a memory and a processor, where the memory stores a computer program executable on the processor, and the processor implements the steps in the behavior recognition method when executing the program.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps in the behavior recognition method.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

in the embodiment of the application, firstly, a detection frame of each object in each frame of image is obtained, then, a first identification result of a video sequence to be identified is determined based on the distribution condition of the detection frame in each frame of image, and then, under the condition that the first identification result meets a first condition, behavior identification is carried out on a track sequence of a group object generated based on the video sequence to be identified, so that a fourth identification result is obtained; and finally, determining a target recognition result of the video sequence to be recognized at least based on the first recognition result or the fourth recognition result. The embodiment of the application provides a behavior identification method based on heuristic logic and deep learning fusion, which can support the abnormal behavior detection requirement of group objects and improve the robustness and accuracy of a pure neural network identification scheme. Meanwhile, the method is different from a behavior recognition method based on a full-image sequence, the video classification problem is converted into the recognition problem of the group object track sequence, the perception capability of the model on the target area in the video source is effectively improved, and the retrieval range and the calculation amount are greatly reduced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without inventive efforts, wherein:

fig. 1A is a schematic diagram of a network architecture of a behavior recognition method according to an embodiment of the present application;

fig. 1B is a schematic flowchart of a behavior recognition method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a behavior recognition method according to an embodiment of the present application;

fig. 3 is a schematic flowchart of a behavior recognition method according to an embodiment of the present application;

fig. 4 is a schematic flowchart of a behavior recognition method according to an embodiment of the present application;

FIG. 5 is a schematic flowchart of generating a trajectory sequence of group objects according to an embodiment of the present disclosure;

fig. 6 is a logic flow diagram of a multi-person behavior recognition method according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a trajectory sequence of multiple persons according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram illustrating a component of an apparatus for behavior recognition according to an embodiment of the present disclosure;

fig. 9 is a hardware entity diagram of an apparatus for behavior recognition according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The following examples are intended to illustrate the present application but are not intended to limit the scope of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

It should be noted that the terms "first \ second \ third" referred to in the embodiments of the present application are only used for distinguishing similar objects and do not represent a specific ordering for the objects, and it should be understood that "first \ second \ third" may be interchanged under specific ordering or sequence if allowed, so that the embodiments of the present application described herein can be implemented in other orders than illustrated or described herein.

It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which embodiments of the present application belong. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Behavior detection in video is an important problem in the field of computer vision, and has wide application in the field of smart cities, such as detection of illegal behaviors, traffic accidents, some unusual events and the like. Most devices that capture video sources simply record the motion at each moment and do not have the ability to automatically recognize it (often requiring special personnel to be responsible for manual viewing). Due to the huge amount of video, it is obviously not realistic to filter the content in the video only by human power. There is a need for techniques that utilize computer vision and deep learning to automatically detect behavioral events occurring in a video.

Fig. 1A is a schematic diagram of a network architecture of a behavior recognition method according to an embodiment of the present application, as shown in fig. 1A, the network architecture includes: camera 101, object detection module 102, logic recognition module 103, and model recognition module 104: the object detection module 102, the logic identification module 103, and the model identification module 104 may be disposed in the server 100, and in order to support an exemplary application, the camera 101 establishes a communication connection with the server 100 through a network. The method comprises the steps that a video under a specific scene is collected through a camera 101, then a video sequence 11 to be identified, namely a multi-frame image containing a group object, is obtained through sampling, and the video sequence 11 to be identified is input into an object detection module 102; the object detection module 102 may fully utilize a correlation detection algorithm, such as an inter-frame difference method, a background subtraction method, an optical flow method, etc., to achieve the positioning and analysis of each object in the video sequence 11 to be recognized, so as to obtain a multi-frame image 12 with a detection result (labeling a detection frame of each object); then, the logic identification module 103 conducts reasoning of a dense area guided by a method based on detection frame density distribution and heuristic dense group position estimation on the multi-frame image 12 with the detection result, preliminarily detects abnormal behaviors of a group object which is obvious in the video sequence 11 to be identified, or filters some unlikely situations from geometric information of some apparent layers, improves the identification efficiency of the video sequence, and simultaneously improves the accuracy of behavior identification; finally, generating a track sequence 13 of group objects based on the dense area in each frame of image for the video sequence 11 to be identified which does not accord with the logic detection rule; inputting the trajectory sequences 13 together into the model identification module 104; the model identification module 104 may perform behavior identification on the track sequence 13 by fully using the relevant video understanding model, and finally output the identification result of the video layer. Therefore, for the video sequence to be identified which does not conform to the logic detection rule, the neural network is used for further confirmation and judgment, and the robustness of the whole set of identification scheme is enhanced. Based on the network architecture, a group object behavior identification method framework based on logic rules and deep learning can be designed.

The embodiment of the application provides a behavior identification method which is applied to a server, terminal equipment or other equipment. The terminal device includes, but is not limited to, a mobile phone, a notebook computer, a tablet computer, a handheld internet device, a multimedia device, a streaming media device, a mobile internet device, a wearable device, or other types of devices.

Fig. 1B is a schematic flow chart of a behavior recognition method according to an embodiment of the present application, and as shown in fig. 1B, the method at least includes the following steps:

step S110, acquiring a detection frame of each object in each frame of image in a video sequence to be identified;

here, the video sequence to be recognized may be obtained by sampling a specific video source. For example, videos shot at a fixed angle often cover a wide field of view and contain much information, such as pedestrians, vehicles, animals, buildings, and other complex background information.

It is understood that the video sequence to be identified is a frame sequence composed of multiple frames of images, where each frame of image may or may not include at least one object, and the objects included in different frames of images are not necessarily the same, but all frames of images of the video sequence to be identified include at least two objects that may have group object behavior events. The object may be a pedestrian, or may be a moving vehicle, an animal, or the like, and is determined according to an actual scene in implementation, which is not limited in the embodiment of the present application.

The detection and positioning analysis of the object in the video image can be realized by related image or video processing technology, for example, the object detection algorithm preprocesses the video sequence to be identified to obtain multiple frames of images with detection frames, and then the detection frames of the object in each frame of image are extracted. The detection algorithm, such as template matching, may be implemented by an inter-frame difference method, a background subtraction method, an optical flow method, and the like for detecting a moving object in a video, which is not limited in this embodiment of the present application.

For different objects appearing in the video sequence to be identified, the detection frame of each object and the unique object identification associated with the detection frame can be obtained after object detection. The detection frames of different objects are distinguished by corresponding object identifiers, so that the main object with behavior can be automatically identified and processed in time.

Step S120, determining a first identification result of the video sequence to be identified based on the distribution condition of the detection frames in each frame of image;

here, each frame image may be divided into a sparse region and a dense region based on the distribution of the detection frames in each frame image, where the sparse region includes the detection frames independent in spatial position or no detection frame, and the dense region includes at least two detection frames overlapping each other.

It can be understood that the dense region in each frame of image can be regarded as a target region where group object behaviors are likely to occur, so that the dense region can be analyzed, whether abnormal behaviors of group objects such as "shelving", "charging", "surrounding", and the like exist in the dense region is preliminarily judged, and a first identification result of the video sequence to be identified is determined.

Therefore, the inference of the local dense region is guided by a heuristic dense group position estimation method, the effective perception region of the model can be increased, and the retrieval range of irrelevant backgrounds is reduced.

Step S130, performing behavior recognition on the track sequence of the group object under the condition that the first recognition result meets a first condition to obtain a fourth recognition result;

here, the track sequence of the group object is generated based on the video sequence to be identified, and the track sequence is an active area sequence of the group object in each frame of image in the video sequence to be identified; wherein the population object includes at least two target objects having a spatial distance less than a distance threshold.

In implementation, firstly, preprocessing the video sequence to be recognized, which is determined by the step S120 that the first recognition result is that abnormal behavior may occur, to generate a track sequence of group objects; then inputting the track sequence of the group object into the trained behavior recognition model for further recognition and judgment to obtain a fourth recognition result: and if the discrimination score output by the behavior recognition model is higher than a given threshold, determining that abnormal behaviors exist in the track sequence and outputting a specific group object behavior type.

Exemplarily, with a pedestrian as a detection object, the abnormal behavior types of the group object may include behaviors such as "trample", "double mutual attack", "multi people group attack", "charging", "holding", and "embrace"; and if the discrimination score output by the model does not meet the given threshold value, determining that the fourth recognition result is that the abnormal behavior of the group object does not occur in the video sequence to be recognized.

Step S140, determining a target recognition result of the video sequence to be recognized based on at least the first recognition result or the fourth recognition result.

Here, the first recognition result is a result of judging whether an abnormal behavior occurs in the video sequence to be recognized based on the logic rule, and the third recognition result is a result of further recognizing through the neural network when the logic rule judges that the abnormal behavior may occur in the video sequence to be recognized. And determining a target identification result of the video sequence to be identified by combining the first identification result obtained by the heuristic logic judgment method and the fourth identification result obtained by the deep learning method, so that the accuracy and efficiency of identifying the abnormal behavior are improved.

In some embodiments, in a case that the first recognition result indicates that no abnormal behavior occurs in the video sequence to be recognized, a target recognition result of the video sequence to be recognized is determined based on the first recognition result. Therefore, when the first identification result of the video sequence to be identified represents that the video sequence to be identified has no abnormal behavior through logic pre-judgment, the first identification result is directly used as the final target identification result, the video sequence to be identified which obviously has no abnormal behavior can be directly determined, and the identification process is shortened.

In other embodiments, in a case where the first recognition result indicates that an abnormal behavior occurs in the video sequence to be recognized, the target recognition result of the video sequence to be recognized is determined based on the fourth recognition result. Therefore, for the situation that abnormal behaviors possibly exist in the video sequence to be identified through logic judgment, the neural network is further combined for identification, and the fourth identification result is used as a target identification result, so that the robustness of behavior identification is improved.

It should be noted that, in step S110, the behavior body can be flexibly and accurately located and analyzed by using the relevant detection algorithm; in step S120, whether the video sequence to be recognized has an obvious abnormal behavior is preliminarily determined according to the distribution of the detection frames in each frame of image, or some unlikely situations are filtered from some apparent layer geometric information, so as to improve the recognition efficiency of the video sequence and improve the accuracy of behavior recognition; in the behavior recognition process in the step S130, the existing video understanding model can be fully utilized, so that a behavior recognition algorithm framework based on logic rules and deep learning is constructed, and the accuracy and robustness of behavior recognition are improved. The video sequence to be recognized of the whole image is converted into the track sequence of the group object, namely the accurate behavior occurrence region, the image range of the input model is reduced, the model is enabled to recognize only the region where the group object is located in each frame of image, and the efficiency and the precision of behavior recognition can be improved. Therefore, the video information acquisition method and device can be suitable for more video information which is large in coverage view and contains more information, such as video sources acquired in outdoor urban street scenes and indoor rail transit scenes.

In some possible implementation modes, after the target recognition result of the video sequence to be recognized is obtained through two-stage analysis of logic prejudgment and model recognition, an alarm notice can be sent to a relevant department or platform, so that the safety behaviors of the user and others are harmed in time. One possible implementation is as follows: under the condition that abnormal behaviors of group objects exist in the video sequence to be recognized, determining the space position and the behavior category of the abnormal behaviors; determining alarm content according to the spatial position and the behavior category; and sending an alarm notification to the terminal equipment corresponding to the spatial position according to the alarm content so as to enable a manager holding the terminal equipment to handle the abnormal behavior.

It can be understood that different location areas have corresponding managers holding terminal devices, and the terminal devices receive the notification of the alarm system, so that a plurality of object behaviors occurring in the location areas can be quickly located and processed. After the safety behaviors of the user and others are damaged in outdoor urban street scenes, indoor rail transit scenes and the like, the system can automatically identify the behavior bodies and give an alarm, and an efficient and convenient detection capability is provided for personnel with related requirements.

In the embodiment of the application, a detection frame of each object in each frame of image is firstly obtained, a first identification result of a video sequence to be identified is determined based on the distribution condition of the detection frame in each frame of image, and then under the condition that the first identification result meets a first condition, a track sequence of group objects generated based on the video sequence to be identified is subjected to behavior identification to obtain a fourth identification result. The embodiment of the application provides a behavior identification method based on heuristic logic and deep learning fusion, which can support the abnormal behavior detection requirement of group objects and improve the robustness and accuracy of a pure neural network identification scheme. Meanwhile, the method is different from a behavior recognition method based on a full-image sequence, the video classification problem is converted into the recognition problem of the group object track sequence, the perception capability of the model on the target area in the video source is effectively improved, and the retrieval range and the calculation amount are greatly reduced.

Fig. 2 is a schematic flow chart of a behavior recognition method according to an embodiment of the present application, and as shown in fig. 2, the method at least includes the following steps:

step S210, acquiring a detection frame of each object in each frame of image in a video sequence to be identified;

step S220, determining a dense area in each frame of image based on the distribution condition of the detection frames in each frame of image;

here, the dense region (patch) includes a center detection frame overlapping with other detection frames in each frame of the image by a number of times equal to or greater than a first threshold. That is, for each frame of image in the video sequence to be identified, the embodiment of the present application is expected to find the detection frame in which the detection frame overlapping with other detection frames is more as the central detection frame to generate the dense region including the group object.

The dense area at least comprises two objects, wherein one object is an object associated with the central detection frame; and the object identification associated with the center detection box is taken as the object identification associated with the dense area.

Step S230, determining a first identification result of the video sequence to be identified based on the number of objects included in the dense region in each frame of image;

here, whether the abnormal behavior of the video sequence to be recognized may occur is preliminarily determined by determining whether the number of objects in the dense area in each frame of image meets the requirement, and a first recognition result is obtained, where the first recognition result includes two cases, that is, the abnormal behavior of the video sequence to be recognized and the non-abnormal behavior of the video sequence to be recognized. It is generally considered that the number of objects involved in the occurrence of a group object behavior is two or more. Therefore, the inference of the local dense region is guided by a heuristic detection frame density distribution and dense group position estimation method, the effective perception region of the model can be increased, and the retrieval range of irrelevant backgrounds is reduced.

In some embodiments, in a case that the number of objects included in the dense region in each frame of image is greater than or equal to 2, determining that a first identification result of the video sequence to be identified is abnormal behavior; in other embodiments, in a case that the number of objects included in the dense region in each frame of image is less than 2, it is determined that the first recognition result of the video sequence to be recognized is that no abnormal behavior occurs.

Step S240, under the condition that the first recognition result represents that abnormal behaviors occur in the video sequence to be recognized, performing behavior recognition on track sequences of group objects to obtain a fourth recognition result;

here, for the case where it is determined that the video sequence to be recognized has abnormal behavior based on the number of objects included in the dense area in each frame of image, it is necessary to further recognize a specific behavior type by using the neural network, so as to obtain a fourth recognition result of the video sequence to be recognized. Therefore, the problem of insufficient robustness and accuracy of pure neural network identification is improved by combining logic prejudgment.

Step S250, determining a target recognition result of the video sequence to be recognized based on at least the first recognition result or the fourth recognition result.

In some embodiments, in a case that the first recognition result represents that no abnormal behavior occurs in the video sequence to be recognized, determining a target recognition result of the video sequence to be recognized based on the first recognition result; in other embodiments, in a case where the first recognition result indicates that an abnormal behavior occurs in the video sequence to be recognized, the target recognition result of the video sequence to be recognized is determined based on the fourth recognition result.

In the embodiment of the application, the track sequence obviously without abnormal behaviors is preliminarily detected based on the number of objects included in the dense area in each frame of image, so that continuous detection in the behavior recognition model is avoided, and the recognition efficiency of the video sequence to be recognized is improved. For the situation that abnormal behaviors occur in a video sequence to be recognized is determined based on the number of objects included in a dense area, a neural network is further required to be used for recognizing specific behavior types, so that the accuracy and robustness of the whole recognition process are enhanced.

Fig. 3 is a schematic flow chart of a behavior recognition method according to an embodiment of the present application, and as shown in fig. 3, the method at least includes the following steps:

step S310, acquiring a detection frame of each object in each frame of image in the video sequence to be identified;

step S320, determining a first identification result of the video sequence to be identified based on the distribution condition of the detection frames in each frame of image;

step S330, determining the area change value of the dense region between adjacent frames in the video sequence to be identified under the condition that the first identification result represents that abnormal behaviors occur in the video sequence to be identified;

here, for the case where the first recognition result represents that an abnormal behavior occurs in the video sequence to be recognized, the area change values of the dense regions between the adjacent frames are further compared.

Illustratively, the area of the dense region in the n-1 th frame is 0.6 square centimeter, and the area of the same dense region in the n-1 th frame is 0.8 square centimeter, so that the area change value of the dense region between the n-1 th frame and the n-1 th frame is 0.2 square centimeter, which indicates that the position range of the group object is increased, and the motion amplitude is changed drastically.

Step S340, determining a second identification result of the video sequence to be identified based on the area change value and the change threshold of the dense region;

and comparing whether the area change value of the dense region between adjacent frames is greater than or equal to a change threshold value, and judging whether the video sequence to be identified has abnormal behaviors to obtain a second identification result, wherein the second identification result comprises two conditions of the abnormal behavior of the video sequence to be identified and the non-abnormal behavior of the video sequence to be identified. In general, when the area of a dense region is greatly changed, a plurality of objects included in the dense region move violently, and abnormal behavior of a group of objects tends to occur.

If the area change values of the dense areas between the continuous adjacent frames in the video sequence to be recognized are smaller than the change threshold, judging that the second recognition result of the video sequence to be recognized is that no abnormal behavior occurs; and if the area change value of the dense region between any two adjacent frames is larger than the change threshold, judging that the second identification result of the video sequence to be identified is abnormal behavior.

Wherein, the change threshold is a preset empirical value. For example, the change threshold of the dense region area may be set to a specific percentage of the pedestrian detection frame when the detection object is a pedestrian and the detection behavior is a fighting type, or the like, which may be set in advance according to the object type and the behavior type. The embodiments of the present application do not limit this.

In some embodiments, the determining a second recognition result of the video sequence to be recognized based on the area variation value and the variation threshold of the dense region includes: determining a second identification result of the video sequence to be identified as abnormal behavior under the condition that the area change values of the dense regions between the adjacent frames are greater than or equal to the change threshold; and under the condition that the area change value of the dense region between the adjacent frames is smaller than the change threshold value, determining that the second identification result of the video sequence to be identified is not abnormal.

Step S350, under the condition that the second recognition result represents that the video sequence to be recognized does not have abnormal behaviors, performing behavior recognition on track sequences of group objects to obtain a fourth recognition result;

here, for the case where it is determined that no abnormal behavior occurs in the video sequence to be recognized based on the area of the dense region in each frame of image, it is necessary to further recognize a specific behavior type using a neural network, so as to obtain a fourth recognition result of the video sequence to be recognized. Therefore, the problem of insufficient robustness and accuracy of pure neural network identification is improved by combining logic prejudgment.

Step S360, determining a target recognition result of the video sequence to be recognized based on at least the first recognition result or the fourth recognition result.

Here, the target recognition result of the video sequence to be recognized is determined based on the first recognition result, the second recognition result, or the fourth recognition result.

In some embodiments, in a case that the first recognition result represents that no abnormal behavior occurs in the video sequence to be recognized, determining a target recognition result of the video sequence to be recognized based on the first recognition result; in some embodiments, in a case that the second recognition result represents that an abnormal behavior occurs in the video sequence to be recognized, determining a target recognition result of the video sequence to be recognized based on the second recognition result; in some embodiments, in a case where the first recognition result indicates that an abnormal behavior occurs in the video sequence to be recognized and the second recognition result indicates that an abnormal behavior does not occur in the video sequence to be recognized, based on the fourth recognition result, a target recognition result of the video sequence to be recognized is determined.

In the embodiment of the application, aiming at the condition that the first identification result represents abnormal behaviors in the video sequence to be identified, the second identification result is further determined based on the area change value of the dense region between adjacent frames, so that the robustness and the accuracy of the neural network identification scheme are improved by combining a two-stage heuristic logic method.

Fig. 4 is a schematic flow chart of a behavior recognition method according to an embodiment of the present application, and as shown in fig. 4, the method at least includes the following steps:

step S410, acquiring a detection frame of each object in each frame of image in the video sequence to be identified;

step S420, determining a first identification result of the video sequence to be identified based on the distribution condition of the detection frames in each frame of image;

step S430, determining an area change value of the dense region between adjacent frames in the video sequence to be identified under the condition that the first identification result represents that the abnormal behavior occurs in the video sequence to be identified;

step S440, determining a second identification result of the video sequence to be identified based on the area change value and the change threshold of the dense region;

the implementation process of the steps S410 to S440 is similar to the implementation process of the steps S310 to S340, and for the technical details not disclosed in the embodiment of the present application, please refer to the description of the previous embodiment for understanding.

Step S450, under the condition that the second identification result represents that the video sequence to be identified has no abnormal behavior, determining an enclosing frame area in each frame of image in the video sequence to be identified based on the coverage area of the dense area in each frame of image in the video sequence to be identified;

here, for the case where the second recognition result represents that the video sequence to be recognized does not send abnormal behavior, a bounding box region of a larger range than the dense region is further determined.

The range of the dense region in different frame images in the video sequence to be identified can be merged at the spatial position to obtain a wider range of the surrounding frame, and each frame image is intercepted to obtain the surrounding frame region in the corresponding frame image.

Step S460, determining a third identification result of the video sequence to be identified based on the pixel change condition of the surrounding frame area between adjacent frames in the video sequence to be identified;

here, in the case that the video sequence to be recognized is preliminarily determined to have abnormal behavior through the area change condition of the dense region, whether the pixels of the surrounding frame region with a larger range between the adjacent frames are changed or not is further compared, so that whether the abnormal behavior occurs in the video sequence to be recognized or not is determined through the multi-stage logic rule, and a third recognition result is obtained.

In implementation, the pixel variation between adjacent frames in the video sequence to be identified can be determined by the image processing algorithm in the related art. The size and dimensions of adjacent frame images are usually set to be the same, and for a grayscale image, pixel subtraction at a corresponding position is performed directly because of only a single channel, and for a color image, components of corresponding colors should be subtracted separately.

In some possible embodiments, the determining a third recognition result of the video sequence to be recognized based on the pixel variation of the bounding box region between adjacent frames in the video sequence to be recognized includes: determining a third identification result of the video sequence to be identified as abnormal behavior under the condition that the accumulated sum of the pixel difference values of the surrounding frame area between the adjacent frames in the video sequence to be identified is greater than or equal to a pixel threshold value; and under the condition that the accumulated sum of the pixel difference values of the surrounding frame areas between the adjacent frames in the video sequence to be identified is smaller than the pixel threshold value, determining that the third identification result of the video sequence to be identified is that no abnormal behavior occurs.

Here, it is determined whether the pixel difference value or the accumulated sum of the pixel difference values satisfies the pixel threshold value for every two adjacent frames of the video sequence to be identified, and it is determined whether an abnormal behavior occurs in the corresponding video sequence to be identified. Therefore, the video sequence to be recognized with obviously abnormal behaviors is quickly determined through logic rules, the recognition process is shortened, and the accuracy of the recognition method is enhanced.

Step S470, under the condition that the third recognition result represents that the video sequence to be recognized does not have abnormal behaviors, performing behavior recognition on the track sequence of the group object to obtain a fourth recognition result;

here, for the case that it is determined that no abnormal behavior occurs in the video sequence to be recognized based on the pixel change of the surrounding frame region in each frame of image, it is necessary to further recognize a specific behavior type by using a neural network, so as to obtain a fourth recognition result of the video sequence to be recognized. Therefore, the problem of insufficient robustness and accuracy of pure neural network identification is improved by combining logic prejudgment.

Step S480, determining a target recognition result of the video sequence to be recognized based on at least the first recognition result or the fourth recognition result.

Here, the target recognition result of the video sequence to be recognized is determined based on the first recognition result, the second recognition result, the third recognition result, or the fourth recognition result.

In some embodiments, in a case that the first recognition result represents that no abnormal behavior occurs in the video sequence to be recognized, determining a target recognition result of the video sequence to be recognized based on the first recognition result; in some embodiments, in a case that the second recognition result represents that an abnormal behavior occurs in the video sequence to be recognized, determining a target recognition result of the video sequence to be recognized based on the second recognition result; in some embodiments, in a case that the third recognition result represents that an abnormal behavior occurs in the video sequence to be recognized, determining a target recognition result of the video sequence to be recognized based on the third recognition result; in some embodiments, when the first recognition result indicates that an abnormal behavior occurs in the video sequence to be recognized, and the second recognition result and the third recognition result both indicate that no abnormal behavior occurs in the video sequence to be recognized, the target recognition result of the video sequence to be recognized is determined based on the fourth recognition result.

In the embodiment of the application, a third recognition result of the video sequence to be recognized is determined according to the situation that the second recognition result represents that the video sequence to be recognized does not send abnormal behaviors, and when the third recognition result represents that the video sequence to be recognized does not send abnormal behaviors, the behavior recognition model is used for recognizing the track sequence of the group object. Therefore, after heuristic logic judgment is carried out on the video sequence to be recognized through area change of the dense area and pixel change of the bounding box area after the union is obtained, behavior recognition is carried out on the recognition model by means of the behavior, and robustness and accuracy of the neural network recognition scheme can be improved.

Fig. 5 is a schematic flowchart of generating a trajectory sequence of a group object according to an embodiment of the present application, and as shown in fig. 5, the method at least includes the following steps:

step S510, determining a dense area in each frame of image based on the distribution condition of the detection frames in each frame of image;

here, the dense region includes a center detection frame that overlaps with other detection frames in each frame of the image by a number of times equal to or greater than a first threshold. The central detection frames in different frame images may be of the same object or of different objects, and the central detection frame of an object may appear in each frame image in the video sequence to be identified or may appear in only a part of frames.

In implementation, the central detection frame which is matched with other detection frames most frequently in each frame of image can be determined by constructing an adjacency matrix, and it is generally considered that the group object behavior is easily transmitted at the spatial position of the central detection frame. Therefore, the dense area of each frame of image can be obtained by extending a certain range outwards with the central detection frame as the center.

In some embodiments, the dense regions in each frame of image are determined by: determining a central detection frame in each frame of image based on the distribution condition of the detection frames in each frame of image; and expanding the central detection frame in each frame of image outwards according to a specific proportion to obtain a dense area in each frame of image.

The specific proportion is an empirical value, and is generally set to be 2 times, and the specific proportion may also be determined according to an application scene picture actually shot, so that the expanded dense region can surround all execution subjects in which group object behaviors occur as much as possible.

In this way, the central detection frame is determined by using the distribution of the detection frames in each frame of image, and then the central detection frame is expanded outwards, so that the expanded dense region includes all objects which are centered on the central detection frame and surround the group object behavior as much as possible. Therefore, the behavior receptive field of the action execution main body is reduced, and the problem of small effective perception range when the video sequence to be recognized in the full visual field is recognized can be solved.

In some embodiments, the number of the detection boxes included in each frame of image is greater than or equal to 2, and the determining the central detection box in each frame of image based on the distribution of the detection boxes in each frame of image includes: generating an adjacency matrix corresponding to each frame of image according to the intersection and parallel ratio between every two detection frames in each frame of image; taking the detection frame with the matching times larger than or equal to the second time threshold value in the adjacent matrix as a central detection frame in each frame of image; wherein the second nonce threshold is greater than the first nonce threshold.

Here, the Intersection over Union (IoU) between each two detection frames is a result of dividing a portion where two regions overlap by a portion where two regions are aggregated. The value of (i, j) in the adjacency matrix represents the intersection ratio of the detection frame i and the detection frame j in the frame image. The number (except itself) of rows i having a value greater than 0 is taken as the number of matches of the detection box i. If the number of times of repetition between a certain detection frame j and a plurality of other detection frames is large, a group object behavior is likely to occur in a detection frame in which a plurality of objects exist around the position of the detection frame j, and therefore the detection frame j can be used as the center detection frame of the frame image in which the detection frame j is located.

Therefore, for the condition that each frame of image comprises a plurality of detection frames, the intersection ratio between any two detection frames is calculated, and the matching times of each detection frame are counted, so that the central detection frame with the most overlapping times with other detection frames can be accurately screened out, the track sequence of the dense area and the group object can be conveniently generated in the subsequent process, and the useful information in the video to be identified can be effectively extracted.

In some other embodiments, when each frame of image includes at least two detection frames, first, the detection frame of each object is expanded outward by a fixed ratio to obtain an expanded detection frame; screening at least two first detection frames from the expanded detection frames; the area of the first detection frame is larger than that of other detection frames in the expanded detection frame; then, the intersection-to-parallel ratio between the at least two first detection frames is determined. For example, in the case where the fixed ratio is 1.5 times, the length and width of the detection frame of each of the objects are expanded by 1.5 times. This can increase the image resolution, and thus can better calculate the intersection ratio between the detection frames that overlap each other.

Step S520, generating a track sequence of the group object based on the dense region in each frame of image.

Here, a smaller bounding box including the same group object in each frame image is determined based on the dense region in each frame image, and a bounding box region in each frame image is extracted based on the smallest bounding box to form a trajectory sequence of the group object. Therefore, the subsequent behavior recognition of the group object through the behavior recognition model is facilitated.

The sequence of trajectories for the population object may be generated by:

step S5201, determining a surrounding frame area in each frame of image based on a coverage area of a dense area in each frame of image in the video sequence to be identified;

here, a union set may be taken for coverage areas of each dense region in the video sequence to be identified, and a minimum bounding box that may include all detection boxes overlapping with the central detection box may be determined by combining maximum values of coordinates of boundary points of each dense region, and then, an image segmentation technique or other image processing techniques in the related art are adopted to intercept a bounding box region corresponding to a spatial position of the minimum bounding box in each frame of image, which is not limited in this embodiment of the present application.

It should be noted that after the region corresponding to the central detection frame is adaptively extended outward according to a specific ratio, the obtained dense region is still a rectangular region. For convenience of subsequent processing, the bounding box region obtained by merging the dense regions of each frame may be subjected to size normalization. For example, the size of the long side of the bounding box region is adjusted to 224 pixels, the short side of the bounding box region is scaled with the long side, and the vertical black edge correction processing is performed on the short side of less than 224 pixels after scaling.

Step S5202, generating a track sequence of the group object based on the timestamp of each frame of image and the bounding box area in each frame of image.

Here, each frame of image in the video sequence to be recognized carries a respective timestamp, which may be a timestamp set when the image is collected by a collection device such as a camera, or a timestamp set in a process of subsequently sampling an original image sequence, or a timestamp set in another implementable manner, which is not limited in this application.

The track sequence generated by the bounding box region according to the time stamp sequence replaces the full-image video sequence, so that the receptive field of the group object behavior event can be reduced, and the accuracy and efficiency of behavior identification are improved. Meanwhile, the relative position loss of the group object behaviors is avoided, and the performance improvement is good for the detection of the behaviors which are similar in space but different in movement rhythm.

In the embodiment of the application, dense areas are determined by using the distribution condition of the detection frames in the obtained frame images, and then the dense areas in each frame image are preprocessed to obtain a track sequence of the group object, so that the problem of identifying the group object behaviors is solved. Meanwhile, the track sequence of the group object reduces the behavior receptive field of the action execution main body, and the problem of small effective perception range when the video sequence to be recognized in the whole visual field is recognized can be solved.

The above behavior recognition method is described below with reference to a specific embodiment, but it should be noted that the specific embodiment is only for better describing the present application and is not to be construed as limiting the present application.

The embodiment of the present application takes the group object as a group of pedestrians and the abnormal behavior of the group object is framed as an example for explanation. The traditional behavior recognition method usually performs full-image data enhancement or other preprocessing on an input video sequence and then sends the processed input video sequence to a classification model for prediction, however, the method is only suitable for human-centered video behavior recognition, and the data is often found in a public video academic data set. For online video streams, more information is often contained and the field of view covered is larger. Meanwhile, the occurrence position of the target event and the human body scale are random. Therefore, it is obviously unreasonable to simply use the full graph as a model input. Furthermore, there is uncertainty and inaccuracy in relying solely on neural networks for identification.

For identification of fighting abnormal behaviors in an online video stream, it is necessary to be able to locate a general fighting area in a whole image (especially a high visual angle) in a video sequence, increase the effective sensing range of a machine for an input video stream, and be able to accurately identify fighting events covering postures of 'two people fighting', 'multi-people fighting', 'charging up', 'holding up', and 'surrounding'. Meanwhile, the indoor and outdoor general scenes such as urban streets, rail transit and the like are supported, so that the automatic analysis of the behavior events in the video content provides convenience and evidence obtaining capability for related departments.

The embodiment of the application provides a method for identifying abnormal fighting behaviors based on the fusion of heuristic and deep learning. Firstly, in a data processing level, the method is different from the traditional video behavior identification method based on a full-image sequence, the embodiment of the application provides a method based on detection frame density distribution and heuristic dense group position estimation to guide reasoning of a local dense region, so that the effective perception region of a model is increased, and the retrieval range of an irrelevant background is reduced.

Secondly, at the aspect of behavior recognition algorithm: the minimum bounding box of the group object in the video sequence is determined through a preprocessing step, and the bounding box area in each frame of image is determined. Calculating the area change degree of an area of a surrounding frame region between adjacent frames for the video sequence meeting the requirement of containing the number of pedestrians, and determining that the corresponding video sequence has a fighting behavior if the area change degree is large; otherwise, further calculating the pixel difference value between adjacent frames. Directly outputting a corresponding video sequence to generate a framing behavior under the condition that the pixel change amplitude is large, namely, an object moves violently; and for the condition that the pixel change amplitude is small, namely the motion intensity does not meet the threshold value, the pixel change amplitude needs to be further sent into a behavior recognition model for final judgment.

Fig. 6 is a logic flow diagram of a multi-person behavior recognition method according to an embodiment of the present application, and as shown in fig. 6, the logic flow includes the following steps:

step S601, extracting a detection frame of each pedestrian in a video sequence to be identified respectively;

here, a full-image video sequence shot by the image acquisition device is acquired, and a detection frame of a pedestrian in the video image is acquired by calling an upstream detection component.

Step S602, expanding the detection frames of each pedestrian according to a fixed proportion and sequencing the detection frames according to areas;

here, all the detection results of the current frame are externally extended by a fixed ratio (default to 1.5 times), and then arranged in descending order by area.

Step S603, constructing an adjacency matrix according to intersection ratio among the detection frames, and determining a central detection frame with the most matching times;

after sorting, selecting a plurality of detection frames with larger areas to determine the intersection and parallel ratio, and constructing an adjacent matrix; and calculating the detection frame with the most matching times as a central detection frame based on the adjacency matrix.

Step S604, restoring the resolution of the central detection frame and calculating a dense area containing pedestrians;

here, the center detection frame obtained as described above is reduced to the original resolution, a larger expansion ratio is defined, and adaptive outward expansion is performed according to the expansion ratio to obtain a larger rectangular frame as the dense region of the current frame.

It is generally considered that a multi-person behavior easily occurs in a dense area, and the center detection box is the center of the multi-person behavior. Therefore, the rectangular frame obtained by expanding the central detection frame outward, i.e., the dense region, should surround the group in which the multi-person behavior occurs as much as possible.

Step S605, judging whether the number of the people in the dense area is more than or equal to 2;

here, if the number of pedestrians in the dense area is 2 or more, step S607 is performed; otherwise, step S606 is executed, and the determination is ended.

Step S606, determining that no fighting takes place;

step S607, judging whether the area of the dense region between the adjacent frames is changed violently;

here, if the area of the dense region between the adjacent frames is changed drastically, step S608 is executed to end the determination; otherwise, the step S609 is continued.

Step S608, determining that a fighting occurs;

step S609, a larger surrounding frame area is obtained by taking and collecting dense areas of all frame sequences, and the pixel difference value of the surrounding frame area between adjacent frames is calculated;

the process of intercepting the larger bounding box area in each frame of image according to the dense area is as follows: firstly, the bounding box area in each frame of image is cut according to the size of the smallest bounding box, then the size (resize) of the long edge of the bounding box area is changed to 224 pixels, and then the short edge of the bounding box area is scaled in equal proportion, and the upper and lower black edges of the 224 pixels are less than.

Step S610, judging whether the accumulated sum of the pixel difference values is larger than a pixel threshold value;

here, if the accumulated sum of the pixel difference values of each frame is greater than the pixel threshold, step S611 is executed to end the determination; otherwise, the step S612 is continuously executed.

Step S611, determining that a frame hitting happens;

and step S612, inputting the trained behavior recognition network for further recognition.

And respectively deducting the designated bounding box areas in each frame to form a motion trail sequence of the group of pedestrians, and sending the motion trail sequence into a neural network to identify the framing events. The trajectory sequence obtained by the above-described procedure is shown in fig. 7, and the trajectory sequence of the group of pedestrians includes 5 frame bounding box area images, and each bounding box area image includes a dense area composed of abnormal behaviors occurring in two behavior bodies.

According to the embodiment of the application, the perception capability of the model to the dense area in the video sequence is effectively improved through a heuristic and extensible estimation scheme of the position of the multi-person dense group, and the retrieval range and the calculation amount are greatly reduced; meanwhile, robustness and accuracy of the pure neural network identification scheme are improved by combining a two-stage heuristic logic method.

The method and the device for detecting the abnormal behavior of the group objects in the urban street are suitable for outdoor urban street scenes, indoor rail transit scenes and other scenes, and after the abnormal behavior of the group objects occurs in the scenes, the device for collecting the video source can automatically identify the region and the type of the event, and alarm to provide efficient and convenient detection capability for personnel with related requirements.

Based on the foregoing embodiments, an embodiment of the present application further provides a behavior recognition apparatus, where the behavior recognition apparatus includes modules, sub-modules included in the modules, and units included in the sub-modules, and may be implemented by a processor in a device; of course, the implementation can also be realized through a specific logic circuit; in the implementation process, the Processor may be a Central Processing Unit (CPU), a microprocessor Unit (MPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), or the like.

Fig. 8 is a schematic structural diagram of an identification apparatus for behaviors provided in an embodiment of the present application, and as shown in fig. 8, the identification apparatus 800 includes an obtaining module 810, a first determining module 820, an identifying module 830, and a second determining module 840, where:

the obtaining module 810 is configured to obtain a detection frame of each object in each frame of image in the video sequence to be identified;

the first determining module 820 is configured to determine a first identification result of the video sequence to be identified based on a distribution of the detection boxes in each frame of image;

the identification module 830 is configured to perform behavior identification on the track sequence of the group object to obtain a fourth identification result when the first identification result meets a first condition; wherein the track sequence of the group object is generated based on the video sequence to be identified; the population objects include at least two target objects having a spatial distance less than a distance threshold;

the second determining module 840 is configured to determine a target recognition result of the video sequence to be recognized based on at least the first recognition result or the fourth recognition result.

In some possible embodiments, the second determining module 840 includes: a fifth determining submodule, configured to determine, based on the first recognition result, a target recognition result of the video sequence to be recognized when the first recognition result represents that no abnormal behavior occurs in the video sequence to be recognized; or, when the first recognition result represents that abnormal behaviors occur in the video sequence to be recognized, determining a target recognition result of the video sequence to be recognized based on the fourth recognition result.

In some possible implementations, the first determining module 820 includes: the first determining submodule is used for determining a dense area in each frame of image based on the distribution condition of the detection frames in each frame of image; the dense area comprises a central detection frame which is overlapped with other detection frames in each frame of image for more than or equal to a first time threshold; and the second determining submodule is used for determining a first identification result of the video sequence to be identified based on the number of objects included in the dense area in each frame of image.

In some possible embodiments, the identifying module 830 includes: a third determining submodule, configured to determine an area change value of the dense region between adjacent frames in the video sequence to be identified, when the first identification result represents that an abnormal behavior occurs in the video sequence to be identified; a fourth determining submodule, configured to determine a second recognition result of the video sequence to be recognized based on the area change value and the change threshold of the dense region; and the recognition submodule is used for performing behavior recognition on the track sequences of the group objects under the condition that the second recognition result represents that the video sequence to be recognized does not have abnormal behaviors, so as to obtain a fourth recognition result.

In some possible embodiments, the identifier module comprises: the first determining unit is used for determining a surrounding frame area in each frame of image in the video sequence to be identified based on the coverage area of the dense area in each frame of image in the video sequence to be identified under the condition that the second identification result represents that the video sequence to be identified has no abnormal behavior; the positions of the surrounding frame areas in each frame of image are consistent, the sizes of the surrounding frame areas are the same, and the surrounding frame areas surround the dense area; the second determining unit is used for determining a third identification result of the video sequence to be identified based on the pixel change condition of the surrounding frame area between adjacent frames in the video sequence to be identified; and the identification unit is used for performing behavior identification on the track sequences of the group objects under the condition that the third identification result represents that the video sequence to be identified does not have abnormal behaviors, so as to obtain a fourth identification result.

In some possible embodiments, the identification sub-module further includes a fifth determining unit, configured to determine, when the second identification result represents that the video sequence to be identified has abnormal behavior, that a target identification result of the video sequence to be identified is the abnormal behavior; or, under the condition that the third recognition result represents that the video sequence to be recognized has abnormal behaviors, determining that the target recognition result of the video sequence to be recognized is the abnormal behavior.

In some possible embodiments, the second determining unit is further configured to determine that a third recognition result of the video sequence to be recognized is abnormal behavior if a pixel difference value of the bounding box region between adjacent frames in the video sequence to be recognized is greater than or equal to a pixel threshold value; and under the condition that the pixel difference value of the surrounding frame region between the adjacent frames in the video sequence to be identified is smaller than the pixel threshold value, determining that the third identification result of the video sequence to be identified is that no abnormal behavior occurs.

In some possible embodiments, the fourth determining sub-module is further configured to determine that a second recognition result of the video sequence to be recognized is abnormal behavior if an area change value of the dense region between the adjacent frames is greater than or equal to the change threshold; and under the condition that the area change value of the dense region between the adjacent frames is smaller than the change threshold value, determining that the second identification result of the video sequence to be identified is not abnormal.

In some possible embodiments, the second determining sub-module is further configured to determine that the first identification result of the video sequence to be identified is abnormal behavior if the number of objects included in the dense region in each frame of image is greater than or equal to 2; and determining that the first identification result of the video sequence to be identified has no abnormal behavior under the condition that the number of objects included in the dense area in each frame of image is less than 2.

In some possible embodiments, the first determining sub-module includes: the first generating unit is used for generating an adjacent matrix corresponding to each frame of image according to the intersection and parallel ratio between every two detection frames in each frame of image; a third determining unit, configured to use a detection frame with a matching number greater than or equal to a second number threshold in the adjacency matrix as a center detection frame in each frame of image; wherein the second nonce threshold is greater than the first nonce threshold; and the expanding unit is used for expanding the central detection frame in each frame of image outwards according to a specific proportion to obtain a dense area in each frame of image.

In some possible embodiments, the first determining module 820 further includes a generating sub-module, configured to generate the trajectory sequence of the group object based on the dense region in each frame of the image.

In some possible embodiments, the generating sub-module comprises: a fourth determining unit, configured to determine, based on a coverage of a dense region in each frame of image in the video sequence to be identified, a bounding box region in each frame of image; the positions of the surrounding frame areas in each frame of image are consistent, the sizes of the surrounding frame areas are the same, and the surrounding frame areas surround the dense area; and the second generation unit is used for generating the track sequence of the group object based on the time stamp of each frame of image and the bounding box area in each frame of image.

Here, it should be noted that: the above description of the apparatus embodiments, similar to the above description of the method embodiments, has similar beneficial effects as the method embodiments. For technical details not disclosed in the embodiments of the apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.

It should be noted that, in the embodiment of the present application, if the behavior recognition method is implemented in the form of a software functional module and is sold or used as a standalone product, the behavior recognition method may also be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling a device (which may be a smartphone with a camera, a tablet computer, etc.) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.

Correspondingly, the present application provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps in the behavior recognition method in any of the above embodiments. Correspondingly, in an embodiment of the present application, a chip is further provided, where the chip includes a programmable logic circuit and/or program instructions, and when the chip runs, the chip is configured to implement the steps in any one of the behavior recognition methods in the foregoing embodiments. Correspondingly, in an embodiment of the present application, there is also provided a computer program product, which is used to implement the steps in any of the behavior recognition methods in the foregoing embodiments when the computer program product is executed by a processor of a device.

Based on the same technical concept, the embodiment of the present application provides a behavior recognition device, which is used for implementing the behavior recognition method described in the above method embodiment. Fig. 9 is a hardware entity diagram of an behavior recognition apparatus according to an embodiment of the present application, and as shown in fig. 9, the recognition apparatus 900 includes a memory 910 and a processor 920, where the memory 910 stores a computer program that is executable on the processor 920, and the processor 920 executes the computer program to implement steps in any behavior recognition method according to an embodiment of the present application.

The Memory 910 is configured to store instructions and applications executable by the processor 920, and may also buffer data (e.g., image data, audio data, voice communication data, and video communication data) to be processed or already processed by the processor 920 and modules in the device, and may be implemented by a FLASH Memory (FLASH) or a Random Access Memory (RAM).

The processor 920 implements the steps of any of the behavior recognition methods described above when executing the program. The processor 920 generally controls the overall operation of the identification device 900.

The Processor may be at least one of an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a Central Processing Unit (CPU), a controller, a microcontroller, and a microprocessor. It is understood that the electronic device implementing the above-mentioned processor function may be other electronic devices, and the embodiments of the present application are not particularly limited.

The computer storage medium/Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read Only Memory (EPROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a magnetic Random Access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical Disc, or a Compact Disc Read-Only Memory (CD-ROM), and the like; or may be a variety of devices including one or any combination of the above memories, such as a mobile phone, computer, tablet device, personal digital assistant, etc.

Here, it should be noted that: the above description of the storage medium and device embodiments is similar to the description of the method embodiments above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the storage medium and apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application. The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiments of the present application.

In addition, all functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing an automatic test line of a device to perform all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code.

The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to obtain new method embodiments.

The features disclosed in the several method or apparatus embodiments provided in the present application may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.

The above description is only for the embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of behavior recognition, the method comprising:

2. The method of claim 1, wherein the determining the target recognition result of the video sequence to be recognized based on at least the first recognition result or the fourth recognition result comprises:

under the condition that the first identification result represents that abnormal behaviors do not occur in the video sequence to be identified, determining a target identification result of the video sequence to be identified based on the first identification result; alternatively, the first and second electrodes may be,

and under the condition that the first identification result represents that abnormal behaviors occur in the video sequence to be identified, determining a target identification result of the video sequence to be identified based on the fourth identification result.

3. The method according to claim 1 or 2, wherein the determining a first recognition result of the video sequence to be recognized based on the distribution of the detection boxes in each frame of image comprises:

determining a dense area in each frame of image based on the distribution condition of the detection frames in each frame of image; the dense area comprises a central detection frame which is overlapped with other detection frames in each frame of image for more than or equal to a first time threshold;

and determining a first identification result of the video sequence to be identified based on the number of objects included in the dense area in each frame of image.

4. The method according to any one of claims 1 to 3, wherein performing behavior recognition on the trajectory sequence of the group object to obtain a fourth recognition result in the case that the first recognition result satisfies a first condition includes:

under the condition that the first identification result represents that abnormal behaviors occur in the video sequence to be identified, determining the area change value of the dense region between adjacent frames in the video sequence to be identified;

determining a second identification result of the video sequence to be identified based on the area change value and the change threshold of the dense region;

and under the condition that the second recognition result represents that the video sequence to be recognized does not have abnormal behaviors, performing behavior recognition on the track sequence of the group object to obtain a fourth recognition result.

5. The method according to claim 4, wherein performing behavior recognition on the track sequence of the group object to obtain the fourth recognition result when the second recognition result indicates that no abnormal behavior occurs in the video sequence to be recognized, includes:

under the condition that the second identification result represents that the video sequence to be identified has no abnormal behavior, determining an enclosing frame area in each frame of image in the video sequence to be identified based on the coverage range of the dense area in each frame of image in the video sequence to be identified; the positions of the surrounding frame areas in each frame of image are consistent, the sizes of the surrounding frame areas are the same, and the surrounding frame areas surround the dense area;

determining a third identification result of the video sequence to be identified based on the pixel change condition of the surrounding frame region between adjacent frames in the video sequence to be identified;

and under the condition that the third recognition result represents that the video sequence to be recognized does not have abnormal behaviors, performing behavior recognition on the track sequence of the group object to obtain a fourth recognition result.

6. The method of claim 5, wherein the method further comprises:

determining the target recognition result of the video sequence to be recognized as the abnormal behavior under the condition that the second recognition result represents that the abnormal behavior occurs in the video sequence to be recognized; alternatively, the first and second electrodes may be,

and under the condition that the third identification result represents that the video sequence to be identified has abnormal behaviors, determining that the target identification result of the video sequence to be identified is the abnormal behavior.

7. The method according to claim 5 or 6, wherein the determining a third recognition result of the video sequence to be recognized based on the pixel change condition of the surrounding frame area between the adjacent frames in the video sequence to be recognized comprises:

determining a third identification result of the video sequence to be identified as abnormal behavior under the condition that the pixel difference value of the surrounding frame region between adjacent frames in the video sequence to be identified is greater than or equal to a pixel threshold value; alternatively, the first and second electrodes may be,

and under the condition that the pixel difference value of the surrounding frame region between the adjacent frames in the video sequence to be identified is smaller than the pixel threshold value, determining that the third identification result of the video sequence to be identified is that no abnormal behavior occurs.

8. The method according to any one of claims 4 to 7, wherein the determining a second recognition result of the video sequence to be recognized based on the area variation value and the variation threshold of the dense region comprises:

determining a second identification result of the video sequence to be identified as abnormal behavior under the condition that the area change values of the dense regions between the adjacent frames are greater than or equal to the change threshold; alternatively, the first and second electrodes may be,

and under the condition that the area change value of the dense region between the adjacent frames is smaller than the change threshold value, determining that the second identification result of the video sequence to be identified is not abnormal.

9. The method according to any one of claims 3 to 8, wherein the determining a first recognition result of the video sequence to be recognized based on the number of objects included in the dense region in each frame of image comprises:

under the condition that the number of objects included in the dense area in each frame of image is more than or equal to 2, determining that a first identification result of the video sequence to be identified is abnormal behavior;

and determining that the first identification result of the video sequence to be identified has no abnormal behavior under the condition that the number of objects included in the dense area in each frame of image is less than 2.

10. The method according to any one of claims 3 to 9, wherein the determining the dense region in each frame of image based on the distribution of the detection boxes in each frame of image comprises:

generating an adjacency matrix corresponding to each frame of image according to the intersection and parallel ratio between every two detection frames in each frame of image;

taking the detection frame with the matching times larger than or equal to the second time threshold value in the adjacent matrix as a central detection frame in each frame of image; wherein the second nonce threshold is greater than the first nonce threshold;

and expanding the central detection frame in each frame of image outwards according to a specific proportion to obtain a dense area in each frame of image.

11. The method of any of claims 3 to 10, further comprising:

and generating a track sequence of the group objects based on the dense areas in each frame of image.

12. The method of claim 11, wherein generating the sequence of trajectories of the population object based on the dense regions in each frame of image comprises:

determining a surrounding frame area in each frame of image based on the coverage area of the dense area in each frame of image in the video sequence to be identified; the positions of the surrounding frame areas in each frame of image are consistent, the sizes of the surrounding frame areas are the same, and the surrounding frame areas surround the dense area;

generating a trajectory sequence of the group object based on the timestamp of each frame of image and the bounding box region in each frame of image.

13. An apparatus for identifying a behavior, the apparatus comprising an obtaining module, a first determining module, an identifying module, and a second determining module, wherein:

the first determining module is configured to determine a first identification result of the video sequence to be identified based on a distribution of the detection frames in each frame of image;

the identification module is used for performing behavior identification on the track sequence of the group object under the condition that the first identification result meets a first condition to obtain a fourth identification result; wherein the track sequence of the group object is generated based on the video sequence to be identified; the population objects include at least two target objects having a spatial distance less than a distance threshold;

the second determining module is configured to determine a target recognition result of the video sequence to be recognized based on at least the first recognition result or the fourth recognition result.

14. An apparatus for behavior recognition comprising a memory and a processor, the memory storing a computer program operable on the processor, wherein the processor implements the steps of the method of any one of claims 1 to 12 when executing the program.

15. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 12.