CN113111838A

CN113111838A - Behavior recognition method and device, equipment and storage medium

Info

Publication number: CN113111838A
Application number: CN202110447987.1A
Authority: CN
Inventors: 苏海昇; 王栋梁
Original assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Current assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Priority date: 2021-04-25
Filing date: 2021-04-25
Publication date: 2021-07-13
Also published as: WO2022227480A1

Abstract

The embodiment of the application discloses a behavior identification method, which comprises the following steps: obtaining a detection result of each object in each frame of image in a video sequence to be identified; wherein, the video sequence to be identified comprises a single object and/or a group object; the group of objects comprises at least two objects with a spatial distance smaller than a distance threshold; generating at least one first sequence and/or at least one second sequence according to the detection result of each object, wherein the first sequence and the second sequence both comprise one of the following: the same single object, the group object; and respectively carrying out behavior recognition on each first sequence and/or each second sequence by utilizing a trained multi-task recognition network model to obtain a behavior recognition result of each single object and/or a behavior recognition result of the group objects. The embodiment of the application also provides a behavior recognition device, equipment and a storage medium.

Description

Behavior recognition method and device, equipment and storage medium

Technical Field

The present application relates to the field of computer vision, and relates to, but is not limited to, behavior recognition methods and apparatus, devices, and storage media.

Background

Behavior recognition in video is an important application in the field of computer vision, and problems in practice mainly include: the number of action execution subjects of an event is variable (the number of action execution subjects according to an event can be generally divided into a single object behavior and a group object behavior); different spatial receptive fields are required for different types of behavior event identification (behavior subjects with different positions and visual fields may exist in an identification scene, and the behavior event occurrence position really required to be detected usually only occupies a small area of a picture); the behavior body which does not generate the preset behavior and the behavior body which generates the preset behavior may have overlapping in spatial position, and the like, and the above problems all cause interference to the behavior recognition result.

Disclosure of Invention

The embodiment of the application provides a behavior identification method, a behavior identification device, equipment and a storage medium.

The technical scheme of the embodiment of the application is realized as follows:

in a first aspect, an embodiment of the present application provides a behavior identification method, where the method includes:

obtaining a detection result of each object in each frame of image in a video sequence to be identified; wherein, the video sequence to be identified comprises a single object and/or a group object; the group of objects comprises at least two objects with a spatial distance smaller than a distance threshold; generating at least one first sequence and/or at least one second sequence according to the detection result of each object, wherein the first sequence and the second sequence both comprise one of the following: the same single object, the group object; and respectively carrying out behavior recognition on each first sequence and/or each second sequence by utilizing a trained multi-task recognition network model to obtain a behavior recognition result of each single object and/or a behavior recognition result of the group objects.

In some possible embodiments, the detection result of the object at least includes a detection frame of the object and an object identifier associated with the detection frame; generating at least one first sequence according to the detection result of each object, including: determining a sparse detection frame in each frame image according to the distribution condition of the detection frame of each object in the corresponding frame image; wherein the number of times of overlapping between the sparse detection frame and the other detection frames except the sparse detection frame in each frame of image is less than a first time threshold; and generating the at least one first sequence according to the sparse detection frame in each frame of image and the object identification associated with the sparse detection frame.

In this way, firstly, the detection frames of the sparse region, namely the sparse detection frames, are determined by using the distribution of the detection frames in the frame image, and then the sparse detection frames in each frame image are preprocessed according to the object identification to obtain at least one first sequence. Therefore, the problem of identifying the behavior of a single object in a video sequence is solved, the identification difficulty of a behavior identification model can be reduced, and the calculation amount of the model is reduced.

In some possible embodiments, the generating at least one second sequence according to the detection result of each object includes: determining a dense detection frame in each frame image according to the distribution condition of the detection frame of each object in the corresponding frame image; the overlapping times of the dense detection frames and other detection frames except the dense detection frames in each frame of image are greater than or equal to a second time threshold value, and the second time threshold value is greater than the first time threshold value; and generating the at least one second sequence according to the dense detection frame in each frame of image and the object identification associated with the dense detection frame.

In this way, the detection frames of the dense region, namely the dense detection frames, are determined by using the distribution of the detection frames in the frame image, and then the dense detection frames in each frame image are preprocessed to obtain at least one second sequence, so that the problem of identifying the group object behaviors is solved. The preprocessing process of the sparse detection frame is combined, the receptive fields of different types, namely different actions of the execution main body can be adjusted in a self-adaptive mode, and the problem that the effective perception range is small when the full-view video sequence to be recognized is recognized can be solved.

In some possible embodiments, the determining a sparse detection frame in each frame image according to a distribution of the detection frame of each object in the corresponding frame image includes: determining the number of detection frames included in each frame of image; and in the case that the number of the detection frames is 1, corresponding the determination to the sparse detection frame in the frame image.

Therefore, for the condition that each frame of image comprises one detection frame, the detection frame is directly used as a sparse detection frame, so that the subsequent generation of a first sequence of single object track behaviors is facilitated, the operation can be simplified, and useful information can be effectively extracted.

In some possible embodiments, determining the sparse detection frame and the dense detection frame in each frame image according to the distribution of the detection frame of each object in the corresponding frame image includes: determining the number of detection frames included in each frame of image; under the condition that the number of the detection frames is more than or equal to 2, generating an adjacent matrix corresponding to each frame of image according to the intersection and parallel ratio between every two detection frames in each frame of image; taking the detection frame with zero matching times in the adjacent matrix as a sparse detection frame in each frame of image; taking the detection frame with the matching times larger than or equal to the third time threshold value in the adjacent matrix as a dense detection frame in each frame of image; wherein the third count threshold is greater than the second count threshold.

Therefore, for the condition that each frame of image comprises a plurality of detection frames, by calculating the intersection ratio between any two detection frames and counting the matching times of each detection frame, a sparse detection frame which is not overlapped with other detection frames and a dense detection frame which is overlapped with other detection frames for the most times can be accurately screened out, so that a first sequence of single object track behaviors and a second sequence of group object track behaviors can be conveniently generated in the subsequent process, and useful information in the video to be identified can be effectively extracted.

In some possible embodiments, the video sequence to be recognized is composed of frame images in a frame sequence buffer, and the generating the at least one first sequence according to a sparse detection box in each frame image and an object identifier associated with the sparse detection box includes: aiming at all frame images of the frame sequence buffer area, taking a union set of sparse detection frames associated with each object identifier at a spatial position to obtain a minimum bounding frame of a single object corresponding to each object identifier; according to the size of the smallest enclosing frame, a first area image corresponding to the spatial position of the smallest enclosing frame in each frame of image is intercepted; and sequentially connecting the first area images according to the time stamp of each frame image to obtain a first sequence corresponding to each object identifier.

Therefore, by calculating the mark of each object, namely the smallest bounding box of each single object, and capturing each frame image by the smallest bounding box to obtain the first sequence, the relative position loss of the behavior body can be avoided, and the performance improvement is better for behavior detection with similar space but different motion rhythms.

In some possible embodiments, the video sequence to be recognized is composed of frame images in a frame sequence buffer, each frame image in the frame sequence buffer includes a dense detection box, and the generating the at least one second sequence according to the dense detection box in each frame image and an object identifier associated with the dense detection box includes: expanding a specific proportion outwards for the dense detection frame in each frame of image in the frame sequence buffer area to obtain a dense area in each frame of image and an object identifier associated with the dense area; wherein the dense area includes at least two objects therein; merging the dense areas in each frame image in the frame sequence buffer area to obtain a dense surrounding frame of the frame sequence buffer area; intercepting a second area image corresponding to the spatial position of the dense enclosure frame in each frame of image according to the size of the dense enclosure frame; and sequentially connecting the second area images according to the time stamp of each frame image to generate a second sequence.

In this way, the dense detection box in which the group object behavior is predicted to easily occur is expanded outward, and the expanded dense region surrounds all the objects of the group behavior centering on the object corresponding to the dense detection box as much as possible. Meanwhile, by determining the dense surrounding frame of the frame sequence buffer area and intercepting each frame image by the dense surrounding frame to obtain a second sequence, the receptive field of the group object behavior event can be reduced, and the accuracy and efficiency of behavior identification are improved.

In some possible embodiments, the group objects included in the images in each of the second sequences include at least one object.

Therefore, the generated second sequence at least comprises the same object, the group object behaviors can be continuously detected, a subsequent detection system can automatically identify the trip as the occurring object, and convenience is brought to departments with related requirements.

In some possible embodiments, the video sequence to be recognized is composed of frame images in a frame sequence buffer, each frame image in the frame sequence buffer includes at least one dense detection box, and the generating the at least one second sequence according to the dense detection box in each frame image and an object identifier associated with the dense detection box includes: expanding a specific proportion outwards for each dense detection frame in each frame of image to obtain a dense area corresponding to each dense detection frame and an object identifier associated with the dense area; wherein the dense area includes at least two objects therein; according to the object identification associated with the dense area, taking a union set of the dense areas comprising the same object to obtain at least one dense enclosure frame comprising the same object; intercepting a third area image corresponding to the spatial position of each dense surrounding frame in each frame of image according to the size of each dense surrounding frame; and sequentially connecting the third area images according to the time stamp of each frame image to generate at least one second sequence comprising the same object.

Therefore, the dense surrounding frames comprising the same object in the frame sequence buffer area are determined through the object identification associated with the dense area, each frame of image is intercepted by the dense surrounding frames to obtain a second sequence comprising the same object, the receptive field of group object behavior events can be reduced, and the accuracy and efficiency of behavior recognition are improved. Meanwhile, the relative position loss of the group object behaviors is avoided, and the performance improvement is good for the detection of the behaviors which are similar in space but different in movement rhythm.

In some possible embodiments, the generating the at least one second sequence according to the dense detection boxes in each frame of image and the object identifiers associated with the dense detection boxes further includes: determining whether a merge tag of the frame sequence buffer is true; the merging tag is a tag for determining whether a dense region of the frame images in the frame sequence buffer area is merged or not according to the background similarity of the frame images in the frame sequence buffer area; the merging the dense regions in each frame image in the frame sequence buffer area to obtain the dense surrounding frame of the frame sequence buffer area comprises: and under the condition that the merging label is true, merging the dense areas in each frame of image in the frame sequence buffer area to obtain a dense surrounding frame of the frame sequence buffer area.

Therefore, by setting the merge tag, the dense surrounding frame is obtained by merging the dense regions in each frame image under the condition that the backgrounds of the frame images in the frame sequence buffer area are similar, namely the regional image of each frame image is intercepted by a larger region, so that the receptive field of the behavior event of the group object can be reduced, the accuracy and the efficiency of behavior identification are further improved, and the method is suitable for the shooting scene with a fixed background.

In some possible embodiments, the generating the at least one second sequence according to the dense detection boxes in each frame of image and the object identifiers associated with the dense detection boxes further includes: under the condition that the merging label is false, respectively intercepting each frame of image according to the dense area of each frame of image to obtain a fourth area image with the length and the width of specific pixels; and sequentially connecting the fourth area images according to the time stamp of each frame image to generate at least one second sequence.

In this way, by setting the merge tag, the fourth region image of each frame image is respectively intercepted by the dense region of each frame under the condition that the backgrounds of the frame images in the frame sequence buffer area are not similar, and the region image of each frame is used for replacing the original video sequence to be identified, so that the receptive field of the group object behavior event can be reduced, the accuracy and the efficiency of behavior identification are further improved, and the method is suitable for the shooting scene with changed backgrounds.

In some possible embodiments, the generating the at least one second sequence according to the dense detection boxes in each frame of image and the object identifiers associated with the dense detection boxes further includes: taking a region image of which the center point is the center of the first frame image and the specific pixels are the length and the width as a fourth region image of the first frame image when the first frame image exists in the frame sequence buffer area; wherein the first frame image is a frame image that does not include a dense detection frame.

Therefore, when the input video sequence to be recognized has the condition that a certain frame does not contain a dense detection frame, the fourth area image with the fixed size is intercepted by taking the central position of the frame as a reference, and then the fourth area image intercepted by other frames and the fourth area image are sent to the model for recognition, so that the problem that the target is lost in the group object behaviors which are continuously detected is solved, the accuracy of behavior recognition is improved, and the robustness of the model is enhanced.

In some possible embodiments, the video sequence to be identified is composed of frame images in a frame sequence buffer, and the generating at least one first sequence according to the detection result of each object further includes: determining a first object identification number associated with a sparse detection frame in each frame of image in the frame sequence buffer; under the condition that the number of the first object identifications is smaller than a specific threshold value, selecting a candidate detection frame from the dense area in each frame of image; wherein the number of times of overlapping of the candidate detection box and other detection boxes is less than the first time threshold; and generating the at least one first sequence according to the candidate detection frame and the object identification associated with the candidate detection frame.

Therefore, in the process of preprocessing a video sequence to be recognized to generate a first sequence, under the condition that the first sequence generated based on the sparse detection frame does not meet a specific threshold, the accuracy of single object behavior recognition by using limited calculation power is ensured by selecting a candidate detection frame with a smaller overlapping frequency in a dense area and generating the first sequence, and the efficiency and the performance of the model are improved.

In some possible embodiments, in the case that the number of the first object identifiers is smaller than a certain threshold, selecting a candidate detection box from the dense area in the image of each frame includes: determining the specific threshold according to the number of second object identifications and specific processing parameters associated with the dense detection frames in each frame of image in the frame sequence buffer area; determining a third object identification number associated with the candidate detection frame according to the first object identification number and the specific threshold value; and selecting the detection frame with the overlapping times with other detection frames being less than the first time threshold value from the dense area in each frame of image as the candidate detection frame according to the third object identification number.

Therefore, a specific threshold value can be determined according to the number of the object identifications associated with the intensive detection frames in each frame of image in the frame sequence buffer area, so that the number of the object identifications associated with the candidate detection frames is determined, the first sequence and the second sequence generated by the video sequence to be identified meet the maximum number of parallel computing paths allowed on the line, the acquisition of effective information is promoted, and the efficiency and the performance of the model are further promoted.

In some possible embodiments, the multi-task recognition network model is trained by using a first sample data set of single object behaviors and a second sample data set of group object behaviors; the multi-task identification network model comprises a backbone network and a classification network corresponding to a task to be identified; the task to be identified comprises at least one of the following: a task type for a single object behavior and a task type for a group object behavior.

In this way, the basic network structure (a plurality of tasks share the backbone network) is trained respectively by using the first sample data set of the single object behavior and the second sample data set of the group object behavior, so that a multi-task recognition network model capable of simultaneously performing task recognition of the single object behavior and task recognition of the group object behavior is obtained. Therefore, accuracy of multi-task identification is improved, and video memory consumption caused by task increase is greatly reduced.

In some possible embodiments, the multi-task recognition network model includes a first classification network for performing behavior recognition on the single object and a second classification network for recognizing the group object, and the performing, by using the trained multi-task recognition network model, behavior recognition on each of the first sequences and/or each of the second sequences to obtain a behavior recognition result of each of the single objects and/or a behavior recognition result of the group object includes: in response to the condition that each first sequence comprises the single object, identifying the at least one first sequence by using the first classification network to obtain a behavior identification result of each single object; and/or, in response to the condition that each second sequence comprises the group object, identifying the at least one second sequence by using the second classification network to obtain a behavior identification result of the group object.

In some possible embodiments, the method further comprises: under the condition that single object behaviors and/or group object behaviors exist in the video sequence to be recognized, determining the spatial positions and behavior categories of the single object behaviors and/or the group object behaviors; wherein the behavior categories include categories of the individual object behaviors and/or categories of the group object behaviors; determining alarm content according to the spatial position and the behavior category; and sending an alarm notification to the terminal equipment corresponding to the spatial position according to the alarm content.

Therefore, when a single object behavior and/or a group object behavior is detected, the system can automatically identify the type and the area of the corresponding behavior and give an alarm, so that the automatic analysis of the behavior event in the video content provides convenience and evidence obtaining capability for related departments, and the method is suitable for indoor and outdoor general scenes.

In a second aspect, an embodiment of the present application provides an apparatus for identifying a behavior, including an obtaining module, a generating module, and an identifying module, where:

the acquisition module is used for acquiring the detection result of each object in each frame of image in the video sequence to be identified; wherein, the video sequence to be identified comprises a single object and/or a group object; the group of objects comprises at least two objects with a spatial distance smaller than a distance threshold;

the generating module is configured to generate at least one first sequence and/or at least one second sequence according to a detection result of each object, where the first sequence and the second sequence each include one of: the same single object, the group object;

and the recognition module is used for respectively carrying out behavior recognition on each first sequence and/or each second sequence by utilizing a trained multi-task recognition network model.

In a third aspect, an embodiment of the present application provides an apparatus, including a memory and a processor, where the memory stores a computer program executable on the processor, and the processor implements the steps in the behavior recognition method when executing the program.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps in the behavior recognition method.

The embodiment of the application provides a behavior recognition method, a behavior recognition device, a behavior recognition equipment and a storage medium. The joint identification method compatible with the single object behaviors and the group object behaviors is provided, and different types of behavior detection requirements can be simultaneously supported. Meanwhile, the video classification problem is converted into the recognition problem of different types of behavior sequences, so that the recognition difficulty of a behavior recognition model is reduced, and the acquisition of effective information is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without inventive efforts, wherein:

fig. 1A is a schematic diagram of a network architecture of a behavior recognition method according to an embodiment of the present application;

fig. 1B is a schematic flowchart of a behavior recognition method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of another behavior identification method according to an embodiment of the present application;

fig. 3 is a schematic flowchart of another behavior identification method according to an embodiment of the present application;

fig. 4 is a schematic flowchart of another behavior recognition method according to an embodiment of the present application;

fig. 5 is a schematic flowchart of another behavior recognition method according to an embodiment of the present application;

fig. 6 is a schematic flowchart of another behavior recognition method according to an embodiment of the present application;

FIG. 7 is a flowchart illustrating a training process of a multi-task recognition network model according to an embodiment of the present disclosure;

FIG. 8A is a system block diagram of a single-person and multi-person behavior joint recognition algorithm provided in an embodiment of the present application;

FIG. 8B is a block diagram of a system for single model training provided by an embodiment of the present application;

FIG. 8C is a block diagram of a system for multi-task recognition network model training according to an embodiment of the present disclosure;

FIG. 8D is a flow chart of adaptive sparse/dense region partitioning logic as provided by an embodiment of the present application;

fig. 9 is a schematic structural diagram illustrating a component of an apparatus for behavior recognition according to an embodiment of the present disclosure;

fig. 10 is a hardware entity diagram of an apparatus for behavior recognition according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The following examples are intended to illustrate the present application but are not intended to limit the scope of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

It should be noted that the terms "first \ second \ third" referred to in the embodiments of the present application are only used for distinguishing similar objects and do not represent a specific ordering for the objects, and it should be understood that "first \ second \ third" may be interchanged under specific ordering or sequence if allowed, so that the embodiments of the present application described herein can be implemented in other orders than illustrated or described herein.

It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which embodiments of the present application belong. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Behavior detection in video is an important problem in the field of computer vision, and has wide application in the field of smart cities, such as detection of illegal behaviors, traffic accidents, some unusual events and the like. Most devices that capture video sources simply record the motion at each moment and do not have the ability to automatically recognize it (often requiring special personnel to be responsible for manual viewing). Due to the huge amount of video, it is obviously not realistic to filter the content in the video only by human power. There is a need for techniques that utilize computer vision and deep learning to automatically detect behavioral events occurring in a video.

Fig. 1A is a schematic diagram of a network architecture of a behavior recognition method according to an embodiment of the present application, as shown in fig. 1A, the network architecture includes: camera 101, object detection module 102, preprocessing module 103 and video recognition module 104: the object detection module 102, the preprocessing module 103 and the video recognition module 104 may be disposed in the server 100, and in order to support an exemplary application, the camera 101 establishes a communication connection with the server 100 through a network. The method comprises the steps that a video under a specific scene is collected through a camera 101, then a video sequence 11 to be identified, namely a multi-frame image containing a target object, is obtained through sampling, and the video sequence 11 to be identified is input into an object detection module 102; the object detection module 102 may fully utilize a correlation detection algorithm, such as an inter-frame difference method, a background subtraction method, an optical flow method, etc., to achieve the positioning and analysis of the target object in the video sequence 11 to be recognized, so as to obtain a multi-frame image 12 with a detection result (a detection frame and an object identifier for labeling the target object); then, processing the multi-frame image 12 with the detection result by the preprocessing module 103, and generating at least one first sequence 13 and/or at least one second sequence 14 based on the detection frame and the object identifier of each object in each frame of image, where the first sequence is a track sequence including the same object, and the second sequence is a behavior sequence including group objects; all the first sequences 13 and the second sequences 14 are input together into the video recognition module 104; the video recognition module 104 may perform behavior recognition on each first sequence 13 and/or each second sequence 14 by fully utilizing the relevant video understanding model, and finally output the recognition result of the video layer. Based on the network architecture, a behavior recognition method framework comprising two stages of object positioning and behavior recognition can be designed. The object positioning stage can flexibly use the existing all-object detection algorithm, and the behavior recognition stage trains according to the sample data set of the single-object behavior and the sample data set of the group object behavior to obtain the multi-task recognition network model.

The embodiment of the application provides a behavior identification method which is applied to a server, terminal equipment or other equipment. The terminal device includes, but is not limited to, a mobile phone, a notebook computer, a tablet computer, a handheld internet device, a multimedia device, a streaming media device, a mobile internet device, a wearable device, or other types of devices.

Fig. 1B is a schematic flow chart of a behavior recognition method according to an embodiment of the present application, and as shown in fig. 1B, the method at least includes the following steps:

step S110, obtaining a detection result of each object in each frame of image in the video sequence to be identified.

Here, the video sequence to be identified includes a single object and/or a group object; the group objects include at least two objects whose spatial distance is smaller than a distance threshold, that is, the video sequence to be recognized may include a single object behavior event, may also include a group object behavior event, and may also include both of the two events. In an implementation, the video sequence to be identified may be obtained by sampling a specific video source. For example, videos shot at a fixed angle often cover a wide field of view and contain much information, such as pedestrians, vehicles, animals, buildings, and other complex background information.

It is understood that the video sequence to be identified is a frame sequence composed of a plurality of frame images, wherein each frame image may or may not contain at least one object, and the objects contained in different frame images are not necessarily the same, but all the frame images of the video sequence to be identified contain at least two objects. The object may be a pedestrian, or may be a moving vehicle, an animal, or the like, and is determined according to an actual scene in implementation, which is not limited in the embodiment of the present application.

The detection and positioning analysis of the object in the video image can be realized by related image or video processing technology, for example, the object detection algorithm preprocesses the video sequence to be identified to obtain multiple frames of images with detection frames, and then the detection frames of the object in each frame of image are extracted. The detection algorithm, such as template matching, may be implemented by an inter-frame difference method, a background subtraction method, an optical flow method, and the like for detecting a moving object in a video, which is not limited in this embodiment of the present application.

Here, the detection result of the object includes at least a detection frame of the object and an object identification associated with the detection frame. For different objects appearing in the video sequence to be identified, the detection frame of each object and the unique object identification associated with the detection frame can be obtained after object detection. The detection frames of different objects are distinguished by corresponding object identifiers, so that the main object with behavior can be automatically identified and processed in time.

Step S120, generating at least one first sequence and/or at least one second sequence according to the detection result of each object.

Here, each of the first sequences and each of the second sequences includes one of: the same single object, the same group object. In some embodiments, each of the first sequences includes the same single object, i.e., a sequence of positions of a single behavior body in each frame of image in the video sequence to be identified; each image of the second sequence comprises group objects, namely an active area sequence of the group objects in each frame of image in the video sequence to be identified. In other embodiments, each of the first sequences comprises a population of objects and each of the second sequences comprises the same single object. In implementation, what type of object the generated sequence includes may be determined according to actual situations, and the embodiments of the present application are not limited herein.

It can be understood that, for the distribution situation of the detection frames in each frame of image in the video sequence to be identified, each frame of image can be divided into a sparse region and a dense region, and at least one first sequence can be generated by processing the sparse region where a single object behavior is easy to occur; processing dense areas prone to population object behavior may generate at least one second sequence. The number of the first sequences can be determined according to the number of object identifications associated with a single object in a sparse region in the video sequence to be identified, and the number of the second sequences can be determined according to the number of dense regions in each frame of image in the video sequence to be identified.

Step S130, using the trained multi-task recognition network model to perform behavior recognition on each first sequence and/or each second sequence, respectively, so as to obtain a behavior recognition result of each single object and/or a behavior recognition result of the group objects.

And sending the at least one first sequence and/or the at least one second sequence obtained in the previous step into a multitask identification network model, respectively carrying out behavior identification on each first sequence and/or each second sequence in the network model, and acquiring the discrimination scores of different behaviors. For example, with a pedestrian as a detection object, the individual object behavior may include falling, climbing, issuing a leaflet, and the like, and the group object behavior may include stepping, hugging, and the like.

That is to say, different types of behavior events in the video sequence to be recognized are subjected to independent behavior recognition, first discrimination scores of each first sequence after passing through the network model are confirmed, and behaviors of a single object corresponding to the first sequence, such as behaviors of falling of an old man, lying of a child, falling of a puppy in water and the like, are output under the condition that at least one first discrimination score is higher than a given first threshold; or confirming second judgment scores of each second sequence after passing through the network model, and outputting group object behaviors corresponding to the first sequence under the condition that at least one second judgment score is higher than a given second threshold value, such as a group of students getting on a road and a group of pedestrians stepping on the road; and under the condition that any discrimination score is not higher than the first threshold or the second threshold, determining that no single object behaviors or group object behaviors exist in the video sequence to be recognized.

It should be noted that the multi-task recognition network model in step S130 can be implemented by mixing data sets of a plurality of different tasks together for training, and optimizing a backbone network, i.e., a feature extractor; the full-link layer of the multitask recognition model can be distilled by means of the network model of each single task, so that the accuracy of the multitask recognition model is as equal as possible to that of the network model of each task or higher.

In some embodiments, the multitask recognition network model comprises a first classification network for performing behavior recognition on the single objects and a second classification network for recognizing the group objects, and in response to the condition that each first sequence comprises the single objects, the first classification network is used for recognizing the at least one first sequence, and a behavior recognition result of each single object is obtained; and/or, in response to the condition that each second sequence comprises the group object, identifying the at least one second sequence by using the second classification network to obtain a behavior identification result of the group object. Wherein the first sequence comprises single objects, that is to say the first sequence is formed by sparse detection boxes and is identified by a first classification network, and the second sequence comprises group objects, that is to say the second sequence is formed on the basis of dense detection boxes and is identified by a second classification network. In this way, the objects included in the first sequence or the second sequence are single objects or group objects, and the corresponding classification networks are used for identification, so that the efficiency of behavior identification can be improved.

In the embodiment of the application, firstly, a detection result of each object in each frame of image in a video sequence to be identified is obtained; then, generating at least one first sequence and/or at least one second sequence according to the detection result of each object; finally, respectively carrying out behavior recognition on each first sequence and/or each second sequence by utilizing a trained multi-task recognition network model; therefore, the joint identification method compatible with the single object behaviors and the group object behaviors is provided, and different types of behavior detection requirements can be simultaneously supported. Meanwhile, the video classification problem is converted into the recognition problem of different types of behavior sequences, so that the recognition difficulty of a behavior recognition model is reduced, and the acquisition of effective information is improved.

Fig. 2 is a schematic flow chart of another behavior identification method provided in an embodiment of the present application, and as shown in fig. 2, the method at least includes the following steps:

step S210, obtaining a detection result of each object in each frame of image in the video sequence to be identified.

Here, the video sequence to be recognized includes at least two objects; the detection result of the object at least comprises a detection frame of the object and an object identifier associated with the detection frame.

Step S220, determining a sparse detection frame and a dense detection frame in each frame of image according to the distribution of the detection frame of each object in the corresponding frame of image.

Here, the number of times of overlapping between the sparse detection frame and the detection frames other than the sparse detection frame in each frame image is smaller than a first time threshold; the overlapping times of the dense detection frames and other detection frames except the dense detection frames in each frame of image are greater than or equal to a second time threshold value, wherein the second time threshold value is greater than the first time threshold value.

For each frame of image in a video sequence to be identified, the embodiment of the present application is expected to find out detection frames in which single objects are scattered as sparse detection frames to generate a sequence including the same single object, and find out detection frames in which the detection frames overlap with other detection frames more as dense detection frames to generate a sequence including group objects.

In some embodiments, the sparse detection boxes and the dense detection boxes in each frame of image may be determined by:

the first method is as follows: determining the number of detection frames included in each frame of image; and determining the detection frame as a sparse detection frame in the corresponding frame image under the condition that the number of the detection frames is 1.

Here, in the case where only 1 detection frame is included in each frame image, it is described that the detection frames are independent in spatial position, and the object corresponding to the detection frame is likely to have a single object behavior. The independent detection frame is directly used as a sparse detection frame, so that the subsequent generation of a sequence of single object track behaviors is facilitated, the operation can be simplified, and useful information can be effectively extracted.

It should be noted that the sparse detection frames in different frame images may be of the same object or of different objects, and the sparse detection frame of an object may appear in each frame image in the video sequence to be identified or may appear in only a part of frames.

The second method comprises the following steps: determining the number of detection frames included in each frame of image; under the condition that the number of the detection frames is more than or equal to 2, generating an adjacent matrix corresponding to each frame of image according to the intersection and parallel ratio between every two detection frames in each frame of image; taking the detection frame with zero matching times in the adjacent matrix as a sparse detection frame in each frame of image; taking the detection frame with the matching times larger than or equal to the third time threshold value in the adjacent matrix as a dense detection frame in each frame of image; wherein the third count threshold is greater than the second count threshold.

Here, the Intersection over Union (IoU) between each two detection frames is a result of dividing a portion where two regions overlap by a portion where two regions are aggregated. The value of (i, j) in the adjacency matrix represents the intersection ratio of the detection frame i and the detection frame j in the frame image. The number (except itself) of rows i having a value greater than 0 is taken as the number of matches of the detection box i.

If no other detection frame with the intersection ratio larger than 0 exists in the adjacent matrix, the matching frequency of the detection frame i is 0, which indicates that no overlapping area exists between the detection frame i and other detection frames in the frame image where the detection frame i is located, and the detection frame i can be used as a sparse detection frame in the frame image where the detection frame i is located. If the intersection ratio between a certain detection frame j and a plurality of other detection frames is greater than 0, it is described that the number of times of repetition between the detection frame j and other detection frames in the frame image where the detection frame j is located is large, and a group object behavior is likely to occur with a detection frame where a plurality of objects exist around the position where the detection frame j is located, and the detection frame j can be used as a dense detection frame of the frame image where the detection frame j is located.

In some other embodiments, when each frame of image includes at least two detection frames, first, the detection frame of each object is expanded outward by a specific proportion to obtain an expanded detection frame; screening at least two first detection frames from the expanded detection frames; the area of the first detection frame is larger than that of other detection frames in the expanded detection frame; then, the intersection-to-parallel ratio between the at least two first detection frames is determined. For example, in the case where the fixed ratio is 1.5 times, the length and width of the detection frame of each of the objects are expanded by 1.5 times. This can increase the image resolution, and thus can better calculate the intersection ratio between the detection frames that overlap each other.

Step S230, generating the at least one first sequence according to the sparse detection frame in each frame of image and the object identifier associated with the sparse detection frame.

Here, the sparse detection frames in different frame images may be of the same object or of different objects, and the sparse detection frame of an object may appear in each frame image in the video sequence to be identified or may appear in only a part of frames. And connecting the sparse detection frames which are associated with each object identifier and appear in all the video sequences to be identified or larger regions containing the sparse detection frames according to a time sequence to generate each first sequence. That is, the same single object is included in the first sequence.

In some other embodiments, the sparse detection frame associated with each object identifier is selected from the at least two first detection frames, and the sparse detection frame associated with each object identifier is first scaled inward according to the specific proportion to obtain a sparse detection frame with an original size; and then generating the at least one first sequence according to the original-size sparse detection frame and the object identification associated with the sparse detection frame. In this way, for the candidate detection frames selected from the at least two first detection frames, before each first sequence is generated, the candidate detection frames need to be scaled inwards according to the original proportion, so that the trajectory sequence finally entering the behavior recognition stage is ensured to be the original motion trajectory of a single object, and the generation of extra calculation amount is avoided.

Step S240, generating the at least one second sequence according to the dense detection box in each frame of image and the object identifier associated with the dense detection box.

Here, the dense detection frames in different frame images may be of the same object or of different objects, and the dense detection frame of a certain object may appear in each frame image in the video sequence to be identified or may appear in only a part of frames. Under the condition that only one dense detection frame exists in each frame of image in the video sequence to be identified, connecting the dense detection frames in each frame of image in the video sequence to be identified and the areas of all detection frames overlapped with the dense detection frames according to a time sequence to generate a second sequence; and under the condition that at least two dense detection frames exist in each frame of image in the video sequence to be identified, respectively connecting the dense detection frames in each frame of image belonging to the same object identifier and the areas of all detection frames overlapped with the dense detection frames according to a time sequence to generate at least one second sequence. That is, the second sequence includes population objects.

In some other embodiments, each of the dense detection frames is selected from the at least two first detection frames, and each of the dense detection frames is first shrunk in the specific proportion to obtain a dense detection frame with an original size; and then generating the at least one second sequence according to the dense detection boxes with the original sizes and the object identifications associated with the dense detection boxes.

Step S250, using the trained multi-task recognition network model to perform behavior recognition on each first sequence and each second sequence, respectively, so as to obtain a behavior recognition result of each single object and/or a behavior recognition result of the group objects.

The implementation process of step S250 is similar to the implementation process of step S130, and is not described herein for avoiding redundancy.

In some possible implementation modes, after the recognition result is obtained through the multitask recognition network model, an alarm notice can be sent to a relevant department or platform, so that the safety behaviors of the user and others are harmed in time. One possible implementation is as follows: under the condition that single object behaviors and/or group object behaviors exist in the video sequence to be recognized, determining the spatial positions and behavior categories of the single object behaviors and/or the group object behaviors; wherein the behavior categories include categories of the individual object behaviors and/or categories of the group object behaviors; determining alarm content according to the spatial position and the behavior category; and sending an alarm notification to the terminal equipment corresponding to the spatial position according to the alarm content so as to enable a manager holding the terminal equipment to process the behavior.

It can be understood that different location areas have corresponding managers holding terminal devices, and the terminal devices receive the notification of the alarm system, so that single object behaviors occurring in the location areas can be quickly located and processed. After the safety behaviors of the user and others are damaged in outdoor urban street scenes, indoor rail transit scenes and the like, the system can automatically identify the behavior bodies and give an alarm, and an efficient and convenient detection capability is provided for personnel with related requirements.

In the embodiment of the application, firstly, the sparse detection frames and the dense detection frames are determined by using the distribution condition of the detection frames in the acquired frame images, and then the sparse detection frames in each frame image are preprocessed to obtain at least one first sequence, so that the problem of identification of single object behaviors is solved; and simultaneously, preprocessing the dense detection frames in each frame of image to obtain at least one second sequence, thereby solving the problem of identification of group object behaviors. The preprocessing logic which gives consideration to single object behavior detection and group object behavior detection is provided, the receptive fields of different types, namely different actions for executing main behaviors, can be adjusted in a self-adaptive mode, and the problem that the effective perception range is small when a full-view video sequence to be recognized is recognized can be solved.

Fig. 3 is a schematic flow chart of another behavior recognition method according to an embodiment of the present application, and as shown in fig. 3, the process of "generating the at least one first sequence according to the sparse detection frame in each frame of image and the object identifier associated with the sparse detection frame" in the step S230 may at least include the following steps:

step S310, aiming at all frame images of the frame sequence buffer area, taking a union set of sparse detection frames associated with each object identification in a spatial position to obtain a minimum bounding frame of a single object corresponding to each object identification.

Here, the spatial position of the sparse detection frame associated with each object identifier on different frame images may be moved, and the sparse detection frames at different spatial positions may be merged into a larger area by taking a union set, and serve as a minimum bounding box of a single object corresponding to each object identifier. For example, the pedestrian detection frames belonging to the same ID (Identity) in the frame sequence buffer are merged at a spatial position to obtain the smallest enclosure frame of the pedestrian in the frame sequence buffer.

Step S320, capturing a first area image corresponding to the spatial position of the smallest bounding box in each frame of image according to the size of the smallest bounding box.

The first area image in each frame of image is intercepted by the minimum bounding box, so that the relative position loss of the behavior main body can be avoided, and the performance of the behavior detection of a single object which is similar in space but different in motion rhythm is improved.

It should be noted that, when the first region image is intercepted, there may be some frame images in which the sparse detection frame corresponding to the object identifier does not appear. In this case, the candidate detection frames belonging to the object identifier in the other frame images in the video frame sequence may be used to perform an operation of merging to obtain a minimum bounding frame belonging to the object identifier, and the minimum bounding frame may be used to capture a solid background as the first region image.

And step S330, sequentially connecting the first area images according to the time stamp of each frame of image to obtain a first sequence corresponding to each object identifier.

Here, the first region images belonging to each object identifier are connected in chronological order, i.e. a first sequence corresponding to each of said object identifiers is obtained.

In the embodiment of the application, the screened sparse detection frames which are not overlapped with other detection frames are processed to generate the first sequence of the track behaviors of the single object, so that the problem of behavior identification of the single object in the full-view video sequence to be identified is converted into the problem of identification of the first sequence, the operation can be simplified, and useful information can be effectively extracted. Meanwhile, each frame of image is intercepted by the minimum bounding box to obtain a first sequence, so that the loss of the relative position of the behavior main body can be avoided, and the performance of the behavior detection which is similar in space but different in motion rhythm is improved.

Fig. 4 is a schematic flowchart of another behavior recognition method according to an embodiment of the present application, and as shown in fig. 4, the step S240 "generating the at least one second sequence according to the dense detection frame in each frame of image and the object identifier associated with the dense detection frame" includes at least the following steps:

step S410, the dense detection frames in each frame of image in the frame sequence buffer area are expanded outward by a specific proportion, and the dense area in each frame of image and the object identification related to the dense area are obtained.

Here, the specific ratio may be an empirical value or may be determined according to an application scene picture actually photographed so that the expanded dense region (patch) can surround all execution subjects in which the group object behavior occurs as much as possible.

It should be noted that after the region corresponding to the dense detection box is adaptively extended outward according to a specific ratio, a rectangular region is obtained. For convenience of subsequent processing, the size normalization of the dense region after the expansion of different dense detection frames can be performed. For example, the size of the long side of the dense region is adjusted to 224 pixels, the short side of the dense region is scaled in equal proportion to the long side, and the vertical black-edge correction processing is performed on the short side of less than 224 pixels after scaling.

Here, the dense area includes at least two objects, one of the objects being an object associated with the dense detection box; and the object identification associated with the dense detection box is taken as the object identification associated with the dense area.

Step S420, determining whether the merge tag of the frame sequence buffer is true.

Here, the merge tag is a preset hyper parameter, which is used to characterize whether a dense region of the frame images in the frame sequence buffer is merged, and the value of the merge tag is determined to be true or false according to the background similarity of the frame images in the frame sequence buffer.

For example, for a video sequence shot at a fixed angle, if the backgrounds of the images of the frames are consistent, the merge flag is set to true; and setting the merging label to be false when the background of each frame of image is changed greatly in the image sequence shot by the handheld device.

Step S430, under the condition that the merge tag is true, merging the dense regions in each frame image in the frame sequence buffer area to obtain a dense enclosure frame of the frame sequence buffer area.

Here, the ranges of dense regions in different frame images in the frame sequence buffer are merged to obtain a larger range of dense bounding boxes.

Step S440, capturing a second area image corresponding to the spatial position of the dense enclosure frame in each frame of image according to the size of the dense enclosure frame.

Here, the image segmentation technique or other image processing techniques in the related art may be adopted to intercept the second region image corresponding to the spatial position of the dense surrounding frame in each frame of image, which is not limited in this embodiment of the application.

And step S450, sequentially connecting the second region images according to the time stamp of each frame of image, and generating a second sequence.

Here, each frame of image in the video sequence to be recognized carries a respective timestamp, which may be a timestamp set when the image is collected by a collection device such as a camera, or a timestamp set in a process of subsequently sampling an original image sequence, or a timestamp set in another implementable manner, which is not limited in this application.

Step S460, in a case that the merge flag is false, respectively intercepting each frame of image according to the dense region of each frame of image, to obtain a fourth region image with a length and a width both of which are specific pixels.

Here, the merge flag is false, that is, the dense region in each frame of image in the frame sequence buffer is not merged, which is suitable for the case that the background change of each frame of image in the video sequence to be identified is large.

Here, the specific pixel is an empirical value, for example, 224 × 224, and represents a rectangular image having a length and a width of 224 pixel units.

In some other embodiments, for a case where a frame image missing dense detection frame exists in the frame sequence buffer, a rectangular frame of a fixed size at the center position of the frame image may be used as the fourth region image.

The step S460 can also be implemented by the following embodiments:

taking a region image of which the center point is the center of the first frame image and the specific pixels are the length and the width as a fourth region image of the first frame image when the first frame image exists in the frame sequence buffer area; wherein the first frame image is a frame image that does not include a dense detection frame.

Step S470, sequentially connecting the fourth region images according to the timestamp of each frame image, and generating the at least one second sequence.

The step S430 is an unnecessary step, and may be omitted in some special application scenarios. For example, in a fixed-angle shooting scene, the default value of the merge flag is usually true, that is, the default value of the merge flag is set to merge the dense regions of each frame of image in the video sequence to be identified. The above steps S440 to S450 and the steps S460 to S470 are two parallel schemes, and in actual implementation, only one of the schemes may be selected to be executed according to the value of the merged tag. In this way, the second area image in each frame image obtained in step S440 or the fourth area image in each frame image obtained in step S460 is input into the network model instead of the original video sequence to be recognized, so that the spatial relative position information of the behavior group can be maintained while the receptive field of the behavior event of the group object is reduced.

In the embodiment of the application, the dense detection box which is predicted to be easy to generate the group object behavior is expanded outwards, so that the expanded dense area surrounds all objects of the group behavior as much as possible. Meanwhile, by determining the dense surrounding frame of the frame sequence buffer area and intercepting each frame image by the dense surrounding frame to obtain a second sequence, the receptive field of the group object behavior event can be reduced, and the accuracy and efficiency of behavior identification are improved.

In some other embodiments, the video sequence to be identified is composed of frame images in a frame sequence buffer, each frame image in the frame sequence buffer includes at least one dense detection box, and each image in the second sequence includes at least one same object in group objects. Fig. 5 is a schematic flowchart of another behavior recognition method according to an embodiment of the present application, and as shown in fig. 5, the step S240 "generating the at least one second sequence according to the dense detection frame in each frame of image and the object identifier associated with the dense detection frame" includes at least the following steps:

step S510, extending each dense detection frame in each frame of image outward by a specific proportion to obtain a dense region corresponding to each dense detection frame and an object identifier associated with the dense region.

Step S520, according to the object identification associated with the dense area, a union set of the dense areas including the same object is obtained, and at least one dense enclosure frame including the same object is obtained.

Here, the dense areas including the same object are associated with the same object identifier, and the dense areas corresponding to each object identifier are merged to obtain a dense enclosure frame corresponding to each object identifier. Different object identifications result in different dense bounding boxes.

Step S530, according to the size of each dense enclosure frame, capturing a third area image corresponding to the spatial position of the dense enclosure frame in each frame of image.

The third area image is captured according to the dense surrounding frame determined by the same object identifier, the generated second sequence is ensured to at least comprise the same object, the group object behaviors can be continuously detected, a subsequent detection system can automatically identify the trip as the occurring object, and convenience is brought to departments with related requirements.

And step S540, sequentially connecting the third area images according to the timestamp of each frame image, and generating the at least one second sequence including the same object.

Here, the third region image in the second sequence including the same object is cut out according to the dense bounding box corresponding to the same object identifier, thereby generating the second sequence corresponding to the object identifier. Different second sequences may be generated according to the dense bounding boxes corresponding to different object identifications.

In the embodiment of the application, the dense enclosure frame including the same object in the frame sequence buffer area is determined through the object identifier associated with the dense area, and each frame image is intercepted by the dense enclosure frame to obtain the second sequence including the same object, so that the receptive field of group object behavior events can be reduced, and the accuracy and efficiency of behavior recognition are improved. Meanwhile, the relative position loss of behavior groups is avoided, and the performance improvement is good for the behavior detection with similar space but different movement rhythms.

In some other embodiments, the video sequence to be recognized is composed of frame images in a frame sequence buffer, fig. 6 is a flowchart of a further behavior recognition method provided in this embodiment, and as shown in fig. 6, the "generating at least one first sequence according to the detection result of each object" includes at least the following steps:

step S610, determining a sparse detection frame in each frame image according to a distribution of the detection frame of each object in the corresponding frame image.

Step S620, generating the at least one first sequence according to the sparse detection frame in each frame of image and the object identifier associated with the sparse detection frame.

Here, the implementation process of steps S610 to S620 is similar to the implementation process of steps S220 to S230 in the above embodiment, and is not repeated herein to avoid repetition.

Step S630, determining a first object identification number associated with the sparse detection frame in each frame image in the frame sequence buffer.

Here, each sparse detection frame is associated with a corresponding object identifier, and the sparse detection frames in each frame image in the frame sequence buffer are counted to determine the number of the first object identifiers.

In step S640, in the case that the number of the first object identifiers is smaller than a specific threshold, a candidate detection frame is selected from the dense area in each frame of image.

Here, the specific threshold is determined according to the maximum number of parallel computing paths allowed on the line. Under the condition that the number of the first object identifications is smaller than a specific threshold, the number of the first sequences generated according to the sparse detection frames and the object identifications associated with the sparse detection frames in the frame sequence buffer is also smaller than the specific threshold, and the detection frames which are easy to generate single object behaviors can be screened out from the dense region in each frame image for subsequent processing, so that the total number of the finally generated first sequences reaches the maximum threshold.

Here, the number of times of overlapping of the candidate detection frame with the other detection frame is smaller than the first time threshold. That is, the candidate detection boxes are relatively independent detection boxes in the dense area, and a single object behavior may also occur.

In some possible embodiments, the candidate detection box may be determined by: determining the specific threshold according to the number of second object identifications and specific processing parameters associated with the dense detection frames in each frame of image in the frame sequence buffer area; determining a third object identification number associated with the candidate detection frame according to the first object identification number and the specific threshold value; and selecting the detection frame with the overlapping times with other detection frames being less than the first time threshold value from the dense area in each frame of image as the candidate detection frame according to the third object identification number.

Here, assuming that the specific threshold is L and the number of first object identifications is M, that is, M first sequences are determined according to the sparse detection box, where L is an integer equal to or greater than 2 and M is any one of 1 to L-1. Detection frames with the overlapping times with other detection frames being smaller than the first time threshold value can be selected from the dense area as candidate detection frames, and the objects associated with the candidate detection frames are identified as L-M for subsequent generation of L-M first sequences according to the candidate detection frames.

Step S650, generating the at least one first sequence according to the candidate detection box and the object identifier associated with the candidate detection box.

Here, the at least one first sequence may be generated in the same manner as the at least one first sequence is generated according to the sparse detection box and the object identifier associated with the sparse detection box in step S620,

in the embodiment of the application, in the process of preprocessing a video sequence to be recognized to generate a first sequence, under the condition that the first sequence generated based on sparse detection frames does not meet a specific threshold, the accuracy of behavior recognition of a single object is ensured to be realized by effectively calculating power by selecting candidate detection frames with smaller overlapping times in a dense area and generating the first sequence, and the efficiency and the performance of a model are improved.

In other embodiments, the multitask recognition network model is obtained by training a built behavior recognition model by using a first sample data set of single object behaviors and a second sample data set of group object behaviors; the built behavior recognition model comprises a backbone network and at least one classification network, and the multi-task recognition network model comprises a multi-task backbone network obtained by training the backbone network and a multi-task classification network corresponding to a task to be recognized obtained by training the classification network; the task to be identified comprises at least one of the following: a task type for a single object behavior and a task type for a group object behavior.

The established behavior recognition model is a behavior recognition model comprising a backbone network and N classification networks (corresponding to N tasks to be recognized one by one). A backbone network (backbone) is a network used for feature extraction, and represents a part of the network, and is generally used for extracting picture information from a front end for use by a subsequent network. VGGNet, Residual Neural Network (ResNet) and initiation Network models are usually used, wherein VGGNet is a deep convolutional Neural Network, the feature extraction capability of these backbone networks is very strong, and model parameters trained on a large data set by an official can be loaded, and then the model parameters are connected with the own Network for fine adjustment. The classification network may be implemented by a Fully Connected Layers (FC) or a head network (head), which acts as a "classifier" throughout the convolutional neural network. The operations of the convolution layer, the pooling layer, the activation function layer and the like of the whole convolutional neural network map the original data to the hidden layer feature space, and the fully-connected layer plays a role in mapping the learned distributed feature representation to the sample mark space. In practical use, the full connection layer can use convolution operation to realize a behavior recognition model trained and built by using the sample data sets of the N tasks to be recognized, and N classification networks which correspond to the N tasks to be recognized in the multi-task behavior recognition model one by one are obtained.

Fig. 7 is a schematic flowchart of a training process of a multi-task recognition network model according to an embodiment of the present application, where as shown in fig. 7, the process includes the following steps:

step S710, the constructed behavior recognition model is respectively trained by utilizing the first sample data set and the second sample data set, and a first video recognition network of a single object and a second video recognition network of a group object are obtained.

Here, the first sample data set is an image frame sequence which may contain single object behaviors, for example, when the object is a human, it is generally considered that only one executing subject is involved in a behavior which endangers self-safety, i.e., a single object behavior, such as climbing, falling into water by a single person, and the like.

Here, the second sample data set may be a sequence of image frames containing group object behaviors, for example, when the object is a person, it is generally considered that there are a plurality of execution subjects involved in behaviors that endanger others and own safety, i.e., group object behaviors. It should be noted that the first sample data set and the second sample data set should be image frame sequences processed for the same sample video sequence.

Here, the first video identification network and the second video identification network each include the backbone network and at least one of the classification networks. The first video identification network is used for identifying single object behaviors, and at least one classification network in the first video identification network is respectively used for identifying task types corresponding to each single object behavior, for example, in single-person behaviors of climbing, falling, stealing, property damage and the like, each behavior corresponds to the respective classification network, but shares the same backbone network. And the second video identification network is used for identifying the task type corresponding to the group object behaviors. For example, in behaviors such as climbing and stepping on by multiple persons, each behavior corresponds to a respective classification network, but shares the same backbone network.

Step S720, utilizing the first video identification network and the second video identification network to train the backbone network to obtain the multitask backbone network.

Here, the multitask backbone network may be a backbone network suitable for a plurality of different tasks to be recognized, so that the backbone network in the built behavior recognition model needs to be trained by using sample data sets of different behavior types to obtain a universal, fixed multitask backbone network capable of recognizing different tasks to be recognized.

In some possible embodiments, the backbone network is trained to obtain a multitask backbone network by the following procedures: acquiring a first feature vector of the first video identification network and a second feature vector of the second video identification network; and training a backbone network in the built behavior recognition model by taking the first characteristic vector and the second characteristic vector as target characteristics to obtain the multi-task backbone network.

Here, the first feature vector of the first video recognition network and the second feature vector of the second video recognition network are used as target features to train the backbone network in the constructed behavior recognition model, so that the features of all samples in each type of event can be obtained. In implementation, feature results output by a plurality of single models, such as a first video recognition network or a second video recognition network, through the multitask backbone network are applied as supervisory signals (L2 loss or klloss) on output results of the multitask backbone network, and are trained together with a Cross Entropy function (Cross Entropy, CE) to obtain the multitask backbone network.

The Loss Function (Loss Function) is also called an error Function, is used for measuring the running condition of the algorithm, and is a non-negative real value Function for measuring the inconsistency degree of the predicted value and the real value of the model. The smaller the loss function, the better the robustness of the model. The loss function is a core part of the empirical risk function and is also an important component of the structural risk function. The loss function is used to evaluate how different the predicted value and the actual value are. In general the better the loss function, the better the model performance. The cross entropy function is used for describing the difference size between the model predicted value and the actual value, and the larger the difference size is, the more the difference size is; the cross entropy function can be derived from the maximum likelihood function under the condition of bernoulli distribution.

The core idea of knowledge distillation is to migrate knowledge so as to obtain a small model more suitable for reasoning through a large trained model. Through a knowledge distillation method, the characteristics of the multitask recognition network model are aligned with the characteristics of the first video recognition network or the characteristics of the second video recognition network, so that a better backbone network is obtained, and the requirement of consistent characteristics is met.

Step S730, training a first behavior recognition model by using the first sample data set and the second sample data set, to obtain the multitask recognition network model.

Here, the first behavior recognition model includes the multitasking backbone network and the at least one classification network.

And training the first behavior recognition model by utilizing the first sample data set and the second sample data set to obtain the multitask recognition network model.

According to the method and the device, a basic network structure (a plurality of tasks share a backbone network) is trained respectively by utilizing a first sample data set of a single object behavior and a second sample data set of a group object behavior, so that a multi-task recognition network model capable of simultaneously performing task recognition of the single object behavior and task recognition of the group object behavior is obtained. The method can realize the extraction of the characteristics of different tasks to be identified on one backbone network, and effectively improve the accuracy of the extraction of the characteristics of the backbone network. And meanwhile, the precision of the multitask backbone network obtained by knowledge distillation is not lower than that of the backbone network in the first video identification network or the second video identification network.

The above behavior recognition method is described below with reference to a specific embodiment, but it should be noted that the specific embodiment is only for better describing the present application and is not to be construed as limiting the present application.

The present embodiment will be described with reference to the pedestrian as an example. The number of the action executives of the events in the acquired video source is not fixed (generally, the number of the action executives can be divided into single-person actions and multi-person actions according to the events), and different types of event identification need different spatial receptive fields; common pedestrians in different positions and view fields may exist in a city street scene, and the action event occurrence position really needing to be detected usually only occupies a small area of a picture; there may be a spatial overlap between a normal pedestrian and a pedestrian who has a predetermined behavior, thereby causing a disturbance in the recognition level. Therefore, it is urgently needed to explore a single person/multi-person behavior joint detection method which is compatible with indoor and outdoor general scenes such as urban streets, rail transit and the like, so that the automatic analysis of behavior events in video contents provides convenience and evidence-obtaining capability for related departments.

The embodiment of the application provides a single-person and multi-person behavior joint detection method used in an urban street scene, and a first sequence is used as a single-person track sequence, and a second sequence is used as a multi-person dense area sequence for explanation. In the embodiment of the application, the sparse region and the dense region are automatically divided mainly according to the crowd density distribution condition, and a unified detection framework which gives consideration to single-person behavior detection preprocessing logic based on a track sequence and multi-person behavior detection preprocessing logic based on the dense region is provided.

FIG. 8A is a system block diagram of a single-person and multi-person behavior joint recognition algorithm provided in an embodiment of the present application; as shown in fig. 8A, a video sequence 81a to be recognized is first processed by a preprocessor 82a to generate N₂A track sequence 822a and N₁A sequence of dense regions 821 a; then all track sequences 822a and all dense area sequences 821a are input into the multitask identification network model 83a together; the recognition results of single behaviors such as climbing and falling and the recognition results of multi-person behaviors such as trampling and hugging are processed and output through the multi-task recognition network model 83 a.

Based on the framework shown in fig. 8A, the overall scheme implementation includes four stages of single model training, single model testing, all-in-one training, and all-in-one testing. The event detection models for different types (single-person/multi-person) and different tasks (treading, climbing, falling and the like) are relatively independent, and training and single-model testing are respectively carried out on respective data set labels.

As shown in fig. 8B, the single model training process for single person behavior is as follows: acquiring single-person behavior training data 811b, and then passing the multi-person behavior training data 811b through a preprocessor 82b to generate N₁A sequence of dense regions 821 b; then N is added₁Multiple dense region sequences 821b input to multipleTask recognition network model 83 b; and the recognition result of the multi-person behavior is output through the video recognition network 83 b. The single model training process for multi-person behavior is as follows: acquiring single-person behavior training data 812b, and then passing single-person behavior training data 812b through preprocessor 82b to generate N₂A sequence of traces 822 b; then N is added₂The plurality of trajectory sequences 822b are input to the multitask identification network model 83 b; and the recognition result of the single person behavior is output after being processed by the multi-task recognition network model 83 b.

Then, multi-task combination is carried out, so that different detection task requirements are completed in a network model, namely an all-in-one model training process. Generally speaking, data sets of a plurality of different tasks are directly mixed together to train an all-in-one model, and meanwhile, a backbone network is optimized, so that the accuracy of the model is difficult to be equal to that of a single model under each task. In order to alleviate this problem, the multiple-in-one model is made possible, and knowledge distillation is required, that is, the single models of each task obtained in the single model training process are used to distill the head branches (full-connected layers) of different tasks in the multiple-in-one model, so that the accuracy of the multiple-in-one model is as equal as possible to that of the single model, and even exceeds that of the single model, thereby fully playing the role of promoting multiple tasks.

As shown in fig. 8C, in the first stage, training of the backbone network is performed first, and N is set₁A dense region sequence 811c and N₂The track sequences 812c are input into a backbone network of the multi-in-one model, namely the multi-task recognition network model 82c, and a multi-user behavior characteristic vector f (N) is obtained₁) And a single-person behavior feature vector f (N)₂). To ensure that the precision of each type of single model is kept unchanged when the all-in-one model is trained. Through knowledge distillation, the characteristic results of the single model of multi-person behavior and single model output of single-person behavior are applied as supervisory signals (L2 loss or KL loss) on the network output of the multitask recognition network model 82c, trained with standard cross entropy. In the second stage, training of the full connection layer is carried out again, and N is added₁A dense region sequence 811c and N₂The track sequences 812c are input into the full-connection layer of the multi-task recognition network model 82c together to obtain the classification result (fc) of the multi-person behaviors₁) And sheetClassification result (fc) of human behavior₂)。

After the all-in-one model training is completed, an all-in-one test deployment on the line is needed. Given that the type of event that the online video may contain is often unknown, it is not straightforward to determine whether to use single-person trace-sequence-based preprocessing logic or multi-person dense-area-based preprocessing logic. Therefore, the embodiment of the application provides a self-adaptive sparse/dense area division logic, and the number of matching times with surrounding pedestrians is calculated according to the distribution situation of the pedestrians in a video picture. As shown in fig. 8D, the process includes the following steps:

step S801, extracting a detection result of a pedestrian in the video sequence to be processed.

The method comprises the steps of firstly acquiring a full-image video sequence shot by a device for collecting a video source as a video sequence to be processed, and then calling an upstream structured detection component to extract detection frames of pedestrians in the video sequence to be processed and identification of the detection frame corresponding to each pedestrian in the video sequence to be processed.

And step S802, constructing a space density map of a detection frame in the video sequence to be processed according to the detection result.

Here, the spatial density map may be presented in the form of an adjacency matrix, where the elements in the adjacency matrix are the intersection ratio between each two detection frames. In implementation, based on the density distribution of the detection frames of different pedestrians, all the detection frames of the current frame can be expanded outwards by a specific proportion (default is 1.5 times), and then sorted in descending order according to the area; the overlapping ratio between every two detection frames in the plurality of (for example, 10) detection frames with the largest area is calculated, and the matching times of each detection frame are counted.

And step S803, determining a dense area and a sparse area of each frame of image in the video sequence to be processed according to the space density map of the detection frame.

Here, the detection frame with the matching number of times being 0 in the previous step is defined as a sparse detection frame, and the region corresponding to the sparse detection frame is a sparse region; and taking the detection frame with the most matching times as a dense detection frame, wherein the dense detection frame and the areas corresponding to all the detection frames overlapped with the dense detection frame are dense areas. For example, the dense detection frame is reduced to the original resolution, a larger expansion ratio is defined, and adaptive outward expansion is performed according to the expansion ratio to obtain a larger rectangular frame as the dense region of the current frame. It is generally considered that multi-person behaviors easily occur in a dense area, and a dense detection box is the center of the multi-person behaviors. Therefore, the rectangular frame obtained by expanding the dense detection frame outward, i.e., the dense region, should surround the group in which the multi-person behavior occurs as much as possible.

Step S804, generating N according to the sparse area₂And (3) a track sequence for single person behavior recognition.

Here, the sparse detection frame of the sparse region is reduced to the original resolution, and pedestrians belonging to the sparse detection frame are connected according to the results of the front and back frame sequences (including the detection frame and the associated identifier) to form a single-person behavior track sequence of each pedestrian as a track sequence; and meanwhile, determining the numerical value of N2 according to the number of the identifications corresponding to the sparse detection frame.

Step S805, generating N according to dense region₁A sequence of dense regions for multi-person behavior recognition.

The method comprises the steps that whether dense regions determined in a video sequence to be identified are merged or not is defined, if the merged regions are true, the dense regions determined by front and rear frame images are merged into a large dense surrounding frame to intercept each frame of image, and finally the intercepted image sequence, namely the dense region sequence is sent to a multi-task identification network model for identification; if the merging label is false, respectively intercepting the dense areas determined by each frame of image, and finally sending the intercepted image sequence, namely the dense area sequence, to the model for identification.

Wherein N is₁The value of (A) is determined according to the number of dense detection frames in each frame of image in the video sequence to be identified, the default is 1, and N is₁And N₂The sum of (a) is a specific value, depending on the maximum number of parallel computing passes allowed on the line, for example a maximum of 32.

Each frame map in the video sequence to be identifiedThe dense detection frames in the image are two, N₁In the case of 2, according to N₁And N₂The sum of (A) determines N₂The value of (a). If the number of sparse regions determined in step S803 is less than N₂And (3) selecting a single detection frame in the dense area as a sparse detection frame to generate a track sequence.

Step S806, adding N₁A sequence of dense regions and N₂The track sequences are input into a multi-task recognition network model together for recognition.

Here, N₁Dense area sequences based on the dense areas are adjusted by extracting features through a backbone network and then are sent to corresponding multi-person task head branches, namely a full connection layer, so that multi-person action recognition is carried out; and N is₂And after feature extraction, the track sequence based on the sparse region is also sent to a corresponding single task head branch, namely a full connection layer, to perform single action recognition.

In step S805, the process of capturing the area image in each frame of image according to the dense area is as follows: firstly, the regional image in each frame of image is cut according to the size of the dense region, then the size (resize) of the long edge of the regional image is changed to 224 pixels, then the short edge of the regional image is scaled in equal proportion, and the upper and lower black edges which are less than 224 pixels are filled. If no pedestrian exists in a certain frame of image, intercepting the area image of the current frame by using a dense surrounding frame which is combined together with other frame images under the condition that the combination label is true; in case the merge flag is true, the region image of 224 x 224 is cut out centering on the center point of the current frame image. If no pedestrian detection frame result exists in the input video segments, the region image of 224 x 224 is scratched by the image center point for each frame image uniformly.

The method and the device for detecting the video source are suitable for outdoor urban street scenes and indoor rail transit scenes, and after the scenes have events of different types/different numbers, the device for collecting the video source can automatically identify the areas and types of the events and give an alarm to provide efficient and convenient detection capability for people with related requirements.

According to the embodiment of the application, automatic switching of two different preprocessing schemes under single person/multiple persons is achieved on the data processing aspect, the sparse region and the dense region are automatically divided according to dense distribution of the number of persons, different types of behavior detection requirements such as single person/multiple persons can be met, a unified frame is formed, and support is provided for subsequent multitask training. The method comprises the steps of single model training/testing, all-in-one training/testing and the like on the model training level, so that the model precision under multiple tasks is not lower than that during single task training, and the efficiency and the performance of the model are greatly improved. The multi-task behavior detection application is realized, a model does not need to be trained independently according to each requirement, redundancy of model capacity is caused, and compared with the prior art, the multi-task behavior detection application is easy to expand in requirement.

Based on the foregoing embodiments, an embodiment of the present application further provides a behavior recognition apparatus, where the behavior recognition apparatus includes modules, sub-modules included in the modules, and units included in the sub-modules, and may be implemented by a processor in a device; of course, the implementation can also be realized through a specific logic circuit; in the implementation process, the Processor may be a Central Processing Unit (CPU), a microprocessor Unit (MPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), or the like.

Fig. 9 is a schematic structural diagram of a device for identifying behaviors provided in an embodiment of the present application, and as shown in fig. 9, the device 900 includes an obtaining module 910, a generating module 920, and an identifying module 930, where:

the obtaining module 910 is configured to obtain a detection result of each object in each frame of image in the video sequence to be identified; wherein, the video sequence to be identified comprises a single object and/or a group object; the group of objects comprises at least two objects with a spatial distance smaller than a distance threshold;

the generating module 920 is configured to generate at least one first sequence and/or at least one second sequence according to a detection result of each object, where the first sequence and the second sequence each include one of: the same single object, the group object;

the identifying module 930 is configured to perform behavior identification on each of the first sequences and/or each of the second sequences by using a trained multi-task identification network model, so as to obtain a behavior identification result of each of the individual objects and/or a behavior identification result of the group object.

In some possible embodiments, the detection result of the object includes at least a detection box of the object and an object identifier associated with the detection box; the generating module 920 includes a first determining sub-module and a first generating sub-module, wherein: the first determining submodule is used for determining a sparse detection frame in each frame of image according to the distribution condition of the detection frame of each object in the corresponding frame of image; wherein the number of times of overlapping between the sparse detection frame and the other detection frames except the sparse detection frame in each frame of image is less than a first time threshold; the first generation submodule is configured to generate the at least one first sequence according to the sparse detection frame in each frame of image and the object identifier associated with the sparse detection frame.

In some possible embodiments, the generating module 920 includes the first determining submodule and a second generating submodule, where: the first determining submodule is further configured to determine a dense detection frame in each frame of image according to a distribution condition of the detection frame of each object in the corresponding frame of image; the overlapping times of the dense detection frames and other detection frames except the dense detection frames in each frame of image are greater than or equal to a second time threshold value, and the second time threshold value is greater than the first time threshold value; (ii) a And the second generation submodule is used for generating the at least one second sequence according to the dense detection frame in each frame of image and the object identification associated with the dense detection frame.

In some possible embodiments, the first determination submodule comprises a first determination unit and a second determination unit, wherein: the first determining unit is used for determining the number of detection frames included in each frame of image; the second determining unit is configured to, when the number of detection frames is 1, determine that the frame corresponds to a sparse detection frame in the frame image.

In some possible embodiments, the first determination submodule comprises the first determination unit, a third determination unit, a fourth determination unit and a fifth determination unit, wherein: the third determining unit is configured to generate an adjacency matrix corresponding to each frame of image according to an intersection ratio between every two detection frames in each frame of image when the number of the detection frames is greater than or equal to 2; the fourth determining unit is configured to use a detection frame with a zero matching number in the adjacency matrix as a sparse detection frame in each frame of image; the fifth determining unit is configured to use a detection frame with a matching frequency greater than or equal to a third frequency threshold in the adjacency matrix as a dense detection frame in each frame of image; wherein the third count threshold is greater than the second count threshold.

In some possible embodiments, the video sequence to be identified is composed of frame images in a frame sequence buffer, and the first generation submodule includes a first merging unit, a first truncating unit, and a first connecting unit, where: the first merging unit is configured to, for all frame images of the frame sequence buffer, merge a sparse detection frame associated with each object identifier at a spatial position to obtain a minimum bounding frame of a single object corresponding to each object identifier; the first intercepting unit is used for intercepting a first area image corresponding to the spatial position of the smallest surrounding frame in each frame of image according to the size of the smallest surrounding frame; the first connecting unit is configured to sequentially connect the first area images according to the timestamp of each frame of image, so as to obtain a first sequence corresponding to each object identifier.

In some possible embodiments, the video sequence to be identified is composed of frame images in a frame sequence buffer, and the second generation submodule includes a first expansion unit, a second merging unit, a second truncation unit, and a second concatenation unit, where: the first expansion unit is used for expanding a dense detection frame in each frame of image in the frame sequence buffer area outwards by a specific proportion to obtain a dense area in each frame of image and an object identifier associated with the dense area; wherein the dense area includes at least two objects therein; the second merging unit is used for merging the dense regions in each frame image in the frame sequence buffer area to obtain a dense surrounding frame of the frame sequence buffer area; the second intercepting unit is used for intercepting a second area image corresponding to the spatial position of the dense enclosure frame in each frame of image according to the size of the dense enclosure frame; and the second connecting unit is used for sequentially connecting the second area images according to the time stamp of each frame of image to generate a second sequence.

In some possible embodiments, the video sequence to be identified is composed of frame images in a frame sequence buffer, each frame image in the frame sequence buffer including at least one dense detection box, and the second generation submodule includes a second expansion unit, a third merging unit, a third truncation unit, and a third connection unit, where: the second expansion unit is configured to expand a specific proportion outwards for each dense detection frame in each frame of image, so as to obtain a dense area corresponding to each dense detection frame and an object identifier associated with the dense area; wherein the dense area includes at least two objects therein; the third merging unit is configured to merge dense regions including the same object according to the object identifier associated with the dense region to obtain at least one dense enclosure frame including the same object; the third capturing unit is configured to capture a third area image corresponding to the spatial position of each dense enclosure frame in each frame of image according to the size of each dense enclosure frame; the third connecting unit is configured to sequentially connect the third area images according to the timestamp of each frame of image, and generate the at least one second sequence including the same object.

In some possible embodiments, the second generation submodule further includes a sixth determination unit configured to determine whether a merge tag of the frame sequence buffer is true; the merging tag is a tag for determining whether a dense region of the frame images in the frame sequence buffer area is merged or not according to the background similarity of the frame images in the frame sequence buffer area; correspondingly, the second merging unit is further configured to, if the merge flag is true, merge the dense regions in each frame of image in the frame sequence buffer to obtain a dense bounding box of the frame sequence buffer.

In some possible embodiments, the second generation submodule further includes a fourth truncation unit and a fourth connection unit, wherein: the fourth intercepting unit is used for respectively intercepting each frame of image according to the dense area of each frame of image under the condition that the merging label is false to obtain a fourth area image with the length and the width of specific pixels; the fourth connecting unit is configured to sequentially connect the fourth area images according to the timestamp of each frame of image, and generate the at least one second sequence.

In some possible embodiments, the fourth clipping unit is further configured to, in a case where a first frame image exists in the frame sequence buffer, take the first frame image as a fourth region image of the first frame image, centering on a center point of the first frame image, and taking a region image of which a specific pixel is long and wide; wherein the first frame image is a frame image that does not include a dense detection frame.

In some possible embodiments, the video sequence to be identified is composed of frame images in a frame sequence buffer, and the generating module 920 further includes a second determining submodule, a filtering submodule, and a third generating submodule, where: the second determining submodule is used for determining the first object identification number associated with the sparse detection frame in each frame of image in the frame sequence buffer; the screening submodule is used for selecting a candidate detection frame from the dense area in each frame of image under the condition that the number of the first object identifications is smaller than a specific threshold value; wherein the number of times the candidate detection box overlaps other detection boxes is less than the first time threshold; the third generating sub-module is configured to generate the at least one first sequence according to the candidate detection box and the object identifier associated with the candidate detection box.

In some possible embodiments, the screening submodule includes a sixth determining unit, a seventh determining unit, and a selecting unit, where: the sixth determining unit is configured to determine the specific threshold according to the second object identifier number and the specific processing parameter associated with the dense detection boxes in each frame of image in the frame sequence buffer; the seventh determining unit is configured to determine, according to the first number of object identifiers and the specific threshold, a third number of object identifiers associated with the candidate detection frame; and the selecting unit is used for selecting the detection frame with the overlapping frequency smaller than the first time threshold value with other detection frames from the dense area in each frame of image as the candidate detection frame according to the third object identification number.

In some possible embodiments, the multi-task recognition network model is trained using a first sample data set of individual object behaviors and a second sample data set of group object behaviors; the multi-task identification network model comprises a backbone network and a classification network corresponding to a task to be identified; the task to be identified comprises at least one of the following: a task type for a single object behavior and a task type for a group object behavior.

In some possible embodiments, the multitask recognition network model comprises a first classification network for performing behavior recognition on the single object and a second classification network for performing recognition on the group object, and the recognition module 930 comprises a first recognition submodule and a second recognition submodule, wherein: the first identification submodule is configured to, in response to a situation that each of the first sequences includes the single object, identify the at least one first sequence by using the first classification network to obtain a behavior identification result of each of the single objects; the second identification submodule is configured to, in response to a situation that each of the second sequences includes the group object, identify the at least one second sequence by using the second classification network, and obtain a behavior identification result of the group object.

In some possible embodiments, the recognition apparatus 900 further comprises a first training module, a second training module, and a third training module, wherein: the first training module is used for respectively training the built behavior recognition model by utilizing the first sample data set and the second sample data set to obtain a first video recognition network of a single object and a second video recognition network of a group object; wherein the first video identification network and the second video identification network both comprise the backbone network and at least one of the classification networks; the second training module is configured to train the backbone network by using the first video identification network and the second video identification network to obtain the multitask backbone network; the third training module is configured to train a first behavior recognition model by using the first sample data set and the second sample data set to obtain the multitask recognition network model; wherein the first behavior recognition model comprises the multitasking backbone network and the at least one classification network.

In some possible embodiments, the second training module comprises an acquisition subunit and a training subunit, wherein: the acquiring subunit is configured to acquire a first feature vector of the first video identification network and a second feature vector of the second video identification network; and the training subunit is configured to train a backbone network in the built behavior recognition model by using the first feature vector and the second feature vector as target features, so as to obtain the multi-task backbone network.

In some possible embodiments, the identification apparatus 900 further comprises a first determination module, a second determination module, and an alarm module, wherein: the first determining module is used for determining the spatial position and the behavior category of the single object behavior and/or the group object behavior under the condition that the single object behavior and/or the group object behavior exist in the video sequence to be recognized; wherein the behavior categories include categories of the individual object behaviors and/or categories of the group object behaviors; the second determining module is used for determining alarm content according to the spatial position and the behavior category; and the alarm module is used for sending an alarm notification to the terminal equipment corresponding to the spatial position according to the alarm content.

Here, it should be noted that: the above description of the apparatus embodiments, similar to the above description of the method embodiments, has similar beneficial effects as the method embodiments. For technical details not disclosed in the embodiments of the apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.

It should be noted that, in the embodiment of the present application, if the behavior recognition method is implemented in the form of a software functional module and is sold or used as a standalone product, the behavior recognition method may also be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling a device (which may be a smartphone with a camera, a tablet computer, etc.) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.

Correspondingly, the present application provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps in the behavior recognition method in any of the above embodiments. Correspondingly, in an embodiment of the present application, a chip is further provided, where the chip includes a programmable logic circuit and/or program instructions, and when the chip runs, the chip is configured to implement the steps in any one of the behavior recognition methods in the foregoing embodiments. Correspondingly, in an embodiment of the present application, there is also provided a computer program product, which is used to implement the steps in any of the behavior recognition methods in the foregoing embodiments when the computer program product is executed by a processor of a device.

Based on the same technical concept, the embodiment of the present application provides a behavior recognition device, which is used for implementing the behavior recognition method described in the above method embodiment. Fig. 10 is a hardware entity diagram of an behavior recognition apparatus according to an embodiment of the present application, as shown in fig. 10, the recognition apparatus 1000 includes a memory 1010 and a processor 1020, where the memory 1010 stores a computer program that can be executed on the processor 1020, and the processor 1020 executes the computer program to implement steps in any behavior recognition method according to an embodiment of the present application.

The Memory 1010 is configured to store instructions and applications executable by the processor 1020, and may also buffer data (e.g., image data, audio data, voice communication data, and video communication data) to be processed or already processed by the processor 1020 and modules in the device, and may be implemented by a FLASH Memory (FLASH) or a Random Access Memory (RAM).

The processor 1020, when executing the program, performs the steps of any of the behavior recognition methods described above. The processor 1020 generally controls the overall operation of the device 1000.

The Processor may be at least one of an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a Central Processing Unit (CPU), a controller, a microcontroller, and a microprocessor. It is understood that the electronic device implementing the above-mentioned processor function may be other electronic devices, and the embodiments of the present application are not particularly limited.

The computer storage medium/Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read Only Memory (EPROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a magnetic Random Access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical Disc, or a Compact Disc Read-Only Memory (CD-ROM), and the like; or may be a variety of devices including one or any combination of the above memories, such as a mobile phone, computer, tablet device, personal digital assistant, etc.

Here, it should be noted that: the above description of the storage medium and device embodiments is similar to the description of the method embodiments above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the storage medium and apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application. The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiments of the present application.

In addition, all functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing an automatic test line of a device to perform all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code.

The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to obtain new method embodiments.

The features disclosed in the several method or apparatus embodiments provided in the present application may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.

The above description is only for the embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of behavior recognition, the method comprising:

obtaining a detection result of each object in each frame of image in a video sequence to be identified; wherein, the video sequence to be identified comprises a single object and/or a group object; the group of objects comprises at least two objects with a spatial distance smaller than a distance threshold;

generating at least one first sequence and/or at least one second sequence according to the detection result of each object, wherein each first sequence and each second sequence comprise one of the following: the same single object, the group object;

and respectively carrying out behavior recognition on each first sequence and/or each second sequence by utilizing a trained multi-task recognition network model to obtain a behavior recognition result of each single object and/or a behavior recognition result of the group objects.

2. The method of claim 1, wherein the detection result of the object comprises at least a detection box of the object and an object identification associated with the detection box;

generating at least one first sequence according to the detection result of each object, including:

determining a sparse detection frame in each frame image according to the distribution condition of the detection frame of each object in the corresponding frame image; wherein the number of times of overlapping between the sparse detection frame and the other detection frames except the sparse detection frame in each frame of image is less than a first time threshold;

and generating the at least one first sequence according to the sparse detection frame in each frame of image and the object identification associated with the sparse detection frame.

3. The method of claim 2, wherein said generating at least one second sequence based on the detection of each of said objects comprises:

determining a dense detection frame in each frame image according to the distribution condition of the detection frame of each object in the corresponding frame image; the overlapping times of the dense detection frames and other detection frames except the dense detection frames in each frame of image are more than or equal to a second time threshold value; the second nonce threshold is greater than the first nonce threshold;

and generating the at least one second sequence according to the dense detection frame in each frame of image and the object identification associated with the dense detection frame.

4. The method according to claim 2 or 3, wherein the determining the sparse detection frame in each frame image according to the distribution of the detection frame of each object in the corresponding frame image comprises:

determining the number of detection frames included in each frame of image;

and determining the detection frame as a sparse detection frame in the corresponding frame image under the condition that the number of the detection frames is 1.

5. The method according to claim 3, wherein the determining the sparse detection frame and the dense detection frame in each frame image according to the distribution of the detection frame of each object in the corresponding frame image comprises:

determining the number of detection frames included in each frame of image;

under the condition that the number of the detection frames is more than or equal to 2, generating an adjacent matrix corresponding to each frame of image according to the intersection and parallel ratio between every two detection frames in each frame of image;

taking the detection frame with zero matching times in the adjacent matrix as a sparse detection frame in each frame of image;

taking the detection frame with the matching times larger than or equal to the third time threshold value in the adjacent matrix as a dense detection frame in each frame of image; wherein the third count threshold is greater than the second count threshold.

6. The method according to any one of claims 2 to 5, wherein the video sequence to be identified is composed of frame images in a frame sequence buffer, and the generating of the at least one first sequence from the sparse detection box in each frame image and the object identification associated with the sparse detection box comprises:

aiming at all frame images of the frame sequence buffer area, taking a union set of sparse detection frames associated with each object identifier at a spatial position to obtain a minimum bounding frame of a single object corresponding to each object identifier;

according to the size of the smallest enclosing frame, a first area image corresponding to the spatial position of the smallest enclosing frame in each frame of image is intercepted;

and sequentially connecting the first area images according to the time stamp of each frame image to obtain a first sequence corresponding to each object identifier.

7. The method according to claim 3 or 5, wherein the video sequence to be identified is composed of frame images in a frame sequence buffer, each frame image in the frame sequence buffer comprising a dense detection box, and wherein the generating the at least one second sequence based on the dense detection box in each frame image and the object identifier associated with the dense detection box comprises:

expanding a specific proportion outwards for the dense detection frame in each frame of image in the frame sequence buffer area to obtain a dense area in each frame of image and an object identifier associated with the dense area; wherein the dense area includes at least two objects therein;

merging the dense areas in each frame image in the frame sequence buffer area to obtain a dense surrounding frame of the frame sequence buffer area;

intercepting a second area image corresponding to the spatial position of the dense enclosure frame in each frame of image according to the size of the dense enclosure frame;

and sequentially connecting the second area images according to the time stamp of each frame image to generate a second sequence.

8. The method according to claim 3 or 5, wherein the group objects included in the images in each of the second sequences include at least one object, the video sequence to be recognized is composed of frame images in a frame sequence buffer, each frame image in the frame sequence buffer includes at least one dense detection box, and the generating the at least one second sequence according to the dense detection box in each frame image and the object identifier associated with the dense detection box includes:

expanding a specific proportion outwards for each dense detection frame in each frame of image to obtain a dense area corresponding to each dense detection frame and an object identifier associated with the dense area; wherein the dense area includes at least two objects therein;

according to the object identification associated with the dense area, taking a union set of the dense areas comprising the same object to obtain at least one dense enclosure frame comprising the same object;

intercepting a third area image corresponding to the spatial position of each dense surrounding frame in each frame of image according to the size of each dense surrounding frame;

and sequentially connecting the third area images according to the time stamp of each frame image to generate at least one second sequence comprising the same object.

9. The method of claim 7, wherein the generating the at least one second sequence according to the dense detection boxes in the each frame of image and object identifications associated with the dense detection boxes, further comprises:

determining whether a merge tag of the frame sequence buffer is true; the merging tag is a tag for determining whether a dense region of the frame images in the frame sequence buffer area is merged or not according to the background similarity of the frame images in the frame sequence buffer area;

the merging the dense regions in each frame image in the frame sequence buffer area to obtain the dense surrounding frame of the frame sequence buffer area comprises:

and under the condition that the merging label is true, merging the dense areas in each frame of image in the frame sequence buffer area to obtain a dense surrounding frame of the frame sequence buffer area.

10. The method of claim 9, wherein the generating the at least one second sequence based on the dense detection boxes in the each frame of image and object identifications associated with the dense detection boxes further comprises:

under the condition that the merging label is false, respectively intercepting each frame of image according to the dense area of each frame of image to obtain a fourth area image with the length and the width of specific pixels;

and sequentially connecting the fourth area images according to the time stamp of each frame image to generate at least one second sequence.

11. The method of claim 10, wherein the generating the at least one second sequence based on the dense detection boxes in the each frame of image and object identifications associated with the dense detection boxes, further comprises:

taking a region image of which the center point is the center of the first frame image and the specific pixel is the length and the width as a fourth region image of the first frame image when the first frame image exists in the frame sequence buffer area;

wherein the first frame image is a frame image that does not include a dense detection frame.

12. The method according to any of claims 3 to 11, wherein said video sequence to be identified is composed of frame images in a frame sequence buffer, said generating at least one first sequence based on the detection result of each of said objects, further comprising:

determining a first object identification number associated with a sparse detection frame in each frame of image in the frame sequence buffer;

under the condition that the number of the first object identifications is smaller than a specific threshold value, selecting a candidate detection frame from the dense area in each frame of image; wherein the number of times the candidate detection box overlaps other detection boxes is less than the first time threshold;

and generating the at least one first sequence according to the candidate detection frame and the object identification associated with the candidate detection frame.

13. The method of claim 12, wherein in the case that the first object identification number is less than a certain threshold, selecting a candidate detection box from the dense area in each frame image comprises:

determining the specific threshold according to the number of second object identifications and specific processing parameters associated with the dense detection frames in each frame of image in the frame sequence buffer area;

determining a third object identification number associated with the candidate detection frame according to the first object identification number and the specific threshold value;

and selecting the detection frame with the overlapping times with other detection frames being less than the first time threshold value from the dense area in each frame of image as the candidate detection frame according to the third object identification number.

14. The method of any one of claims 1 to 13, wherein the multi-task recognition network model is trained using a first sample data set of individual object behaviors and a second sample data set of group object behaviors;

the multi-task identification network model comprises a backbone network and a classification network corresponding to a task to be identified, wherein the task to be identified comprises at least one of the following: a task type for a single object behavior and a task type for a group object behavior.

15. The method of any of claims 1 to 14, wherein the multi-tasking recognition network model comprises a first classification network that identifies behavior of the individual objects and a second classification network that identifies the group objects,

the performing behavior recognition on each first sequence and/or each second sequence by using the trained multi-task recognition network model to obtain a behavior recognition result of each single object and/or a behavior recognition result of the group object includes:

in response to the condition that each first sequence comprises the single object, identifying the at least one first sequence by using the first classification network to obtain a behavior identification result of each single object; and/or the presence of a gas in the gas,

and in response to the condition that each second sequence comprises the group object, identifying the at least one second sequence by using the second classification network to obtain a behavior identification result of the group object.

16. An apparatus for behavior recognition, the apparatus comprising an acquisition module, a generation module, and a recognition module, wherein:

the recognition module is used for respectively carrying out behavior recognition on each first sequence and/or each second sequence by utilizing a trained multi-task recognition network model to obtain a behavior recognition result of each single object and/or a behavior recognition result of the group objects.

17. An apparatus for behavior recognition comprising a memory and a processor, the memory storing a computer program operable on the processor, wherein the processor when executing the program implements the steps of the method of any one of claims 1 to 15.

18. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 15.