CN114677754A

CN114677754A - Behavior recognition method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN114677754A
Application number: CN202210239612.0A
Authority: CN
Inventors: 苏海昇
Original assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Current assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Priority date: 2022-03-11
Filing date: 2022-03-11
Publication date: 2022-06-28

Abstract

The embodiment of the disclosure discloses a behavior identification method, a behavior identification device, behavior identification equipment and a computer readable storage medium. The method comprises the following steps: classifying and identifying important regions of each image frame in at least one image frame of the obtained image frame sequence to obtain an identification result corresponding to the at least one image frame; the important area is determined according to the detection frame area where the target object is located in at least one image frame; under the condition that the identification result represents that a preset abnormal event exists, determining the activation center of a class activation graph corresponding to the identification result of each image frame to obtain at least one activation center of at least one image frame; an activation center characterizes abnormal positions in a class activation map of a corresponding image frame; based on the at least one activation center, an anomalous target object is identified for each image frame. Through the method and the device, the identification accuracy can be improved.

Description

Behavior recognition method and device, electronic equipment and computer readable storage medium

Technical Field

The present disclosure relates to terminal technologies, and in particular, to a behavior recognition method and apparatus, an electronic device, and a computer-readable storage medium.

Background

The detection of anomalies in videos is an important problem in the field of computer vision, and has wide application in the field of intelligent identification, such as detection of illegal behaviors, traffic accidents, some unusual events and the like. Thousands of cameras are deployed worldwide. However, most cameras merely record the dynamics at every moment, and do not have the ability to automatically recognize (often requiring special personnel to be responsible for manual viewing). Due to the huge amount of video, it is obviously not realistic to filter the content in the video only by human power. There is a need for techniques that utilize computer vision and deep learning to automatically detect anomalous events that occur in video.

It is extremely difficult to identify anomalous events in a video. Possible challenges include scarcity of annotation data due to small probability events, large inter/intra-class variance, subjectively defined differences in anomalous events, low resolution of video images, and so forth. As a human, we can identify an anomaly through common sense, for example, if people gather on a street that is usually not in traffic, it may be an anomaly; for example, a violent event such as fighting occurs. They are not common knowledge for machines, but only visual features. Generally, the stronger the visual features, the better the desired anomaly detection performance.

Some identification methods related to abnormal events exist in the related art, but when the identification method in the related art is adopted for abnormal identification, the identification accuracy is low.

Disclosure of Invention

The embodiment of the disclosure provides a behavior recognition method, a behavior recognition device, an electronic device and a computer-readable storage medium, which can improve recognition accuracy.

The technical scheme of the embodiment of the disclosure is realized as follows:

the embodiment of the disclosure provides a behavior identification method, which includes:

classifying and identifying important regions of each image frame in at least one image frame of the obtained image frame sequence to obtain an identification result corresponding to the at least one image frame; the important region is determined according to the detection frame region where the target object is located in the at least one image frame;

under the condition that the identification result represents that a preset abnormal event exists, determining an activation center of a class activation graph corresponding to the identification result of each image frame to obtain at least one activation center of the at least one image frame; an activation center characterizes abnormal locations in the activation-like map of a corresponding one of the image frames;

identifying an abnormal target object for the respective image frames based on the at least one activation center.

An embodiment of the present disclosure provides a behavior recognition apparatus, including: the identification unit is used for classifying and identifying important areas of each image frame in at least one image frame of the obtained image frame sequence to obtain an identification result corresponding to the at least one image frame; the important area is determined according to the detection frame area where the target object is located in the at least one image frame;

the determining unit is used for determining the activation center of a class activation map corresponding to the recognition result of each image frame under the condition that the recognition result represents that a preset abnormal event exists, so as to obtain at least one activation center of the at least one image frame; an activation center characterizes abnormal locations in the activation-like map of a corresponding one of the image frames;

the identification unit is further configured to identify an abnormal target object of each image frame based on the at least one activation center.

An embodiment of the present disclosure provides an electronic device, including: a memory for storing an executable computer program; a processor for implementing the above-described behavior recognition method when executing the executable computer program stored in the memory.

The embodiment of the present disclosure provides a computer-readable storage medium, which stores a computer program for causing a processor to execute the method for recognizing behavior described above.

In the behavior recognition method, the behavior recognition device, the behavior recognition equipment and the computer readable storage medium provided by the embodiment of the disclosure, in at least one image frame of an obtained image frame sequence, important regions of each image frame are classified and recognized, and a recognition result corresponding to the at least one image frame is obtained; the important area is determined according to the detection frame area where the target object is located in at least one image frame; under the condition that the identification result represents that a preset abnormal event exists, determining an activation center of a class activation map of each image frame corresponding to the identification result to obtain at least one activation center of at least one image frame, wherein one activation center represents an abnormal position in the class activation map of the corresponding image frame; based on the at least one activation center, an abnormal target object is identified for each image frame. By adopting the technical scheme, on one hand, the determined important regions of the image frames are classified and identified, so that compared with the identification according to the data of the whole image of each image frame, the searching range during the identification is effectively reduced, and meanwhile, the interference on the identification is reduced, thereby improving the identification accuracy; on the other hand, the abnormal target object of each identified image frame is more accurate, and the identification accuracy of the abnormal target object is improved because the abnormal center (activation center) of each image frame is positioned by determining the abnormal position in the class activation map corresponding to the identification result of each image frame, the abnormal target object in the corresponding image frame is identified according to the obtained abnormal center, and the class activation map corresponding to the identification result can represent the target object related to the identification result in the image frame; therefore, the behavior identification method provided by the embodiment of the disclosure can improve the identification accuracy.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 is an alternative flow chart of a behavior recognition method provided in the embodiment of the present disclosure;

FIG. 2 is a schematic diagram illustrating an exemplary class activation graph and an activation center according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram illustrating an effect of displaying an abnormal target object in an exemplary image frame according to an embodiment of the present disclosure;

fig. 4 is an alternative flow chart of a behavior recognition method provided by the embodiment of the present disclosure;

fig. 5 is an alternative flow chart of a behavior recognition method provided by the embodiment of the present disclosure;

FIG. 6 is another illustration of an exemplary class activation graph provided by an embodiment of the disclosure;

fig. 7A is an alternative flow chart of a behavior recognition method provided by the embodiment of the disclosure;

fig. 7B is an alternative flow chart of a behavior recognition method provided by the embodiment of the present disclosure;

Fig. 8 is a schematic diagram illustrating a class activation map and a target activation center corresponding to an exemplary image frame according to an embodiment of the present disclosure;

fig. 9 is an alternative flow chart of a behavior recognition method provided in the embodiment of the present disclosure;

fig. 10 is an alternative flow chart of a behavior recognition method provided in the embodiment of the present disclosure;

fig. 11 is an alternative flow chart of a behavior recognition method provided in the embodiment of the present disclosure;

FIG. 12 is a schematic illustration of generalized region locations of exemplary two different second regions in an image frame corresponding to regions provided by embodiments of the present disclosure;

fig. 13 is a schematic structural diagram of a behavior recognition apparatus according to an embodiment of the present disclosure;

fig. 14 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

To further clarify the objects, technical solutions and advantages of the present disclosure, the present disclosure will be described in further detail with reference to the accompanying drawings, and the described embodiments should not be construed as limiting the present disclosure, and all other embodiments that can be obtained by a person of ordinary skill in the art without making an inventive effort fall within the scope of protection of the present disclosure.

The inventor of the present disclosure finds that, in the conventional behavior recognition method, the input video sequence is usually subjected to full-image data enhancement or other preprocessing and then is sent to a behavior recognition model for prediction, however, this method is only suitable for human-centered video behavior recognition, and such data is often found in the published video academic data set. For videos shot by a monitoring camera, more information is contained, and the covered view field is larger; meanwhile, the occurrence position and the human body scale of the target event are random; therefore, it is obviously unreasonable to simply input the full graph as a model, and much irrelevant information in the picture interferes with behavior recognition, thereby affecting the accuracy of behavior recognition and the recognition accuracy of a behavior performer.

In contrast, the inventors of the present disclosure have studied that a local region may be input to a behavior recognition model instead of the entire image for behavior classification, so that interference of most irrelevant information in a picture may be reduced, and accuracy of behavior recognition may be improved; but on the basis, the search range of the behavior recognition model is narrowed, and how to accurately locate the position of a specific behavior executor subsequently is also considered.

To this end, the embodiments of the present disclosure provide a behavior recognition method capable of improving the recognition accuracy of abnormal behavior recognition and an abnormal target object (behavior performer) associated with the abnormal behavior recognition. Before further detailed description of the embodiments of the present disclosure, terms and expressions referred to in the embodiments of the present disclosure are explained, and the terms and expressions referred to in the embodiments of the present disclosure are applied to the following explanations.

1) CNN (convolutional Neural networks), which refers to a convolutional Neural network, is a deep Neural network with a convolutional structure; the method is essentially input-to-output mapping, can learn a large number of mapping relations between input and output, and trains a convolution network by using a known mode, so that the network has the mapping capability between input and output pairs; are commonly used in the field of image recognition.

2) Cam (class Activation map), which refers to a class Activation thermodynamic diagram, referred to as class Activation thermodynamic diagram or class Activation diagram for short, is a two-dimensional diagram, which is essentially a two-dimensional feature score network associated with a particular output class, and each position of the grid represents the degree of importance for that class. For a picture input to the CNN model and classified as "dog", the degree of similarity of each position in the picture to the "dog" class can be presented in the form of a thermodynamic diagram. The class activation map helps to understand which part of a picture causes the convolutional neural network to make the final decision.

An exemplary application of the electronic device provided by the embodiment of the present disclosure is described below, and the electronic device provided by the embodiment of the present disclosure may be implemented as various types of user terminals (hereinafter, referred to as terminals) such as AR glasses, a notebook computer, a tablet computer, a desktop computer, a set-top box, a mobile device (for example, a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, and a portable game device), and may also be implemented as a server.

Next, an exemplary application when the electronic device is implemented as a terminal will be explained. Fig. 1 is an alternative flow chart of a behavior recognition method provided in an embodiment of the present disclosure, which will be described with reference to the steps shown in fig. 1.

S101, classifying and identifying important regions of each image frame in at least one image frame of the obtained image frame sequence to obtain an identification result corresponding to the at least one image frame; the important area is determined according to the detection frame area where the target object is located in at least one image frame.

The electronic device may, under the condition that an important region of each image frame in at least one image frame of the image frame sequence is obtained, simultaneously input the important region of each image frame in the at least one image frame into the behavior recognition model for classification recognition, and obtain a recognition result corresponding to the at least one image frame; the important area of each image frame is determined according to the detection frame areas where all the target objects in the at least one image frame are located; for example, in the case that each important region of each image frame in 8 image frames of the image frame sequence is obtained, the electronic device may input the important regions of each image frame in the 8 image frames into the behavior recognition model for classification recognition at the same time, so as to obtain a recognition result corresponding to the 8 image frames.

Here, the electronic device may crop important regions of each image frame of the at least one image frame, and classify and identify the cropped important regions.

In the embodiment of the present disclosure, the detection frames are used to mark the positions of target objects in the image frame, where each target object corresponds to one detection frame, for example, there are two detection frames corresponding to two target objects one to one in fig. 3 described below.

In some embodiments, the electronic device may directly obtain the at least one image frame and the important region of each image frame in the image frame sequence from other devices, so as to perform classification recognition according to the important regions of the respective image frames in the at least one image frame, thereby obtaining a recognition result corresponding to the at least one image frame.

In the embodiment of the disclosure, the behavior recognition model is a pre-trained neural network model; for example, Two-Stream model, 3D (3-dimensional) ConvNets model, 2D (2-dimensional) ConvNets, etc., which are not limited in the embodiments of the present disclosure. The electronic device can determine whether abnormal events such as fighting, crowding, traffic accidents and the like exist in the at least one image frame by classifying and identifying important areas of each image frame in the at least one image frame.

In some embodiments, the image frame sequence may be a video stream captured by a capturing device such as a camera set in a specific scene, and the at least one image frame may be a continuous frame in the video stream; the specific scene may be a street, a mall, a scenic spot, etc., which is not limited by the embodiment of the present disclosure.

S102, under the condition that the recognition result represents that a preset abnormal event exists, determining an activation center of a class activation graph corresponding to the recognition result of each image frame to obtain at least one activation center of at least one image frame; an activation center characterizes the location of an anomaly in the class activation map of a corresponding one of the image frames.

When the electronic device identifies that a preset abnormal event exists in the at least one image frame (i.e., identifies a category of the preset abnormal event), the electronic device may determine an activation center of a class activation map corresponding to each image frame in the at least one image frame, so as to obtain the at least one activation center corresponding to the at least one image frame, where the class activation map is the class activation map associated with the category corresponding to the identification result, and one activation center represents an abnormal position in the class activation map of the corresponding image frame. For example, as shown in fig. 2, in case that the existence of the predetermined abnormal event of fighting is recognized from 8 consecutive image frames of y1, y2, y3, y4, y5, y6, y7 and y8, 8 activation-like maps w1, w2, w3, w4, w5, w6, w7 and w8 corresponding to the shelving events, corresponding one-to-one to 8 consecutive image frames of y1, y2, y3, y4, y5, y6, y7 and y8, may be generated, wherein, in class activation diagrams w1, w2, w3, w4, w5, w6, w7 and w8, regions r1, r2, r3, r4, r5, r6, r7 and r8 are correspondingly marked, respectively represent the regions of the respective image frames which provide the most information when the model identifies the category of the preset abnormal event, for example, region r1 in class activation map w1 represents the region in image frame y1 that provides the most information for the model to identify the category of the shelving event; and, the activation centers of the 8 class activation maps w1, w2, w3, w4, w5, w6, w7 and w8 are also determined, such as j1, j2, j3, j4, j5, j6, j7 and j8 in fig. 2 (since the sizes of the 8 class activation maps are consistent with the sizes of the corresponding 8 image frames, the activation centers of the class activation maps are directly marked in the corresponding image frames in fig. 2, and as shown in fig. 2, the activation centers are positions marked with pentagonal heart shapes).

In the embodiment of the present disclosure, the preset abnormal event may be, for example, a fighting event, a crowd busy event, a traffic accident, and the like, which is not limited in the embodiment of the present disclosure.

S103, identifying abnormal target objects of the image frames based on the at least one activation center.

In the embodiment of the disclosure, in the case that the activation center of each image frame in the at least one image frame is determined, for each image frame, it may be determined whether an abnormal target object exists in the image frame according to the activation center of the image frame, and in the case that the abnormal target object exists, the electronic device may identify the existing abnormal target object.

In some embodiments, since each activation center corresponds to one image frame, for each activation center, the electronic device may determine a position of the activation center in the corresponding one image frame, and in a case that the position is located in any detection frame area in the image frame, determine a target object corresponding to the any detection frame area as an abnormal target object in the image frame; thus, for the at least one activation center, an abnormal target object of each image frame in the at least one image frame corresponding to the at least one activation center can be obtained. For example, as shown in fig. 3, for an image frame a, in a case where the activation center 31 is located at two different

human detection boxes

32 and 33, it is possible to determine an abnormal target object in two lines of the human image frame a to which the

human detection boxes

32 and 33 correspond.

Here, for each activation center, in a case where the position of the activation center in a corresponding one of the image frames does not belong to any one of the detection frame regions in the image frame, it may be determined that an abnormal target object does not exist in the image frame.

In the embodiment of the disclosure, because the determined important regions of the image frames are classified and identified, compared with the identification according to the data of the whole image of each image frame, the searching range during the identification is effectively reduced, and meanwhile, the interference on the identification is reduced, so that the identification accuracy is improved; on the other hand, the abnormal center (activation center) of each image frame is positioned by determining the abnormal position in the class activation map corresponding to the recognition result of each image frame, the abnormal target object in the corresponding image frame is recognized according to the obtained abnormal center, and the class activation map corresponding to the recognition result can represent the target object related to the recognition result in the image frame, so that the abnormal target object of each image frame recognized subsequently can be more accurate, the recognition accuracy of the abnormal target object is improved, and meanwhile, the search range of the model is narrowed during recognition, so that the recognition of the subsequent abnormal target object is not influenced; therefore, the technical scheme of the embodiment of the disclosure can improve the identification accuracy.

Fig. 4 is an optional flowchart of the behavior recognition method according to an embodiment of the disclosure, and as shown in fig. 4, the S102 may be implemented by S1021 to S1023, which will be described with reference to fig. 4.

S1021, under the condition that the identification result represents that a preset abnormal event exists, determining a group of activation values corresponding to the identification result of each image frame; the set of activation values is used to generate an activation-like map of the image frame, and each pixel location in the activation-like map has a one-to-one correspondence with an activation value in the set of activation values.

When the electronic device identifies that a preset abnormal event exists in the at least one image frame through the behavior identification model, for each image frame, the electronic device may determine a group of activation values of the image frame corresponding to the preset abnormal event through a class activation mapping method, where the group of activation values is used to draw a class activation map of the image frame corresponding to the preset abnormal event, and each pixel position in the class activation map corresponds to one pixel position in the group of activation values one to one; in the process of drawing the class activation map of the image frame according to the set of activation values, for example, in the case that one activation value represents that the color of one pixel point corresponding to the activation value is red, red is generated at the pixel position corresponding to the activation value, and the class activation map corresponding to the set of activation values is drawn by this method, that is, the contributions of different pixel points to the identification result are represented by colors with different shades in the class activation map.

In the embodiment of the present disclosure, the size of the class activation map of the image frame drawn according to the set of activation values is consistent with the size of the image frame, so that it is convenient to accurately position the original abnormal target object in the image frame according to the abnormal position in the class activation map of the image frame.

In some embodiments, fig. 5 is an optional flowchart of the behavior recognition method provided in the embodiment of the present disclosure, and as shown in fig. 5, the above S1021 may be implemented by S201 to S203, which will be described with reference to fig. 5.

S201, generating a group of initial class activation mapping values corresponding to the recognition result of each image frame.

For each image frame, the electronic device, upon obtaining a recognition result of the image frame, may generate a set of initial class activation mapping values for the image frame and corresponding to the recognition result.

In some embodiments, the above S201 may be implemented by: for each image frame, generating a set of initial class activation mapping values of the image frame corresponding to the recognition result based on the weight of the classification layer corresponding to the recognition result and the feature map output by the convolution layer in the behavior recognition model; the identification result is obtained by classifying and identifying the important area by adopting a behavior identification model; the behavior recognition model comprises: a classification layer and a plurality of convolution layers. In some embodiments, the electronic device may perform classification recognition on the important region of each image frame in the at least one image frame by using a behavior recognition model including a classification layer and a plurality of convolution layers to identify an event class of a preset abnormal event. For each image frame, the last convolution layer in the behavior recognition model can generate y feature maps of the important region of the image frame, each feature map mainly extracts features related to one of the y categories, for example, the y feature maps can be represented as A ¹,A²,…,A^y(ii) a The classification layer correspondingly comprises y neurons, one neuron corresponds to one category, and each neuron corresponds to y weightsHeavy values, e.g. w¹,w²,…,w^y(ii) a When the event category of the preset abnormal event is the category c, a set of initial category activation mapping values corresponding to the image frame may be calculated by using the following formula (1):

wherein, i represents the ith neuron,

represents the weight value of class c corresponding to the ith neuron, AⁱRepresenting the characteristic diagram corresponding to the ith neuron,

a set of initial class activation value mapping values is characterized.

S202, performing up-sampling processing on the initial class activation mapping values to obtain a group of intermediate activation values; a set of intermediate activation values is used to generate an activation-like map corresponding to the important regions of the image frame.

And S203, carrying out fusion processing on the intermediate activation value and the original pixel value of the image frame to obtain a group of activation values.

Here, for an image frame, the size of the class activation map that can be drawn by the obtained initial set of class activation mapping values is consistent with the size of the feature map generated by the last convolutional layer, so the electronic device may perform an upsampling operation on the initial set of class activation mapping values to generate a set of intermediate activation values, and the size of the class activation map that can be drawn by the generated set of intermediate activation values is consistent with the size of the important region of the image frame, that is, the class activation map corresponding to the important region of the image frame is obtained; on the basis, the electronic device may further perform fusion processing on the group of intermediate activation values and a group of original pixel values of the image frame to obtain a group of activation values corresponding to the image frame, and a size of a class activation map that can be drawn by the obtained group of activation values is consistent with a size of an original map of the image frame, so that the original map of the image frame is superimposed with the class activation map corresponding to the important region of the image frame, and finally a class activation map that corresponds to the image frame and is consistent with the size of the image frame is obtained. For example, fig. 6 shows that when a shelving event is identified based on 8 consecutive image frames y10, y20, y30, y40, y50, y60, y70, and y80 (not shown in fig. 6), 8 activation maps w60, and w60, which correspond one-to-one to the 8 consecutive image frames y10, y20, y30, y40, y50, y60, and correspond to the shelving event are generated, wherein the activation maps w60, and w60 indicate regions r60, r60 and r60, and r60 respectively represent a model when a model of the most abnormal image frame type 60 in the image frame is identified, and when this model represents the most abnormal image frame type 60 in this activation map is provided for this image frame 60.

The following proceeds to the explanation of S1022 to S1023 in fig. 4:

and S1022, determining the maximum activation value in a group of activation values of each image frame, and determining the pixel position corresponding to the maximum activation value as an abnormal position.

For each image frame, the electronic device may determine a maximum activation value of a set of activation values corresponding to the image frame and determine a pixel location corresponding to the maximum activation value as an anomaly location for the image frame. Since the class activation map is a two-dimensional image, the anomaly location (pixel location) may be represented using two-dimensional coordinates, e.g., X (i, j), to characterize the location of the anomaly location in the class activation map or in the image frame.

S1023, based on the abnormal position of the image frame, obtaining the activation center of the image frame, and for at least one image frame, correspondingly obtaining at least one activation center.

For each image frame, the electronic device may determine an abnormal position in the class activation map of the image frame as an activation center of the class activation map of the image frame, so that for the at least one image frame, at least one activation center may be obtained correspondingly.

Fig. 7A is an optional flowchart schematic diagram of the behavior recognition method provided by the embodiment of the present disclosure, as shown in fig. 7A, S103 may be implemented by S1031 to S1032, and will be described with reference to fig. 7A by taking fig. 1 as an example.

And S1031, determining a target activation center according to the at least one activation center.

Because the positions of the activation centers in the different types of activation maps may be shifted, the electronic device may determine a target activation center according to the obtained activation centers when determining all the activation centers corresponding to the at least one image frame; therefore, the drifting phenomenon of the activation center can be relieved, so that the abnormal target object of each image frame in the at least one image frame determined by the target activation center is more accurate, and the identification accuracy of the abnormal target object is improved.

In some embodiments, the electronic device may calculate an average of all activation centers corresponding to the at least one image frame, and use the average as the target activation center. Here, since each activation center is actually a two-dimensional coordinate of one pixel, the electronic device may obtain the target activation center by calculating an average value of coordinates of the two-dimensional coordinates, for example, an average value of x coordinates and an average value of y coordinates may be calculated, respectively.

And S1032, identifying abnormal target objects in each image frame according to the target activation center.

Under the condition that a target activation center corresponding to the at least one image frame is determined, the electronic equipment can determine whether an abnormal target object exists in the image frame or not according to the target activation center for each image frame, and under the condition that the abnormal target object exists, the existing abnormal target object is identified.

Here, for each image frame, the electronic device may determine a position of the target activation center in the image frame, and in a case where the position is located in any detection frame area in the image frame, determine a target object corresponding to the any detection frame area as an abnormal target object in the image frame; thus, for the at least one image frame, the abnormal target object of each image frame in the at least one image frame can be obtained. For example, as described above at 32 and 33 in fig. 3, are abnormal target objects in the image frame a.

Fig. 7B is an optional schematic flow chart of the behavior recognition method according to the embodiment of the present disclosure, and as shown in fig. 7B, the above S103 may also be implemented through S1033 to S1034, which will be described with reference to fig. 7B by taking fig. 1 as an example.

And S1033, clustering at least one activation center in an unsupervised clustering mode to obtain a target activation center.

S1034, identifying abnormal target objects in each image frame according to the target activation center.

Under the condition that the electronic equipment determines at least one activation center corresponding to at least one image frame one to one, a target activation center can be determined according to the at least one obtained activation center in an unsupervised clustering mode; therefore, the drifting phenomenon of the activation center can be relieved, so that the abnormal target object of each image frame in the at least one image frame determined by the target activation center is more accurate, and the identification accuracy of the abnormal target object is improved.

In some embodiments, the electronic device may adopt a K-Means clustering method to cluster all the activation centers corresponding to the at least one image frame, so that drift points may be eliminated in an unsupervised clustering manner, so that the abnormal target object of each image frame in the at least one image frame determined subsequently by using the target activation center is more accurate, and the accuracy of identifying the abnormal target object is improved.

Here, the unsupervised clustering method may also be another clustering method, which is not specifically limited in this embodiment of the disclosure.

In some embodiments, S1031 may also be implemented by S1033 described above.

In some embodiments, S1033 may be implemented by S301-S302:

s301, clustering at least one activation center in an unsupervised clustering mode to obtain a center cluster with the largest number of the activation centers.

The electronic device may cluster all the activation centers corresponding to the at least one image frame in an unsupervised clustering manner, thereby obtaining a plurality of clusters (cluster), where each cluster includes at least one activation center, and then may select a cluster including the largest number of activation centers from the plurality of clusters, thereby obtaining a center cluster.

S302, determining the average value of the activation centers contained in the center cluster, and determining the average value as a target activation center.

In the case where the electronic device determines the center cluster, an average value of all the activation centers included in the center cluster may be calculated, and the average value may be used as the target activation center. Similarly, since each activation center is actually a two-dimensional coordinate of a pixel, the electronic device may obtain the target activation center, for example, the target activation center X (i, j), by calculating an average value of coordinates of the two-dimensional coordinates. Fig. 8 illustrates that in case that a preset abnormal event of shelving is identified according to 8 consecutive image frames of y11, y22, y33, y44, y55, y66, y77, and y88, 8 activation maps w66, and w66, corresponding to the shelving event, corresponding to 8 consecutive image frames of y11, y22, y33, y44, y55, y66, and y66, may be generated, wherein the activation maps w66, r66 and r66, and r66 represent a model when a preset abnormal event of the area 66 represents a preset type of image frame 66, and a preset image frame, and when this preset abnormal event is identified, such as y66 represents a model, and when this preset abnormal event is provided for the area 66, for this area 66 represents the most of this image frame; and, the target activation centers of the 8 class activation maps w11, w22, w33, w44, w55, w66, w77 and w88 are also determined, as shown by j11, j22, j33, j44, j55, j66, j77 and j88 in fig. 8 (since the sizes of the 8 class activation maps are consistent with the sizes of the corresponding 8 image frames, the target activation centers of the class activation maps are directly marked in the corresponding image frames in fig. 8, and as shown in fig. 8, the target activation centers are positions marked with pentagonal heart shapes).

In some embodiments, for at least one image frame in S101 and the important region of each image frame in the at least one image, the electronic device may also obtain the important region by some method, such as the method shown in fig. 9. Fig. 9 is an optional schematic flow diagram of the behavior recognition method according to the embodiment of the present disclosure, as shown in fig. 9, before the foregoing S101, S001-S003 may be further performed, which will be described with reference to fig. 9 by taking fig. 1 as an example.

S001, performing target detection on a current image frame of the image frame sequence, and determining at least one detection frame area corresponding to at least one existing target object.

The electronic equipment can obtain a video stream of a specific scene collected in real time from a camera and other collection equipment arranged in the specific scene, and carry out real-time target detection on the video stream collected in real time by taking an image frame as a unit; for example, there are two

detection frame areas

32 and 33 in the image frame a in fig. 3 described above, which correspond one-to-one to two target objects.

In the embodiment of the present disclosure, the target detection network may be a CNN network, or may be another network, and the like, which is not limited in the embodiment of the present disclosure.

For one image frame, there may be a target object or no target object, and the embodiment of the present disclosure is described with respect to a case where there is a target object in an image frame.

And S002, determining a central detection frame area in the at least one detection frame area.

And S003, under the condition that the at least one image frame is detected, determining the important area of each image frame in the at least one image frame based on at least one central detection frame area corresponding to the at least one image frame.

For each image frame, the electronic device may determine a detection frame region as a central detection frame region when all detection frame regions in the image frame are obtained, so that at least one central detection frame region may be correspondingly obtained when the electronic device obtains at least one image frame, and determine an important region of each image frame in the at least one image frame according to the at least one central detection frame region.

In an embodiment of the present disclosure, the electronic device may determine that at least one image frame has been detected in a case where it is determined that a preset condition is satisfied. For example, the electronic device may determine that a preset condition is met and obtain the image frame detected when the detected duration reaches a preset duration under the condition that it is determined that the detected duration reaches the preset duration, where the preset duration may be set according to actual needs; for example, the preset time duration may be 6 seconds, the electronic device may use the image frame detected in every 6 seconds as the detected image frame; that is, the electronic device obtains at least one image frame every 6 seconds, and determines the important region of each image frame in the at least one image frame; for example, the electronic device may further determine that a preset condition is met and obtain the detected image frames under the condition that it is determined that the number of the detected image frames reaches a preset number, where the preset number may be set according to actual needs; for example, in the case that the preset number is 8, the electronic device takes 8 image frames as detected image frames every time the electronic device detects 8 image frames; that is, every 8 image frames detected by the electronic device, the important region of each image frame in the 8 image frames is determined.

In some embodiments, fig. 10 is an optional flowchart of the behavior recognition method provided in the embodiments of the present disclosure, and as shown in fig. 10, the above-mentioned S002 may be implemented by S401-S404, which will be described with reference to fig. 10.

S401, expanding at least one detection frame area by a first preset proportion to obtain at least one first area corresponding to the at least one detection frame area.

For all the detection frame regions in each image frame, the electronic device may enlarge the areas of all the detection frame regions by a first preset proportion, so as to obtain first regions corresponding to all the detection frame regions; for example, in a case where there are N detection frame regions in each image frame, the electronic device may enlarge the areas of the N detection frame regions by a first preset ratio, thereby obtaining N first regions, where N is an integer greater than 0. The enlarging of the area of one detection frame region by the first preset ratio means that the range included in the detection frame region is enlarged by the first preset ratio, so that a first region having a larger included range is obtained.

In the embodiment of the present disclosure, the first preset ratio may be set according to actual needs, for example, the first preset ratio may be 1.5, which is not limited in the embodiment of the present disclosure.

S402, determining a matching result between any two first regions based on the area intersection ratio between any two first regions in at least one first region; the matching result represents whether any two first areas are matched.

For the first regions corresponding to all the detection frame regions in each image frame, the electronic device may calculate an area intersection ratio between any two first regions, and determine whether any two first regions are matched according to a size relationship between the obtained area intersection ratio and a preset threshold. For example, in the case where one image frame corresponds to 3 first regions (first region 1, first region 2, and first region 3), the electronic device may calculate an area intersection ratio between the first region 1 and the first region 2, an area intersection ratio between the first region 1 and the first region 3, and the area intersection ratio between the first region 2 and the first region 3, and in the case where the area intersection ratio between the first region 1 and the first region 2 is less than a preset threshold, determining that there is no match between the first region 1 and the first region 2, and in the case where the area intersection ratio between the first region 1 and the first region 3 is greater than or equal to a preset threshold, it is determined that the first region 1 matches the first region 3, and determining that the first region 2 matches the first region 3 in a case where the area intersection ratio between the first region 2 and the first region 3 is greater than or equal to a preset threshold.

In the embodiment of the present disclosure, the preset threshold may be set according to actual needs, for example, may be 0, and the embodiment of the present disclosure is not limited herein.

And S403, determining the matching times of each first region based on the matching result, and determining the first region with the maximum matching times.

For each image frame, when determining the matching results between all the first regions of the image frame, the electronic device may determine, according to the matching results, the number of other first regions matched with each first region, and use the number as the matching times, perform descending or ascending order on the matching times of all the first regions, and determine, according to the ordering result, one or more first regions with the largest matching times. Continuing with the above example, since it is determined by the area intersection ratio that the first area 1 does not match the first area 2, the first area 1 matches the first area 3, and the first area 2 matches the first area 3, it may be determined that the other first areas matched by the first area 1 are the first area 3, the other first areas matched by the first area 2 are the first area 3, and the other first areas matched by the first area 3 are the first area 1 and the first area 2; this makes it possible to obtain a first region 3 having the largest number of matches, wherein the number of matches of the first region 1 is 1, the number of matches of the first region 2 is 1, and the number of matches of the first region 3 is 2.

S404, determining a dense center according to the first area with the maximum matching times, and determining a detection frame area corresponding to the dense center as a center detection frame area.

For each image frame, the electronic device may determine a dense center of the image frame according to the first region with the largest matching frequency when obtaining the first region with the largest matching frequency of the image frame, and determine a detection frame region corresponding to the dense center as a center detection frame region of the image frame. For example, in the example of fig. 3, when the detection frame region corresponding to the dense center is 33, the detection frame region 33 may be the center detection frame region of the image frame a.

In some embodiments, the determining the dense center according to the first region with the largest matching times in S404 may be implemented by: determining a first area with the maximum matching times as a dense center under the condition that the first area with the maximum matching times exists; determining a first area with the largest area in two or more first areas with the largest matching times as a dense center under the condition that the two or more first areas with the largest matching times exist; in this way, the accuracy of the resulting dense center can be improved.

In some embodiments, fig. 11 is an optional schematic flow chart of the behavior recognition method provided in the embodiments of the present disclosure, and as shown in fig. 11, S003 described above may be implemented by S501 to S503, which will be described with reference to fig. 11.

S501, under the condition that at least one image frame is detected, at least one central detection frame area corresponding to the at least one image frame is enlarged by a second preset proportion, and at least one second area corresponding to the at least one detection frame area is obtained.

For all the detection frame regions of each image frame in the at least one image frame, the electronic device may enlarge the areas of all the detection frame regions by a second set ratio, thereby obtaining second regions corresponding to all the detection frame regions; for example, in a case where there are N detection frame regions in each image frame, the electronic device may expand the areas of the N detection frame regions by a second preset ratio, thereby obtaining N second regions, where N is an integer greater than 0. The second preset ratio is to enlarge the range included in the detection frame region by the second preset ratio, so as to obtain a second region with a larger included range.

In the embodiment of the present disclosure, the second preset ratio may be set according to an actual need, for example, the second preset ratio may be 2, which is not limited in the embodiment of the present disclosure.

S502, determining the position of the general area according to the position of each second area in the image frame of the at least one second area.

In some embodiments, for all the obtained second regions corresponding to the at least one image frame, the electronic device may determine, according to the position information of each second region in the corresponding image frame, a summarized region position corresponding to all the second regions corresponding to the at least one image frame, and enable the region corresponding to the summarized region position to include any one of all the second regions corresponding to the at least one image frame.

Here, the summarized region position may be a region coordinate, and the summarized region position may be a union of the region coordinates of all second regions corresponding to the at least one image frame, for example, may be a minimum union, for all second regions corresponding to the at least one image frame. For example, in the case that 2 second regions corresponding to 2 image frames are obtained, and the region coordinates of the first second region m11 in the 1 st image frame are (x11, y11), (x12, y12), and the region coordinates of the second region m12 in the 2 nd image frame are (x21, y21), (x22, y22), and x11< x12< x21< x22, y21< y22< y11< y12, the minimum coordinates of the region coordinates of the two second regions are merged as follows: (x11, y21), (x22, y 12); the region corresponding to the minimum coordinate union in the 1 st image frame includes the first second region m11, and the region corresponding to the minimum coordinate union in the 2 nd image frame includes the second region m 12. For example, fig. 12 shows a region 111 corresponding to a region coordinate (x11, y11) (x12, y12) of the second region m11 in the 1 st image frame and a region coordinate (x21, y21), (x22, y22) of the second region m12 in the 2 nd image frame, and a minimum coordinate union between the two second regions, where the region 111 includes a region 112 corresponding to the region coordinate (x11, y11) (x12, y12), and includes a region 113 corresponding to the region coordinate (x21, y21), (x22, y 22).

S503, for each image frame, determining the corresponding area of the summarized area position in the image frame as the important area of the image frame; the important region of each image frame comprises a corresponding second region of the image frame.

In a case where a summarized region position corresponding to all the regions of interest of the at least one image frame is obtained, for each image frame of the at least one image frame, the electronic device may determine a region of the summarized region position in the image frame, and use the determined region as an important region of the image frame, where the important region of the image frame includes the second region of the image frame.

Fig. 13 is a schematic structural diagram of the behavior recognition apparatus provided in the embodiment of the present disclosure; as shown in fig. 13, the behavior recognition apparatus 1 includes: the identification unit 11 is configured to classify and identify important regions of each image frame in at least one image frame of the obtained image frame sequence to obtain an identification result corresponding to the at least one image frame; the important region is determined according to the detection frame region where the target object is located in the at least one image frame; the determining unit 12 is configured to determine, when the recognition result indicates that a preset abnormal event exists, an activation center of a class activation map corresponding to the recognition result for each image frame, to obtain at least one activation center of the at least one image frame; an activation center characterizes abnormal locations in the activation-like map of a corresponding one of the image frames; the identification unit 11 is further configured to identify an abnormal target object of each image frame based on the at least one activation center.

In some embodiments of the present disclosure, the determining unit 12 is further configured to determine a set of activation values corresponding to the recognition result for each image frame; the set of activation values is used for generating a class activation map of each image frame, and each pixel position in the class activation map has a one-to-one correspondence with one activation value in the set of activation values; determining the maximum activation value in the group of activation values of each image frame, and determining the pixel position corresponding to the maximum activation value as the abnormal position; based on the abnormal position of each image frame, obtaining an activation center of each image frame, and correspondingly obtaining the at least one activation center for the at least one image frame.

In some embodiments of the present disclosure, the identifying unit 11 is further configured to determine a position of each activation center in a corresponding one of the image frames; and under the condition that the position is located in any detection frame area in the image frame, determining a target object corresponding to the detection frame area as an abnormal target object in the image frame, and correspondingly obtaining the abnormal target object of each image frame for the at least one activation center.

In some embodiments of the present disclosure, the identifying unit 11 is further configured to determine a target activation center according to the at least one activation center; and identifying abnormal target objects in each image frame according to the target activation center.

In some embodiments of the present disclosure, the identifying unit 11 is further configured to cluster the at least one activation center in an unsupervised clustering manner to obtain a target activation center; and identifying abnormal target objects in each image frame according to the target activation center.

In some embodiments of the present disclosure, the identifying unit 11 is further configured to cluster the at least one activation center in an unsupervised clustering manner, so as to obtain a center cluster with the largest number of activation centers; determining an average of the activation centers contained in the center cluster, and determining the average as the target activation center.

In some embodiments of the present disclosure, the determining unit 12 is further configured to generate a set of initial class activation mapping values corresponding to the recognition result for each image frame; performing up-sampling processing on the initial class activation mapping values to obtain a group of intermediate activation values; the set of intermediate activation values is used for generating an activation-like map corresponding to the important region of each image frame; and carrying out fusion processing on the group of intermediate activation values and the original pixel value of each image frame to obtain the group of activation values.

In some embodiments of the present disclosure, the behavior recognition apparatus 1 further includes a detection unit 13, where the detection unit 13 is configured to perform, before performing classification recognition on important regions of each image frame in at least one image frame of the obtained image frame sequence to obtain a recognition result corresponding to the at least one image frame, target detection on a current image frame in the image frame sequence, and determine at least one detection frame region corresponding to at least one existing target object; the determining unit 12 is further configured to determine a central detection frame region in the at least one detection frame region; determining the important region of each image frame in the at least one image frame based on at least one center detection frame region corresponding to the at least one image frame when the at least one image frame is detected.

In some embodiments of the present disclosure, the determining unit 12 is further configured to enlarge the at least one detection frame area by a first preset ratio to obtain at least one first area corresponding to the at least one detection frame area; determining a matching result between any two first regions based on the area intersection ratio between any two first regions in the at least one first region; the matching result represents whether any two first areas are matched or not; determining the matching times of each first region based on the matching result, and determining the first region with the maximum matching times; and determining a dense center according to the first region with the maximum matching times, and determining a detection frame region corresponding to the dense center as the central detection frame region.

In some embodiments of the present disclosure, the determining unit 12 is further configured to enlarge at least one central detection frame region corresponding to the at least one image frame by a second preset ratio, so as to obtain at least one second region corresponding to the at least one detection frame region; determining the position of a general area according to the position of each second area in the image frame to which the second area belongs; for each image frame, determining a corresponding region of the generalized region position in the each image frame as the important region of the each image frame; the important region of each image frame comprises a second region corresponding to each image frame.

In some embodiments of the present disclosure, the determining unit 12 is further configured to determine, in a case that there is one first region with the largest number of matching times, the first region with the largest number of matching times as the dense center; and determining a first region having a largest area among the two or more first regions having the largest number of matching times as the dense center, when there are two or more first regions having the largest number of matching times.

An embodiment of the present disclosure further provides an electronic device, fig. 14 is a schematic structural diagram of the electronic device provided in the embodiment of the present disclosure, and as shown in fig. 14, the electronic device 2 includes: a memory 21 and a processor 22, wherein the memory 21 and the processor 22 are connected by a bus 23; a memory 21 for storing an executable computer program; the processor 22 is configured to implement the method provided by the embodiment of the present disclosure, for example, the behavior recognition method provided by the embodiment of the present disclosure, when the executable computer program stored in the memory 21 is executed.

The present disclosure provides a computer-readable storage medium, which stores a computer program for causing the processor 22 to execute a method provided by the present disclosure, for example, a behavior recognition method provided by the present disclosure.

In some embodiments of the present disclosure, the storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments of the present disclosure, a computer program may be written in any form of programming language (including compiled or interpreted languages) in the form of software, software modules, scripts or code and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, computer programs may, but do not necessarily, correspond to files in a file system, and may be stored as part of a file that holds other programs or data, e.g., in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, a computer program can be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

In summary, with the technical solution, on one hand, because the determined important regions of the image frames are classified and identified, compared with identification based on data enhancement of the whole image of each image frame, the method and the device effectively reduce the search range during identification, and simultaneously reduce interference during identification, thereby improving the identification accuracy; on the other hand, as the abnormal center (activation center) of each image frame is positioned by determining the abnormal position in the class activation map corresponding to the recognition result of each image frame, the abnormal target object in the corresponding image frame is recognized according to the obtained abnormal center, and the class activation map corresponding to the recognition result can represent the target object related to the recognition result in the image frame, the recognized abnormal target object in each image frame is more accurate, so that the recognition accuracy of the abnormal target object is improved; therefore, the behavior identification method provided by the embodiment of the disclosure can improve the identification accuracy.

The above description is only an example of the present disclosure, and is not intended to limit the scope of the present disclosure. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present disclosure are included in the protection scope of the present disclosure.

Claims

1. A method of behavior recognition, comprising:

2. The method of claim 1, wherein the determining the activation center of the activation-like map corresponding to the recognition result for each image frame, resulting in at least one activation center of the at least one image frame, comprises:

Determining a set of activation values for each image frame corresponding to the recognition result; the set of activation values is used for generating a class activation map of each image frame, and each pixel position in the class activation map has a one-to-one correspondence with one activation value in the set of activation values;

determining the maximum activation value in the group of activation values of each image frame, and determining the pixel position corresponding to the maximum activation value as the abnormal position;

obtaining an activation center of each image frame based on the abnormal position of each image frame, and correspondingly obtaining the at least one activation center for the at least one image frame.

3. The method according to claim 1 or 2, wherein said identifying abnormal target objects in said respective image frames based on said at least one activation center comprises:

determining a position of each activation center in a corresponding one of the image frames;

and under the condition that the position is located in any detection frame area in the image frame, determining a target object corresponding to the detection frame area as an abnormal target object in the image frame, and correspondingly obtaining the abnormal target object of each image frame for the at least one activation center.

4. The method according to any one of claims 1-3, wherein said identifying abnormal target objects in said respective image frames based on said at least one activation center comprises:

determining a target activation center according to the at least one activation center;

and identifying abnormal target objects in each image frame according to the target activation center.

5. The method according to any one of claims 1-4, wherein said identifying abnormal target objects in said respective image frames based on said at least one activation center comprises:

clustering the at least one activation center in an unsupervised clustering mode to obtain a target activation center;

6. The method of claim 5, wherein clustering the at least one activation center by unsupervised clustering to obtain a target activation center comprises:

clustering the at least one activation center in an unsupervised clustering mode to obtain a center cluster with the largest number of the activation centers;

determining an average value of the activation centers contained in the center cluster, and determining the average value as the target activation center.

7. The method of claim 2, wherein said determining a set of activation values for each image frame corresponding to said recognition result comprises:

generating a set of initial class activation mapping values for each image frame corresponding to the recognition result;

performing up-sampling processing on the initial class activation mapping values to obtain a group of intermediate activation values; the set of intermediate activation values is used for generating an activation-like map corresponding to the important region of each image frame;

and carrying out fusion processing on the group of intermediate activation values and the original pixel value of each image frame to obtain the group of activation values.

8. The method according to any one of claims 1 to 7, wherein before the classifying and identifying the important region of each image frame in at least one image frame of the obtained image frame sequence to obtain the identification result corresponding to the at least one image frame, the method further comprises:

performing target detection on a current image frame in the image frame sequence, and determining at least one detection frame region corresponding to at least one existing target object;

determining a central detection frame region of the at least one detection frame region;

Determining the important region of each image frame in the at least one image frame based on at least one center detection frame region corresponding to the at least one image frame when the at least one image frame is detected.

9. The method of claim 8, wherein the determining a center detection frame region of the at least one detection frame region comprises:

expanding the at least one detection frame area by a first preset proportion to obtain at least one first area corresponding to the at least one detection frame area;

determining a matching result between any two first regions based on an area intersection ratio between any two first regions in the at least one first region; the matching result represents whether any two first areas are matched or not;

determining the matching times of each first region based on the matching result, and determining the first region with the maximum matching times;

and determining a dense center according to the first region with the maximum matching times, and determining a detection frame region corresponding to the dense center as the central detection frame region.

10. The method according to claim 8 or 9, wherein the determining the important region of each image frame of the at least one image frame based on at least one central detection frame region corresponding to the at least one image frame comprises:

Expanding at least one central detection frame region corresponding to the at least one image frame by a second preset proportion to obtain at least one second region corresponding to the at least one detection frame region;

determining the position of a general area according to the position of each second area in the image frame to which the second area belongs;

for each image frame, determining a corresponding region of the general region position in the image frame as the important region of the image frame; the important region of each image frame comprises a second region corresponding to each image frame.

11. The method according to claim 9, wherein determining the dense center according to the first region with the largest number of matching times comprises:

in the case that there is a first region with the largest number of matching times, determining the first region with the largest number of matching times as the dense center;

and determining a first region having a largest area among the two or more first regions having the largest number of matching times as the dense center, when there are two or more first regions having the largest number of matching times.

12. A behavior recognition apparatus, comprising:

the identification unit is used for classifying and identifying important regions of each image frame in at least one image frame of the obtained image frame sequence to obtain an identification result corresponding to the at least one image frame; the important region is determined according to the detection frame region where the target object is located in the at least one image frame;

13. An electronic device, comprising:

a memory for storing an executable computer program;

a processor for implementing the method of any one of claims 1 to 11 when executing an executable computer program stored in the memory.

14. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, causes the processor to carry out the method of any one of claims 1 to 11.