CN111753590A

CN111753590A - Behavior identification method and device and electronic equipment

Info

Publication number: CN111753590A
Application number: CN201910245567.8A
Authority: CN
Inventors: 王轩瀚; 周纪强
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2019-03-28
Filing date: 2019-03-28
Publication date: 2020-10-09
Anticipated expiration: 2039-03-28
Also published as: CN111753590B

Abstract

The embodiment of the application provides a behavior identification method and device and electronic equipment. The method comprises the following steps: aiming at a plurality of video frames in a video to be analyzed, acquiring global image characteristics of the video frames; determining an image area where each target is located in the video frame based on the global image characteristics; extracting the regional image characteristics of the image region where each target is located from the global image characteristics to serve as the target characteristics of the target; determining a behavior recognition result of the target based on the target characteristics; carrying out consistency processing on the behavior recognition results of all targets in the video frame to obtain the behavior recognition result of the video frame; and carrying out consistency processing on the behavior recognition results of the plurality of video frames to obtain the behavior recognition result of the video to be analyzed. The calculation amount required by behavior recognition can be effectively reduced, and real-time behavior recognition is easier to realize.

Description

Behavior identification method and device and electronic equipment

Technical Field

The present application relates to the field of machine learning technologies, and in particular, to a behavior recognition method and apparatus, and an electronic device.

Background

In some application scenarios, to determine the behavior of objects in a video, a behavior analysis may be performed on the video. In the related art, the method may be that based on optical flow information of a video, a video frame in the video is subjected to pose estimation to track a key point of a target, so as to obtain a pose sequence of the target in a plurality of consecutive video frames, and a behavior of the target is determined based on the pose sequence.

However, this method requires the use of optical flow information of the video, which is often huge and may require more computing resources, and thus it is difficult to realize real-time behavior recognition.

Disclosure of Invention

An object of the embodiments of the present application is to provide a behavior recognition method, so as to reduce the amount of computation required for implementing behavior recognition, thereby implementing real-time behavior recognition. The specific technical scheme is as follows:

in a first aspect of embodiments of the present application, a behavior recognition method is provided, where the method includes:

aiming at a plurality of video frames in a video to be analyzed, acquiring global image characteristics of the video frames;

determining an image area where each target is located in the video frame based on the global image characteristics;

extracting the regional image characteristics of the image region where each target is located from the global image characteristics to serve as the target characteristics of the target;

determining a behavior recognition result of the target based on the target characteristics;

carrying out consistency processing on the behavior recognition results of all targets in the video frame to obtain the behavior recognition result of the video frame;

and carrying out consistency processing on the behavior recognition results of the plurality of video frames to obtain the behavior recognition result of the video to be analyzed.

With reference to the first aspect, in a first possible implementation manner, the determining, based on the target feature, a behavior recognition result of the target includes:

extracting key point features of the target from the target features;

performing regression on the key point features to obtain a key point thermodynamic diagram of the target, wherein the thermodynamic diagram is used for expressing the probability that each pixel point in an image area where the target is located is a key point;

splicing the thermodynamic diagram and the target feature to obtain a fusion image feature;

and performing regression on the fusion image characteristics to obtain a behavior recognition result of the target.

With reference to the first aspect, in a second possible implementation manner, the determining, based on the global image feature, an image area where each target in the video frame is located includes:

and performing single regression on the global image characteristics, and determining the image area where each target in the video frame is located.

With reference to the first aspect, in a third possible implementation manner, the acquiring, for a plurality of video frames in a video to be analyzed, a global image feature of the video frame includes:

inputting a plurality of video frames in the video to be analyzed into a global feature sub-network in a behavior recognition network to obtain the output of the global feature sub-network as the global image feature of the video frames;

the determining the image area where each target is located in the video frame based on the global image feature includes:

inputting the global image characteristics into a target detection sub-network in the behavior recognition network to obtain the output of the target detection sub-network, wherein the output is used as an image area where each target in the video frame is located;

the extracting, from the global image features, the region image features of the image region where each target is located, as the target features of the target, includes:

inputting the global image features and the image area where each target is located into an area feature sub-network in the behavior recognition network to obtain the output of the area feature sub-network as the target features of the targets;

the determining the behavior recognition result of the target based on the target feature comprises the following steps:

inputting the target characteristics into a posture estimation sub-network in the behavior recognition network to obtain the output of the posture estimation sub-network as a posture estimation result of the target;

and inputting the target characteristics and the attitude estimation result into a behavior recognition sub-network in the behavior recognition network to obtain the output of the behavior recognition sub-network as the behavior recognition result of the target.

With reference to the third possible implementation manner of the first aspect, in a fourth possible implementation manner, the behavior recognition network is pre-trained by:

inputting a sample video frame marked with a target area, a target posture and a target behavior into the behavior recognition network, obtaining the output of the target detection subnetwork as a pre-estimated image area, obtaining the output of the posture estimation subnetwork as a pre-estimated posture result, and obtaining the output of the behavior recognition subnetwork as a pre-estimated behavior recognition result;

calculating the loss of the behavior recognition network based on the target area, the target posture, the target behavior, the estimated image area, the estimated posture result and the estimated behavior recognition result;

based on the loss, network parameters of the behavior recognition network are adjusted.

In a second aspect of embodiments of the present application, there is provided a behavior recognition apparatus, the apparatus including:

the global feature extraction module is used for acquiring global image features of a plurality of video frames in a video to be analyzed;

the image area determining module is used for determining the image area where each target in the video frame is located based on the global image characteristics;

the regional characteristic extraction module is used for extracting regional image characteristics of an image region where each target is located from the global image characteristics to serve as target characteristics of the target;

the target behavior identification module is used for determining a behavior identification result of the target based on the target characteristics;

the single-frame behavior recognition module is used for carrying out consistency processing on the behavior recognition results of all targets in the video frame to obtain the behavior recognition result of the video frame;

and the video behavior identification module is used for carrying out consistency processing on the behavior identification results of the plurality of video frames to obtain the behavior identification result of the video to be analyzed.

With reference to the second aspect, in a first possible implementation manner, the target behavior identification module is specifically configured to extract a key point feature of the target from the target feature;

With reference to the second aspect, in a second possible implementation manner, the image region determining module is specifically configured to perform a single regression on the image features to determine an image region where each target in the video frame is located.

With reference to the second aspect, in a third possible implementation manner, the global feature extraction module is specifically configured to input a plurality of video frames in the video to be analyzed to a global feature sub-network in a behavior recognition network, so as to obtain an output of the global feature sub-network, where the output is used as a global image feature of the video frame;

the image area determining module is specifically configured to input the global image feature to a target detection subnetwork in the behavior recognition network, to obtain an output of the target detection subnetwork, where the output is used as an image area where each target in the video frame is located;

the local feature extraction module is specifically configured to input the global image features and the image area where each target is located into a regional feature sub-network in the behavior recognition network, to obtain an output of the regional feature sub-network, where the output is used as a target feature of the target;

the target behavior recognition module is specifically configured to input the target feature into an attitude estimation subnetwork in the behavior recognition network, and obtain an output of the attitude estimation subnetwork as an attitude estimation result of the target;

With reference to the second aspect, in a fourth possible implementation manner, the apparatus further includes a network training module, configured to perform pre-training to obtain the behavior recognition network by:

The behavior recognition method, the behavior recognition device and the electronic equipment provided by the embodiment of the application can utilize the characteristic of the space-time consistency of the video to be analyzed aiming at the video to be analyzed meeting the space-time consistency, carry out consistency processing on all targets in a single video frame and carry out consistency processing on a plurality of video frames in the video frame to be analyzed, do not need to track key points of the targets based on optical flow information of the video, can effectively reduce the calculated amount required by behavior recognition, and can more easily realize real-time behavior recognition. Of course, not all advantages described above need to be achieved at the same time in the practice of any one product or method of the present application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flowchart of a behavior recognition method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a behavior recognition network according to an embodiment of the present application;

fig. 3 is a schematic flow chart of an end-to-end behavior recognition method according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a behavior recognition network according to an embodiment of the present application;

fig. 5 is a schematic flowchart of a training method for a behavior recognition network according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a behavior recognition apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, fig. 1 is a schematic flow chart of a behavior recognition method provided in an embodiment of the present application, and the method may include:

s101, aiming at a plurality of video frames in a video to be analyzed, obtaining global image characteristics of the video frames.

The plurality of video frames may be part of video frames in the video to be analyzed, or all video frames in the video to be analyzed. For example, a plurality of video frames may be screened from the video to be analyzed according to a preset screening condition (e.g., read from the video to be analyzed according to a preset number of frames at intervals), or all the video frames in the video frames to be analyzed may be regarded as a plurality of video frames.

S102, determining an image area where each target is located in the video frame based on the global image characteristics.

For convenience of discussion, in this embodiment, the position of the image region where an object is located may be expressed in a form of { x, y, w, h }, where x is a horizontal pixel coordinate of a center point of the image region where the object is located, y is a vertical pixel coordinate of the center point of the image region where the object is located, w is a pixel width of the image region where the object is located, and h is a pixel height of the image region where the object is located, and in other alternative embodiments, the position of the image region where an object is located may also be expressed in other forms (for example, may be expressed by using a vertex coordinate of the image region).

In an alternative embodiment, the global image features may be regressed to determine candidate regions in the image where the target may exist, and the regression is performed again to screen the candidate regions to determine the image region where the target is located. In another alternative embodiment, a single regression may also be performed on the global image feature to directly determine the image area where the target is located in the image, so as to reduce the calculation resource occupied by and the time cost spent on determining the image area where the target is located in the image.

S103, extracting the regional image characteristics of the image region where each target is located from the global image characteristics as the target characteristics of the target.

In this embodiment, the regional image features of the image Region may be extracted from the global image features by a Region of interest (Region) Pooling algorithm based on the position of the input image Region.

And S104, determining a behavior recognition result of the target based on the target characteristics.

The target features may be input to a classifier that is trained in advance to obtain a behavior recognition result. In this embodiment, the behavior recognition result may be represented in the form of a behavior class table and a corresponding confidence coefficient, for example, the behavior recognition result may be { 80% rope skipping, 10% tug-of-war, 10% jump }, the behavior of the target may be one of rope skipping, tug-of-war, and jump-of-high, the confidence coefficient of the rope skipping is 80% for the target, the confidence coefficient of the tug-of-war is 10% for the target, and the confidence coefficient of the jump is 10% for the target.

In an alternative embodiment, the key point features of the target may be extracted from the target features, and the key point features are regressed to obtain a key point thermodynamic diagram of the target, where the key points may be different according to the target. For ease of discussion, taking the target as a person, the key points may include a plurality of joint positions of the person, such as head, left shoulder point, right shoulder point, hand, knee, foot, and the like. The thermodynamic diagram is used for representing the probability that each pixel point in the image region where the target is located is a key point, namely representing the probability distribution condition of the key point in the image region. The distribution of a plurality of joint positions of the person can represent the posture of the person, so that the thermodynamic diagram of the key points can be used as the posture estimation result of the person. And splicing the thermodynamic diagram and the target characteristics to obtain fusion image characteristics, and then regressing the obtained fusion characteristics to obtain a behavior recognition result of the target.

And S105, carrying out consistency processing on the behavior recognition results of all the targets in the video frame to obtain the behavior recognition result of the video frame.

The consistency processing is used for keeping the behavior recognition results of all the targets in the video frame consistent, and the consistency processing can be different according to different application scenes. Illustratively, the behavior recognition results of all the targets in the video frame may be added and averaged, and the obtained average value is used as the behavior recognition result of the video frame. In other alternative embodiments, other algorithms (such as weighted average and median) may also be used to keep the behavior recognition results of all the targets in the video frame consistent, which is not limited in this embodiment.

Depending on the actual application scenario, one or more targets may be included in the image frame to be analyzed. If only one object is included in the image frame to be analyzed, the behavior recognition result of the image frame to be analyzed may represent the behavior of the object. If a plurality of targets are included in the image frame to be analyzed, the behavior recognition result of the image frame to be analyzed may be a group behavior (such as parade and party) for representing the plurality of targets.

It can be understood that, if the video to be analyzed is obtained by shooting a monitoring scene with a spatial scale smaller than a preset spatial scale threshold, it can be considered that the behaviors of different targets in the video to be analyzed are the same, that is, the behaviors of the targets in the video to be analyzed satisfy spatial consistency, and therefore, after all target behavior recognition results are subjected to consistency processing, the behavior recognition result of the video frame can be obtained.

And S106, carrying out consistency processing on the behavior recognition results of the plurality of video frames to obtain the behavior recognition result of the video frame to be analyzed.

For the consistency processing, reference may be made to the related description in the foregoing S105, which is not described herein again. For example, the behavior recognition results of a plurality of video frames may be averaged, and the obtained average value may be used as the behavior recognition result of the video frame to be analyzed.

It can be understood that, if the duration of the video to be analyzed is smaller than the preset duration threshold, the time span of the behavior of the same target in the video to be analyzed may be considered to be smaller, so that the behavior of the same target may not change, that is, the behavior of the target in the video to be analyzed satisfies the time consistency. Therefore, after the behavior recognition results of the plurality of video frames are subjected to consistency processing, the behavior recognition result of the video to be analyzed can be obtained.

By adopting the embodiment, the characteristics of the space-time consistency of the video to be analyzed can be utilized for the video to be analyzed meeting the space-time consistency, the consistency processing is carried out on all targets in a single video frame, the consistency processing is carried out on a plurality of video frames in the video frame to be analyzed, the key points of the targets do not need to be tracked based on the optical flow information of the video, the calculated amount required by behavior identification can be effectively reduced, and the real-time behavior identification can be realized more easily.

In an alternative embodiment, behavior recognition of each target in a video frame can be achieved by using a behavior recognition network trained in advance. In other alternative embodiments, behavior recognition of each target in the video frame may be implemented based on other machine learning algorithms.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a behavior recognition network provided in an embodiment of the present application, including:

a global feature subnetwork 110, a target detection subnetwork 120, a regional feature subnetwork 130, an attitude estimation subnetwork 140, and a behavior recognition subnetwork 150. The global feature sub-network 110 is used to extract global image features of an input image frame, and input the extracted global image features to the target detection sub-network 120 and the regional feature sub-network.

And the object detection sub-network 120 is used for determining an image area where the object in the image is located based on the input global image feature, and inputting the position of the image area to the area feature sub-network 130. And the regional characteristic sub-network 130 is used for extracting regional image characteristics of the image region from the global image characteristics and inputting the regional image characteristics to the posture estimation sub-network 140 and the behavior recognition sub-network 150. And the posture estimation sub-network 140 is configured to perform regression on the input region image features to obtain a posture estimation result of the target, and input the posture estimation result to the behavior recognition sub-network 150. In this embodiment, the pose estimation result may be represented in the form of a thermodynamic diagram of the keypoint, and in other alternative embodiments, the pose estimation result may be represented in other forms (the position and category of the keypoint).

And the behavior recognition sub-network 150 is configured to perform regression on the input regional image features and the posture estimation result to obtain a behavior recognition result of each target in the input image frame, and output the behavior recognition result. Referring to fig. 3, a flow of performing end-to-end behavior recognition through the behavior recognition network is shown in fig. 3, where fig. 3 is a schematic flow diagram of an end-to-end behavior recognition method provided in an embodiment of the present application, and the flow diagram may include:

s301, acquiring an image frame to be analyzed.

The image frame to be analyzed may be one image frame or a plurality of image frames according to different practical application scenarios.

S302, inputting the image frame to be analyzed into a behavior recognition network to obtain the output of the behavior recognition network as the behavior recognition result of the image frame to be analyzed.

The behavior recognition network comprises a global feature sub-network, a target detection sub-network, a regional feature sub-network, an attitude estimation sub-network and a behavior recognition sub-network. The principle of each sub-network can be referred to the related description, and will not be described herein.

By adopting the embodiment, the image characteristics can be extracted for the target detection subnetwork, the attitude estimation subnetwork and the behavior recognition subnetwork through the global characteristic subnetwork, and the characteristic extraction subnetwork does not need to be respectively and independently arranged for the target detection subnetwork, the attitude estimation subnetwork and the behavior recognition subnetwork, so that the structural complexity of the behavior recognition network can be effectively reduced.

On the other hand, since the global feature subnetwork extracts image features for the target detection subnetwork, the posture estimation subnetwork, and the behavior recognition subnetwork, the global image features extracted by the global feature subnetwork need to meet the requirements of the target detection subnetwork, the posture estimation subnetwork, and the behavior recognition subnetwork. The image features required by the three sub-networks are not identical, for example, a part of the image features required for behavior identification of the sub-networks may not be required for the target detection sub-network, and the image features are also input to the target detection sub-network, so that the part of the image features can be regarded as noise signals input to the target detection sub-network.

In the related art, the target detection network, the posture estimation network and the behavior recognition network are three independent neural networks, and the feature extraction sub-networks in the three neural networks usually only extract image features required by the neural networks. Compared with the related art, the signal-to-noise ratio of the global image features is lower in the embodiment of the application, and the lower signal-to-noise ratio enables the behavior recognition network provided by the embodiment of the application to allocate more probabilities on wrong behavior categories, so that the generalization capability of the behavior recognition network can be effectively improved.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a behavior recognition network according to an embodiment of the present application, in which a global feature sub-network 110 includes a shallow image spatial feature sub-network 111, a first middle image spatial feature sub-network 112, a second middle image spatial feature sub-network 113, a first deep image semantic feature sub-network 114, and a second deep image semantic feature sub-network 115.

The shallow image spatial feature sub-network 111 is configured to extract a shallow image spatial feature from an input image frame, and input the shallow image spatial feature to the first middle image spatial feature sub-network 112. A first middle image spatial feature subnetwork 112 for further extracting a first middle image spatial feature from the input shallow image spatial feature and inputting the extracted first middle image spatial feature to a second middle image spatial feature subnetwork 113. A second middle image spatial feature subnetwork 113 for further extracting a second middle image spatial feature from the input first middle image spatial feature and inputting the extracted second middle image spatial feature to the first deep image semantic feature subnetwork 114. A first deep image semantic feature sub-network 114 for further extracting the first deep semantic features from the inputted second mid-level image spatial features and inputting the extracted first deep semantic features to a second deep image semantic feature sub-network 115. A second deep image semantic feature sub-network 115 for further extracting second deep semantic features from the input first deep image semantic features.

It will be appreciated that one image feature is further extracted based on another image feature being more abstract, e.g. the second mid-level image spatial feature is more abstract than the first mid-level image spatial feature. In this embodiment, the global image features include a shallow image spatial feature, a first mid-level image spatial feature, a second mid-level image spatial feature, a first deep image semantic feature, and a second deep image semantic feature. The shallow image spatial feature, the first middle image spatial feature and the second middle image spatial feature are used for representing texture features and color information of the input image frame, and the first deep image semantic feature and the second deep image semantic feature are used for representing semantic features of each image area in the input image frame.

The target detection subnetwork 120 may be an RPN (Region detection Network) Network. The structure of the target detection subnetwork 120 can vary according to actual needs. For example, in an alternative embodiment, the target detection subnetwork may be one or more convolutional layers extending from the global image feature subnetwork for determining the image region in which the target is located by performing a single regression on the global image features.

The regional feature sub-network 130 includes a first regional feature extractor 131, a second regional feature extractor 132, a third regional feature extractor 133, and a fourth regional feature extractor 134. The first region feature extractor 131 is configured to obtain spatial features of the first middle layer image, and extract region image features of an image region where a target is located from the spatial features of the first middle layer image. The second region feature extractor 132 is configured to obtain spatial features of the first middle layer image, and extract region image features of an image region where the target is located from the spatial features of the first middle layer image. The third region feature extractor 133 is configured to obtain a spatial feature of the first deep image, and extract a region image feature of an image region where the target is located from the semantic feature of the first deep image. And a fourth region feature extractor 134, configured to acquire the semantic features of the second deep image, and extract the region image features of the image region where the target is located from the semantic features of the second deep image. In other alternative embodiments, the number of the region feature extractors included in the region feature sub-network may be different according to different actual requirements, and for example, in other alternative embodiments, a region feature extractor may be further included, which is configured to acquire a shallow image spatial feature and extract a region image feature of an image region where a target is located from the shallow image spatial feature.

It can be understood that, since the region feature extractor extracts the region image features of the image region where the object is located, these region image features may be regarded as the object features of the object, and taking the object as a pedestrian as an example, the region image features extracted by the region feature extractor may be regarded as the pedestrian features.

The pose estimation sub-network 140 includes a keypoint feature sub-network 141 and a thermodynamic diagram estimation sub-network 142, where the keypoint feature sub-network 141, which may be a plurality of continuously stacked convolutional layers (the number of convolutional layers may be different according to the actual application scenario) extending from the target detection sub-network 120, is used to extract keypoint features in the image region from the region image features. The thermodynamic diagram estimating sub-network 142 may be a plurality of deconvolution layers extending from the keypoint feature sub-network 141, and is configured to perform regression on the keypoint features to obtain a thermodynamic diagram of the keypoints in the image region, which is used as a target posture estimation result.

The behavior recognition subnetwork 150 comprises a behavior classification subnetwork 151, a fusion subnetwork 152 and a video behavior recognition subnetwork 153. The behavior classification sub-network 151 is configured to splice, for each target, the thermodynamic diagrams of the key points of the target output by the thermodynamic diagram estimation sub-network 142 with the region image features of the target to obtain a fused image feature of the target, and perform regression on the obtained fused image feature of the target to obtain a behavior recognition result of the target.

And the fusion sub-network 152 is used for calculating the average value of the behavior recognition results of each target in each input video frame as the behavior recognition result of the video frame. For example, assuming that a video frame includes 3 objects, which are respectively labeled as person a, person B, and person C, the behavior recognition result of person a is { 80% behavior class 1, 10% behavior class 2, and 10% behavior class 3}, the behavior recognition result of person B is { 70% behavior class 1, 20% behavior class 2, and 10% behavior class 3}, and the behavior recognition result of person C is { 90% behavior class 1, 0% behavior class 2, and 10% behavior class 3}, the behavior recognition result of the video frame may be { 80% behavior class 1, 10% behavior class 2, and 10% behavior class 3 }. And calculating the average value of the behavior recognition results of all the video frames as the behavior recognition result of the video to be analyzed.

The video behavior recognition sub-network 153 is configured to determine a behavior of a target in the video to be analyzed based on a behavior recognition result of the video to be analyzed. In this embodiment, the behavior category with the highest confidence may be determined as the behavior of the target in the video to be analyzed.

Referring to fig. 5, fig. 5 is a schematic flowchart illustrating a method for training a behavior recognition network according to an embodiment of the present application, where the method may include:

s501, inputting the sample image frame marked with the target area, the target posture and the target behavior into a behavior recognition network, and acquiring the image area determined by the target detection subnetwork, the posture estimation result output by the posture estimation subnetwork and the behavior recognition result output by the behavior recognition subnetwork.

The representation modes of the target area and the target posture may be different according to different practical application scenes. For convenience of discussion, assume that the target region is represented by { x, y, w, h }, and the target pose is represented by the position of each keypoint.

Similarly, the representation modes of the image area, the attitude estimation result and the behavior recognition result may be different according to different actual application scenes.

S502, calculating the loss of the behavior recognition network based on the target area, the target posture, the target behavior, the image area, the posture estimation result and the behavior estimation result.

It is to be understood that, among them, the target region, the target pose, and the target behavior may be regarded as true values, and the image region, the pose estimation result, and the behavior estimation result may be regarded as output values of the behavior recognition network, so that the loss of the behavior recognition network may be calculated based on the true values and the output values using a preset objective function. Different objective functions can be selected in different application scenarios, which is not limited in this embodiment.

For example, in an alternative embodiment, the objective function may be as follows:

L＝αL_loc+βL_cls+λL_kps+VL_act

wherein α, β, lambda and v are preset weighting coefficients, t_iTarget area of i-th target, v_iIs the image area of the ith target, D is the number of sample image frames, p_jThe confidence that the target class of the target in the d-th sample image frame output by the target detection sub-network is j, k is the number of key points,

estimate the thermodynamic diagram of the jth keypoint of the subnetwork output for pose, C is the total number of behavior classes, p_iThe probability of the subnetwork estimate for the ith behavior class is identified for the behavior.

S503, based on the loss, adjusting the network parameters of the behavior recognition network.

It will be appreciated that the tasks performed by the target detection sub-network, the attitude estimation sub-network, and the behavior recognition sub-network are interrelated, and therefore, the three sub-networks can be trained jointly, with faster convergence than if the three sub-networks were trained independently. In the related art, the target detection network, the posture estimation network and the behavior recognition network are three independent neural networks, and the three independent neural networks need to be trained separately. Therefore, the time cost for training the neural network for behavior recognition can be effectively reduced by adopting the embodiment.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a behavior recognition apparatus provided in an embodiment of the present application, and the behavior recognition apparatus may include:

the global feature extraction module 601 is configured to, for a plurality of video frames in a video to be analyzed, obtain global image features of the video frames;

an image area determining module 602, configured to determine, based on the global image features, an image area in which each target in the video frame is located;

a region feature extraction module 603, configured to extract, from the global image features, a region image feature of an image region where each target is located, as a target feature of the target;

a target behavior recognition module 604, configured to determine a behavior recognition result of the target based on the target feature;

the single-frame behavior recognition module 605 performs consistency processing on the behavior recognition results of all targets in the video frame to obtain the behavior recognition result of the video frame;

the video behavior recognition module 606 performs consistency processing on the behavior recognition results of the plurality of video frames to obtain a behavior recognition result of the video to be analyzed.

In an alternative embodiment, the target behavior identification module 604 is specifically configured to extract a key point feature of the target from the target feature;

performing regression on the characteristics of the key points to obtain a key point thermodynamic diagram of the target, wherein the thermodynamic diagram is used for expressing the probability that each pixel point in an image area where the target is located is the key point;

splicing the thermodynamic diagram with the target characteristics to obtain fused image characteristics;

and (5) performing regression on the fusion image characteristics to obtain a behavior recognition result of the target.

In an alternative embodiment, the image region determining module 602 is specifically configured to perform a single regression on the removed image features to determine the image region where each target in the video frame is located.

In an optional embodiment, the global feature extraction module 601 is specifically configured to input a plurality of video frames in a video to be analyzed to a global feature sub-network in a behavior recognition network, to obtain an output of the global feature sub-network, where the output is used as a global image feature of the video frame;

the image area determining module 602 is specifically configured to input the global image features into a target detection subnetwork in the behavior recognition network, to obtain an output of the target detection subnetwork, where the output is used as an image area where each target in the video frame is located;

the local feature extraction module 603 is specifically configured to input the global image features and the image region where each target is located into a regional feature subnetwork in the behavior recognition network, to obtain an output of the regional feature subnetwork, where the output is used as a target feature of the target;

a target behavior recognition module 604, configured to input target features into a posture estimation sub-network in a behavior recognition network, to obtain an output of the posture estimation sub-network as a posture estimation result of the target; and inputting the target characteristics and the attitude estimation result into a behavior recognition sub-network in the behavior recognition network to obtain the output of the behavior recognition sub-network as the behavior recognition result of the target.

In an optional embodiment, the apparatus further includes a network training module, configured to obtain the behavior recognition network by performing pre-training in the following manner:

inputting a sample video frame marked with a target area, a target posture and a target behavior into a behavior recognition network, obtaining the output of a target detection sub-network as a pre-estimated image area, obtaining the output of a posture estimation sub-network as a pre-estimated posture result, and obtaining the output of a behavior recognition sub-network as a pre-estimated behavior recognition result;

calculating the loss of the behavior recognition network based on the target area, the target posture, the target behavior, the pre-estimated image area, the pre-estimated posture result and the pre-estimated behavior recognition result;

An embodiment of the present application further provides an electronic device, as shown in fig. 7, which may include:

a memory 701 for storing a computer program;

the processor 702 is configured to implement the following steps when executing the program stored in the memory 701:

In an alternative embodiment, determining the behavior recognition result of the target based on the target feature includes:

extracting key point features of the target from the target features;

In an alternative embodiment, determining the image area where each target in the video frame is located based on the global image feature includes:

and performing single regression on the global image characteristics, and determining the image area where each target is located in the video frame.

In an optional embodiment, for a plurality of video frames in a video to be analyzed, acquiring global image features of the video frames includes:

inputting a plurality of video frames in a video to be analyzed into a global feature sub-network in a behavior recognition network to obtain the output of the global feature sub-network as the global image feature of the video frames;

determining the image area where each target is located in the video frame based on the global image characteristics, including:

inputting the global image characteristics into a target detection subnetwork in the behavior recognition network to obtain the output of the target detection subnetwork, and taking the output as an image area where each target is located in the video frame;

extracting the regional image characteristics of the image region where each target is located from the global image characteristics, wherein the regional image characteristics are taken as the target characteristics of the target, and the method comprises the following steps:

inputting the global image characteristics and the image area where each target is located into an area characteristic sub-network in the behavior recognition network to obtain the output of the area characteristic sub-network as the target characteristics of the target;

determining a behavior recognition result of the target based on the target characteristics, comprising:

In an alternative embodiment, the behavior recognition network is pre-trained by:

The Memory mentioned in the above electronic device may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.

In yet another embodiment provided by the present application, a computer-readable storage medium is further provided, which stores instructions that, when executed on a computer, cause the computer to perform any of the behavior recognition methods in the above embodiments.

In yet another embodiment provided by the present application, there is also provided a computer program product containing instructions that, when run on a computer, cause the computer to perform any of the behavior recognition methods of the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the embodiments of the apparatus, the electronic device, the computer-readable storage medium, and the computer program product, since they are substantially similar to the method embodiments, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiments.

The above description is only for the preferred embodiment of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application are included in the protection scope of the present application.

Claims

1. A method of behavior recognition, the method comprising:

2. The method of claim 1, wherein determining the behavior recognition result of the target based on the target feature comprises:

extracting key point features of the target from the target features;

3. The method of claim 1, wherein determining the image region of each target in the video frame based on the global image features comprises:

4. The method according to claim 1, wherein the obtaining global image features of a plurality of video frames in the video to be analyzed comprises:

5. The method of claim 4, wherein the behavior recognition network is pre-trained by:

6. An apparatus for behavior recognition, the apparatus comprising:

7. The apparatus according to claim 6, wherein the target behavior identification module is specifically configured to extract a key point feature of the target from the target features;

8. The apparatus according to claim 6, wherein the image region determining module is specifically configured to perform a single regression on the de-clustered image features to determine an image region where each target is located in the video frame.

9. The apparatus according to claim 6, wherein the global feature extraction module is specifically configured to input a plurality of video frames in the video to be analyzed into a global feature subnetwork in a behavior recognition network, and obtain an output of the global feature subnetwork as a global image feature of the video frame;

10. The apparatus of claim 9, further comprising a network training module configured to pre-train the behavior recognition network by: