CN111753590B

CN111753590B - Behavior recognition method and device and electronic equipment

Info

Publication number: CN111753590B
Application number: CN201910245567.8A
Authority: CN
Inventors: 王轩瀚; 周纪强
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2019-03-28
Filing date: 2019-03-28
Publication date: 2023-10-17
Anticipated expiration: 2039-03-28
Also published as: CN111753590A

Abstract

The embodiment of the application provides a behavior recognition method and device and electronic equipment. The method comprises the following steps: acquiring global image characteristics of a plurality of video frames in a video to be analyzed; determining an image area where each target in the video frame is located based on the global image features; extracting the regional image characteristics of the image region where each target is located from the global image characteristics as target characteristics of the target; determining a behavior recognition result of the target based on the target characteristics; consistency processing is carried out on the behavior recognition results of all targets in the video frame, so that the behavior recognition results of the video frame are obtained; and carrying out consistency processing on the behavior recognition results of the plurality of video frames to obtain the behavior recognition results of the video to be analyzed. The calculation amount required by behavior recognition can be effectively reduced, and real-time behavior recognition is easier to realize.

Description

Behavior recognition method and device and electronic equipment

Technical Field

The present application relates to the field of machine learning technologies, and in particular, to a behavior recognition method and apparatus, and an electronic device.

Background

In some application scenarios, behavior analysis may be performed on a video in order to determine the behavior of objects in the video. In the related art, the method can be based on optical flow information of a video, and the method is used for tracking key points of a target by carrying out gesture estimation on video frames in the video to obtain gesture sequences of the target in a plurality of continuous video frames, and determining the behavior of the target based on the gesture sequences.

However, the method needs to use optical flow information of the video, and the optical flow information is often huge and may occupy more computing resources, so that real-time behavior recognition is difficult to realize.

Disclosure of Invention

The embodiment of the application aims to provide a behavior recognition method, which is used for reducing the calculation amount required by realizing behavior recognition and further realizing real-time behavior recognition. The specific technical scheme is as follows:

in a first aspect of an embodiment of the present application, there is provided a behavior recognition method, including:

acquiring global image characteristics of a plurality of video frames in a video to be analyzed;

determining an image area where each target in the video frame is located based on the global image features;

extracting the regional image characteristics of the image region where each target is located from the global image characteristics as target characteristics of the target;

Determining a behavior recognition result of the target based on the target characteristics;

consistency processing is carried out on the behavior recognition results of all targets in the video frame, so that the behavior recognition results of the video frame are obtained;

and carrying out consistency processing on the behavior recognition results of the plurality of video frames to obtain the behavior recognition results of the video to be analyzed.

With reference to the first aspect, in a first possible implementation manner, the determining, based on the target feature, a behavior recognition result of the target includes:

extracting key point features of the target from the target features;

regression is carried out on the key point characteristics to obtain a key point thermodynamic diagram of the target, wherein the thermodynamic diagram is used for representing the probability that each pixel point in an image area where the target is located is the key point;

splicing the thermodynamic diagram with the target feature to obtain a fusion image feature;

and carrying out regression on the fused image characteristics to obtain a behavior recognition result of the target.

With reference to the first aspect, in a second possible implementation manner, the determining, based on the global image feature, an image area where each target in the video frame is located includes:

and carrying out single regression on the global image characteristics to determine an image area where each target in the video frame is located.

With reference to the first aspect, in a third possible implementation manner, the obtaining, for a plurality of video frames in the video to be analyzed, global image features of the video frames includes:

inputting a plurality of video frames in the video to be analyzed into a global feature sub-network in a behavior recognition network to obtain the output of the global feature sub-network as the global image feature of the video frames;

the determining, based on the global image feature, an image area where each target in the video frame is located includes:

inputting the global image features into a target detection sub-network in the behavior recognition network to obtain the output of the target detection sub-network, wherein the output is used as an image area where each target in the video frame is located;

the extracting the regional image feature of the image region where each target is located from the global image features as the target feature of the target comprises the following steps:

inputting the global image features and the image region where each target is located into a region feature sub-network in the behavior recognition network to obtain the output of the region feature sub-network as the target feature of the target;

the determining the behavior recognition result of the target based on the target characteristics comprises the following steps:

Inputting the target characteristics into a gesture estimation sub-network in the behavior recognition network to obtain the output of the gesture estimation sub-network, and taking the output as a gesture estimation result of the target;

and inputting the target characteristics and the gesture estimation result into a behavior recognition sub-network in the behavior recognition network to obtain the output of the behavior recognition sub-network as the behavior recognition result of the target.

With reference to the third possible implementation manner of the first aspect, in a fourth possible implementation manner, the behavior recognition network is pre-trained by:

inputting a sample video frame marked with a target area, a target gesture and a target behavior into the behavior recognition network, obtaining the output of the target detection sub-network as a predicted image area, obtaining the output of the gesture estimation sub-network as a predicted gesture result, and obtaining the output of the behavior recognition sub-network as a predicted behavior recognition result;

calculating the loss of the behavior recognition network based on the target area, the target gesture, the target behavior, the estimated image area, the estimated gesture result and the estimated behavior recognition result;

Based on the loss, network parameters of the behavior recognition network are adjusted.

In a second aspect of the embodiments of the present application, there is provided a behavior recognition apparatus, the apparatus comprising:

the global feature extraction module is used for acquiring global image features of a plurality of video frames in the video to be analyzed;

the image area determining module is used for determining an image area where each target in the video frame is located based on the global image characteristics;

the regional characteristic extraction module is used for extracting regional image characteristics of the image region where each target is located from the global image characteristics and taking the regional image characteristics as target characteristics of the target;

the target behavior recognition module is used for determining a behavior recognition result of the target based on the target characteristics;

the single-frame behavior recognition module is used for carrying out consistency processing on the behavior recognition results of all targets in the video frame to obtain the behavior recognition results of the video frame;

and the video behavior recognition module is used for carrying out consistency processing on the behavior recognition results of the plurality of video frames to obtain the behavior recognition results of the video to be analyzed.

With reference to the second aspect, in a first possible implementation manner, the target behavior recognition module is specifically configured to extract a key point feature of the target from the target feature;

With reference to the second aspect, in a second possible implementation manner, the image area determining module is specifically configured to determine an image area where each target in the video frame is located by performing a single regression on the image features.

With reference to the second aspect, in a third possible implementation manner, the global feature extraction module is specifically configured to input a plurality of video frames in the video to be analyzed to a global feature sub-network in a behavior recognition network, so as to obtain an output of the global feature sub-network, where the output is used as a global image feature of the video frames;

the image area determining module is specifically configured to input the global image feature to a target detection sub-network in the behavior recognition network, and obtain an output of the target detection sub-network as an image area where each target in the video frame is located;

The local feature extraction module is specifically configured to input the global image feature and an image area where each target is located into a regional feature sub-network in the behavior recognition network, so as to obtain an output of the regional feature sub-network, and use the output as a target feature of the target;

the target behavior recognition module is specifically configured to input the target feature to a gesture estimation sub-network in the behavior recognition network, and obtain an output of the gesture estimation sub-network as a gesture estimation result of the target;

With reference to the second aspect, in a fourth possible implementation manner, the apparatus further includes a network training module, configured to perform pre-training to obtain the behavior recognition network by:

According to the behavior recognition method, the behavior recognition device and the electronic equipment, all targets in a single video frame can be subjected to consistency processing by utilizing the characteristic of the space-time consistency of the video to be analyzed aiming at the video to be analyzed meeting the space-time consistency, and a plurality of video frames in the video frame to be analyzed are subjected to the consistency processing, so that key points of the targets are not required to be tracked based on optical flow information of the video, the calculated amount required by the behavior recognition can be effectively reduced, and the real-time behavior recognition is easier to realize. Of course, it is not necessary for any one product or method of practicing the application to achieve all of the advantages set forth above at the same time.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a behavior recognition method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a behavior recognition network according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of an end-to-end behavior recognition method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of another architecture of a behavior recognition network according to an embodiment of the present application;

FIG. 5 is a flowchart of a training method of a behavior recognition network according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a behavior recognition device according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Referring to fig. 1, fig. 1 is a schematic flow chart of a behavior recognition method according to an embodiment of the present application, which may include:

S101, acquiring global image characteristics of a plurality of video frames in a video to be analyzed.

The plurality of video frames may be part of video frames in the video to be analyzed, or may be all video frames in the video to be analyzed. The method may be to screen a plurality of video frames from the video to be analyzed according to a preset screening condition (for example, read from the video to be analyzed according to a preset interval frame number), or may be to use all video frames in the video to be analyzed as a plurality of video frames.

S102, determining an image area where each target in the video frame is located based on the global image characteristics.

For ease of discussion, in this embodiment, the position of an image region in which a target is located may be expressed in { x, y, w, h }, where x is the horizontal pixel coordinate of the center point of the image region in which the target is located, y is the vertical pixel coordinate of the center point of the image region in which the target is located, w is the pixel width of the image region in which the target is located, and h is the pixel height of the image region in which the target is located, and in other alternative embodiments, the position of an image region in which a target is located may be expressed in other forms (e.g., the vertex coordinate of the image region may be used).

In an alternative embodiment, the global image features may be regressed to determine candidate regions in the image where the target may exist, and regressed again to screen the candidate regions to determine the image region in which the target is located. In another alternative embodiment, a single regression may be performed on the global image feature to directly determine the image area where the target is located in the image, so as to reduce the computational resources and the time cost spent in determining the image area where the target is located in the image.

S103, extracting the regional image characteristics of the image region where each target is located from the global image characteristics as the target characteristics of the target.

In this embodiment, the regional image features of the image region may be extracted from the global image features by a regional pooling (Region of Interest Pooling) algorithm based on the location of the input image region.

S104, determining a behavior recognition result of the target based on the target characteristics.

The target features may be input to a pre-trained classifier to obtain behavior recognition results. In this embodiment, the behavior recognition result may be represented in the form of a behavior class table and a corresponding confidence level, and exemplary behavior recognition results may be {80% rope skipping, 10% tug, 10% high jump }, where the behavior of the target may be one of rope skipping, tug, and high jump, and the behavior of the target is 80% of the confidence level of the rope skipping, 10% of the confidence level of the tug, and 10% of the confidence level of the target is the high jump.

In an alternative embodiment, the key point features of the target may be extracted from the target features, and the key point features are regressed to obtain a key point thermodynamic diagram of the target, where the key points may be different according to the targets. For ease of discussion, taking a person as an example, the key points may include a plurality of joint positions of the person, such as head, left shoulder points, right shoulder points, hands, knees, feet, etc. The thermodynamic diagram is used for representing the probability that each pixel point in the image area where the target is located is a key point, namely, the probability distribution condition of the key point in the image area. The distribution condition of the positions of the joints of the person can represent the pose of the person, so that the thermodynamic diagram of the key points can be used as the pose estimation result of the person. And splicing the thermodynamic diagram with the target feature to obtain a fused image feature, and then carrying out regression on the obtained fused feature to obtain a behavior recognition result of the target.

S105, carrying out consistency processing on the behavior recognition results of all targets in the video frame to obtain the behavior recognition result of the video frame.

The consistency processing is used for enabling behavior recognition results of all targets in the video frame to be consistent, and the consistency processing can be different according to different application scenes. For example, the behavior recognition results of all the targets in the video frame may be summed and averaged, and the resulting average value is taken as the behavior recognition result of the video frame. In other alternative embodiments, other algorithms (such as weighted averaging and median taking) may be used to keep the behavior recognition results of all the objects in the video frame consistent, which is not limited in this embodiment.

Depending on the actual application scenario, the image frame to be analyzed may contain one or more targets. If only one object is included in the image frame to be analyzed, the behavior recognition result of the image frame to be analyzed may represent the behavior of the object. If a plurality of targets are included in the image frame to be analyzed, the behavior recognition result of the image frame to be analyzed may be group behaviors (such as parades, pairs) for representing the plurality of targets.

It can be understood that if the video to be analyzed is obtained by shooting a monitoring scene with a spatial scale smaller than a preset spatial scale threshold, the behaviors of different targets in the video to be analyzed can be considered to be the same, i.e. the behaviors of the targets in the video to be analyzed meet the spatial consistency, so that the behavior recognition result of the video frame can be obtained after the consistency processing is performed on all the target behavior recognition results.

S106, carrying out consistency processing on the behavior recognition results of the plurality of video frames to obtain the behavior recognition results of the video frames to be analyzed.

For consistency processing, reference may be made to the description of the foregoing S105, and the details are not repeated here. For example, the behavior recognition results of a plurality of video frames may be averaged, and the obtained average value is used as the behavior recognition result of the video frame to be analyzed.

It can be understood that if the duration of the video to be analyzed is smaller than the preset duration threshold, the time span of the behavior of the same object in the video to be analyzed can be considered to be smaller, so that the behavior of the same object cannot be changed, that is, the behavior of the object in the video to be analyzed meets the time consistency. Therefore, after the behavior recognition results of the plurality of video frames are subjected to consistency processing, the behavior recognition results of the video to be analyzed can be obtained.

By selecting the embodiment, aiming at the video to be analyzed which meets the space-time consistency, by utilizing the characteristic of the space-time consistency of the video to be analyzed, and through carrying out consistency processing on all targets in a single video frame and carrying out consistency processing on a plurality of video frames in the video frame to be analyzed, the key points of the targets are not required to be tracked based on the optical flow information of the video, the calculated amount required by behavior recognition can be effectively reduced, and the real-time behavior recognition is easier to realize.

In an alternative embodiment, the behavior recognition of the individual objects in the video frame may be implemented using a pre-trained behavior recognition network. In other alternative embodiments, behavior recognition of each object in the video frame may be implemented based on other machine learning algorithms.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a behavior recognition network according to an embodiment of the present application, including:

global features sub-network 110, object detection sub-network 120, regional features sub-network 130, pose estimation sub-network 140, and behavior recognition sub-network 150. The global feature sub-network 110 is configured to extract global image features of an input image frame, and input the extracted global image features to the target detection sub-network 120 and the regional feature sub-network.

The target detection sub-network 120 is configured to determine an image area where a target is located in an image based on the input global image feature, and input a position of the image area to the area feature sub-network 130. The regional image characteristics sub-network 130 is configured to extract regional image characteristics of the image region from the global image characteristics, and input the regional image characteristics to the pose estimation sub-network 140 and the behavior recognition sub-network 150. The gesture estimation sub-network 140 is configured to perform regression on the input regional image features to obtain a gesture estimation result of the target, and input the gesture estimation result to the behavior recognition sub-network 150. In this embodiment, the pose estimation result may be expressed in the form of a thermodynamic diagram of the keypoint, and in other alternative embodiments, the pose estimation result may be expressed in other forms (location and category of the keypoint).

The behavior recognition sub-network 150 is configured to perform regression on the input region image features and the gesture estimation result, obtain a behavior recognition result of each target in the input image frame, and output the behavior recognition result. The flow of end-to-end behavior recognition through the behavior recognition network may refer to fig. 3, and fig. 3 is a schematic flow diagram of an end-to-end behavior recognition method provided by an embodiment of the present application, which may include:

s301, acquiring an image frame to be analyzed.

The image frame to be analyzed can be one image frame or a plurality of image frames according to different practical application scenes.

S302, inputting the image frames to be analyzed into a behavior recognition network to obtain the output of the behavior recognition network, and taking the output as the behavior recognition result of the image frames to be analyzed.

The behavior recognition network comprises a global feature sub-network, a target detection sub-network, a regional feature sub-network, a gesture estimation sub-network and a behavior recognition sub-network. The principle of each sub-network can be seen from the foregoing related description, and will not be repeated here.

By adopting the embodiment, the image features can be extracted by taking the global feature sub-network as the target detection sub-network, the gesture estimation sub-network and the behavior recognition sub-network, and the feature extraction sub-network is not required to be independently arranged for the target detection sub-network, the gesture estimation sub-network and the behavior recognition sub-network, so that the structural complexity of the behavior recognition network can be effectively reduced.

On the other hand, since the global feature sub-network extracts image features for the target detection sub-network, the gesture estimation sub-network and the behavior recognition sub-network, the global image features extracted by the global feature sub-network need to satisfy the requirements of the target detection sub-network, the gesture estimation sub-network and the behavior recognition sub-network. The image features required by the three sub-networks are not identical, for example, part of the image features required by the behavior recognition sub-network may not be required by the target detection sub-network, and these image features are also input to the target detection sub-network, so that the part of the image features can be regarded as noise signals input to the target detection sub-network.

In the related art, since the object detection network, the gesture estimation network and the behavior recognition network are three independent neural networks, only the image features required by the neural network are extracted by the feature extraction sub-network in the three neural networks. Therefore, compared with the related art, the global image feature in the embodiment of the application has low signal-to-noise ratio, and the lower signal-to-noise ratio can enable the behavior recognition network provided by the embodiment of the application to distribute more probabilities on wrong behavior categories, so that the generalization capability of the behavior recognition network can be effectively improved.

Referring to fig. 4, fig. 4 is a schematic diagram showing another structure of a behavior recognition network according to an embodiment of the present application, where a global feature sub-network 110 includes a shallow image space feature sub-network 111, a first middle image space feature sub-network 112, a second middle image space feature sub-network 113, a first deep image semantic feature network 114, and a second deep image semantic feature sub-network 115.

Wherein the shallow image space feature sub-network 111 is configured to extract shallow image space features from an input image frame and input the shallow image space features to the first middle image space feature sub-network 112. The first middle-layer image space feature sub-network 112 is configured to further extract a first middle-layer image space feature from the input shallow-layer image space features, and input the extracted first middle-layer image space feature to the second middle-layer image space feature sub-network 113. A second middle-layer image space feature sub-network 113 for further extracting second middle-layer image space features from the input first middle-layer image space features and inputting the extracted second middle-layer image space features to the first deep-layer image semantic feature sub-network 114. The first deep image semantic feature sub-network 114 is configured to further extract first deep semantic features from the input second middle layer image spatial features, and input the extracted first deep semantic features to the second deep image semantic feature sub-network 115. A second deep image semantic feature sub-network 115 for further extracting second deep semantic features from the input first deep image semantic features.

It will be appreciated that one image feature is further extracted based on another image feature being more abstract, e.g. the second mid-level image space feature is more abstract than the first mid-level image space feature. In this embodiment, the global image features include shallow image space features, first mid-layer image space features, second mid-layer image space features, first deep image semantic features, and second deep image semantic features. The shallow image space features, the first middle image space features and the second middle image space features are used for representing texture features and color information of an input image frame, and the first deep image semantic features and the second deep image semantic features are used for representing semantic features of various image areas in the input image frame.

The target detection subnetwork 120 may be an RPN (Region Proposal Network, regional detection network) network. The structure of the target detection subnetwork 120 may be different depending on the actual requirements. For example, in an alternative embodiment, the object detection subnetwork may be one or more convolution layers extending from the global image feature subnetwork for determining the image region in which the object is located by a single regression of the global image features.

The regional characteristics sub-network 130 includes a first regional characteristics extractor 131, a second regional characteristics extractor 132, a third regional characteristics extractor 133, and a fourth regional characteristics extractor 134. The first region feature extractor 131 is configured to obtain a first middle-layer image space feature, and extract a region image feature of an image region where the target is located from the first middle-layer image space feature. The second region feature extractor 132 is configured to obtain a first middle-layer image space feature, and extract a region image feature of an image region where the target is located from the first middle-layer image space feature. The third region feature extractor 133 is configured to obtain the first deep image spatial feature, and extract the region image feature of the image region where the target is located from the first deep image semantic feature. A fourth region feature extractor 134, configured to obtain semantic features of the second deep image, and extract region image features of the image region where the target is located from the semantic features of the second deep image. In other optional embodiments, the number of the regional feature extractors included in the regional feature sub-network may be different according to actual requirements, and illustratively, in other optional embodiments, the regional feature extractor may be further included for acquiring shallow image space features and extracting regional image features of an image region where the target is located from the shallow image space features.

It will be appreciated that, since the region image features of the image region where the target is located are extracted by the region feature extractor, these region image features may be regarded as target features of the target, and, taking the target as a pedestrian as an example, the region image features extracted by the region feature extractor may be regarded as pedestrian features.

The pose estimation sub-network 140 includes a keypoint feature sub-network 141 and a thermodynamic diagram estimation sub-network 142, wherein the keypoint feature sub-network 141 may be a plurality of continuously stacked convolution layers (the number of the convolution layers may be different according to the actual application scenario) extending from the object detection sub-network 120, for extracting the keypoint feature in the image region from the region image feature. The thermodynamic diagram estimation sub-network 142 may be a plurality of deconvolution layers extending from the key point feature sub-network 141, and is used for carrying out regression on the key point features to obtain a thermodynamic diagram of the key points in the image area as a target pose estimation result.

The behavior recognition sub-network 150 includes a behavior classification sub-network 151, a convergence sub-network 152, and a video behavior recognition sub-network 153. The behavior classification sub-network 151 is configured to splice, for each target, the thermodynamic diagram of the key point of the target output by the thermodynamic diagram estimation sub-network 142 with the regional image feature of the target, to obtain the fused image feature of the target, and perform regression on the obtained fused image feature of the target, to obtain the behavior recognition result of the target.

And a fusion sub-network 152 for calculating, for each input video frame, an average value of the behavior recognition results of each target in the video frame as the behavior recognition result of the video frame. For example, assuming that 3 objects are included in one video frame, respectively denoted as person a, person B, and person C, the behavior recognition result of person a is {80% behavior class 1, 10% behavior class 2, 10% behavior class 3}, the behavior recognition result of person B is {70% behavior class 1, 20% behavior class 2, 10% behavior class 3}, and the behavior recognition result of person C is {90% behavior class 1,0% behavior class 2, 10% behavior class 3}, the behavior recognition result of the video frame may be {80% behavior class 1, 10% behavior class 2, 10% behavior class 3}. And calculating the average value of the behavior recognition results of all the video frames to be used as the behavior recognition result of the video to be analyzed.

The video behavior recognition sub-network 153 is configured to determine a behavior of a target in the video to be analyzed based on a behavior recognition result of the video to be analyzed. In this embodiment, the behavior of the target in the video to be analyzed may be determined based on the behavior category with the highest confidence.

Referring to fig. 5, fig. 5 is a schematic flow chart of a training method of a behavior recognition network according to an embodiment of the present application, which may include:

s501, inputting a sample image frame marked with a target area, a target gesture and a target behavior into a behavior recognition network, and acquiring an image area determined by a target detection sub-network, a gesture estimation result output by the gesture estimation sub-network and a behavior recognition result output by the behavior recognition sub-network.

The representation modes of the target area and the target gesture may be different according to different actual application scenes. For ease of discussion, assume that the representation of the target region is { x, y, w, h }, and the representation of the target pose is the location of each keypoint.

Similarly, the representation modes of the image area, the gesture estimation result and the behavior recognition result may be different according to different practical application scenes.

S502, calculating loss of the behavior recognition network based on the target area, the target gesture, the target behavior, the image area, the gesture estimation result and the behavior estimation result.

It can be understood that, the target area, the target gesture and the target behavior can be regarded as true values, and the image area, the gesture estimation result and the behavior estimation result can be regarded as output values of the behavior recognition network, so that the loss of the behavior recognition network can be calculated based on the true values and the output values by using a preset target function. Different objective functions can be selected in different application scenarios, which are not limited in this embodiment.

Illustratively, in an alternative embodiment, the objective function may be as follows:

L＝αL _loc +βL _cls +λL _kps +VL _act

wherein alpha, beta, lambda and v are preset weighting coefficients, t _i Target area, v, being the i-th target _i For the image area of the ith object, D is the number of sample image frames, p _j The confidence that the target class of the target in the d-th sample image frame output by the target detection sub-network is j, k is the number of key points,thermodynamic diagram of jth key point output by sub-network for gesture estimation, C is total number of behavior categories, p _i The probability of the sub-network estimating the ith behavior class for the behavior is identified.

And S503, adjusting network parameters of the behavior recognition network based on the loss.

It will be appreciated that the tasks performed by the target detection sub-network, the attitude estimation sub-network and the behavior recognition sub-network are interrelated, and therefore, the three sub-networks may be trained jointly, with a faster convergence rate than if the three sub-networks were trained separately. In the related art, the target detection network, the gesture estimation network and the behavior recognition network are three independent neural networks, and the three independent neural networks need to be trained separately. Therefore, the time cost for training the neural network for behavior recognition can be effectively reduced by adopting the embodiment.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a behavior recognition device according to an embodiment of the present application, which may include:

the global feature extraction module 601 is configured to obtain global image features of a plurality of video frames in a video to be analyzed, for the video frames;

an image region determining module 602, configured to determine, based on the global image features, an image region in which each target in the video frame is located;

the region feature extraction module 603 is configured to extract, from the global image features, a region image feature of an image region where each target is located, as a target feature of the target;

a target behavior recognition module 604, configured to determine a behavior recognition result of the target based on the target feature;

the single-frame behavior recognition module 605 performs consistency processing on the behavior recognition results of all targets in the video frame to obtain the behavior recognition result of the video frame;

the video behavior recognition module 606 performs consistency processing on the behavior recognition results of the plurality of video frames to obtain behavior recognition results of the video to be analyzed.

In an alternative embodiment, the target behavior recognition module 604 is specifically configured to extract key point features of the target from the target features;

In an alternative embodiment, the image region determining module 602 is specifically configured to determine, by performing a single regression on the image features, an image region in which each object in the video frame is located.

In an alternative embodiment, the global feature extraction module 601 is specifically configured to input a plurality of video frames in a video to be analyzed to a global feature sub-network in the behavior recognition network, to obtain an output of the global feature sub-network, and to use the output as a global image feature of the video frames;

the image area determining module 602 is specifically configured to input global image features to a target detection sub-network in the behavior recognition network, and obtain an output of the target detection sub-network as an image area where each target in the video frame is located;

the local feature extraction module 603 is specifically configured to input the global image feature and the image area where each target is located into a regional feature sub-network in the behavior recognition network, and obtain an output of the regional feature sub-network as a target feature of the target;

The target behavior recognition module 604 is specifically configured to input a target feature to a gesture estimation sub-network in the behavior recognition network, and obtain an output of the gesture estimation sub-network as a gesture estimation result of the target; and inputting the target characteristics and the gesture estimation result into a behavior recognition sub-network in the behavior recognition network to obtain the output of the behavior recognition sub-network as the behavior recognition result of the target.

In an alternative embodiment, the apparatus further comprises a network training module for pre-training to obtain the behavior recognition network by:

inputting a sample video frame marked with a target area, a target gesture and a target behavior into a behavior recognition network, obtaining the output of a target detection sub-network as a predicted image area, obtaining the output of a gesture estimation sub-network as a predicted gesture result, and obtaining the output of the behavior recognition sub-network as a predicted behavior recognition result;

based on the losses, network parameters of the behavior recognition network are adjusted.

The embodiment of the application also provides an electronic device, as shown in fig. 7, which may include:

A memory 701 for storing a computer program;

the processor 702 is configured to execute the program stored in the memory 701, and implement the following steps:

In an alternative embodiment, determining a behavior recognition result of the target based on the target feature includes:

extracting key point characteristics of the target from the target characteristics;

In an alternative embodiment, determining the image region in which each object in the video frame is located based on the global image features includes:

and carrying out single regression on the global image characteristics to determine the image area where each target in the video frame is located.

In an alternative embodiment, for a plurality of video frames in a video to be analyzed, acquiring global image features of the video frames includes:

determining, based on the global image feature, an image region in which each target in the video frame is located, including:

inputting the global image characteristics into a target detection sub-network in a behavior recognition network to obtain the output of the target detection sub-network, wherein the output is used as an image area where each target in the video frame is located;

extracting the regional image characteristics of the image region where each target is located from the global image characteristics as the target characteristics of the target, wherein the regional image characteristics comprise:

Inputting the global image features and the image region where each target is located into a region feature sub-network in a behavior recognition network to obtain the output of the region feature sub-network as the target features of the targets;

determining a behavior recognition result of the target based on the target features, including:

inputting the target characteristics into a gesture estimation sub-network in a behavior recognition network to obtain the output of the gesture estimation sub-network, and taking the output as a gesture estimation result of the target;

In an alternative embodiment, the behavior recognition network is pre-trained by:

The Memory mentioned in the electronic device may include a random access Memory (Random Access Memory, RAM) or may include a Non-Volatile Memory (NVM), such as at least one magnetic disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processing, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

In yet another embodiment of the present application, a computer readable storage medium is provided, in which instructions are stored, which when run on a computer, cause the computer to perform any of the behavior recognition methods of the above embodiments.

In yet another embodiment of the present application, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the behavior recognition methods of the above embodiments.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for embodiments of the apparatus, electronic device, computer readable storage medium, computer program product, the description is relatively simple as it is substantially similar to the method embodiments, where relevant see also part of the description of the method embodiments.

The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application are included in the protection scope of the present application.

Claims

1. A method of behavior recognition, the method comprising:

for a plurality of video frames in a video to be analyzed, global image characteristics of each video frame in the plurality of video frames are respectively obtained;

determining an image area where each target in each of the plurality of video frames is located based on the global image feature;

extracting the regional image characteristics of the image region where each target is located from the global image characteristics as the respective target characteristics of each target;

determining a respective behavior recognition result of each target based on the target characteristics;

consistency processing is carried out on the behavior recognition results of all targets in each video frame in the plurality of video frames, so that the behavior recognition results of each video frame in the plurality of video frames are obtained;

consistency processing is carried out on the behavior recognition results of the plurality of video frames to obtain the behavior recognition results of the video to be analyzed;

The determining a respective behavior recognition result of each target based on the target features comprises:

extracting respective key point features of each target from the target features;

regression is carried out on the key point characteristics to obtain a respective key point thermodynamic diagram of each target, wherein the thermodynamic diagram is used for representing the probability that each pixel point in the image area where each target is located is a key point;

and carrying out regression on the fused image characteristics to obtain respective behavior recognition results of each target.

2. The method of claim 1, wherein determining an image region in which each object in each of the plurality of video frames is located based on the global image feature comprises:

and carrying out single regression on the global image characteristics to determine an image area where each target in each video frame in the plurality of video frames is located.

3. The method according to claim 1, wherein the separately obtaining global image features of each of the plurality of video frames for the plurality of video frames in the video to be analyzed comprises:

Inputting a plurality of video frames in the video to be analyzed into a global feature sub-network in a behavior recognition network, and obtaining output of the global feature sub-network as global image features of each video frame in the plurality of video frames;

the determining, based on the global image feature, an image area in which each target in each of the plurality of video frames is located, includes:

inputting the global image features into a target detection sub-network in the behavior recognition network, and obtaining output of the target detection sub-network as an image area where each target in each of the plurality of video frames is located;

the extracting the regional image feature of the image region where each target is located from the global image features as the respective target feature of each target comprises:

inputting the global image features and the image region where each target is located into a region feature sub-network in the behavior recognition network to obtain the output of the region feature sub-network as the respective target feature of each target;

Inputting the target characteristics into a gesture estimation sub-network in the behavior recognition network to obtain output of the gesture estimation sub-network, wherein the output is used as a respective gesture estimation result of each target;

and inputting the target characteristics and the gesture estimation result into a behavior recognition sub-network in the behavior recognition network to obtain the output of the behavior recognition sub-network as the respective behavior recognition result of each target.

4. A method according to claim 3, characterized in that the behavior recognition network is pre-trained by:

5. A behavior recognition apparatus, the apparatus comprising:

the global feature extraction module is used for respectively acquiring global image features of each video frame in a plurality of video frames in the video to be analyzed;

the image area determining module is used for determining an image area where each target in each video frame in the plurality of video frames is located based on the global image characteristics;

the regional characteristic extraction module is used for extracting regional image characteristics of the image region where each target is located from the global image characteristics and taking the regional image characteristics as respective target characteristics of each target;

the target behavior recognition module is used for determining a respective behavior recognition result of each target based on the target characteristics;

the single-frame behavior recognition module is used for carrying out consistency processing on the behavior recognition results of all targets in each of the plurality of video frames to obtain the behavior recognition results of each of the plurality of video frames;

the video behavior recognition module is used for carrying out consistency processing on the behavior recognition results of the plurality of video frames to obtain the behavior recognition results of the video to be analyzed;

The target behavior recognition module is specifically configured to extract respective key point features of each target from the target features;

6. The apparatus of claim 5, wherein the image region determination module is specifically configured to perform a single regression on the global image feature to determine an image region in which each object in each of the plurality of video frames is located.

7. The apparatus of claim 5, wherein the global feature extraction module is specifically configured to input a plurality of video frames in the video to be analyzed to a global feature sub-network in a behavior recognition network, to obtain an output of the global feature sub-network as a global image feature of each video frame in the plurality of video frames;

The image area determining module is specifically configured to input the global image feature to a target detection sub-network in the behavior recognition network, and obtain an output of the target detection sub-network as an image area where each target in each of the plurality of video frames is located;

the regional characteristic extraction module is specifically configured to input the global image characteristic and an image region where each target is located into a regional characteristic sub-network in the behavior recognition network, and obtain an output of the regional characteristic sub-network as a respective target characteristic of each target;

the target behavior recognition module is specifically configured to input the target features to a gesture estimation sub-network in the behavior recognition network, and obtain output of the gesture estimation sub-network as a respective gesture estimation result of each target;

8. The apparatus of claim 7, further comprising a network training module to pre-train to the behavior recognition network by: