WO2017129020A1

WO2017129020A1 - Human behaviour recognition method and apparatus in video, and computer storage medium

Info

Publication number: WO2017129020A1
Application number: PCT/CN2017/071574
Authority: WO
Inventors: 姜育刚; 张殿凯; 沈琳; 瞿广财; 赵瑞伟; 雷晨雨
Original assignee: 中兴通讯股份有限公司
Priority date: 2016-01-29
Filing date: 2017-01-18
Publication date: 2017-08-03
Also published as: CN107025420A

Abstract

A human behaviour recognition method and apparatus in a video, and a computer storage medium. The method comprises: detecting a human region in a video to be recognised, and acquiring human moving track information in the human region (S1); according to the human region, obtaining a prediction value corresponding to the human region by means of calculation, and filtering a human region, the prediction value of which is a non-human category, so as to obtain a human region, the prediction value of which is a human category (S2); calculating the human region, the prediction value of which is a human category, so as to obtain a behaviour category score of a target in the human body, the prediction value of which is a human category (S3); and according to the behaviour category score, outputting a corresponding behaviour category (S4).

Description

Method, device and computer storage medium for human behavior recognition in video

Technical field

The present invention relates to the field of video recognition technologies, and in particular, to a method, device and computer storage medium for human behavior recognition in video.

Background technique

The existing video behavior analysis technology mainly includes three steps of detection, tracking and recognition. The traditional method is mainly to extract some manually defined visual features, such as color histogram, SIFT, HoG, etc., and then based on these features to detect, track and classify the target. However, since the calculation methods of these traditional features are artificially defined, the ability to describe features is limited. In practical applications, if all rely on traditional methods to implement detection, tracking and identification systems, the recognition performance that can be achieved is often limited.

In contrast to traditional methods, deep network models are used to perform behavior detection and recognition in pictures or videos. The model of the deep network can learn better characterization. At present, there are some work results in the video analysis using the deep learning-based method, including the application of time series models such as 3D-CNN, RCNN, and two-streams. These existing deep network-based video classification methods are mainly general-purpose algorithms. In the specific application scenario for human behavior recognition in surveillance video, the prior art has certain deficiencies and improvement spaces, for example, in monitoring. The behavior of different types of people in the scene should be treated differently in the process of identification. Some behaviors can be quickly identified through static images, such as fighting, cycling, etc. Some actions are more regular in timing, and continuous image frame analysis is more helpful in distinguishing, such as walking and (slow) running. The use of a single model in the prior art cannot simultaneously take into account the above two aspects, affecting real-time and accuracy.

Summary of the invention

In order to solve the existing technical problems, embodiments of the present invention provide a method, an apparatus, and a computer storage medium for human body behavior recognition in a video.

The technical solution of the embodiment of the present invention is implemented as follows:

A method for human behavior recognition in a video provided by an embodiment of the present invention includes:

Detecting a human body region in the to-be-identified video, and acquiring human body running track information in the human body region;

Calculating a predicted value corresponding to the human body region according to the human body region, and filtering the human body region whose predicted value is a non-human body category, and obtaining the human body region whose predicted value is a human body category;

Calculating, by the body region of the human body category, the predicted value is a behavior category score of the target in the human body region of the human body category;

According to the behavior category score, the corresponding behavior category is output.

Preferably, according to the behavior category score, the corresponding behavior category is output, including:

Outputting the behavior category if the behavior category score is higher than a threshold of a preset behavior category;

If the behavior category score is not higher than a threshold of the preset behavior category, the corresponding behavior category is calculated and output in combination with the human body running trajectory information.

Preferably, the calculating, by the body area of the human body category, the predicted value is a behavior category score of the target in the human body region of the human body category, including:

Obtaining a background image of the human body region whose predicted value is a human body category, and obtaining description information of the background image;

Calculating background region information corresponding to the background image according to the description information of the background image, and calculating neighboring target information corresponding to the background image;

Combining the background region information corresponding to the background image and the neighboring target information, the behavior category score of the target of the human body region is calculated.

Preferably, the combining the human body running track information, calculating and outputting a corresponding behavior category, including:

Obtaining a current time image of the to-be-identified video and a tracking area image corresponding to the human body running track information;

And sequentially superimposing the current time image and the tracking area image;

The behavior category score and the result of the sequential superposition are weighted and summed, and the corresponding behavior category is output.

Preferably, the predicting value corresponding to the human body region is calculated according to the human body region, Filtering the human body area whose predicted value is non-human body category, including:

Obtaining and analyzing the human body region, and outputting a predicted value corresponding to the human body region;

If the predicted value is a non-human body category, filtering the human body region whose predicted value is a non-human body category from the acquired human body region;

If the predicted value is a human body category, a step of calculating the behavior category score of the target in the human body region of the human body category is performed.

Preferably, the detecting the human body region in the to-be-identified video and acquiring the human body running track information in the human body region includes:

Obtaining the to-be-identified video, and detecting a human body region in the to-be-identified video;

Tracking pedestrians in the human body region to obtain human body running track information in the human body region.

The embodiment of the invention further provides a device for recognizing human behavior in a video, the device comprising:

The detecting module is configured to detect a human body region in the to-be-identified video, and acquire information about the human body running track in the human body region;

The filtering module is configured to calculate a predicted value corresponding to the human body region according to the human body region, and filter the human body region whose predicted value is a non-human body category, to obtain the human body region whose predicted value is a human body category;

a calculation module configured to calculate, according to the human body region whose predicted value is a human body category, the behavior category score of the target in the human body region of the human body category;

An output module configured to output a corresponding behavior category according to the behavior category score.

Preferably, the output module is configured to output the behavior category if the behavior category score is higher than a threshold of a preset behavior category; if the behavior category score is not higher than a threshold of a preset behavior category, The human body runs track information, and calculates and outputs a corresponding behavior category.

Preferably, the calculating module is configured to acquire a background image of the human body region whose predicted value is a human body category, to obtain description information of the background image, and calculate corresponding to the background image according to the description information of the background image. The background area information is calculated, and the neighboring target information corresponding to the background image is calculated; and the behavior category score of the target of the human body area is calculated according to the background area information corresponding to the background image and the neighboring target information.

Preferably, the output module is configured to acquire an image of a current moment of the to-be-identified video a tracking area image corresponding to the human body running track information; sequentially superimposing the current time image and the tracking area image; weighting and summing the behavior category score and the sequentially superimposed result, and outputting The corresponding behavior category.

Preferably, the filtering module is configured to acquire the human body region and perform analysis, and output a predicted value corresponding to the human body region; if the predicted value is a non-human body category, the predicted value is a non-human body category. The human body region is filtered from the acquired human body region; if the predicted value is a human body category, the predicted value is calculated as a behavioral category score of the target in the human body region of the human body category.

Preferably, the detecting module is configured to acquire the to-be-identified video, detect a human body region in the to-be-identified video, and track a pedestrian in the human body region to obtain a human body operation in the human body region. Track information.

Embodiments of the present invention also provide a computer storage medium comprising a set of instructions that, when executed, cause at least one processor to perform a method of human behavior recognition in the video described above.

Embodiments of the present invention provide a method, a device, and a computer storage medium for recognizing a human body in a video. The human body region in the human body region is acquired by detecting a human body region in the video to be identified, and the human body region corresponding to the human body region is calculated according to the human body region. The predicted value is used to filter the human body region whose predicted value is non-human body type, and the human body region whose predicted value is the human body type is obtained; and the human body region whose predicted value is the human body type is calculated to obtain the target value in the human body region whose predicted value is the human body category. The behavior category score; according to the behavior category score, the corresponding behavior category is output, which solves the problem that the human behavior performance in the identified video in the prior art is poor, real-time and low accuracy. Realize the real-time and accuracy of video recognition.

DRAWINGS

1 is a schematic flow chart of a first embodiment of a method for human behavior recognition in the video of the present invention;

2 is a schematic structural diagram of a network model based on a non-sequential input depth in an embodiment of the present invention;

3 is a schematic structural diagram of a behavior recognition network model based on non-sequential input, fusion background and neighboring target features according to an embodiment of the present invention;

4 is a schematic structural diagram of a behavior recognition network model based on time series input, fusion background and neighboring target features according to an embodiment of the present invention;

FIG. 5 is a schematic flowchart of a step of outputting a corresponding behavior category according to the behavior category score in the embodiment of the present invention; FIG.

6 is a schematic flowchart of a step of calculating, according to an embodiment of the present invention, a behavior type score of a target in a human body region in which a predicted value is a human body region;

FIG. 7 is a schematic flowchart of a step of calculating and outputting a corresponding behavior category in combination with the human body running track information according to an embodiment of the present invention; FIG.

FIG. 8 is a schematic flowchart of a step of filtering a body region that is predicted to be a non-human body type according to the predicted value corresponding to the human body region calculated according to the human body region in the embodiment of the present invention; FIG.

9 is a schematic flowchart of a step of detecting a human body region in a video to be recognized and acquiring human body running track information in the human body region according to an embodiment of the present invention;

FIG. 10 is a schematic diagram of functional modules of a first embodiment of a device for human behavior recognition in the video of the present invention.

detailed description

The implementation, functional features, and advantages of the present invention will be further described in conjunction with the embodiments.

It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In various embodiments of the present invention, the human body area in the video to be identified is detected, and the human body running track information in the human body area is acquired; the predicted value corresponding to the human body area is calculated according to the human body area, and the predicted value is a non-human body type. The region is filtered to obtain a human body region whose predicted value is a human body category; the body region region whose predicted value is a human body category is calculated to obtain a behavior category score of a target in a human body region whose predicted value is a human body category; according to the behavior category score, the corresponding output is output. Behavior category.

Therefore, the problem that the performance of the human body in the recognized video in the prior art is poor, real-time and low accuracy is solved. Realize the real-time and accuracy of video recognition.

As shown in FIG. 1 , a first embodiment of the present invention provides a method for human behavior recognition in a video, including:

Step S1: detecting a human body region in the to-be-identified video, and acquiring human body running track information in the human body region.

The executor of the method of the embodiment of the present invention may be a video monitoring device or a video identification device. This embodiment is exemplified by a video monitoring device, and is of course not limited to other devices capable of realizing human behavior in the video.

Specifically, the video monitoring device detects the human body region in the to-be-identified video, and acquires the human body running track information in the human body region.

The video monitoring device obtains the to-be-identified video and detects the human body region in the target video. In a specific implementation, the video surveillance device can obtain the original video to be identified through the front-end video capture device, and use the detection based on the traditional feature classification. The device detects the human body area in the video.

After the acquisition of the to-be-identified video and the detection of the human body region in the target video, the video monitoring device tracks the pedestrians in the human body region to obtain the human body running track information in the human body region; in specific implementation, the video monitoring device A pedestrian tracking algorithm based on detection area matching can be used to track pedestrians in the picture to obtain motion trajectory information of the human body in the picture.

The result of human body detection and tracking can be saved in the form of a target ID and a sequence of detection area images, namely:

O (i, t) = I t (i), R t (i);

Where O(i,t) represents the information of the target i at time t, I _t ⁽ⁱ⁾ is the image content detected by the target at time t, and R _t ⁽ⁱ⁾ is the position of the target at the time t, R _{In t} ⁽ⁱ⁾ , the horizontal and vertical coordinate positions and the length and width values of the upper left corner of the recording area are used in the form of vectors (x, y, w, h).

Step S2: Calculate a predicted value corresponding to the human body region according to the human body region, and filter the human body region whose predicted value is a non-human body category, and obtain the human body region whose predicted value is a human body category.

Specifically, after detecting the human body region in the to-be-identified video and acquiring the human body running track information in the human body region, the video monitoring device calculates the predicted value corresponding to the human body region according to the human body region, and the human body region with the predicted value is a non-human body category. Filtering to get the predicted value of the human body The human body area of the category.

The video monitoring device acquires and analyzes the human body region, and outputs a predicted value corresponding to the human body region, and the predicted value includes a human body category and a non-human body category; in a specific implementation, after acquiring a certain human body region in the current frame, the video monitoring device The image of the human body region is input into the background filtering network M1 network model for analysis. The structure of the M1 network model is shown in FIG. 2, and the M1 network model is a deep convolution network model based on single frame image input; The input is the detected foreground area image, followed by several Convolution Layers (CONV) with ReLU layer and pooling layer, and then connected with several Fully Connection Layers (FC) for depth feature calculation. The dimension of the last layer of the M1 network is 2 dimensions. After sigmoid transformation, it corresponds to the behavior category scores of the human body category and the non-human body category.

Wherein, if the predicted value is a non-human body type, the human body region whose predicted value is a non-human body category is filtered from the acquired human body region; after the classification by the M1 network model, the pre-detection and tracking algorithm may be filtered out to be misdetected as a human body. The area of the category. Since the network at this time is only calculated on the foreground image generated by the detection link (instead of the entire image), it does not generate significant computational overhead, and can improve the detection accuracy while satisfying the real-time performance of the entire system. Requirements. At the same time, the number of convolutional layers and fully connected layers in the M1 network model can be adjusted according to factors such as the size of the monitoring screen and the hardware performance of the deployed device.

In the embodiment of the present invention, after the detection and tracking step, a deep network model with relatively simple structure is used to further filter the detected foreground region; in the early detection link, the algorithm intentionally reduces the algorithm for foreground prediction. Thresholds allow the algorithm to return as many foreground areas as possible, minimizing the rate of missed detection. Since the network at this time is only calculated on the foreground image generated by the detection link (instead of the entire image), the computational overhead of the algorithm is greatly reduced, and the detection accuracy is improved, and the entire system is well satisfied. Sexual requirements.

In step S3, the human body region whose predicted value is the human body type is calculated to obtain the behavior category score of the target in the human body region of the human body category.

Specifically, after the predicted value corresponding to the human body region is calculated according to the human body region, the human body region whose predicted value is not the human body type is filtered, and after the predicted human body region of the human body category is obtained, the video monitoring device predicts the human body region. The body area is calculated to obtain a predicted value The behavior category score for the target in the body region of the human body category.

The video monitoring device obtains a background image of the human body region whose predicted value is a human body type, and obtains description information of the background image; in a specific implementation, if the predicted result obtained by the M1 network model is a human body category (ie, a foreground in the image), the video The monitoring device can identify the behavior of each human body region in a single frame image by using a non-sequential input behavior based on neighboring target features with more complex structure and more recognizable capability. The structure of the network model is shown in FIG. 3. As shown in the figure, the hidden layer of the M2 network model adds the background image of the current human target and the feature information of the adjacent target hidden layer. The location of the feature fusion lies in the first fully connected layer of the network, as shown in the first FC in FIG. The layer is shown; the background image of the area where the target is located can be obtained from a preset pure background image, as long as the portion corresponding to the position of the detection area is taken. The complete background image can be obtained from a pre-set standard background image or via a dynamically updated background model. Note that the background image obtained by a target i at time t is B _t ⁽ⁱ⁾ , then for a target area, its description information can be expressed as:

O(i,t)=I _t ⁽ⁱ⁾ , R _t ⁽ⁱ⁾ , B _t ⁽ⁱ⁾ ;

Where I _t ⁽ⁱ⁾ and B _t ⁽ⁱ⁾ share the same location area R _t ⁽ⁱ⁾ .

After the background image of the body region of the human body category is obtained and the description information of the background image is obtained, the video monitoring device calculates the background region information corresponding to the background image according to the description information of the background image, and calculates the background image corresponding to the background image. Adjacent to the target information; in the specific implementation, the background image will obtain its visual feature description through several convolutional layers, and then obtain a corresponding first hidden layer feature through a fully connected layer, and its dimension and target image are obtained. The first hidden layer has the same dimensions. For the target image, the feature calculation process of its first hidden layer can be expressed as:

FC ₁ (I _t ⁽ⁱ⁾ )=f ₁ (c _m (...c ₁ I _t ⁽ⁱ⁾ ));

Where c(·) represents a convolution operation for an image, and f(·) represents a matrix multiplication operation and an offset amount operation of the fully connected layer. Similarly, for a background position image, the characteristics of its first hidden layer are:

FC ₁ (B _t ⁽ⁱ⁾ )=f ₁ (c _m (...c ₁ B _t ⁽ⁱ⁾ ));

Among them, some of the features of the first hidden layer of the model are features from adjacent targets, which are mainly from the target features in the adjacent regions of the current region. The range of neighboring regions can be determined by setting a threshold. The central location of the current target is:

among them,

Is the horizontal coordinate of the upper left corner of the target area.

Is the ordinate of the upper left corner of the target area.

Is the width of the target area,

Is the height of the target area. Simultaneously calculate the center position of other foreground targets in the same picture

when

versus

When the Euclidean distance d _{ij is} less than a certain threshold D or there is an intersection between the two, the foreground is classified into the effective neighboring target of the current target.

After the background information corresponding to the background image is calculated according to the description information of the background image, and the neighboring target information corresponding to the background image is calculated, the video monitoring device calculates the human body by combining the background region information corresponding to the background image and the adjacent target information. The behavior category score of the target of the region; in a specific implementation, the video monitoring device can record that the set of features of the first fully connected layer calculated by all the adjacent target regions is

The maximum value of these eigenvalues in each dimension is separately counted:

And weighted average:

As part of the characterization of the adjacent target. By stitching together the above two sets of features, you can get an overall feature representation of the proximity target description, namely:

If the current target does not have any adjacent targets in the picture, then

The values are all set to zero. After synthesizing the background area information and the adjacent target information, the characteristics of the first fully connected layer of the network model of behavior recognition can be expressed as:

The feature passes through the subsequent fully connected layer, so that the entire network model naturally utilizes the background information and context information of the current target in the process of identifying.

The M2 network model output is a multi-dimensional vector, the length of the vector is the number of behavior categories to be identified, and the score on each dimension of the output represents the prediction probability on the category.

Step S4, outputting a corresponding behavior category according to the behavior category score.

Specifically, after completing the calculation of the body region of the human body class with the predicted value as the body class, the video monitoring device outputs the corresponding behavior class according to the behavior category score after the behavior class score of the target in the body region of the body class is calculated.

Wherein, if the behavior category score is higher than the threshold of the preset behavior category, the behavior category is output; according to the behavior category score, if the category score output at this time is higher than a certain threshold in some categories with obvious static characteristics, Then directly output the category as the final prediction result.

The embodiments of the present invention are directed to monitoring the types of different behaviors in the video, and according to their different static characteristics and dynamic characteristics, respectively adopting timing (multi-frame image) and non-timing (single-frame image) input network pairs of different structures. The image is analyzed, and finally the two different network outputs are combined to obtain the final behavior recognition result; in particular, for some static characteristic clear behavior categories, such as fighting, cycling, etc., the embodiment of the present invention mainly relies on a structure that is sufficiently complicated. The non-sequential input network model performs fast prediction because these motion features are obvious. Once they appear, the image through a single frame can generally be accurately judged; for some behavior categories that are difficult to judge through a single frame image, such as walking and jogging, etc. Further analysis using a deep network with time-staggered images as input provides more reliable recognition performance than a network using a single static image input. In addition, in the design of the deep classification model fusion strategy of time series input and non-sequence input, the idea of cascade classifier is adopted to improve the operation efficiency of the whole classification system and realize the requirement of real-time behavior recognition.

Wherein, if the behavior category score is not higher than the threshold of the preset behavior category, the corresponding behavior category is calculated and output in combination with the human body running track information.

The video monitoring device acquires the current time image of the to-be-identified video and the tracking area image corresponding to the human body running track information. In a specific implementation, the video monitoring device can acquire the tracking area image corresponding to the current time image and the human body running track information, using the same The superposition of the target previous time image identifies the input of the M3 network model as a multi-frame sequential input behavior based on the background and neighboring target features for further category prediction. The structure diagram of the M3 network model is shown in Figure 4. Because the time-based target motion picture overlay is used as the input of the network, the M3 network model has a stronger ability to capture motion information, and has obvious advantages for some dynamic features with obvious behavior recognition.

Wherein, after obtaining the current time image of the to-be-identified video and the human body running track information pair After tracking the area image, the video monitoring device sequentially superimposes the current time image and the tracking area image; in specific implementation, the video monitoring device uses the M3 network model, uses the information of the motion track, and uses the same target at the current time and the previous number. The sequential overlay of the tracking area image at the moment is used as the input to the model, ie:

The middle layer of the M3 network model will simultaneously fuse the depth features of the background region sequence of the current target and the hidden features of other target historical sequences in the current target neighbor region. The information of the adjacent target is beneficial to improve the prediction accuracy of the algorithm.

The location of the hidden layer feature fusion of the M3 network model is also in the first fully connected layer of the network, as shown by the first FC layer in Figure 4. For the background area of the M3 network model, also take the sequence of background regions on its trajectory

As input. The acquisition of the adjacent target features is also consistent with the M2 network model. The distance between the target and the preset threshold at the current time are used as the selection criteria of the neighboring objects, and the maximum value and weighted mean of the FC1 features are calculated to form the adjacent target features. description. After the fusion, input to the subsequent fully connected layer for further recognition calculation.

The M3 network model output is also a multi-dimensional vector, the length of the vector is the number of behavior categories to be identified, and the score on each dimension of the output is the prediction probability on the category.

After the video image capture device sequentially superimposes the current time image and the tracking area image, the video monitoring device weights and sums the behavior category score and the sequentially superimposed result, and outputs the corresponding behavior category; in specific implementation, the video monitoring device Combining the processing results of M2 network model and M3 network model, the comprehensive behavior category prediction of the target to be detected is obtained. The fusion method can be the weighted sum of the two groups of network results, and the weight can be obtained by the training set fitting effect.

The embodiment of the present invention combines the characteristics of the behaviors appearing in the surveillance video, and designs a fusion method based on the hidden layer features in the single frame input and multiframe input networks, and adopts the combination of the current target foreground, the background image information and the adjacent target information. As a new implicit feature, it enriches the available information of the classification network, so that the depth model used for classification can simultaneously utilize the information of the background area of the current target and the behavior information of other targets in the adjacent area, and the behavior recognition in the surveillance video. It has very valuable auxiliary information that improves the performance of the entire system for behavior recognition.

Through the foregoing solution, the embodiment of the invention provides a method for human body behavior recognition in a video, which realizes real-time performance and accuracy of improving video recognition.

In an embodiment, in order to better improve the real-time and accuracy of video recognition, referring to FIG. 5, in a specific embodiment of the present invention, a process of outputting a corresponding behavior category according to the behavior category score is performed. schematic diagram.

As an implementation manner, the foregoing step S4 includes:

Step S41: If the behavior category score is higher than a threshold of the preset behavior category, the behavior category is output.

In step S42, if the behavior category score is not higher than the threshold of the preset behavior category, the corresponding behavior category is calculated and output according to the human body running track information.

Specifically, if the behavior category score is not higher than the threshold of the preset behavior category, the corresponding behavior category is calculated and output in combination with the human body running track information.

After the tracking area image corresponding to the current time image of the to-be-identified video and the human body running track information is completed, the video monitoring device sequentially superimposes the current time image and the tracking area image; in specific implementation, the video monitoring device uses the M3 network. The model, using the information of the motion trajectory, uses the same target to overlap the sequence of the tracking area images at the current time and the previous time Add as input to the model, ie:

Through the foregoing solution, the embodiment of the present invention provides a method for human body behavior recognition in a video, which better improves the real-time performance and accuracy of video recognition.

In an embodiment, in order to better improve the real-time and accuracy of the video recognition, referring to FIG. 6, in the specific embodiment of the present invention, the predicted human body area of the human body type is calculated to obtain the predicted value. A flow diagram of the steps of the behavioral category score of a target in a human body region of a human body category.

As an implementation manner, the foregoing step S3 includes:

Step S31: Obtain a background image of the human body region whose predicted value is a human body category, and obtain description information of the background image.

Specifically, the non-human target filtering algorithm is used to output the prediction corresponding to the human body region. The value, after filtering the human body region whose predicted value is non-human body type, the video monitoring device acquires the background image of the human body region whose predicted value is the human body category, and obtains the description information of the background image.

In the specific implementation, if the prediction result obtained by the M1 network model is the human body category (ie, the foreground in the picture), the video monitoring device can use a non-sequential input behavior based on the neighboring target features with more complex structure and stronger recognition ability. Identifying the M2 network model to identify the behavior of each human body region in a single frame image. The structure of the network model is shown in Figure 3. The hidden layer of the M2 network model adds the background image of the current human target and the adjacent target hidden layer. Characteristic information, the location of the feature fusion lies in the first fully connected layer of the network, as shown by the first FC layer in Figure 3; wherein the background image of the target region can be obtained from a pre-set pure background image As long as the part corresponding to the position of the detection area is taken. The complete background image can be obtained from a pre-set standard background image or via a dynamically updated background model. Note that the background image obtained by a target i at time t is B _t ⁽ⁱ⁾ , then for a target area, its description information can be expressed as:

O(i,t)=I _t ⁽ⁱ⁾ , R _t ⁽ⁱ⁾ , B _t ⁽ⁱ⁾ ;

Step S32: Calculate background area information corresponding to the background image according to the description information of the background image, and calculate neighboring target information corresponding to the background image.

Specifically, after obtaining the background image of the human body region whose predicted value is the human body category, and obtaining the description information of the background image, the video monitoring device calculates the background region information corresponding to the background image according to the description information of the background image, and calculates the corresponding background image. Proximity target information.

Among them, in the specific implementation, the background image will get its visual feature description through several convolutional layers, and then get its corresponding first hidden layer feature through a fully connected layer, its dimension and the target image obtained The dimensions of an implicit layer are the same. For the target image, the feature calculation process of its first hidden layer can be expressed as:

FC ₁ (I _t ⁽ⁱ⁾ )=f ₁ (c _m (...c ₁ I _t ⁽ⁱ⁾ ));

FC ₁ (B _t ⁽ⁱ⁾ )=f ₁ (c _m (...c ₁ B _t ⁽ⁱ⁾ ));

Among them, the characteristic composition of the first hidden layer of the model, and some of it is from the neighbor The characteristics of the target, mainly from the target features in the neighborhood of the current region. The range of neighboring regions can be determined by setting a threshold. The central location of the current target is:

among them,

Is the horizontal coordinate of the upper left corner of the target area.

Is the ordinate of the upper left corner of the target area.

Is the width of the target area,

when

versus

Step S33, calculating the behavior category score of the target of the human body region in combination with the background region information corresponding to the background image and the neighboring target information.

Specifically, after calculating the background area information corresponding to the background image according to the description information of the background image, and calculating the neighboring target information corresponding to the background image, the video monitoring device combines the background area information corresponding to the background image with the adjacent target information, and calculates The behavior category score of the target of the human body region.

Wherein, in a specific implementation, the video monitoring device can record that the set of features of the first fully connected layer calculated by all adjacent target regions is

The maximum value of these eigenvalues in each dimension is separately counted:

And weighted average:

If the current target does not have any adjacent targets in the picture, then

In an embodiment, in order to better improve the real-time and accuracy of video recognition, reference is made to FIG. 7 , which is a step of calculating and outputting a corresponding behavior category in combination with the human body running track information according to an embodiment of the present invention. A schematic diagram of the process.

As an implementation manner, the foregoing step S42 includes:

Step S421: Acquire a current time image of the video and a tracking area image corresponding to the human body running track information.

Specifically, the video monitoring device acquires a current time image of the to-be-identified video and a tracking area image corresponding to the human body running track information.

In a specific implementation, the video monitoring device can acquire the tracking area image corresponding to the current time image and the human body running track information, and use the superposition of the same target previous time image as the multi-frame time input behavior recognition network model based on the background and the adjacent target features. Input of the M3 network model for further category prediction. The structure diagram of the M3 network model is shown in Figure 4. Because the time-based target motion picture overlay is used as the input of the network, the M3 network model has a stronger ability to capture motion information, and has obvious advantages for some dynamic features with obvious behavior recognition.

Step S422, sequentially superimposing the current time image and the tracking area image.

Specifically, after the acquisition of the current time image of the video and the tracking area image corresponding to the human body running track information, the video monitoring device sequentially superimposes the current time image and the tracking area image.

In the specific implementation, the video monitoring device uses the M3 network model, and uses the information of the motion trajectory to use the sequential superposition of the tracking image of the same target at the current time and the previous time as the input of the model, namely:

Step S423, weighting and summing the behavior category score and the result of performing the sequential superposition, and outputting the corresponding behavior category.

Specifically, after the current time image and the tracking area image are sequentially superimposed, after the multi-frame image superimposition input processing is performed, the video monitoring device weights and sums the behavior category score and the sequentially superimposed result, and outputs the corresponding behavior. category.

In the specific implementation, the video monitoring device combines the processing results of the M2 network model and the M3 network model to obtain a comprehensive behavior category prediction of the target to be detected, and the fusion method may be a weighted sum of the two network results, and the weight may pass The training set fitting effect is obtained.

In an embodiment, in order to better improve the real-time and accuracy of the video recognition, referring to FIG. 8, in the specific embodiment of the present invention, the predicted value corresponding to the human body region is calculated according to the human body region, and the A flow chart showing the steps of predicting the value of a human body region that is not a human body.

As an implementation manner, the foregoing step S2 includes:

Step S21: acquiring the human body region and performing analysis, and outputting a predicted value corresponding to the human body region.

Specifically, after detecting the human body region in the to-be-identified video and acquiring the human body running track information in the human body region, the video monitoring device acquires the human body region and performs analysis, and outputs the predicted value corresponding to the human body region.

In the specific implementation, after acquiring a certain human body region in the current frame, the video monitoring device inputs the image of the human body region into the background filtering network M1 network model for analysis, and the structure of the M1 network model is as shown in FIG. 2 . The M1 network model is a deep convolutional network model based on single-frame image input; the input of the network is the detected foreground area image, followed by several convolution layers (Convolution Layers, CONV) with ReLU layer and pooling layer. ), and then connected to the Fully Connection Layers (FC) for depth feature calculation. The dimension of the last layer of the network is 2D. After sigmoid transformation, it corresponds to the human body category and the non-human body category. Behavior category score.

Step S22: If the predicted value is a non-human body category, the human body region whose predicted value is a non-human body category is filtered from the acquired human body region.

Specifically, if the predicted value is a non-human body category, the human body region whose predicted value is a non-human body category is filtered from the acquired human body region; in a specific implementation, the video monitoring device can be filtered out after being classified by the M1 network model. The pre-detection and tracking algorithms are misdetected as areas of the human body category. Since the network at this time is only calculated on the foreground image generated by the detection link (instead of the entire image), it does not generate significant computational overhead, and can improve the detection accuracy while satisfying the real-time performance of the entire system. Requirements. At the same time, the number of convolutional layers and fully connected layers in the M1 network model can be adjusted according to factors such as the size of the monitoring screen and the hardware performance of the deployed device.

Specifically, if the predicted value is a human body category, the video monitoring device performs the above step S3 to calculate the behavior category score of the target in the human body region in which the predicted value is the human body category.

In an embodiment, in order to better improve the real-time and accuracy of video recognition, referring to FIG. 9, in the embodiment of the present invention, the human body area in the video to be identified is detected, and the person is obtained. A schematic flow chart of the steps of the human body running track information in the body region.

As an implementation manner, the foregoing step S1 includes:

Step S11: Acquire the video to be identified, and detect a human body region in the target video.

Specifically, the video monitoring device acquires the to-be-identified video and detects the human body region in the target video.

In a specific implementation, the video surveillance device can obtain the original video to be identified through the front-end video capture device, and detect the human body region in the video by using a detector based on the traditional feature classification.

Step S12: Tracking pedestrians in the human body region to obtain human body running track information in the human body region.

Specifically, after the acquiring the to-be-identified video and detecting the human body region in the target video, the video monitoring device tracks the pedestrian in the human body region to obtain the human body running track information in the human body region.

In a specific implementation, the video monitoring device may track the pedestrians in the picture by using a tracking algorithm based on the detection of the detection area, thereby obtaining motion track information of the human body in the picture.

O(i,t)=I _t ⁽ⁱ⁾ , R _t ⁽ⁱ⁾ ;

Based on the implementation of the method embodiment of human behavior recognition in the above video, the present invention also provides a corresponding apparatus embodiment.

As shown in FIG. 10, a first embodiment of the present invention provides a device for recognizing a human body in a video, including:

The detecting module 100 is configured to detect a human body area in the video to be identified, and acquire the human body area Human body trajectory information in the domain.

The executor of the device in the embodiment of the present invention may be a video monitoring device or a video identification device. This embodiment is exemplified by a video monitoring device, and is of course not limited to other devices capable of realizing human behavior in the video.

Specifically, the detecting module 100 detects a human body region in the video to be identified, and acquires human body running track information in the human body region.

After the acquiring the to-be-identified video and detecting the human body region in the target video, the detecting module 100 tracks the pedestrian in the human body region to obtain the human body running track information in the human body region; in specific implementation, the video monitoring device A pedestrian tracking algorithm based on detection area matching can be used to track pedestrians in the picture to obtain motion trajectory information of the human body in the picture.

O(i,t)=I _t ⁽ⁱ⁾ , R _t ⁽ⁱ⁾ ;

The filtering module 200 is configured to calculate a predicted value corresponding to the human body region according to the human body region, and filter the human body region whose predicted value is a non-human body category to obtain a human body region whose predicted value is a human body category.

Specifically, after detecting the human body region in the to-be-identified video and acquiring the human body running track information in the human body region, the filtering module 200 calculates the predicted value corresponding to the human body region according to the human body region, and the human body region with the predicted value is a non-human body category. Filtering is performed to obtain a human body region whose predicted value is a human body type.

The video monitoring device acquires and analyzes the human body region, and outputs a predicted value corresponding to the human body region, and the predicted value includes a human body category and a non-human body category; in a specific implementation, after acquiring a certain human body region in the current frame, the video monitoring device Input the image of the human body area to the background The filtering network M1 network model is analyzed. The structure of the M1 network model is shown in Figure 2. The M1 network model is a deep convolutional network model based on single-frame image input. The input of the network is the detected foreground area image. It is followed by several Convolution Layers (CONV) with ReLU layer and pooling layer, and then connected with several Fully Connection Layers (FC) for deep feature calculation. The last layer of the network is the output layer. The dimension is 2 dimensions, and after sigmoid transformation, it corresponds to the behavior category scores on the human body category and the non-human body category.

If the predicted value is a non-human body category, the filtering module 200 filters the human body region whose predicted value is a non-human body category from the acquired human body region; after the classification by the M1 network model, the previous detection and tracking algorithm error can be filtered out. The area measured as a human body category. Since the network at this time is only calculated on the foreground image generated by the detection link (instead of the entire image), it does not generate significant computational overhead, and can improve the detection accuracy while satisfying the real-time performance of the entire system. Requirements. At the same time, the number of convolutional layers and fully connected layers in the M1 network model can be adjusted according to factors such as the size of the monitoring screen and the hardware performance of the deployed device.

The calculation module 300 is configured to calculate, for the human body region whose predicted value is a human body category, the behavior category score of the target in the human body region of the human body category.

Specifically, after calculating the predicted value corresponding to the human body region calculated according to the human body region, and filtering the human body region whose predicted value is a non-human body category, and obtaining the human body region whose predicted value is the human body category, the calculation module 300 compares the predicted value. The human body region of the human body type is calculated to obtain a behavior category score of the target in the human body region whose predicted value is the human body category.

O(i,t)=I _t ⁽ⁱ⁾ , R _t ⁽ⁱ⁾ , B _t ⁽ⁱ⁾ ;

After the background image of the body region of the human body category is obtained, and the description information of the background image is obtained, the calculation module 300 calculates the background region information corresponding to the background image according to the description information of the background image, and calculates the background image corresponding to the background image. Adjacent to the target information; in the specific implementation, the background image will obtain its visual feature description through several convolutional layers, and then obtain a corresponding first hidden layer feature through a fully connected layer, and its dimension and target image are obtained. The first hidden layer has the same dimensions. For the target image, the feature calculation process of its first hidden layer can be expressed as:

FC ₁ (I _t ⁽ⁱ⁾ )=f ₁ (c _m (...c ₁ I _t ⁽ⁱ⁾ ));

FC ₁ (B _t ⁽ⁱ⁾ )=f ₁ (c _m (...c ₁ B _t ⁽ⁱ⁾ ));

among them,

Is the horizontal coordinate of the upper left corner of the target area.

Is the ordinate of the upper left corner of the target area.

Is the width of the target area,

when

versus

After calculating the background area information corresponding to the background image according to the description information of the background image, and calculating the neighboring target information corresponding to the background image, the calculating module 300 calculates the human body by combining the background area information corresponding to the background image and the adjacent target information. The behavior category score of the target of the region; in a specific implementation, the video monitoring device can record that the set of features of the first fully connected layer calculated by all the adjacent target regions is

The maximum value of these eigenvalues in each dimension is separately counted:

And weighted average:

If the current target does not have any adjacent targets in the picture, then

The output module 400 is configured to output a corresponding behavior category according to the behavior category score.

Specifically, after completing the calculation of the body region of the human body category, the predicted value is the behavior category score of the target in the human body region of the human body category, the output module 400 outputs the corresponding according to the behavior category score. Behavior category.

Wherein, if the behavior category score is higher than a threshold of the preset behavior category, the behavior category is output; if the score according to the behavior category is scored, if the category score output at this time is in some static If the score on the category with obvious characteristics is higher than a certain threshold, the category is directly output as the final prediction result.

Wherein, if the behavior category is not higher than the threshold of the preset behavior category, the output module 400 combines the human body running track information to calculate and output the corresponding behavior category.

After the tracking area image corresponding to the current time image of the to-be-identified video and the human body running track information is completed, the output module 400 sequentially superimposes the current time image and the tracking area image; in specific implementation, the video monitoring device uses the M3 network. The model uses the information of the motion trajectory to use the order of the same target at the current time and the tracking area image of the previous time as the input of the model, namely:

After the current time image and the tracking area image are sequentially superimposed, the output module 400 performs weighted summation on the behavior category score and the result of the sequential superposition, and outputs the corresponding behavior category; in specific implementation, the video monitoring device Combining the processing results of M2 network model and M3 network model, the comprehensive behavior category prediction of the target to be detected is obtained. The fusion method can be the weighted sum of the two groups of network results, and the weight can be obtained by the training set fitting effect.

The invention combines the characteristics of the behaviors appearing in the surveillance video, and designs a fusion method based on the hidden layer features in the single frame input and multiframe input networks, using the combination of the current target foreground, the background image information and the adjacent target information. The new implicit feature enriches the available information of the classification network, so that the depth model used for classification can simultaneously utilize the information of the background area of the current target and the behavior information of other targets in the adjacent area, which is very important for the behavior recognition in the surveillance video. Valuable auxiliary information enhances the performance of the entire system for behavioral recognition.

Through the foregoing solution, the embodiment of the invention provides a device for recognizing a human body in a video, which realizes real-time performance and accuracy of improving video recognition.

In an embodiment, in order to better improve the real-time and accuracy of the video recognition, the output module 400 is further configured to output the behavior category if the behavior category score is higher than a threshold of the preset behavior category; If the behavior category score is not higher than a threshold of the preset behavior category, the corresponding behavior category is calculated and output in combination with the human body running trajectory information.

Wherein, if the behavior category score is higher than a threshold of the preset behavior category, the behavior category is output; when the score according to the behavior category is above, if the category score output at this time is high in some categories with obvious static characteristics At a certain threshold, the category is directly output as the final prediction result.

If the behavior category is not higher than the threshold of the preset behavior category, the output module 400 combines the human body running trajectory information to calculate and output the corresponding behavior category.

In an embodiment, in order to improve the real-time and accuracy of the video recognition, the calculation module 300 is further configured to acquire a background image of the human body region whose predicted value is a human body category, and obtain a description of the background image. And calculating, according to the description information of the background image, background region information corresponding to the background image, and calculating neighboring target information corresponding to the background image; calculating background region information corresponding to the background image and neighboring target information, A behavior category score of the target of the human body region is obtained.

Specifically, after calculating the predicted value corresponding to the human body region calculated according to the human body region, filtering the human body region whose predicted value is a non-human body category, and obtaining the human body region whose body value is the predicted body value, and then calculating The module 300 acquires a background image of the human body region whose predicted value is a human body type, and obtains description information of the background image.

O(i,t)=I _t ⁽ⁱ⁾ , R _t ⁽ⁱ⁾ , B _t ⁽ⁱ⁾ ;

After obtaining the background image of the human body region whose predicted value is the human body category, and obtaining the description information of the background image, the calculation module 300 calculates the background region information corresponding to the background image according to the description information of the background image, and calculates the neighboring target corresponding to the background image. information.

FC ₁ (I _t ⁽ⁱ⁾ )=f ₁ (c _m (...c ₁ I _t ⁽ⁱ⁾ ));

FC ₁ (B _t ⁽ⁱ⁾ )=f ₁ (c _m (...c ₁ B _t ⁽ⁱ⁾ ));

among them,

Is the horizontal coordinate of the upper left corner of the target area.

Is the ordinate of the upper left corner of the target area.

Is the width of the target area,

when

versus

After calculating the background area information corresponding to the background image according to the description information of the background image, and calculating the neighboring target information corresponding to the background image, the calculating module 300 calculates the human body area by combining the background area information corresponding to the background image and the adjacent target information. The target's behavior category score.

The maximum value of these eigenvalues in each dimension is separately counted:

And weighted average:

If the current target does not have any adjacent targets in the picture, then

Through the above solution, the embodiment of the invention provides a device for recognizing a human body in a video, which better realizes real-time and accuracy of improving video recognition.

In an embodiment, in order to improve the real-time and accuracy of the video recognition, the output module 400 is further configured to acquire the current time image of the to-be-identified video and the tracking area image corresponding to the human body running track information. And sequentially superimposing the current time image and the tracking area image; weighting and summing the behavior category score and the sequentially superimposed result, and outputting a corresponding behavior category.

Specifically, the output module 400 acquires the current moment image of the video to be identified and the human body running track. The tracking area image corresponding to the trace information.

After completing the acquisition of the tracking area image corresponding to the current time image of the video and the human body running track information, the output module 400 sequentially superimposes the current time image and the tracking area image.

After the current time image and the tracking area image are sequentially superimposed, the output module 400 weights and sums the behavior category score and the result of the sequential superposition, and outputs the corresponding behavior category.

Among them, in the specific implementation, the video monitoring device fuses the M2 network model and the M3 network model. The processing result is obtained by the comprehensive behavior category prediction of the target to be detected, and the fusion method may be the weighted sum of the two groups of network results, and the weight of the weight can be obtained by the training set fitting effect.

In an embodiment, in order to improve the real-time and accuracy of the video recognition, the filtering module 200 is further configured to acquire the human body region and perform analysis, and output a predicted value corresponding to the human body region; If the predicted value is a non-human body category, the human body region whose predicted value is a non-human body category is filtered from the acquired human body region; if the predicted value is a human body category, the predicted value is calculated as a human body category. The behavior category score of the target in the human body region.

Specifically, after detecting the human body region in the to-be-identified video and acquiring the human body running track information in the human body region, the filtering module 200 acquires the human body region and performs analysis, and outputs the predicted value corresponding to the human body region.

If the predicted value is a non-human body category, the filtering module 200 filters the human body region whose predicted value is a non-human body category from the acquired human body region; in a specific implementation, the video monitoring device can be filtered out after being classified by the M1 network model. The pre-detection and tracking algorithms are misdetected as areas of the human body category. Since the network at this time is only calculated on the foreground image generated by the detection link (instead of the entire image), it does not generate significant computational overhead, and can improve the detection accuracy while satisfying the real-time performance of the entire system. Requirements. At the same time, the number of convolutional layers and fully connected layers in the M1 network model can be adjusted according to factors such as the size of the monitoring screen and the hardware performance of the deployed device.

If the predicted value is a human body category, the filtering module 200 calculates the predicted value as a human body class. The behavior category score of the target in other body regions.

In an embodiment, in order to improve the real-time and accuracy of the video recognition, the detecting module 100 is further configured to acquire the to-be-identified video, and detect a human body region in the target video; The human body in the human body region is tracked to obtain the human body running track information in the human body region.

Specifically, the detecting module 100 acquires a video to be identified, and detects a human body region in the target video.

After the acquiring the to-be-identified video and detecting the human body region in the target video, the detecting module 100 tracks the pedestrian in the human body region to obtain the human body running track information in the human body region.

O(i,t)=I _t ⁽ⁱ⁾ , R _t ⁽ⁱ⁾ ;

It should be noted that, in actual application, the detection module 100, the filtering module 200, the calculation module 300, and the output module 400 can be identified by a human body in the video, a CPU, a central processing unit, a microprocessor (MCU, a Micro Control Unit). ), digital signal processor (DSP, Digital Signal Processor) or programmable logic array (FPGA, Field- Programmable Gate Array) implementation.

Those skilled in the art will appreciate that embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention can take the form of a hardware embodiment, a software embodiment, or a combination of software and hardware. Moreover, the invention can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage and optical storage, etc.) including computer usable program code.

The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (system), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG. These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine for the execution of instructions for execution by a processor of a computer or other programmable data processing device. Means for implementing the functions specified in one or more of the flow or in a block or blocks of the flow chart.

The computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device. The apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.

These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device. The instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.

Based on this, an embodiment of the present invention further provides a computer storage medium, the computer storage medium comprising a set of instructions, when the instruction is executed, causing at least one processor to perform the method for recognizing human behavior in the video.

The above are only the preferred embodiments of the present invention, and are not intended to limit the scope of the invention, and the equivalent structure or equivalent process transformations made by the description of the present invention and the drawings are directly or indirectly applied to other related technical fields. The same is included in the scope of patent protection of the present invention. Inside.

Industrial applicability

The solution provided by the embodiment of the present invention acquires the human body running track information in the human body region by detecting the human body region in the to-be-identified video; the predicted value corresponding to the human body region is calculated according to the human body region, and the human body region with the predicted value is a non-human body category Filtering is performed to obtain a human body region whose predicted value is a human body type; a human body region whose predicted value is a human body type is calculated to obtain a behavior category score of a target in a human body region whose predicted value is a human body category; and a corresponding behavior is output according to the behavior category score Category, which improves the real-time and accuracy of video recognition.

Claims

A method for human behavior recognition in a video, the method comprising:

Detecting a human body region in the to-be-identified video, and acquiring human body running track information in the human body region;

Calculating a predicted value corresponding to the human body region according to the human body region, and filtering the human body region whose predicted value is a non-human body category, and obtaining the human body region whose predicted value is a human body category;

Calculating, by the body region of the human body category, the predicted value is a behavior category score of the target in the human body region of the human body category;

According to the behavior category score, the corresponding behavior category is output.
The method of claim 1, wherein the corresponding behavior category is output according to the behavior category score, including:

Outputting the behavior category if the behavior category score is higher than a threshold of a preset behavior category;

If the behavior category score is not higher than a threshold of the preset behavior category, the corresponding behavior category is calculated and output in combination with the human body running trajectory information.
The method according to claim 2, wherein the calculating the body region of the human body category by the predicted value is a behavior class score of the target in the human body region of the human body class, including:

Obtaining a background image of the human body region whose predicted value is a human body category, and obtaining description information of the background image;

Calculating background region information corresponding to the background image according to the description information of the background image, and calculating neighboring target information corresponding to the background image;

Combining the background region information corresponding to the background image and the neighboring target information, the behavior category score of the target of the human body region is calculated.
The method according to claim 2, wherein said combining said human body running track information, calculating and outputting a corresponding behavior category comprises:

Obtaining a current time image of the to-be-identified video and a tracking area image corresponding to the human body running track information;

And sequentially superimposing the current time image and the tracking area image;

The behavior category score and the result of the sequential superposition are weighted and summed, and the corresponding behavior category is output.
The method according to claim 1, wherein the calculating the predicted value corresponding to the human body region according to the human body region, and filtering the human body region whose predicted value is a non-human body category comprises:

Obtaining and analyzing the human body region, and outputting a predicted value corresponding to the human body region;

If the predicted value is a non-human body category, filtering the human body region whose predicted value is a non-human body category from the acquired human body region;

If the predicted value is a human body category, a step of calculating the behavior category score of the target in the human body region of the human body category is performed.
The method according to claim 1, wherein the detecting the human body region in the to-be-identified video and acquiring the human body running track information in the human body region comprises:

Obtaining the to-be-identified video, and detecting a human body region in the to-be-identified video;

Tracking pedestrians in the human body region to obtain human body running track information in the human body region.
A device for human behavior recognition in a video, the device comprising:

The detecting module is configured to detect a human body region in the to-be-identified video, and acquire information about the human body running track in the human body region;

The filtering module is configured to calculate a predicted value corresponding to the human body region according to the human body region, and filter the human body region whose predicted value is a non-human body category, to obtain the human body region whose predicted value is a human body category;

a calculation module configured to calculate, according to the human body region whose predicted value is a human body category, the behavior category score of the target in the human body region of the human body category;

An output module configured to output a corresponding behavior category according to the behavior category score.
The apparatus according to claim 7, wherein

The output module is configured to output the behavior category if the behavior category score is higher than a threshold of a preset behavior category; and if the behavior category score is not higher than a threshold of a preset behavior category, combining the human body Run track information, calculate and output the corresponding behavior category.
The device according to claim 8, wherein

The calculating module is configured to acquire a background image of the human body region whose predicted value is a human body type, to obtain description information of the background image, and calculate background information corresponding to the background image according to the description information of the background image. And calculating the neighboring target information corresponding to the background image; combining the background region information corresponding to the background image and the neighboring target information, calculating a behavior category score of the target of the human body region.
The apparatus according to claim 7, wherein

The output module is configured to acquire a current time image of the to-be-identified video and a tracking area image corresponding to the human body running track information; and sequentially superimpose the current time image and the tracking area image; The category score and the result of the sequential superposition are weighted and summed, and the corresponding behavior category is output.
The apparatus according to claim 7, wherein

The filtering module is configured to acquire the human body region and perform analysis, and output a predicted value corresponding to the human body region; if the predicted value is a non-human body category, the predicted value is a non-human body region Filtering is performed in the acquired human body region; if the predicted value is a human body category, the predicted value is calculated as a behavior category score of the target in the human body region of the human body category.
The apparatus according to claim 7, wherein

The detecting module is configured to acquire the to-be-identified video, detect a human body region in the to-be-identified video, and track a pedestrian in the human body region to obtain human body running track information in the human body region.
A computer storage medium comprising a set of instructions that, when executed, cause at least one processor to perform a method of human behavior recognition in a video according to any one of claims 1 to 6.