CN112183153A

CN112183153A - Object behavior detection method and device based on video analysis

Info

Publication number: CN112183153A
Application number: CN201910585625.1A
Authority: CN
Inventors: 汤人杰
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Zhejiang Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Group Zhejiang Co Ltd
Priority date: 2019-07-01
Filing date: 2019-07-01
Publication date: 2021-01-05

Abstract

The invention discloses a method and a device for detecting object behaviors based on video analysis, wherein the method comprises the following steps: converting an original video into a video image sequence containing a plurality of video frames, and detecting a target object contained in each video frame in the video image sequence; respectively determining the position information of the target object in each video frame, and determining the motion track of the target object according to the position information; detecting skeleton key point information of a target object contained in each video frame in a video image sequence according to the position information of the target object in each video frame; determining the action type of the target object according to a pre-trained bone recognition model and the bone key point information of the target object contained in each video frame obtained by detection; and detecting whether the behavior of the target object is abnormal or not according to the motion track of the target object and the action type of the target object. Therefore, the method and the device can comprehensively utilize the motion trail and the motion category of the target object to realize accurate prediction of abnormal behaviors.

Description

Object behavior detection method and device based on video analysis

Technical Field

The invention relates to the technical field of computer vision, in particular to a method and a device for detecting object behaviors based on video analysis.

Background

With the rapid development of the economic science and technology level, the monitoring system is like a bamboo shoot in spring after rain, and electronic eyes are distributed on large and small streets in various cities throughout the country. People's understanding to the security is higher and higher, and the ubiquitous phenomenon of carrying out the safety control through monitored control system, monitored control system is also unpopular, has greatly promoted social security. However, the current monitoring system still needs to participate in the judgment or supervision of abnormal behaviors manually, and with the continuous development of computer vision technology, a highly automated and intelligent monitoring system inevitably plays an important role in the future monitoring system.

The pedestrian abnormal behavior analysis is firstly carried out on pedestrian detection, the existing pedestrian detection method is mature, if a background modeling method is used, a foreground moving target is extracted, the characteristic extraction is carried out in a target area, then a classifier is used for classification, and whether pedestrians are included is judged; a statistical learning-based method performs a pedestrian detection classifier based on a large amount of data, and uses gray scale, color, texture, HOG (histogram of oriented gradients) feature, and the like of an object as main object features. The classifier includes a neural network, an SVM (support vector machine), deep learning, and the like.

However, the following problems still exist in the prior art:

(1) the inspection workload is large: generally, the current detection means still mainly adopts manpower, and a large amount of manpower and material resources are consumed.

(2) The measurement standard is single, and the detection accuracy is not high: the detection system using the single model often obtains a single detection index, and detection with the single index has certain advantages in efficiency, but the accuracy is reduced.

Disclosure of Invention

In view of the above, the present invention is proposed to provide an object behavior detection method and apparatus based on video analysis that overcomes or at least partially solves the above problems.

According to an aspect of the present invention, there is provided a method for detecting object behaviors based on video analysis, including:

converting an original video into a video image sequence containing a plurality of video frames, and detecting a target object contained in each video frame in the video image sequence;

respectively determining the position information of the target object in each video frame, and determining the motion track of the target object according to the position information;

detecting skeleton key point information of a target object contained in each video frame in a video image sequence according to the position information of the target object in each video frame;

determining the action category of a target object according to a pre-trained bone recognition model and the bone key point information of the target object contained in each video frame obtained through detection;

and detecting whether the behavior of the target object is abnormal or not according to the motion track of the target object and the action type of the target object.

Optionally, detecting the target object contained in each video frame in the sequence of video images comprises:

detecting a candidate object contained in each video frame in the video image sequence;

and screening target objects contained in the video image sequence according to the candidate objects contained in each video frame.

Optionally, detecting, according to the position information of the target object in each video frame, bone keypoint information of the target object contained in each video frame in the video image sequence includes:

detecting bone key point information of candidate objects contained in each video frame in a video image sequence;

and screening the bone key point information of the target object from the bone key point information of the candidate object according to the position information of the target object in each video frame.

And detecting the bone key point information of the candidate object contained in each video frame in the video image sequence by utilizing an OpenPose algorithm.

Optionally, determining the position information of the target object in each video frame includes:

and determining the position information of the target object in each video frame according to the tracking frame of the target object.

the YOLOV3 algorithm is used to detect a target object contained in each video frame in a sequence of video images.

Optionally, determining the motion trajectory of the target object according to the position information includes:

estimating a prediction tracking result of a target object according to the position information of the target object in the previous video frame number in the two adjacent video frame numbers by using a Kalman filtering algorithm;

and judging whether the predicted tracking result of the target object is matched with the actual detection result of the target object or not from the two aspects of the motion matching degree and the apparent matching degree.

Optionally, the detecting of the bone keypoint information of the candidate object contained in each video frame in the video image sequence includes:

according to an aspect of the present invention, there is provided an object behavior detection apparatus based on video analysis, including:

the target object detection module is suitable for converting an original video into a video image sequence containing a plurality of video frames and detecting a target object contained in each video frame in the video image sequence;

the motion track determining module is suitable for respectively determining the position information of the target object in each video frame and determining the motion track of the target object according to the position information;

the skeleton key point information detection module is suitable for detecting the skeleton key point information of the target object contained in each video frame in the video image sequence according to the position information of the target object in each video frame;

the action category determining module is suitable for determining the action category of the target object according to a bone recognition model trained in advance and the bone key point information of the target object contained in each video frame obtained through detection;

and the abnormal behavior judging module is suitable for detecting whether the behavior of the target object is abnormal or not according to the motion track of the target object and the action type of the target object.

Optionally, the target object detection module is adapted to:

Optionally, the bone key point information detection module is adapted to:

Optionally, the motion trajectory determination module is adapted to:

Optionally, the target object detection module is adapted to:

Optionally, the motion trajectory determination module is adapted to:

and determining the motion track of the target object according to the position information by utilizing a Kalman filtering algorithm and the motion matching degree index and the apparent matching degree index.

Optionally, the bone key point information detection module is adapted to:

According to still another aspect of the present invention, there is provided an electronic apparatus including: the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the object behavior detection method based on the video analysis.

According to yet another aspect of the present invention, a computer storage medium is provided, in which at least one executable instruction is stored, and the executable instruction causes a processor to perform an operation corresponding to the object behavior detection based on video analysis as described above.

In summary, the invention discloses a method and a device for detecting object behaviors based on video analysis. First, an original video is converted into a video image sequence including a plurality of video frames, and a target object included in each video frame in the video image sequence is detected. And then, respectively determining the position information of the target object in each video frame, and determining the motion track of the target object according to the position information. Then, the bone key point information of the target object contained in each video frame in the video image sequence is detected according to the position information of the target object in each video frame. And then, determining the action type of the target object according to a pre-trained bone recognition model and the detected bone key point information of the target object contained in each video frame. And finally, detecting whether the behavior of the target object is abnormal or not according to the motion track of the target object and the action type of the target object. Therefore, the method and the device can comprehensively utilize the motion trail and the motion category of the target object to realize accurate prediction of abnormal behaviors.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a flowchart illustrating an object behavior detection method based on video analysis according to a first embodiment;

fig. 2 is a flowchart illustrating an object behavior detection method based on video analysis according to a second embodiment;

fig. 3 is a block diagram showing an object behavior detection apparatus based on video analysis according to a third embodiment;

FIG. 4 is a schematic diagram of an electronic device according to an embodiment of the invention;

FIG. 5 shows the tracking result of the tracking box of the target object;

FIG. 6 illustrates another tracking result of the tracking box of the target object;

FIG. 7 shows skeletal keypoint information extraction results;

FIG. 8 shows another skeletal keypoint information extraction result;

fig. 9 shows a stop motion category recognition result.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Example one

Fig. 1 shows a flowchart of an object behavior detection method based on video analysis according to an embodiment. As shown in fig. 1, the method comprises the steps of:

step S110: an original video is converted into a video image sequence including a plurality of video frames, and a target object included in each video frame in the video image sequence is detected.

The video frame refers to a single video image extracted from an original video, and the video image sequence refers to a set of a plurality of video frames arranged in sequence. The method for extracting a single video image from an original video may be to extract video frames one by one according to the video frame number, or may also be to extract video frames at intervals of a preset video frame number.

Specifically, target objects contained in video frames are detected, wherein the number of the video frames may be multiple, and the number of the target objects contained in a single video frame may be multiple.

Step S120: and respectively determining the position information of the target object in each video frame, and determining the motion track of the target object according to the position information.

And determining the position information of the target object in each video frame according to the tracking frame of the target object. In specific implementation, the position information of the target object in the video frame is determined according to the abscissa x and the ordinate y of the central position of the tracking frame of the target object. It should be noted that, in this embodiment, the determination method of the position information of the target object is not particularly limited, and a person skilled in the art may determine the position information of the target object in other manners.

Specifically, according to the position information of the target object in each video frame, the motion trajectory of the target object in the corresponding time period of the video image sequence can be determined.

Step S130: and detecting the bone key point information of the target object contained in each video frame in the video image sequence according to the position information of the target object in each video frame.

Specifically, firstly, obtaining skeleton key point information of a target object in a video frame; then, the bone key point information corresponding to the target object is determined according to the position information of the target object in the video frame. For example, firstly, bone key point information of a target object a, a target object b and a target object c is acquired; then, the bone key point information corresponding to the target object a is determined according to the position information of the target object a. The skeleton key point information comprises position coordinates of each key point, and the position information comprises center position coordinates of a tracking frame of the target object. And when the distance between the position coordinate of each key point in the skeleton key point information and the center position coordinate of the tracking frame in the position information is within a preset range, judging that the skeleton key point information is successfully matched with the position information of the target object, namely determining the skeleton key point information as the skeleton key point information of the target object.

Step S140: and determining the action type of the target object according to a pre-trained bone recognition model and the detected bone key point information of the target object contained in each video frame.

Specifically, the skeleton key point information of the target object in each video frame is input into a skeleton recognition model, and the skeleton recognition model outputs the specific action category of the target object in each video frame.

Step S150: and detecting whether the behavior of the target object is abnormal or not according to the motion track of the target object and the action type of the target object.

Specifically, the motion speed of the target object is determined according to the motion track of the target object, and whether the motion speed of the target object exceeds a preset motion speed threshold value is judged; and/or judging whether the action type of the target object is an abnormal action type.

In summary, the method detects the motion trajectory and the motion category of the target object in each video frame, and comprehensively utilizes the motion trajectory and the motion category of the target object to realize accurate prediction of abnormal behaviors.

Example two

Fig. 2 shows a flowchart of an object behavior detection method based on video analysis according to a second embodiment. As shown in fig. 2, the method comprises the steps of:

step S200: the bone recognition model is trained in advance.

The bone recognition model is a model for recognizing a specific motion category by using bone key point information. The input to the bone recognition model is the bone key point information and the output of the bone recognition model is the specific action category.

Specifically, a human body structure data set is prepared, an ST-GCN model is adopted as a specific bone recognition model, and the human body structure data set is used as a training data set of the ST-GCN model. The input of the ST-GCN model is a human body structure data set, and the output is an action category label. The human body structure data set refers to skeleton key point coordinate information of specific actions in a video frame, and the action category labels specifically include: stop, walk, run, squat, stand, dance. The embodiment does not limit the specific meaning of the action category label, and those skilled in the art may define the specific meaning of the action category label by other methods. As shown in fig. 9, fig. 9 shows the stop motion category recognition result.

In specific implementation, the ST-GCN model training method by using the human body structure data set comprises the following specific steps: firstly, a space-time diagram is established on a human body structure data set, and multilayer space-time diagram convolution operation is carried out on the space-time diagram to generate a higher-level characteristic diagram. Then, the human body structure data set is input into the network, and the data proportion of each node of the ST-GCN model is kept consistent. The ST-GCN model consists of nine layers of space-time convolution, 64 channels in the first three layers, 28 channels in the middle three layers and 256 channels in the last three layers. The ST-GCN model-a total of nine time convolution kernels, uses residual concatenation and performs feature regularization with Dropout. Pooling is set at the fourth and seventh time convolutional layers and the last 256 output channels are pooled globally and then sorted using a Softmax sorter. Optimization was performed using a random gradient descent method with a learning rate set to 0.1, 100 epochs set, and a 0.01 reduction per 10 Epoch iterations.

Step S210: an original video is converted into a video image sequence including a plurality of video frames, and a target object included in each video frame in the video image sequence is detected.

Specifically, a YOLOV3 algorithm is used to detect a target object contained in each video frame in a sequence of video images. In particular, when a target object included in a video frame is detected, the detected target object is identified using a tracking frame. Wherein the tracking frame is rectangular, and the parameters describing the size of the tracking frame comprise the height h of the tracking frame and the aspect ratio R of the tracking frame_aThe parameter describing the position of the tracking frame in the video frame has an abscissa x and an ordinate y of the position of the center of the tracking frame, and the parameter describing the speed of movement of the tracking frame in the sequence of video images has a speed v. It should be noted that the size of the tracking frame is determined by the size of the target object in the video frame, and the size of the tracking frame changes with the size change of the target object in the video frame. As shown in fig. 5 and 6, fig. 5 shows a tracking result of the tracking frame of the target object, and fig. 6 shows another tracking result of the tracking frame of the target object.

Step S220: and respectively determining the position information of the target object in each video frame, and determining the motion track of the target object according to the position information.

And determining the position information of the target object in each video frame according to the tracking frame of the target object. In specific implementation, the position information of the target object in the video frame is determined according to the abscissa x and the ordinate y of the central position of the tracking frame of the target object.

Specifically, a Kalman filtering algorithm, a motion matching degree index and an apparent matching degree index are utilized, and the motion track of the target object is determined according to the position information of the target object in each video frame.

When the method is specifically implemented, firstly, a prediction tracking result of a target object is estimated by using a Kalman filtering algorithm according to the position information of the target object in the previous video frame number in two adjacent video frame numbers. The estimation of the prediction tracking result of the target object by using the kalman filter algorithm specifically includes: obtainTaking the abscissa x and the ordinate y of the central position of the tracking frame of the target object i in the previous video frame in two adjacent video frames, and simultaneously referring to the height h of the tracking frame of the target object i in the previous video frame and the aspect ratio R of the tracking frame_aAnd a moving speed v for estimating a result (x) of predictive tracking of the tracking frame of the target object i in a subsequent video frame in two adjacent video frames using a Kalman filtering algorithm based on the 5 parameters of the tracking frame of the target object i in the previous video frame_i,y_i,h_i,R_ai,v_i)。

Then, it is determined whether the result of predictive tracking of the target object matches the result of actual detection of the target object, from both the degree of motion matching and the degree of apparent matching. Acquiring the actual detection result (x) of the tracking frame of the target object j in the subsequent video frame in two adjacent video frames before judging whether the predicted tracking result of the target object is matched with the actual detection result of the target object_j,y_j,h_j,R_aj,v_j) Wherein, the number of the target objects j in the following video frame can be multiple. Judging whether the predicted tracking result of the target object is matched with the actual detection result of the target object specifically comprises the following steps: the first step is as follows: evaluating the motion matching degree of the actual detection result of the tracking frame of the target object j in the subsequent video frame and the predicted tracking result of the tracking frame of the target object i in the previous video frame, wherein the specific calculation formula is as follows:

d⁽¹⁾(i,j)＝(d_j-y_i)^TS_i ^-1(d_j-y_i)

wherein d is_jIs the actual detection result (x) of the tracking frame of the target object j in the following video frame_j,y_j,h_j,R_aj,v_j)，y_iPredicted tracking result (x) of tracking frame for target object i in previous video frame_i,y_i,h_i,R_ai,v_i)，S_iFor the spatial covariance matrix predicted by the Kalman Filter Algorithm, T denotes the transposition, d⁽¹⁾(i, j) is the Mahalanobis distance between the actual detection result and the predicted tracking result, and the Mahalanobis distance is usedAnd evaluating the motion matching degree of the actual detection result and the prediction tracking result.

A threshold function is defined for the motion matching degree, and the specific threshold function is as follows:

b_ij ⁽¹⁾＝T[d⁽¹⁾(i,j)≤t⁽¹⁾]

wherein, t⁽¹⁾Is a threshold value, T represents a chi-square distribution, T⁽¹⁾In particular, 0.95 decimals of the chi-squared distribution may be used as the threshold t⁽¹⁾。b_ij ⁽¹⁾Specifically, the result of the motion matching degree between the actual detection result and the prediction tracking result is represented. b_ij ⁽¹⁾The value of (b) specifically includes 1 or 0. b_ij ⁽¹⁾A value of 1 indicates that the motion matching of the actual detection result and the predicted tracking result is successful. b_ij ⁽¹⁾A value of 0 indicates that the motion matching of the actual detection result and the predicted tracking result is unsuccessful.

The second step is that: and if the motion matching of the actual detection result and the predicted tracking result is successful, further evaluating the apparent matching degree of the actual detection result and the predicted tracking result. And if the motion matching of the actual detection result and the predicted tracking result is unsuccessful, not evaluating the apparent matching degree.

Specifically, the evaluating the apparent matching degree between the actual detection result and the prediction tracking result includes: calculating a surface feature description factor r of a tracking frame of a target object j in a subsequent video frame_jCalculating the characteristic description factor r of the tracking frame of the target object i in 100 video frames_k ⁽ⁱ⁾Where k represents the number of video frames, for example, k may be 101-. Describing the characteristics in 100 video frames by a factor r_k ⁽ⁱ⁾Stored in the set R_iIn (1).

The specific calculation formula of the apparent matching degree between the actual detection result and the prediction tracking result is as follows:

d⁽²⁾(i,j)＝min{1-r_j ^Tr_k ⁽ⁱ⁾|r_k ⁽ⁱ⁾∈R_i}

wherein, T tableDisplay and transfer device, d⁽²⁾(i, j) is the minimum cosine value between the actual detection result and the feature vector of the predicted tracking result, r_jSurface feature description factor, r, for a tracking box of a target object, j_k ⁽ⁱ⁾Feature description factor, R, of a tracking frame in a video frame of number k for a target object i_iIs r_k ⁽ⁱ⁾A collection of (a). And evaluating the apparent matching degree of the actual detection result and the predictive tracking result by using the minimum cosine value between the actual detection result and the characteristic vector of the predictive tracking result.

Similarly, a threshold function is defined for the apparent degree of matching, and the specific threshold function is as follows:

b_ij ⁽²⁾＝T[d⁽²⁾(i,j)≤t⁽²⁾]

wherein, t⁽²⁾T represents a chi-squared distribution. b_ij ⁽²⁾The value of (b) specifically includes 1 or 0. b_ij ⁽²⁾A value of 1 indicates that the apparent matching of the actual detection result and the predicted tracking result is successful. b_ij ⁽²⁾A value of 0 indicates that the apparent matching of the actual detection result and the predicted tracking result is unsuccessful.

The third step: calculating a weighted average value aiming at the motion matching degree and the apparent matching degree, wherein the specific formula is as follows:

c_i,j＝λd⁽¹⁾(i,j)+(1-λ)d⁽²⁾(i,j)

b_ij ⁽¹⁾directly reflecting whether the actual detection result and the prediction tracking result are successfully matched in motion, b_ij ⁽²⁾And directly reflecting whether the actual detection result is matched with the prediction tracking result in appearance successfully or not.

Using c_i,jQuantizing the degree of motion matching and the degree of apparent matching when c_i,jWhen the target object j in the subsequent video frame is within the preset range, the tracking frame of the target object j in the subsequent video frame is successfully matched with the tracking frame of the target object i in the previous video frame. Repeating the above process, determining the tracking frame of the target object in each video frame, and determining according to the position information of the tracking frame in each video frameAnd determining the position information of the target object in each video frame, and further determining the motion track of the target object.

Further, in order to improve matching efficiency, before determining the position information of the target object in each video frame, the target object is screened from candidate objects contained in each video frame. In specific implementation, firstly, candidate objects contained in each video frame in a video image sequence are detected; then, screening target objects included in the video image sequence according to the candidate objects included in each video frame, wherein screening the target objects included in the video image sequence specifically includes: and judging the interval time from the last successful matching of the candidate object to the current moment, if the interval time exceeds a preset threshold, indicating that the motion trail of the candidate object is terminated, and taking the candidate object as a target object.

Step S230: and detecting the bone key point information of the target object contained in each video frame in the video image sequence according to the position information of the target object in each video frame.

The bone key point information refers to information which is obtained by utilizing an openpos algorithm and is related to the posture of a target object, and the bone key point information specifically comprises the following two parts: confidence S and body-part affinity L.

Specifically, the first step: detecting skeleton key point information of candidate objects contained in each video frame in a video image sequence, wherein the detecting the skeleton key point information specifically comprises: and detecting the bone key point information of the candidate object contained in each video frame in the video image sequence by utilizing an OpenPose algorithm. In specific implementation, the process of detecting the bone key point information is as follows: the method comprises the following steps: and inputting the video frame into the VGG-19 model, and outputting a feature map F through the convolution of the first ten layers of the VGG-19 model. Step two: the feature map F is respectively fed into two branches, the first branch being used to evaluate the confidence S. The first branch evaluation confidence S specifically includes: after the feature map F is imported into the first branch, a body part corresponding to a certain key point in the feature map F is predicted, for example, the key point 01 in the feature map F is predicted to belong to the left shoulder part of the body. After predicting a body part corresponding to a certain key point, the probability of the body part corresponding to the key point is evaluated, for example, the probability that the key point 01 in the predicted feature map F belongs to the left shoulder part of the body is evaluated, and the probability obtained by the evaluation is the confidence S. The specific calculation formula of the first branch for evaluating the confidence level S is as follows:

S¹＝ρ¹(F)；

S^t＝ρ^t(F,S^t-1,L^t-1)

wherein S is^tThe confidence of the key point t is represented, F represents the feature map, and L represents the body part affinity vector.

Step three: the feature map F is passed into a second branch, which is used to predict the body-part affinity L, which refers to the degree of association between different body parts. For example, the key point 01 in the feature map F belongs to the left shoulder portion of the body, the key point 02 belongs to the neck portion of the body, the left shoulder portion of the key point 01 and the neck portion of the key point 02 are associated with each other by the second branch prediction, and a set of vectors [ (x) representing the trunk between the left shoulder portion and the neck portion is further output₁,y₁),(x₂,y₂)]Wherein (x)₁,y₁) Indicates the position of the starting point of the trunk (i.e., the neck region), (x)₂,y₂) Indicating the location of the end point of the torso (i.e., the left shoulder region). The vector representing the trunk is the affinity L of the body part, and the vector contains the position information of each key point. Wherein L is¹＝φ¹(F)，

L^tA body-part affinity vector representing the torso t.

A Label bit is agreed upon for the confidence level S, wherein,

S^* _j(p)＝max_kS^* _j,k(p)

wherein p is the currentPosition, x_j,kIs the coordinates of the jth part of the kth person. To prevent too small numbers from overwhelming the training, a constant σ is set.

Formula (1) represents that the score is higher as the distance between the current position (x, y) and the jth part of the kth person is higher; s^* _j(p)＝max_kS^* _j,kAnd (p) is formula (2), wherein formula (2) represents that the maximum value is taken for k, and the position with the highest score is found.

In equation (3), equation (3) indicates that the current position p is at the c-th position of the kth person, and the Label of the current position p takes the unit vector of the position, otherwise 0 is taken.

v＝(x_j2,k-x_j1,k)/||x_j2,k-x_j1,k||₂In equation (4), equation (4) expresses vector division modulo to obtain unit vector.

0≤v·(p-x_j1,k)≤l_c,kand|v⊥·(p-x_j1,k)|≤σ_lIn equation (5), equation (5) indicates whether p is discussed at the c site.

It should be noted that, the bone information is extracted and rendered for the object in the passing video. Wherein, the motion transformation can cause partial bone information acquisition failure, 0 is supplemented at the position of the bone information acquisition failure, the coordinate position is set as 0, and the score value is also set as 0. The method does not generate great interference on the identification of subsequent actions. As shown in fig. 7 and 8, fig. 7 shows a skeletal key point information extraction result, and fig. 8 shows another skeletal key point information extraction result.

The second step is that: and screening the bone key point information of the target object from the bone key point information of the candidate object according to the position information of the target object in each video frame. The bone key point information of the screening target object specifically comprises the following steps: the method comprises the following steps: and extracting key point position information in skeleton key point information of the target object, and extracting position information of the target object in each video frame, wherein the skeleton key point information comprises position coordinates of each key point, and the position information comprises center position coordinates of a tracking frame of the target object.

Step two: and when the distance between the position coordinate of each key point in the skeleton key point information and the center position coordinate of the tracking frame in the position information is within a preset range, judging that the skeleton key point information is successfully matched with the position information of the target object. It should be noted that, in a video frame exceeding a preset frame number, if the matching between the key point position information in the skeleton key point information and the position information of the target object is successful, it indicates that the skeleton key point information and the target object are successfully bound, that is, the skeleton key point information is determined as the skeleton key point information of the target object.

Further, considering that a short-distance target object may completely cover a long-distance target object, the extreme value of the key point position information in the skeleton key point information is compared with the coordinates of the four vertexes of the tracking frame of the target object, so that a binding error is prevented. Wherein, the extreme value of the key point position information refers to the maximum abscissa value x in the key point position information_maxMinimum abscissa value x_minMaximum longitudinal coordinate value y_maxMinimum ordinate value y_min. And if the extreme value of the position information of the key point exceeds the coordinates of the four vertexes of the tracking frame of the target object, the bone key point information and the target object cannot be bound.

Step S240: and determining the action type of the target object according to a pre-trained bone recognition model and the detected bone key point information of the target object contained in each video frame.

Step S250: and detecting whether the behavior of the target object is abnormal or not according to the motion track of the target object and the action type of the target object.

The specific abnormal behavior judgment rule may adopt at least one of the following:

firstly, detecting the action type of a target object in a video frame, and judging whether the behavior of the target object is abnormal according to the action type of the target object, wherein the number of the video frames can be multiple, and the number of the target object in a single video frame can be multiple. For example, the number of frames of a certain video frame is j, and the target object i in the video frame sets sudden stop of a pedestrian in motion as one of the abnormal motion types. If the total number of the target objects with abnormal action types in 20 video frames exceeds a preset abnormal number threshold, judging that a behavior abnormal phenomenon occurs, wherein a specific judgment formula is as follows:

wherein j is the frame number of the video frame, i represents the target object in a single video frame, and walk is an abnormal action type label and is a preset abnormal quantity threshold.

And secondly, determining the motion speed of the target object according to the motion track of the target object, and judging whether the behavior of the target object is abnormal or not according to the motion speed of the target object. For example, the speed V of the target object is determined according to the displacement speed of the tracking frame of the target object, and if the speed exceeds the preset speed threshold value V_nIf the total number of the target objects exceeds a preset abnormal number threshold, judging that the behavior abnormal phenomenon occurs, wherein a specific judgment formula is as follows:

where j is the frame number of the video frame, i represents the target object in a single video frame, V_nTo a preset speed threshold, σ isAnd setting an abnormal quantity threshold value.

Thirdly, the movement speed and the action category of the target object are used in combination, for example, when a preset number of pedestrians have a preset abnormal behavior category, a behavior abnormality is prompted to occur.

It should be noted that, the specific meaning of the abnormal behavior determination rule is not specifically limited in this embodiment, and those skilled in the art may use other methods to define the specific meaning of the abnormal behavior determination rule.

In summary, in this way, firstly, a kalman filter algorithm, a motion matching degree index and an apparent matching degree index are used to determine a tracking frame of the target object in each video frame, and the position information of the target object is determined according to the position information of the tracking frame, so as to determine the motion trajectory of the target object. Then, the openpos algorithm is used to detect the bone key point information of the target object contained in each video frame. And finally, binding the motion track of the target object with the bone key point information of the target object, and evaluating whether the target object is abnormal or not. The advantages of this approach are: on one hand, the automatic extraction, composition and analysis of the video content are realized, and the manpower and material resources required by the detection system are greatly reduced; on the other hand, the motion trail and the motion category of the target object are comprehensively utilized, and accurate prediction of abnormal behaviors is achieved.

EXAMPLE III

Fig. 3 is a block diagram of an object behavior detection apparatus based on video analysis according to a third embodiment, where the apparatus includes:

a target object detection module 31 adapted to convert an original video into a video image sequence including a plurality of video frames, and detect a target object included in each video frame in the video image sequence;

a motion track determining module 32, adapted to determine position information of the target object in each video frame, respectively, and determine a motion track of the target object according to the position information;

a bone key point information detection module 33, adapted to detect bone key point information of a target object contained in each video frame in a video image sequence according to position information of the target object in each video frame;

the action category determining module 34 is adapted to determine an action category of the target object according to a bone recognition model trained in advance and detected bone key point information of the target object contained in each video frame;

the abnormal behavior determining module 35 is adapted to detect whether the behavior of the target object is abnormal according to the motion trajectory of the target object and the motion category of the target object.

Optionally, the target object detection module 31 is adapted to:

Optionally, the bone key point information detection module 33 is adapted to:

Optionally, the motion trajectory determination module 32 is adapted to:

Optionally, the target object detection module 31 is adapted to:

Optionally, the motion trajectory determination module 32 is adapted to:

The embodiment of the application provides a non-volatile computer storage medium, wherein at least one executable instruction is stored in the computer storage medium, and the computer executable instruction can execute an object behavior detection method based on video analysis in any method embodiment.

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and the specific embodiment of the present invention does not limit the specific implementation of the electronic device.

As shown in fig. 4, the electronic device may include: a processor (processor)402, a Communications Interface 404, a memory 406, and a Communications bus 408.

Wherein:

the processor 402, communication interface 404, and memory 406 communicate with each other via a communication bus 408.

A communication interface 404 for communicating with network elements of other devices, such as clients or other servers.

The processor 402 is configured to execute the program 410, and may specifically execute relevant steps in the above-described embodiment of the fault location method based on multiple levels of network nodes.

In particular, program 410 may include program code comprising computer operating instructions.

The processor 402 may be a central processing unit CPU or an application Specific Integrated circuit asic or one or more Integrated circuits configured to implement embodiments of the present invention. The electronic device comprises one or more processors, which can be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

And a memory 406 for storing a program 410. Memory 406 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The program 410 may be specifically configured to cause the processor 402 to perform the operations in the above-described method embodiments.

The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some or all of the components in a system according to embodiments of the present invention. The present invention may also be embodied as apparatus or system programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several systems, several of these systems may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims

1. An object behavior detection method based on video analysis comprises the following steps:

converting an original video into a video image sequence comprising a plurality of video frames, and detecting a target object contained in each video frame in the video image sequence;

detecting skeleton key point information of the target object contained in each video frame in the video image sequence according to the position information of the target object in each video frame;

determining the action category of a target object according to a pre-trained bone recognition model and the detected bone key point information of the target object contained in each video frame;

and detecting whether the behavior of the target object is abnormal or not according to the motion track of the target object and the action category of the target object.

2. The method of claim 1, wherein said detecting a target object contained in each video frame in the sequence of video images comprises:

detecting a candidate object contained in each video frame in the sequence of video images;

and screening target objects contained in the video image sequence according to the candidate objects contained in the video frames.

3. The method according to claim 1 or 2, wherein the detecting of the skeletal keypoint information of the target object contained in each video frame of the sequence of video images from the position information of the target object in each video frame comprises:

detecting bone key point information of candidate objects contained in each video frame in the video image sequence;

4. The method of claim 3, wherein said detecting skeletal keypoint information of a candidate object contained in a respective video frame of said sequence of video images comprises:

5. The method of claim 1, wherein the determining the position information of the target object in each video frame comprises:

6. The method of claim 1, wherein said detecting a target object contained in each video frame in the sequence of video images comprises:

the YOLOV3 algorithm is used to detect a target object contained in each video frame in the sequence of video images.

7. The method of claim 1, wherein the determining a motion trajectory of the target object from the location information comprises:

8. An object behavior detection apparatus based on video analysis, comprising:

the skeleton key point information detection module is suitable for detecting skeleton key point information of the target object contained in each video frame in the video image sequence according to the position information of the target object in each video frame;

the action category determining module is suitable for determining the action category of the target object according to a bone recognition model trained in advance and the detected bone key point information of the target object contained in each video frame;

and the abnormal behavior judging module is suitable for detecting whether the behavior of the target object is abnormal or not according to the motion track of the target object and the action category of the target object.

9. An electronic device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the operation corresponding to the object behavior detection method based on video analysis in any one of claims 1-7.

10. A computer storage medium having at least one executable instruction stored therein, the executable instruction causing a processor to perform operations corresponding to the method for detecting object behavior based on video analytics as claimed in any one of claims 1 to 7.