WO2021129569A1

WO2021129569A1 - Human action recognition method

Info

Publication number: WO2021129569A1
Application number: PCT/CN2020/137991
Authority: WO
Inventors: 井焜; 高朋; 许野平; 刘辰飞; 陈英鹏; 张朝瑞; 席道亮
Original assignee: 神思电子技术股份有限公司
Priority date: 2019-12-25
Filing date: 2020-12-21
Publication date: 2021-07-01
Also published as: CN111062355A

Abstract

A human action recognition method, comprising: first performing preprocessing, nearest neighborhood construction and filtering, on images; then performing image channel transformation, target contour enhancement, and differential image extraction; performing threshold processing and foreground image processing on a foreground image; and finally, on the basis of a three-dimensional convolution network, performing model training or action recognition and action positioning. Said method solves the problem in the existing action recognition methods that the detection precision of a model is reduced in a large scene, a small target, and a complex background, and also achieves action detection and action positioning in any continuous non-boundary video stream, improving the precision of human action recognition and robustness in different application scenarios, improving the capability of normalized application of a model.

Description

Human action recognition method

Technical field

The invention relates to a human body action recognition method, which belongs to the technical field of human body action recognition.

Background technique

Action recognition realizes the task of action classification and classification by extracting the action characteristics of continuous video frames, avoiding the occurrence of possible dangerous behaviors in practice, and has a wide range of practical application scenarios, so it has always been an active research direction in the field of computer vision. The existing action recognition methods based on deep learning have achieved high classification accuracy in small scenes and large targets. However, in the real-time monitoring of complex backgrounds (with noise) and small targets, the existing human motion recognition methods have low recognition accuracy, a large number of false alarms and false alarms.

Summary of the invention

Aiming at the shortcomings of the prior art, the present invention provides a human body motion recognition method, which solves the problem of low accuracy of motion recognition in large scenes, small targets, and complex backgrounds. At the same time, it solves the problem of realizing arbitrary Accurate positioning and classification of actions in long-length continuous videos.

In order to solve the technical problem, the technical solution adopted by the present invention is: a human body motion recognition method, including the following steps:

S01). Decode the video, and perform preprocessing on each frame of picture, the preprocessing includes minimum neighborhood selection and filter design, and Kalman filter is used to filter the image;

S02). The image format conversion of the preprocessed image is completed according to formula 21, and the output image is converted from a three-channel RGB image to a single-channel GRAY image:

Gray(m,n)=0.299r(m,n)+0.587g(m,n)+0.441b(m,n) (21),

Where Gray(m,n) is the gray value of the filter output grayscale image at the pixel point (m,n), r(m,n), g(m,n), b(m,n) are colors The corresponding three-channel pixel value of the image at the pixel point (m, n);

S03). Perform target contour enhancement on the image by formula 31 to remove noise in the grayscale image and at the same time improve the contour definition of the target in the image:

Where Pixel(m,n) represents the pixel value calculated by contour enhancement of the preprocessed output grayscale image at the pixel point (m,n), and Gray(m,n) is the single-channel gray converted by formula 21 The pixel value of the degree image at (m, n), w(m, n, i, j) is the weight, and i, j represent the size of the neighborhood;

The weight w(m,n,i,j) consists of two parts, which are the spatial distance d(m,n,i,j) and the pixel distance r(m,n,i,j). The calculation process is:

w(m,n,i,j)=d(m,n,i,j)·r(m,n,i,j) (32),

Where δ _d =0.7, δ _r =0.2,

S04). Every 8 frames, three images I _t , I _t-8 , and I _t-16 are selected from the image sequence, the obtained foreground picture is denoted by D, and the three pictures are at the pixel point (m, n). The pixel values are: I _t (m,n), I _t-8 (m,n), I _n-16 (m,n), then the foreground image is:

D(m,n)=|I _t (m,n)-I _t-8 (m,n)|∩|I _t-8 (m,n)-I _t-16 (m,n)| (41 ),

Perform a threshold operation on the foreground image D(m,n):

The calculation of the threshold T is as follows:

T=Min(T _t/t-8 ,T _t-8/t-16 ) (43),

In formula 43, T _t/t-8 and T _t-8/t-16 take values that conform to formula 44 and 45 respectively,

Among them, A is the number of pixels in the entire picture, and δ=0.6;

S05). Perform corrosion and expansion operations on the foreground image D(m,n);

S06). Convert the acquired gray-scale foreground image D(m, n) into a three-channel image, combine them into a continuous picture sequence, and input the three-dimensional convolutional network for training and detection.

Further, the specific steps for the three-dimensional convolutional network to detect the continuous picture sequence are as follows:

S61). The input of the three-dimensional convolutional network is a collection of video frame images with 3 channels, video length L, video frame image height H, and video frame image width W, and the output obtained after forward propagation through the three-dimensional convolution network 2048 channels, video length is

The video frame image height is

The video frame image width is

The feature map collection;

S62),

A multi-scale window is predefined with a uniformly distributed time position as the center. Each time position specifies K anchor segments, and the fixed proportion of each anchor segment is different. By applying the kernel size as

3D max-pooling filter, from the spatial dimension

_Up to 1×1 sampling to generate a time-only feature map set C tpn, where C _tpn is 2048 channels, and the video length is

The video frame image height is 1, the video frame image width is 1, _{the 2048-dimensional feature vector at each time position in C tpn} is used to predict the center position and length of each anchor segment {C _k ,l _k }, _{The relative offset {σC k} ,σl _k } of k∈{1,...,K};

S63). Use the softmax loss function for classification, and use the smooth L1 loss function for regression. The L1 loss function is:

Among them, N _cls and N _reg represent the batch size and the number of suggestion boxes, λ is the loss trade-off parameter and is set to a value of 1, k is the suggestion box index in the batch, and a _k is the probability of prediction in the suggestion box or action ,

Is the action value of the real action box,

Indicates the relative offset from the anchor segment or suggestion box prediction,

Indicates the coordinate conversion from the real segment of the video to the anchor segment or the suggested coordinate conversion. The calculation of the coordinate conversion is:

Where: c _k and l _k are the anchor point or the center position and length of the proposal, and

with

Represents the center position and length of the real action segment of the video.

Further, the L1 loss function is applied to both the temporary suggestion box subnet and the action classification subnet. In the suggestion box subnet, the binary classification loss L _cls prediction suggestion box indicates that it contains an action, the regression loss L _reg optimization suggestion and basic facts In the action classification subnet, the multi-category loss L _cls is the suggestion box to predict a specific action category. The number of categories is the number of actions plus one action as a background. The regression loss L _reg optimizes actions and basic facts. The relative displacement between.

Further, in step S01, the minimum neighborhood width of the two-dimensional image is set to 9, that is, one pixel and 8 surrounding pixels are taken as the minimum filtering neighborhood, and the Kalman filter design process based on the minimum filtering neighborhood for:

S11). The linear expression of the gray value X(m,n) of the pixel (m,n) is:

X(m,n)=F(m|i,n|j)·X ^T (m|i,n|j)+Φ(m,n) (11),

Among them, T is the transposition operation, φ(m,n) is the noise term,

The formula 11 is expressed as:

Among them: x(m+i,n+j) is the pixel value of each point in the image, which is a known quantity; c(m+i,n+j) is the weight of each point of the original video frame image, which is unknown the amount;

S12), c(m+i,n+j) calculation standard is:

The value of c(m+i,n+j) must make the formula 15 reach the minimum value, then:

A and B of the above formula are respectively expressed as:

A=x(m+i,n+j) (17),

B=x(m+i,n+j)-x(m+i-1,n+j)

S13). Suppose the observation equation is:

Z(m,n)=X(m,n)+V(m,n) (18),

Where v(m,n) is noise,

S14). According to the minimum linear variance, the recursive formula of the two-dimensional discrete Kalman filter in the 3×3 neighborhood of the pixel point (m, n) is:

X(m,n)=F(m|i,n|j)X ^T (m|i,n|j)+K(m,n)[Z(m,n)-F(m|i,n |j)X ^T (m|i,n|j)] (19),

The one-step forecast variance equation is:

The gain equation is:

K(m,n)=P _m/m-1 (m,n)/[P _m/m-1 (m,n)+r(m,n)] (111),

Error variance matrix equation:

P _m/m (m,n)=[1-K(m,n)] ² P _m/m-1 (m,n)+K ² (m,n)r(m,n) (112)

The filter is constructed by formulas 19, 110, 111, and 112 to complete the preprocessing of the input data.

The beneficial effects of the present invention: in the continuous video motion detection task, the present invention uses a background removal method to reduce the influence of the video background on the detection accuracy. Solve the problem of reduced detection accuracy of the model in large scenes, small targets, and complex backgrounds in the existing motion recognition methods. At the same time, it realizes the motion detection and motion positioning in any continuous borderless video stream, and improves the human body motion recognition. Accuracy and robustness in different application scenarios improve the model's normalized application capabilities. At the same time, the three-dimensional convolutional neural network is used to encode the video stream, extract the video action features, and complete the action classification task and the action positioning task at the same time.

Description of the drawings

Figure 1 is a flow chart of the present invention.

Detailed ways

It should be noted that the following detailed descriptions are all illustrative and are intended to provide further descriptions of the present invention. Unless otherwise specified, all technical and scientific terms used in the present invention have the same meaning as commonly understood by those of ordinary skill in the technical field to which the present invention belongs.

Example 1

This embodiment is mainly aimed at large scenes and small targets. Through preprocessing of training and test data, the impact of complex background on model detection accuracy is reduced, and the model's action recognition accuracy is improved. At the same time, only a three-dimensional convolutional deep learning model is used to realize motion detection and precise positioning of actions in continuous videos of any length, reducing the amount of calculation.

As shown in Figure 1, this embodiment includes the following steps:

The first step: image preprocessing operation:

The video is decoded, and each frame of picture is preprocessed. The preprocessing includes the following steps:

1) Minimum neighborhood selection

For a two-dimensional image, the minimum neighborhood width is 9, that is, one pixel and 8 pixels around it are taken as the minimum filtering neighborhood, that is, in the neighborhood window length (i, j) of the pixel, the value of i and j is taken The value range is an integer between [-1,1].

2) Filter design

The linear expression of the gray value X(m,n) of the pixel (m,n) is:

X(m,n)=F(m|i,n|j)·X ^T (m|i,n|j)+Φ(m,n) (11),

Among them, T is the transposition operation, φ(m,n) is the noise term,

The formula 11 is expressed as:

Where x(m+i,n+j) is the pixel value of each point of the original video frame image, which is a known quantity, and c(m+i,n+j) is the weight of each point of the original video frame image, which is Unknown;

The calculation standard of c(m+i,n+j) is:

E in formula 15 is the symbol of matrix mean operation in probability;

The value of c(m+i,n+j) must be such that Equation 15 reaches the minimum value. From this, we can get:

among them:

Suppose the observation equation is:

Z(m,n)=X(m,n)+V(m,n) (18),

Among them, V(m,n) is white noise with zero mean and variance r(m,n);

According to the minimum linear variance, the recursive formula of the two-dimensional discrete Kalman filter in the 3×3 neighborhood of the pixel point (m, n) is:

X(m,n)=F(m|i,n|j)X ^T (m|i,n|j)+K(m,n)[Z(m,n)-F(m|i,n |j)X ^T (m|i,n|j)] (19),

The one-step forecast variance equation is:

Gain equation:

K(m,n)=P _m/m-1 (m,n)/[P _m/m-1 (m,n)+r(m,n)] (111),

Error variance matrix equation:

P _m/m (m,n)=[1-K(m,n)] ² P _m/m-1 (m,n)+K ² (m,n)r(m,n) (112),

Step 2: Image format conversion related processing:

Complete the image format conversion of the preprocessed image according to formula 21, and the output image is converted from a three-channel RGB image to a single-channel GRAY image;

Gray(m,n)=0.299r(m,n)+0.587g(m,n)+0.441b(m,n) (21),

The third step: target contour enhancement, the method is as follows:

The pixel value at (m,n) of the output grayscale image is:

w(m,n,i,j)=d(m,n,i,j)·r(m,n,i,j) (32),

Where δ _d =0.7, δ _r =0.2,

Using the above method, the noise in the grayscale image can be removed, and the contour definition of the target in the image can be improved at the same time.

Step 4: Consider the amplitude of the action and the frame rate of the video, try to eliminate the hole phenomenon, every 8 frames, select three images I _n , I _n-8 , I _n-16 in the image sequence, and use the obtained foreground image D indicates that the pixel values of the three pictures at the pixel point (m, n) are: I _t (m, n), I _t-8 (m, n), I _n-16 (m, n), then The foreground image is:

D(m,n)=|I _t (m,n)-I _t-8 (m,n)|∩|I _t-8 (m,n)-I _t-16 (m,n)| (41 ),

Perform a threshold operation on the foreground image D(m,n):

The calculation of the threshold T is as follows:

T=Min(T _t/t-8 ,T _t-8/t-16 ) (43),

Among them, A is the number of pixels in the entire picture, and δ=0.6;

Step 5: On the basis of the previous step, remove holes and tiny noises from the foreground image D (x, y), and perform corrosion and expansion operations;

The sixth step, model training and testing

The obtained gray-scale foreground image D(x,y) is converted into a three-channel image, combined into a continuous picture sequence, and input into a three-dimensional convolutional network for training and detection.

The input of the model is a series of R ^3×L×H×W frame images. The 3D-ConvNet architecture uses Resnet-50 as the backbone network. Through the deep network structure, more abundant action features can be obtained, and finally a feature map is generated.

R ^3×L×H×W indicates that the input size frame image is a collection of video frame images with 3 channels, video length L, video frame image height H, and video frame image width W.

Indicates that the output is 2048 channels, and the video length is

The video frame image height is

The video frame image width is

A collection of feature maps.

To

A pre-defined multi-scale window with uniformly distributed time positions as the center. Each time position specifies K anchor segments, and each anchor segment has a different fixed ratio. By applying the kernel size as

3D max-pooling filter, downsampling the spatial dimension (from

To 1×1) to generate time-only feature maps

The 2048-dimensional feature vector at each time position in C _tpn _{is used to predict the relative offset (σC k} ,σl _{) of each anchor segment {C k} ,l _k },k ∈ {1,...,K) _k };

Is the action value of the real action box,

with

The above loss function is applied to both the temporary suggestion box subnet and the action classification subnet. In the suggestion box subnet, the binary classification loss L _cls prediction suggestion box represents an action, and the regression loss L _reg optimizes the relative displacement between the suggestion box and the basic facts. In the suggestion box subnet, the loss has nothing to do with the action category. In the action classification subnet, the multi-class classification loss L _cls will predict a specific action category for the suggestion box, and the number of categories is the number of actions plus one action as a background. The regression loss L _reg optimizes the relative displacement between the action and the basic facts. All four losses of the two subnets are optimized together.

The above descriptions are only preferred embodiments of the present invention and are not used to limit the present invention. For those skilled in the art, the present invention can have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

A method for human body action recognition, characterized in that it comprises the following steps:

S01). Decode the video, and perform preprocessing on each frame of picture, the preprocessing includes minimum neighborhood selection and filter design, and Kalman filter is used to filter the image;

S02). The image format conversion of the preprocessed image is completed according to formula 21, and the output image is converted from a three-channel RGB image to a single-channel GRAY image:

Gray(m,n)=0.299r(m,n)+0.587g(m,n)+0.441b(m,n) (21),

Where Gray(m,n) is the gray value of the filter output grayscale image at the pixel point (m,n), r(m,n), g(m,n), b(m,n) are colors The corresponding three-channel pixel value of the image at the pixel point (m, n);

S03). Perform target contour enhancement on the image by formula 31 to remove noise in the grayscale image and at the same time improve the contour definition of the target in the image:

Where Pixel(m,n) represents the pixel value calculated by contour enhancement of the preprocessed output grayscale image at the pixel point (m,n), and Gray(m,n) is the single-channel gray converted by formula 21 The pixel value of the degree image at (m, n), w(m, n, i, j) is the weight, and i, j represent the size of the neighborhood;

The weight w(m,n,i,j) consists of two parts, which are the spatial distance d(m,n,i,j) and the pixel distance r(m,n,i,j). The calculation process is:

w(m,n,i,j)=d(m,n,i,j)·r(m,n,i,j) (32),

Where δ d =0.7, δ r =0.2,

S04). Every 8 frames, three images I t , I t-8 , and I t-16 are selected from the image sequence, the obtained foreground picture is denoted by D, and the three pictures are at the pixel point (m, n). The pixel values are: I t (m,n), I t-8 (m,n), I n-16 (m,n), then the foreground image is:

D(m,n)=|I t (m,n)-I t-8 (m,n)|∩|I t-8 (m,n)-I t-16 (m,n)| (41 ),

Perform a threshold operation on the foreground image D(m,n):

The calculation of the threshold T is as follows:

T=Min(T t/t-8 ,T t-8/t-16 ) (43),

In formula 43, T t/t-8 and T t-8/t-16 take values that conform to formula 44 and 45 respectively,

Among them, A is the number of pixels in the entire picture, and δ=0.6;

S05). Perform corrosion and expansion operations on the foreground image D(m,n);

S06). Convert the acquired gray-scale foreground image D(m, n) into a three-channel image, combine them into a continuous picture sequence, and input the three-dimensional convolutional network for training and detection.
The human body action recognition method according to claim 1, wherein the specific steps of the three-dimensional convolutional network detecting continuous picture sequence are:

S61). The input of the three-dimensional convolutional network is a collection of video frame images with 3 channels, video length L, video frame image height H, and video frame image width W, and the output obtained after forward propagation through the three-dimensional convolution network 2048 channels, video length is
The video frame image height is
The video frame image width is
The feature map collection;

S62),
A multi-scale window is predefined with a uniformly distributed time position as the center. Each time position specifies K anchor segments, and the fixed proportion of each anchor segment is different. By applying the kernel size as
3Dmax-pooling filter, from the spatial dimension
Up to 1×1 sampling to generate a time-only feature map set C tpn, where C tpn is 2048 channels, and the video length is
The video frame image height is 1, the video frame image width is 1, the 2048-dimensional feature vector at each time position in C tpn is used to predict the center position and length of each anchor segment {C k ,l k }, The relative offset {σC k ,σl k } of k∈{1,...,K};

S63). Use the softmax loss function for classification, and use the smooth L1 loss function for regression. The L1 loss function is:

Among them, N cls and N reg represent the batch size and the number of suggestion boxes, λ is the loss trade-off parameter and is set to a value of 1, k is the suggestion box index in the batch, and a k is the probability of prediction in the suggestion box or action ,
Is the action value of the real action box,
Indicates the relative offset from the anchor segment or suggestion box prediction,
Indicates the coordinate conversion from the real segment of the video to the anchor segment or the suggested coordinate conversion. The calculation of the coordinate conversion is:

Where: c k and l k are the anchor point or the center position and length of the proposal, and
with
Represents the center position and length of the real action segment of the video.
The human body action recognition method according to claim 2, wherein the L1 loss function is applied to both the temporary suggestion box subnet and the action classification subnet. In the suggestion box subnet, the binary classification loss L cls prediction suggestion box indicates Yes Contains an action, the relative displacement between the regression loss L reg optimization suggestion and the basic facts. In the action classification subnet, the multi-category classification loss L cls is the suggestion box to predict a specific action category, and the number of categories is the number of actions plus one as the background The return loss L reg optimizes the relative displacement between the action and the basic facts.
The human body action recognition method according to claim 1, characterized in that: in step S01, the minimum neighborhood width of the two-dimensional image is set to 9, that is, one pixel and 8 pixels around it are taken as the minimum filtering neighborhood, The Kalman filter design process based on the minimum filtering neighborhood is:

S11). The linear expression of the gray value X(m,n) of the pixel (m,n) is:

X(m,n)=F(m|i,n|j)·X T (m|i,n|j)+Φ(m,n) (11),

Among them, T is the transposition operation, φ(m,n) is the noise term,

The formula 11 is expressed as:

Among them: x(m+i,n+j) is the pixel value of each point in the image, which is a known quantity; c(m+i,n+j) is the weight of each point of the original video frame image, which is unknown the amount;

S12), c(m+i,n+j) calculation standard is:

The value of c(m+i,n+j) must make the formula 15 reach the minimum value, then:

A and B of the above formula are respectively expressed as:

A=x(m+i,n+j) (17),

B=x(m+i,n+j)-x(m+i-1,n+j)

S13). Suppose the observation equation is:

Z(m,n)=X(m,n)+V(m,n) (18),

Where v(m,n) is noise,

S14). According to the minimum linear variance, the recursive formula of the two-dimensional discrete Kalman filter in the 3×3 neighborhood of the pixel point (m, n) is:

X(m,n)=F(m|i,n|j)X T (m|i,n|j)+K(m,n)[Z(m,n)-F(m|i,n |j)X T (m|i,n|j)] (19),

The one-step forecast variance equation is:

The gain equation is:

K(m,n)=P m/m-1 (m,n)/[P m/m-1 (m,n)+r(m,n)] (111),

Error variance matrix equation:

P m/m (m,n)=[1-K(m,n)] 2 P m/m-1 (m,n)+K 2 (m,n)r(m,n) (112)

The filter is constructed by formulas 19, 110, 111, and 112 to complete the preprocessing of the input data.