WO2021129569A1 - Human action recognition method - Google Patents

Human action recognition method Download PDF

Info

Publication number
WO2021129569A1
WO2021129569A1 PCT/CN2020/137991 CN2020137991W WO2021129569A1 WO 2021129569 A1 WO2021129569 A1 WO 2021129569A1 CN 2020137991 W CN2020137991 W CN 2020137991W WO 2021129569 A1 WO2021129569 A1 WO 2021129569A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
action
pixel
formula
video
Prior art date
Application number
PCT/CN2020/137991
Other languages
French (fr)
Chinese (zh)
Inventor
井焜
高朋
许野平
刘辰飞
陈英鹏
张朝瑞
席道亮
Original Assignee
神思电子技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 神思电子技术股份有限公司 filed Critical 神思电子技术股份有限公司
Publication of WO2021129569A1 publication Critical patent/WO2021129569A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/30Noise filtering

Definitions

  • the invention relates to a human body action recognition method, which belongs to the technical field of human body action recognition.
  • Action recognition realizes the task of action classification and classification by extracting the action characteristics of continuous video frames, avoiding the occurrence of possible dangerous behaviors in practice, and has a wide range of practical application scenarios, so it has always been an active research direction in the field of computer vision.
  • the existing action recognition methods based on deep learning have achieved high classification accuracy in small scenes and large targets.
  • the existing human motion recognition methods have low recognition accuracy, a large number of false alarms and false alarms.
  • the present invention provides a human body motion recognition method, which solves the problem of low accuracy of motion recognition in large scenes, small targets, and complex backgrounds. At the same time, it solves the problem of realizing arbitrary Accurate positioning and classification of actions in long-length continuous videos.
  • a human body motion recognition method including the following steps:
  • Gray(m,n) 0.299r(m,n)+0.587g(m,n)+0.441b(m,n) (21),
  • Gray(m,n) is the gray value of the filter output grayscale image at the pixel point (m,n)
  • r(m,n), g(m,n), b(m,n) are colors
  • Pixel(m,n) represents the pixel value calculated by contour enhancement of the preprocessed output grayscale image at the pixel point (m,n)
  • Gray(m,n) is the single-channel gray converted by formula 21
  • the pixel value of the degree image at (m, n), w(m, n, i, j) is the weight, and i, j represent the size of the neighborhood;
  • the weight w(m,n,i,j) consists of two parts, which are the spatial distance d(m,n,i,j) and the pixel distance r(m,n,i,j).
  • the calculation process is:
  • T t/t-8 and T t-8/t-16 take values that conform to formula 44 and 45 respectively,
  • the input of the three-dimensional convolutional network is a collection of video frame images with 3 channels, video length L, video frame image height H, and video frame image width W, and the output obtained after forward propagation through the three-dimensional convolution network 2048 channels, video length is The video frame image height is The video frame image width is The feature map collection;
  • a multi-scale window is predefined with a uniformly distributed time position as the center.
  • Each time position specifies K anchor segments, and the fixed proportion of each anchor segment is different.
  • the kernel size as 3D max-pooling filter, from the spatial dimension Up to 1 ⁇ 1 sampling to generate a time-only feature map set C tpn, where C tpn is 2048 channels, and the video length is The video frame image height is 1, the video frame image width is 1, the 2048-dimensional feature vector at each time position in C tpn is used to predict the center position and length of each anchor segment ⁇ C k ,l k ⁇ , The relative offset ⁇ C k , ⁇ l k ⁇ of k ⁇ 1,...,K ⁇ ;
  • the L1 loss function is:
  • N cls and N reg represent the batch size and the number of suggestion boxes
  • is the loss trade-off parameter and is set to a value of 1
  • k is the suggestion box index in the batch
  • a k is the probability of prediction in the suggestion box or action
  • Is the action value of the real action box Indicates the relative offset from the anchor segment or suggestion box prediction
  • c k and l k are the anchor point or the center position and length of the proposal, and with Represents the center position and length of the real action segment of the video.
  • the L1 loss function is applied to both the temporary suggestion box subnet and the action classification subnet.
  • the binary classification loss L cls prediction suggestion box indicates that it contains an action
  • the regression loss L reg optimization suggestion indicates that it contains an action
  • the multi-category loss L cls is the suggestion box to predict a specific action category.
  • the number of categories is the number of actions plus one action as a background.
  • the regression loss L reg optimizes actions and basic facts. The relative displacement between.
  • step S01 the minimum neighborhood width of the two-dimensional image is set to 9, that is, one pixel and 8 surrounding pixels are taken as the minimum filtering neighborhood, and the Kalman filter design process based on the minimum filtering neighborhood for:
  • T is the transposition operation
  • ⁇ (m,n) is the noise term
  • x(m+i,n+j) is the pixel value of each point in the image, which is a known quantity
  • c(m+i,n+j) is the weight of each point of the original video frame image, which is unknown the amount
  • the one-step forecast variance equation is:
  • the filter is constructed by formulas 19, 110, 111, and 112 to complete the preprocessing of the input data.
  • the present invention uses a background removal method to reduce the influence of the video background on the detection accuracy. Solve the problem of reduced detection accuracy of the model in large scenes, small targets, and complex backgrounds in the existing motion recognition methods. At the same time, it realizes the motion detection and motion positioning in any continuous borderless video stream, and improves the human body motion recognition. Accuracy and robustness in different application scenarios improve the model's normalized application capabilities. At the same time, the three-dimensional convolutional neural network is used to encode the video stream, extract the video action features, and complete the action classification task and the action positioning task at the same time.
  • Figure 1 is a flow chart of the present invention.
  • This embodiment is mainly aimed at large scenes and small targets. Through preprocessing of training and test data, the impact of complex background on model detection accuracy is reduced, and the model's action recognition accuracy is improved. At the same time, only a three-dimensional convolutional deep learning model is used to realize motion detection and precise positioning of actions in continuous videos of any length, reducing the amount of calculation.
  • this embodiment includes the following steps:
  • the first step image preprocessing operation:
  • the video is decoded, and each frame of picture is preprocessed.
  • the preprocessing includes the following steps:
  • the minimum neighborhood width is 9, that is, one pixel and 8 pixels around it are taken as the minimum filtering neighborhood, that is, in the neighborhood window length (i, j) of the pixel, the value of i and j is taken
  • the value range is an integer between [-1,1].
  • T is the transposition operation
  • ⁇ (m,n) is the noise term
  • x(m+i,n+j) is the pixel value of each point of the original video frame image, which is a known quantity
  • c(m+i,n+j) is the weight of each point of the original video frame image, which is Unknown;
  • E in formula 15 is the symbol of matrix mean operation in probability
  • Equation 15 c(m+i,n+j)
  • V(m,n) is white noise with zero mean and variance r(m,n);
  • the recursive formula of the two-dimensional discrete Kalman filter in the 3 ⁇ 3 neighborhood of the pixel point (m, n) is:
  • the one-step forecast variance equation is:
  • the filter is constructed by formulas 19, 110, 111, and 112 to complete the preprocessing of the input data.
  • Step 2 Image format conversion related processing:
  • Gray(m,n) 0.299r(m,n)+0.587g(m,n)+0.441b(m,n) (21),
  • Gray(m,n) is the gray value of the filter output grayscale image at the pixel point (m,n)
  • r(m,n), g(m,n), b(m,n) are colors
  • the third step: target contour enhancement, the method is as follows:
  • the pixel value at (m,n) of the output grayscale image is:
  • Pixel(m,n) represents the pixel value calculated by contour enhancement of the preprocessed output grayscale image at the pixel point (m,n)
  • Gray(m,n) is the single-channel gray converted by formula 21
  • the pixel value of the degree image at (m, n), w(m, n, i, j) is the weight, and i, j represent the size of the neighborhood;
  • the weight w(m,n,i,j) consists of two parts, which are the spatial distance d(m,n,i,j) and the pixel distance r(m,n,i,j).
  • the calculation process is:
  • the noise in the grayscale image can be removed, and the contour definition of the target in the image can be improved at the same time.
  • Step 4 Consider the amplitude of the action and the frame rate of the video, try to eliminate the hole phenomenon, every 8 frames, select three images I n , I n-8 , I n-16 in the image sequence, and use the obtained foreground image D indicates that the pixel values of the three pictures at the pixel point (m, n) are: I t (m, n), I t-8 (m, n), I n-16 (m, n), then The foreground image is:
  • T t/t-8 and T t-8/t-16 take values that conform to formula 44 and 45 respectively,
  • Step 5 On the basis of the previous step, remove holes and tiny noises from the foreground image D (x, y), and perform corrosion and expansion operations;
  • the obtained gray-scale foreground image D(x,y) is converted into a three-channel image, combined into a continuous picture sequence, and input into a three-dimensional convolutional network for training and detection.
  • the input of the model is a series of R 3 ⁇ L ⁇ H ⁇ W frame images.
  • the 3D-ConvNet architecture uses Resnet-50 as the backbone network. Through the deep network structure, more abundant action features can be obtained, and finally a feature map is generated.
  • R 3 ⁇ L ⁇ H ⁇ W indicates that the input size frame image is a collection of video frame images with 3 channels, video length L, video frame image height H, and video frame image width W. Indicates that the output is 2048 channels, and the video length is The video frame image height is The video frame image width is A collection of feature maps.
  • Each time position specifies K anchor segments, and each anchor segment has a different fixed ratio.
  • the kernel size as 3D max-pooling filter, downsampling the spatial dimension (from To 1 ⁇ 1) to generate time-only feature maps
  • the 2048-dimensional feature vector at each time position in C tpn is used to predict the relative offset ( ⁇ C k , ⁇ l ) of each anchor segment ⁇ C k ,l k ⁇ ,k ⁇ ⁇ 1,...,K) k ⁇ ;
  • the L1 loss function is:
  • N cls and N reg represent the batch size and the number of suggestion boxes
  • is the loss trade-off parameter and is set to a value of 1
  • k is the suggestion box index in the batch
  • a k is the probability of prediction in the suggestion box or action
  • Is the action value of the real action box Indicates the relative offset from the anchor segment or suggestion box prediction
  • c k and l k are the anchor point or the center position and length of the proposal, and with Represents the center position and length of the real action segment of the video.
  • the above loss function is applied to both the temporary suggestion box subnet and the action classification subnet.
  • the binary classification loss L cls prediction suggestion box represents an action
  • the regression loss L reg optimizes the relative displacement between the suggestion box and the basic facts.
  • the loss has nothing to do with the action category.
  • the multi-class classification loss L cls will predict a specific action category for the suggestion box, and the number of categories is the number of actions plus one action as a background.
  • the regression loss L reg optimizes the relative displacement between the action and the basic facts. All four losses of the two subnets are optimized together.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Biomedical Technology (AREA)
  • Image Analysis (AREA)

Abstract

A human action recognition method, comprising: first performing preprocessing, nearest neighborhood construction and filtering, on images; then performing image channel transformation, target contour enhancement, and differential image extraction; performing threshold processing and foreground image processing on a foreground image; and finally, on the basis of a three-dimensional convolution network, performing model training or action recognition and action positioning. Said method solves the problem in the existing action recognition methods that the detection precision of a model is reduced in a large scene, a small target, and a complex background, and also achieves action detection and action positioning in any continuous non-boundary video stream, improving the precision of human action recognition and robustness in different application scenarios, improving the capability of normalized application of a model.

Description

一种人体动作识别方法Human action recognition method 技术领域Technical field
本发明涉及一种人体动作识别方法,属于人体动作识别技术领域。The invention relates to a human body action recognition method, which belongs to the technical field of human body action recognition.
背景技术Background technique
动作识别通过提连续视频帧的动作特征,实现动作分类分类任务,在实际中避免可能存在的危险行为的发生,实际应用场景广泛,因此其一直是计算机视觉领域一个活跃的研究方向。现有的基于深度学习的动作识别方法,在所得模型在小场景、大目标下,取得了较高的分类精度。但是在复杂背景(存在噪音)、小目标的实时监控中,现有人体动作识别方法存在识别精度低、出现大量漏报及误报的现象。Action recognition realizes the task of action classification and classification by extracting the action characteristics of continuous video frames, avoiding the occurrence of possible dangerous behaviors in practice, and has a wide range of practical application scenarios, so it has always been an active research direction in the field of computer vision. The existing action recognition methods based on deep learning have achieved high classification accuracy in small scenes and large targets. However, in the real-time monitoring of complex backgrounds (with noise) and small targets, the existing human motion recognition methods have low recognition accuracy, a large number of false alarms and false alarms.
发明内容Summary of the invention
针对现有技术的缺陷,本发明提供一种人体动作识别方法,决大场景,小目标、复杂背景下,动作识别精度较低的问题,同时,在较小计算量下,解决了实现对任意长度连续视频中的动作精确地定位及动作分类问题。Aiming at the shortcomings of the prior art, the present invention provides a human body motion recognition method, which solves the problem of low accuracy of motion recognition in large scenes, small targets, and complex backgrounds. At the same time, it solves the problem of realizing arbitrary Accurate positioning and classification of actions in long-length continuous videos.
为了解决所述技术问题,本发明采用的技术方案是:一种人体动作识别方法,包括以下步骤:In order to solve the technical problem, the technical solution adopted by the present invention is: a human body motion recognition method, including the following steps:
S01)、将视频解码,对每一帧图片进行预处理,所述预处理包括最小邻域选择和滤波器设计,采用卡尔曼滤波器对图像进行滤波;S01). Decode the video, and perform preprocessing on each frame of picture, the preprocessing includes minimum neighborhood selection and filter design, and Kalman filter is used to filter the image;
S02)、对预处理后的图像根据公式21完成图像格式转换,输出图像由三通道RGB图像转化为单通道GRAY图像:S02). The image format conversion of the preprocessed image is completed according to formula 21, and the output image is converted from a three-channel RGB image to a single-channel GRAY image:
Gray(m,n)=0.299r(m,n)+0.587g(m,n)+0.441b(m,n)      (21),Gray(m,n)=0.299r(m,n)+0.587g(m,n)+0.441b(m,n) (21),
其中Gray(m,n)为滤波器输出灰度图像在像素点(m,n)处的灰度值,r(m,n)、g(m,n)、b(m,n)为彩色图像在像素点(m,n)处对应的三通道像素值;Where Gray(m,n) is the gray value of the filter output grayscale image at the pixel point (m,n), r(m,n), g(m,n), b(m,n) are colors The corresponding three-channel pixel value of the image at the pixel point (m, n);
S03)、通过公式31对图像进行目标轮廓增强,以去除灰度图像中噪声,同时提高图像中目标的轮廓清晰度:S03). Perform target contour enhancement on the image by formula 31 to remove noise in the grayscale image and at the same time improve the contour definition of the target in the image:
Figure PCTCN2020137991-appb-000001
Figure PCTCN2020137991-appb-000001
其中Pixel(m,n)表示预处理输出灰度图像在像素点(m,n)处进行轮廓增强后计算出的像素值,Gray(m,n)为经过公式21转化后得到的单通道灰度图像在(m,n)处的像素值,w(m,n,i,j)为权重,i、j表示邻域大小;Where Pixel(m,n) represents the pixel value calculated by contour enhancement of the preprocessed output grayscale image at the pixel point (m,n), and Gray(m,n) is the single-channel gray converted by formula 21 The pixel value of the degree image at (m, n), w(m, n, i, j) is the weight, and i, j represent the size of the neighborhood;
权重w(m,n,i,j)由两部分组成,分别为空间距离d(m,n,i,j)、像素距离r(m,n,i,j),其计算过程为:The weight w(m,n,i,j) consists of two parts, which are the spatial distance d(m,n,i,j) and the pixel distance r(m,n,i,j). The calculation process is:
w(m,n,i,j)=d(m,n,i,j)·r(m,n,i,j)     (32),w(m,n,i,j)=d(m,n,i,j)·r(m,n,i,j) (32),
Figure PCTCN2020137991-appb-000002
Figure PCTCN2020137991-appb-000002
Figure PCTCN2020137991-appb-000003
Figure PCTCN2020137991-appb-000003
其中δ d=0.7,δ r=0.2, Where δ d =0.7, δ r =0.2,
S04)、每间隔8帧,在图像序列中选取三张图像I t、I t-8、I t-16,获取的前景图片用D表示,三张图片在像素点(m,n)处的像素值分别为:I t(m,n)、I t-8(m,n)、I n-16(m,n),则前景图像为: S04). Every 8 frames, three images I t , I t-8 , and I t-16 are selected from the image sequence, the obtained foreground picture is denoted by D, and the three pictures are at the pixel point (m, n). The pixel values are: I t (m,n), I t-8 (m,n), I n-16 (m,n), then the foreground image is:
D(m,n)=|I t(m,n)-I t-8(m,n)|∩|I t-8(m,n)-I t-16(m,n)|      (41), D(m,n)=|I t (m,n)-I t-8 (m,n)|∩|I t-8 (m,n)-I t-16 (m,n)| (41 ),
对前景图像D(m,n)进行阈值操作:Perform a threshold operation on the foreground image D(m,n):
Figure PCTCN2020137991-appb-000004
Figure PCTCN2020137991-appb-000004
其中阈值T的计算采用如下方式:The calculation of the threshold T is as follows:
T=Min(T t/t-8,T t-8/t-16)       (43), T=Min(T t/t-8 ,T t-8/t-16 ) (43),
公式43中,T t/t-8、T t-8/t-16分别取符合公式44、45的值, In formula 43, T t/t-8 and T t-8/t-16 take values that conform to formula 44 and 45 respectively,
Figure PCTCN2020137991-appb-000005
Figure PCTCN2020137991-appb-000005
Figure PCTCN2020137991-appb-000006
Figure PCTCN2020137991-appb-000006
其中,A为整张图片的像素点个数,δ=0.6;Among them, A is the number of pixels in the entire picture, and δ=0.6;
S05)、对前景图像D(m,n)进行腐蚀及膨胀操作;S05). Perform corrosion and expansion operations on the foreground image D(m,n);
S06)、将获取的灰度前景图像D(m,n)并转为三通道图像,组合成连续图片序列,输入三维卷积网络进行训练和检测。S06). Convert the acquired gray-scale foreground image D(m, n) into a three-channel image, combine them into a continuous picture sequence, and input the three-dimensional convolutional network for training and detection.
进一步的,三维卷积网络对连续图片序列进行检测的具体步骤为:Further, the specific steps for the three-dimensional convolutional network to detect the continuous picture sequence are as follows:
S61)、三维卷积网络输入的是3通道、视频长度为L、视频帧图像高度为H、视频帧图像宽度为W的视频帧图像集合,经过三维卷积网络前向传播后,得到的输出为2048通道、视频长度为
Figure PCTCN2020137991-appb-000007
视频帧图像高度为
Figure PCTCN2020137991-appb-000008
视频帧图像宽度为
Figure PCTCN2020137991-appb-000009
的特征图集合;
S61). The input of the three-dimensional convolutional network is a collection of video frame images with 3 channels, video length L, video frame image height H, and video frame image width W, and the output obtained after forward propagation through the three-dimensional convolution network 2048 channels, video length is
Figure PCTCN2020137991-appb-000007
The video frame image height is
Figure PCTCN2020137991-appb-000008
The video frame image width is
Figure PCTCN2020137991-appb-000009
The feature map collection;
S62)、
Figure PCTCN2020137991-appb-000010
以均匀分布的时间位置为中心预定义多尺度窗口,每个时间位置指定K个锚段,每个锚段的固定比例不同,通过应用内核尺寸为
Figure PCTCN2020137991-appb-000011
的3D max-pooling滤波器,对空间维度进行从
Figure PCTCN2020137991-appb-000012
到1×1的采样,以生成仅时间的特征图集合C tpn,C tpn中是2048通道、视频长度为
Figure PCTCN2020137991-appb-000013
视频帧图像高度为1、视频帧图像宽度为1的图片,C tpn中每个时间位置处的2048维特征向量用于预测到每个锚段的中心位置和长度{C k,l k},k∈{1,...,K}的相对偏移{σC k,σl k};
S62),
Figure PCTCN2020137991-appb-000010
A multi-scale window is predefined with a uniformly distributed time position as the center. Each time position specifies K anchor segments, and the fixed proportion of each anchor segment is different. By applying the kernel size as
Figure PCTCN2020137991-appb-000011
3D max-pooling filter, from the spatial dimension
Figure PCTCN2020137991-appb-000012
Up to 1×1 sampling to generate a time-only feature map set C tpn, where C tpn is 2048 channels, and the video length is
Figure PCTCN2020137991-appb-000013
The video frame image height is 1, the video frame image width is 1, the 2048-dimensional feature vector at each time position in C tpn is used to predict the center position and length of each anchor segment {C k ,l k }, The relative offset {σC k ,σl k } of k∈{1,...,K};
S63)、使用softmax损失函数进行分类,使用平滑L1损失函数进行回归,L1损失函数为:S63). Use the softmax loss function for classification, and use the smooth L1 loss function for regression. The L1 loss function is:
Figure PCTCN2020137991-appb-000014
Figure PCTCN2020137991-appb-000014
其中,N cls和N reg代表批次大小和建议框的数量,λ是损失权衡参数,并设置为值1,k是批次中的建议框索引,a k是在建议框或动作预测的概率,
Figure PCTCN2020137991-appb-000015
是为真实动作框动作值,
Figure PCTCN2020137991-appb-000016
表示与锚定段或建议框预测的相对偏移,
Figure PCTCN2020137991-appb-000017
表示视频真实段到锚定段或建议的坐标转换,坐标转换的计算为:
Among them, N cls and N reg represent the batch size and the number of suggestion boxes, λ is the loss trade-off parameter and is set to a value of 1, k is the suggestion box index in the batch, and a k is the probability of prediction in the suggestion box or action ,
Figure PCTCN2020137991-appb-000015
Is the action value of the real action box,
Figure PCTCN2020137991-appb-000016
Indicates the relative offset from the anchor segment or suggestion box prediction,
Figure PCTCN2020137991-appb-000017
Indicates the coordinate conversion from the real segment of the video to the anchor segment or the suggested coordinate conversion. The calculation of the coordinate conversion is:
Figure PCTCN2020137991-appb-000018
Figure PCTCN2020137991-appb-000018
其中:c k和l k是锚点或提议的中心位置和长度,而
Figure PCTCN2020137991-appb-000019
Figure PCTCN2020137991-appb-000020
代表视频真实动作段的中心位置和长度。
Where: c k and l k are the anchor point or the center position and length of the proposal, and
Figure PCTCN2020137991-appb-000019
with
Figure PCTCN2020137991-appb-000020
Represents the center position and length of the real action segment of the video.
进一步的,所述L1损失函数同时应用于临时建议框子网和动作分类子网,在建议框子网中,二进制分类损失L cls预测建议框表示是包含一个动作,回归损失L reg优化建议与基本事实之间的相对位移,在动作分类子网中,多类别分类损失L cls为建议框预测特定的动作类别,类别数是动作数加一个作为背景的动作,回归损失L reg优化动作和基本事实之间的相对位移。 Further, the L1 loss function is applied to both the temporary suggestion box subnet and the action classification subnet. In the suggestion box subnet, the binary classification loss L cls prediction suggestion box indicates that it contains an action, the regression loss L reg optimization suggestion and basic facts In the action classification subnet, the multi-category loss L cls is the suggestion box to predict a specific action category. The number of categories is the number of actions plus one action as a background. The regression loss L reg optimizes actions and basic facts. The relative displacement between.
进一步的,步骤S01中,设置二维图像的最小邻域宽度为9,即取一个像素点和其周围8个像素点作为最小滤波邻域,基于该最小滤波邻域的卡尔曼滤波器设计过程为:Further, in step S01, the minimum neighborhood width of the two-dimensional image is set to 9, that is, one pixel and 8 surrounding pixels are taken as the minimum filtering neighborhood, and the Kalman filter design process based on the minimum filtering neighborhood for:
S11)、像素点(m,n)的灰度值X(m,n)的线性表示为:S11). The linear expression of the gray value X(m,n) of the pixel (m,n) is:
X(m,n)=F(m|i,n|j)·X T(m|i,n|j)+Φ(m,n)      (11), X(m,n)=F(m|i,n|j)·X T (m|i,n|j)+Φ(m,n) (11),
其中,T为转置操作,φ(m,n)为噪声项,Among them, T is the transposition operation, φ(m,n) is the noise term,
Figure PCTCN2020137991-appb-000021
Figure PCTCN2020137991-appb-000021
Figure PCTCN2020137991-appb-000022
Figure PCTCN2020137991-appb-000022
则公式11表示为:The formula 11 is expressed as:
Figure PCTCN2020137991-appb-000023
Figure PCTCN2020137991-appb-000023
其中:x(m+i,n+j)为图像中每个点的像素值,为已知量;c(m+i,n+j)为原始视频帧图像每个点的权重,为未知量;Among them: x(m+i,n+j) is the pixel value of each point in the image, which is a known quantity; c(m+i,n+j) is the weight of each point of the original video frame image, which is unknown the amount;
S12)、c(m+i,n+j)的计算标准为:S12), c(m+i,n+j) calculation standard is:
Figure PCTCN2020137991-appb-000024
Figure PCTCN2020137991-appb-000024
c(m+i,n+j)的取值必须使公式15达到最小值,则:The value of c(m+i,n+j) must make the formula 15 reach the minimum value, then:
Figure PCTCN2020137991-appb-000025
Figure PCTCN2020137991-appb-000025
上式的A,B分别表示为:A and B of the above formula are respectively expressed as:
A=x(m+i,n+j)       (17),A=x(m+i,n+j) (17),
B=x(m+i,n+j)-x(m+i-1,n+j)B=x(m+i,n+j)-x(m+i-1,n+j)
S13)、设观测方程为:S13). Suppose the observation equation is:
Z(m,n)=X(m,n)+V(m,n)      (18),Z(m,n)=X(m,n)+V(m,n) (18),
其中v(m,n)为噪声,Where v(m,n) is noise,
S14)、按最小线性方差,得到像素点(m,n)点的3×3邻域内的二维离散卡尔曼滤波器的递推公式为:S14). According to the minimum linear variance, the recursive formula of the two-dimensional discrete Kalman filter in the 3×3 neighborhood of the pixel point (m, n) is:
X(m,n)=F(m|i,n|j)X T(m|i,n|j)+K(m,n)[Z(m,n)-F(m|i,n|j)X T(m|i,n|j)]     (19), X(m,n)=F(m|i,n|j)X T (m|i,n|j)+K(m,n)[Z(m,n)-F(m|i,n |j)X T (m|i,n|j)] (19),
一步预报方差方程为:The one-step forecast variance equation is:
Figure PCTCN2020137991-appb-000026
Figure PCTCN2020137991-appb-000026
增益方程为:The gain equation is:
K(m,n)=P m/m-1(m,n)/[P m/m-1(m,n)+r(m,n)]       (111), K(m,n)=P m/m-1 (m,n)/[P m/m-1 (m,n)+r(m,n)] (111),
误差方差矩阵方程:Error variance matrix equation:
P m/m(m,n)=[1-K(m,n)] 2P m/m-1(m,n)+K 2(m,n)r(m,n)     (112) P m/m (m,n)=[1-K(m,n)] 2 P m/m-1 (m,n)+K 2 (m,n)r(m,n) (112)
由公式19、110、111、112四式构建滤波器,完成对输入数据的预处理。The filter is constructed by formulas 19, 110, 111, and 112 to complete the preprocessing of the input data.
本发明的有益效果:本发明在连续视频动作检测任务中,使用背景去除的 方法,降低视频背景对检测精度的影响。解决现有动作识别方法中,模型在大场景、小目标、复杂背景下,检测精度下降的问题,同时,实现了对任意连续无边界视频流中动作检测及动作定位,提高了人体动作识别的精度以及在不同应用场景下的鲁棒性提高了模型的范化应用能力。同时,使用三维卷积神经网络,对视频流进行编码,提取视频动作特征,同时完成动作分类任务以及动作定位任务。The beneficial effects of the present invention: in the continuous video motion detection task, the present invention uses a background removal method to reduce the influence of the video background on the detection accuracy. Solve the problem of reduced detection accuracy of the model in large scenes, small targets, and complex backgrounds in the existing motion recognition methods. At the same time, it realizes the motion detection and motion positioning in any continuous borderless video stream, and improves the human body motion recognition. Accuracy and robustness in different application scenarios improve the model's normalized application capabilities. At the same time, the three-dimensional convolutional neural network is used to encode the video stream, extract the video action features, and complete the action classification task and the action positioning task at the same time.
附图说明Description of the drawings
图1为本发明的流程图。Figure 1 is a flow chart of the present invention.
具体实施方式Detailed ways
应该指出,以下详细说明都是例示性的,旨在对本发明提供进一步的说明。除非另有指明,本发明使用的所有技术和科学术语具有与本发明所属技术领域的普通技术人员通常理解的相同含义。It should be noted that the following detailed descriptions are all illustrative and are intended to provide further descriptions of the present invention. Unless otherwise specified, all technical and scientific terms used in the present invention have the same meaning as commonly understood by those of ordinary skill in the technical field to which the present invention belongs.
实施例1Example 1
本实施例主要针对大场景、小目标下,通过对训练及测试数据的预处理,降低复杂背景对模型检测精度影响,提高模型的动作识别精度。同时,仅仅使用一个三维卷积深度学习模型,实现对任意长度连续视频中的动作检测及精确地定位动作,降低了计算量。This embodiment is mainly aimed at large scenes and small targets. Through preprocessing of training and test data, the impact of complex background on model detection accuracy is reduced, and the model's action recognition accuracy is improved. At the same time, only a three-dimensional convolutional deep learning model is used to realize motion detection and precise positioning of actions in continuous videos of any length, reducing the amount of calculation.
如图1所示,本实施例包括以下步骤:As shown in Figure 1, this embodiment includes the following steps:
第一步:图像预处理操作:The first step: image preprocessing operation:
将视频解码,对每一帧图片进行预处理,预处理包括以下步骤:The video is decoded, and each frame of picture is preprocessed. The preprocessing includes the following steps:
1)最小邻域选择1) Minimum neighborhood selection
对于二维图像,最小邻域宽度为9,即取一个像素点和其周围8个像素点作为最小滤波邻域,即像素点的邻域窗长(i,j)中,i和j的取值范围为[-1,1]之间的整数。For a two-dimensional image, the minimum neighborhood width is 9, that is, one pixel and 8 pixels around it are taken as the minimum filtering neighborhood, that is, in the neighborhood window length (i, j) of the pixel, the value of i and j is taken The value range is an integer between [-1,1].
2)滤波器设计2) Filter design
像素点(m,n)的灰度值X(m,n)的线性表示为:The linear expression of the gray value X(m,n) of the pixel (m,n) is:
X(m,n)=F(m|i,n|j)·X T(m|i,n|j)+Φ(m,n)       (11), X(m,n)=F(m|i,n|j)·X T (m|i,n|j)+Φ(m,n) (11),
其中,T为转置操作,φ(m,n)为噪声项,Among them, T is the transposition operation, φ(m,n) is the noise term,
Figure PCTCN2020137991-appb-000027
Figure PCTCN2020137991-appb-000027
Figure PCTCN2020137991-appb-000028
Figure PCTCN2020137991-appb-000028
则公式11表示为:The formula 11 is expressed as:
Figure PCTCN2020137991-appb-000029
Figure PCTCN2020137991-appb-000029
其中x(m+i,n+j)为原始视频帧图像每个点的像素值,为已知量,c(m+i,n+j)为原始视频帧图像每个点的权重,为未知量;Where x(m+i,n+j) is the pixel value of each point of the original video frame image, which is a known quantity, and c(m+i,n+j) is the weight of each point of the original video frame image, which is Unknown;
c(m+i,n+j)的计算标准为:The calculation standard of c(m+i,n+j) is:
Figure PCTCN2020137991-appb-000030
Figure PCTCN2020137991-appb-000030
公式15中的E为概率中矩阵均值运算符号;E in formula 15 is the symbol of matrix mean operation in probability;
c(m+i,n+j)的取值必须使公式15达到最小值,由此则可以得出:The value of c(m+i,n+j) must be such that Equation 15 reaches the minimum value. From this, we can get:
Figure PCTCN2020137991-appb-000031
Figure PCTCN2020137991-appb-000031
其中:among them:
Figure PCTCN2020137991-appb-000032
Figure PCTCN2020137991-appb-000032
设观测方程为:Suppose the observation equation is:
Z(m,n)=X(m,n)+V(m,n)          (18),Z(m,n)=X(m,n)+V(m,n) (18),
其中,V(m,n)为零均值、方差为r(m,n)的白噪声;Among them, V(m,n) is white noise with zero mean and variance r(m,n);
按最小线性方差,得到像素点(m,n)点的3×3邻域内的二维离散Kalman滤波器的递推公式为:According to the minimum linear variance, the recursive formula of the two-dimensional discrete Kalman filter in the 3×3 neighborhood of the pixel point (m, n) is:
X(m,n)=F(m|i,n|j)X T(m|i,n|j)+K(m,n)[Z(m,n)-F(m|i,n|j)X T(m|i,n|j)]       (19), X(m,n)=F(m|i,n|j)X T (m|i,n|j)+K(m,n)[Z(m,n)-F(m|i,n |j)X T (m|i,n|j)] (19),
一步预报方差方程为:The one-step forecast variance equation is:
Figure PCTCN2020137991-appb-000033
Figure PCTCN2020137991-appb-000033
增益方程:Gain equation:
K(m,n)=P m/m-1(m,n)/[P m/m-1(m,n)+r(m,n)]        (111), K(m,n)=P m/m-1 (m,n)/[P m/m-1 (m,n)+r(m,n)] (111),
误差方差矩阵方程:Error variance matrix equation:
P m/m(m,n)=[1-K(m,n)] 2P m/m-1(m,n)+K 2(m,n)r(m,n)       (112), P m/m (m,n)=[1-K(m,n)] 2 P m/m-1 (m,n)+K 2 (m,n)r(m,n) (112),
由公式19、110、111、112四式构建滤波器,完成对输入数据的预处理。The filter is constructed by formulas 19, 110, 111, and 112 to complete the preprocessing of the input data.
第二步:图像格式转化相关处理:Step 2: Image format conversion related processing:
对预处理后的图像根据公式21完成图像格式转换,输出图像由三通道RGB图像转化为单通道GRAY图像;Complete the image format conversion of the preprocessed image according to formula 21, and the output image is converted from a three-channel RGB image to a single-channel GRAY image;
Gray(m,n)=0.299r(m,n)+0.587g(m,n)+0.441b(m,n)     (21),Gray(m,n)=0.299r(m,n)+0.587g(m,n)+0.441b(m,n) (21),
其中Gray(m,n)为滤波器输出灰度图像在像素点(m,n)处的灰度值,r(m,n)、g(m,n)、b(m,n)为彩色图像在像素点(m,n)处对应的三通道像素值;Where Gray(m,n) is the gray value of the filter output grayscale image at the pixel point (m,n), r(m,n), g(m,n), b(m,n) are colors The corresponding three-channel pixel value of the image at the pixel point (m, n);
第三步:目标轮廓增强,方法如下:The third step: target contour enhancement, the method is as follows:
输出灰度图像的在(m,n)处的像素值为:The pixel value at (m,n) of the output grayscale image is:
Figure PCTCN2020137991-appb-000034
Figure PCTCN2020137991-appb-000034
其中Pixel(m,n)表示预处理输出灰度图像在像素点(m,n)处进行轮廓增强后计算出的像素值,Gray(m,n)为经过公式21转化后得到的单通道灰度图像在(m,n)处的像素值,w(m,n,i,j)为权重,i、j表示邻域大小;Where Pixel(m,n) represents the pixel value calculated by contour enhancement of the preprocessed output grayscale image at the pixel point (m,n), and Gray(m,n) is the single-channel gray converted by formula 21 The pixel value of the degree image at (m, n), w(m, n, i, j) is the weight, and i, j represent the size of the neighborhood;
权重w(m,n,i,j)由两部分组成,分别为空间距离d(m,n,i,j)、像素距离r(m,n,i,j),其计算过程为:The weight w(m,n,i,j) consists of two parts, which are the spatial distance d(m,n,i,j) and the pixel distance r(m,n,i,j). The calculation process is:
w(m,n,i,j)=d(m,n,i,j)·r(m,n,i,j)       (32),w(m,n,i,j)=d(m,n,i,j)·r(m,n,i,j) (32),
Figure PCTCN2020137991-appb-000035
Figure PCTCN2020137991-appb-000035
Figure PCTCN2020137991-appb-000036
Figure PCTCN2020137991-appb-000036
其中δ d=0.7,δ r=0.2, Where δ d =0.7, δ r =0.2,
采用上述方法,可以去除灰度图像中噪声,同时提高图像中目标的轮廓清晰度。Using the above method, the noise in the grayscale image can be removed, and the contour definition of the target in the image can be improved at the same time.
第四步:考虑动作的幅度以及视频的帧频,尽量去除空洞现象,每间隔8帧,在图像序列中选取三张图像I n、I n-8、I n-16,获取的前景图片用D表示,三张图片在像素点(m,n)处的像素值分别为:I t(m,n)、I t-8(m,n)、I n-16(m,n),则前景图像为: Step 4: Consider the amplitude of the action and the frame rate of the video, try to eliminate the hole phenomenon, every 8 frames, select three images I n , I n-8 , I n-16 in the image sequence, and use the obtained foreground image D indicates that the pixel values of the three pictures at the pixel point (m, n) are: I t (m, n), I t-8 (m, n), I n-16 (m, n), then The foreground image is:
D(m,n)=|I t(m,n)-I t-8(m,n)|∩|I t-8(m,n)-I t-16(m,n)|     (41), D(m,n)=|I t (m,n)-I t-8 (m,n)|∩|I t-8 (m,n)-I t-16 (m,n)| (41 ),
对前景图像D(m,n)进行阈值操作:Perform a threshold operation on the foreground image D(m,n):
Figure PCTCN2020137991-appb-000037
Figure PCTCN2020137991-appb-000037
其中阈值T的计算采用如下方式:The calculation of the threshold T is as follows:
T=Min(T t/t-8,T t-8/t-16)        (43), T=Min(T t/t-8 ,T t-8/t-16 ) (43),
公式43中,T t/t-8、T t-8/t-16分别取符合公式44、45的值, In formula 43, T t/t-8 and T t-8/t-16 take values that conform to formula 44 and 45 respectively,
Figure PCTCN2020137991-appb-000038
Figure PCTCN2020137991-appb-000038
Figure PCTCN2020137991-appb-000039
Figure PCTCN2020137991-appb-000039
其中,A为整张图片的像素点个数,δ=0.6;Among them, A is the number of pixels in the entire picture, and δ=0.6;
第五步:在上一步的基础上对前景图像D(x,y)去除空洞及微小噪声,可以进行腐蚀及膨胀操作;Step 5: On the basis of the previous step, remove holes and tiny noises from the foreground image D (x, y), and perform corrosion and expansion operations;
第六步,模型训练及测试The sixth step, model training and testing
将获取的灰度前景图像D(x,y)并转为三通道图像,组合成连续图片序列,输入三维卷积网络进行训练和检测。The obtained gray-scale foreground image D(x,y) is converted into a three-channel image, combined into a continuous picture sequence, and input into a three-dimensional convolutional network for training and detection.
模型的输入是一系列R 3×L×H×W尺寸帧图像,3D-ConvNet的架构以Resnet-50为骨干网络,通过深层网络结构可获得更加丰富的动作特征,最后生成了一个特征图
Figure PCTCN2020137991-appb-000040
R 3×L×H×W表示输入的尺寸帧图像是3通道、视频长度为L、视频帧图像高度为H、视频帧图像宽度为W的视频帧图像集合,
Figure PCTCN2020137991-appb-000041
表示输出的是2048通道、视频长度为
Figure PCTCN2020137991-appb-000042
视频帧图像高度为
Figure PCTCN2020137991-appb-000043
视频帧图像宽度为
Figure PCTCN2020137991-appb-000044
的特征图集合。
The input of the model is a series of R 3×L×H×W frame images. The 3D-ConvNet architecture uses Resnet-50 as the backbone network. Through the deep network structure, more abundant action features can be obtained, and finally a feature map is generated.
Figure PCTCN2020137991-appb-000040
R 3×L×H×W indicates that the input size frame image is a collection of video frame images with 3 channels, video length L, video frame image height H, and video frame image width W.
Figure PCTCN2020137991-appb-000041
Indicates that the output is 2048 channels, and the video length is
Figure PCTCN2020137991-appb-000042
The video frame image height is
Figure PCTCN2020137991-appb-000043
The video frame image width is
Figure PCTCN2020137991-appb-000044
A collection of feature maps.
Figure PCTCN2020137991-appb-000045
均匀分布的时间位置为中心的预定义多尺度窗口,每个时间位置指定K个锚段,每个锚段的固定比例不同。通过应用内核尺寸为
Figure PCTCN2020137991-appb-000046
的3D max-pooling滤波器,对空间维度进行下采样(从
Figure PCTCN2020137991-appb-000047
到1×1)以生成仅时间的特征图
Figure PCTCN2020137991-appb-000048
C tpn中每个时间位置处的2048维特征向量用于预测到每个锚段的{C k,l k},k∈{1,...,K}的相对偏移{σC k,σl k};
To
Figure PCTCN2020137991-appb-000045
A pre-defined multi-scale window with uniformly distributed time positions as the center. Each time position specifies K anchor segments, and each anchor segment has a different fixed ratio. By applying the kernel size as
Figure PCTCN2020137991-appb-000046
3D max-pooling filter, downsampling the spatial dimension (from
Figure PCTCN2020137991-appb-000047
To 1×1) to generate time-only feature maps
Figure PCTCN2020137991-appb-000048
The 2048-dimensional feature vector at each time position in C tpn is used to predict the relative offset (σC k ,σl ) of each anchor segment {C k ,l k },k ∈ {1,...,K) k };
S63)、使用softmax损失函数进行分类,使用平滑L1损失函数进行回归,L1损失函数为:S63). Use the softmax loss function for classification, and use the smooth L1 loss function for regression. The L1 loss function is:
Figure PCTCN2020137991-appb-000049
Figure PCTCN2020137991-appb-000049
其中,N cls和N reg代表批次大小和建议框的数量,λ是损失权衡参数,并设置为值1,k是批次中的建议框索引,a k是在建议框或动作预测的概率,
Figure PCTCN2020137991-appb-000050
是为 真实动作框动作值,
Figure PCTCN2020137991-appb-000051
表示与锚定段或建议框预测的相对偏移,
Figure PCTCN2020137991-appb-000052
表示视频真实段到锚定段或建议的坐标转换,坐标转换的计算为:
Among them, N cls and N reg represent the batch size and the number of suggestion boxes, λ is the loss trade-off parameter and is set to a value of 1, k is the suggestion box index in the batch, and a k is the probability of prediction in the suggestion box or action ,
Figure PCTCN2020137991-appb-000050
Is the action value of the real action box,
Figure PCTCN2020137991-appb-000051
Indicates the relative offset from the anchor segment or suggestion box prediction,
Figure PCTCN2020137991-appb-000052
Indicates the coordinate conversion from the real segment of the video to the anchor segment or the suggested coordinate conversion. The calculation of the coordinate conversion is:
Figure PCTCN2020137991-appb-000053
Figure PCTCN2020137991-appb-000053
其中:c k和l k是锚点或提议的中心位置和长度,而
Figure PCTCN2020137991-appb-000054
Figure PCTCN2020137991-appb-000055
代表视频真实动作段的中心位置和长度。
Where: c k and l k are the anchor point or the center position and length of the proposal, and
Figure PCTCN2020137991-appb-000054
with
Figure PCTCN2020137991-appb-000055
Represents the center position and length of the real action segment of the video.
以上损失函数同时应用于临时建议框子网和动作分类子网。在建议框子网中,二进制分类损失L cls预测建议框表示是包含一个动作,而回归损失L reg优化建议框与基本事实之间的相对位移。在建议框子网中,损失与动作类别无关。在动作分类子网中,多类别分类损失L cls会为建议框预测特定的动作类别,而类别数是动作数加一个作为背景的动作。回归损失L reg优化了动作和基本事实之间的相对位移。两个子网的所有四个损耗共同优化。 The above loss function is applied to both the temporary suggestion box subnet and the action classification subnet. In the suggestion box subnet, the binary classification loss L cls prediction suggestion box represents an action, and the regression loss L reg optimizes the relative displacement between the suggestion box and the basic facts. In the suggestion box subnet, the loss has nothing to do with the action category. In the action classification subnet, the multi-class classification loss L cls will predict a specific action category for the suggestion box, and the number of categories is the number of actions plus one action as a background. The regression loss L reg optimizes the relative displacement between the action and the basic facts. All four losses of the two subnets are optimized together.
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention and are not used to limit the present invention. For those skilled in the art, the present invention can have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (4)

  1. 一种人体动作识别方法,其特征在于:包括以下步骤:A method for human body action recognition, characterized in that it comprises the following steps:
    S01)、将视频解码,对每一帧图片进行预处理,所述预处理包括最小邻域选择和滤波器设计,采用卡尔曼滤波器对图像进行滤波;S01). Decode the video, and perform preprocessing on each frame of picture, the preprocessing includes minimum neighborhood selection and filter design, and Kalman filter is used to filter the image;
    S02)、对预处理后的图像根据公式21完成图像格式转换,输出图像由三通道RGB图像转化为单通道GRAY图像:S02). The image format conversion of the preprocessed image is completed according to formula 21, and the output image is converted from a three-channel RGB image to a single-channel GRAY image:
    Gray(m,n)=0.299r(m,n)+0.587g(m,n)+0.441b(m,n)  (21),Gray(m,n)=0.299r(m,n)+0.587g(m,n)+0.441b(m,n) (21),
    其中Gray(m,n)为滤波器输出灰度图像在像素点(m,n)处的灰度值,r(m,n)、g(m,n)、b(m,n)为彩色图像在像素点(m,n)处对应的三通道像素值;Where Gray(m,n) is the gray value of the filter output grayscale image at the pixel point (m,n), r(m,n), g(m,n), b(m,n) are colors The corresponding three-channel pixel value of the image at the pixel point (m, n);
    S03)、通过公式31对图像进行目标轮廓增强,以去除灰度图像中噪声,同时提高图像中目标的轮廓清晰度:S03). Perform target contour enhancement on the image by formula 31 to remove noise in the grayscale image and at the same time improve the contour definition of the target in the image:
    Figure PCTCN2020137991-appb-100001
    Figure PCTCN2020137991-appb-100001
    其中Pixel(m,n)表示预处理输出灰度图像在像素点(m,n)处进行轮廓增强后计算出的像素值,Gray(m,n)为经过公式21转化后得到的单通道灰度图像在(m,n)处的像素值,w(m,n,i,j)为权重,i、j表示邻域大小;Where Pixel(m,n) represents the pixel value calculated by contour enhancement of the preprocessed output grayscale image at the pixel point (m,n), and Gray(m,n) is the single-channel gray converted by formula 21 The pixel value of the degree image at (m, n), w(m, n, i, j) is the weight, and i, j represent the size of the neighborhood;
    权重w(m,n,i,j)由两部分组成,分别为空间距离d(m,n,i,j)、像素距离r(m,n,i,j),其计算过程为:The weight w(m,n,i,j) consists of two parts, which are the spatial distance d(m,n,i,j) and the pixel distance r(m,n,i,j). The calculation process is:
    w(m,n,i,j)=d(m,n,i,j)·r(m,n,i,j)  (32),w(m,n,i,j)=d(m,n,i,j)·r(m,n,i,j) (32),
    Figure PCTCN2020137991-appb-100002
    Figure PCTCN2020137991-appb-100002
    Figure PCTCN2020137991-appb-100003
    Figure PCTCN2020137991-appb-100003
    其中δ d=0.7,δ r=0.2, Where δ d =0.7, δ r =0.2,
    S04)、每间隔8帧,在图像序列中选取三张图像I t、I t-8、I t-16,获取的前景图片用D表示,三张图片在像素点(m,n)处的像素值分别为:I t(m,n)、I t-8(m,n)、I n-16(m,n),则前景图像为: S04). Every 8 frames, three images I t , I t-8 , and I t-16 are selected from the image sequence, the obtained foreground picture is denoted by D, and the three pictures are at the pixel point (m, n). The pixel values are: I t (m,n), I t-8 (m,n), I n-16 (m,n), then the foreground image is:
    D(m,n)=|I t(m,n)-I t-8(m,n)|∩|I t-8(m,n)-I t-16(m,n)|  (41), D(m,n)=|I t (m,n)-I t-8 (m,n)|∩|I t-8 (m,n)-I t-16 (m,n)| (41 ),
    对前景图像D(m,n)进行阈值操作:Perform a threshold operation on the foreground image D(m,n):
    Figure PCTCN2020137991-appb-100004
    Figure PCTCN2020137991-appb-100004
    其中阈值T的计算采用如下方式:The calculation of the threshold T is as follows:
    T=Min(T t/t-8,T t-8/t-16)  (43), T=Min(T t/t-8 ,T t-8/t-16 ) (43),
    公式43中,T t/t-8、T t-8/t-16分别取符合公式44、45的值, In formula 43, T t/t-8 and T t-8/t-16 take values that conform to formula 44 and 45 respectively,
    Figure PCTCN2020137991-appb-100005
    Figure PCTCN2020137991-appb-100005
    Figure PCTCN2020137991-appb-100006
    Figure PCTCN2020137991-appb-100006
    其中,A为整张图片的像素点个数,δ=0.6;Among them, A is the number of pixels in the entire picture, and δ=0.6;
    S05)、对前景图像D(m,n)进行腐蚀及膨胀操作;S05). Perform corrosion and expansion operations on the foreground image D(m,n);
    S06)、将获取的灰度前景图像D(m,n)并转为三通道图像,组合成连续图片序列,输入三维卷积网络进行训练和检测。S06). Convert the acquired gray-scale foreground image D(m, n) into a three-channel image, combine them into a continuous picture sequence, and input the three-dimensional convolutional network for training and detection.
  2. 根据权利要求1所述的人体动作识别方法,其特征在于:三维卷积网络对连续图片序列进行检测的具体步骤为:The human body action recognition method according to claim 1, wherein the specific steps of the three-dimensional convolutional network detecting continuous picture sequence are:
    S61)、三维卷积网络输入的是3通道、视频长度为L、视频帧图像高度为H、视频帧图像宽度为W的视频帧图像集合,经过三维卷积网络前向传播后,得到的输出为2048通道、视频长度为
    Figure PCTCN2020137991-appb-100007
    视频帧图像高度为
    Figure PCTCN2020137991-appb-100008
    视频帧图像宽度为
    Figure PCTCN2020137991-appb-100009
    的特征图集合;
    S61). The input of the three-dimensional convolutional network is a collection of video frame images with 3 channels, video length L, video frame image height H, and video frame image width W, and the output obtained after forward propagation through the three-dimensional convolution network 2048 channels, video length is
    Figure PCTCN2020137991-appb-100007
    The video frame image height is
    Figure PCTCN2020137991-appb-100008
    The video frame image width is
    Figure PCTCN2020137991-appb-100009
    The feature map collection;
    S62)、
    Figure PCTCN2020137991-appb-100010
    以均匀分布的时间位置为中心预定义多尺度窗口,每个时间位置指定K个锚段,每个锚段的固定比例不同,通过应用内核尺寸为
    Figure PCTCN2020137991-appb-100011
    的3Dmax-pooling滤波器,对空间维度进行从
    Figure PCTCN2020137991-appb-100012
    到1×1的采样,以生成仅时间的特征图集合C tpn,C tpn中是2048通道、视频长度为
    Figure PCTCN2020137991-appb-100013
    视频帧图像高度为1、视频帧图像宽度为1的图片,C tpn中每个时间位置处的2048维特征向量用于预测到每个锚段的中心位置和长度{C k,l k},k∈{1,...,K}的相对偏移{σC k,σl k};
    S62),
    Figure PCTCN2020137991-appb-100010
    A multi-scale window is predefined with a uniformly distributed time position as the center. Each time position specifies K anchor segments, and the fixed proportion of each anchor segment is different. By applying the kernel size as
    Figure PCTCN2020137991-appb-100011
    3Dmax-pooling filter, from the spatial dimension
    Figure PCTCN2020137991-appb-100012
    Up to 1×1 sampling to generate a time-only feature map set C tpn, where C tpn is 2048 channels, and the video length is
    Figure PCTCN2020137991-appb-100013
    The video frame image height is 1, the video frame image width is 1, the 2048-dimensional feature vector at each time position in C tpn is used to predict the center position and length of each anchor segment {C k ,l k }, The relative offset {σC k ,σl k } of k∈{1,...,K};
    S63)、使用softmax损失函数进行分类,使用平滑L1损失函数进行回归,L1损失函数为:S63). Use the softmax loss function for classification, and use the smooth L1 loss function for regression. The L1 loss function is:
    Figure PCTCN2020137991-appb-100014
    Figure PCTCN2020137991-appb-100014
    其中,N cls和N reg代表批次大小和建议框的数量,λ是损失权衡参数,并设置为值1,k是批次中的建议框索引,a k是在建议框或动作预测的概率,
    Figure PCTCN2020137991-appb-100015
    是为真实动作框动作值,
    Figure PCTCN2020137991-appb-100016
    表示与锚定段或建议框预测的相对偏移,
    Figure PCTCN2020137991-appb-100017
    表示视频真实段到锚定段或建议的坐标转换,坐标转换的计算为:
    Among them, N cls and N reg represent the batch size and the number of suggestion boxes, λ is the loss trade-off parameter and is set to a value of 1, k is the suggestion box index in the batch, and a k is the probability of prediction in the suggestion box or action ,
    Figure PCTCN2020137991-appb-100015
    Is the action value of the real action box,
    Figure PCTCN2020137991-appb-100016
    Indicates the relative offset from the anchor segment or suggestion box prediction,
    Figure PCTCN2020137991-appb-100017
    Indicates the coordinate conversion from the real segment of the video to the anchor segment or the suggested coordinate conversion. The calculation of the coordinate conversion is:
    Figure PCTCN2020137991-appb-100018
    Figure PCTCN2020137991-appb-100018
    其中:c k和l k是锚点或提议的中心位置和长度,而
    Figure PCTCN2020137991-appb-100019
    Figure PCTCN2020137991-appb-100020
    代表视频真实动作段的中心位置和长度。
    Where: c k and l k are the anchor point or the center position and length of the proposal, and
    Figure PCTCN2020137991-appb-100019
    with
    Figure PCTCN2020137991-appb-100020
    Represents the center position and length of the real action segment of the video.
  3. 根据权利要求2所述的人体动作识别方法,其特征在于:所述L1损失函数同时应用于临时建议框子网和动作分类子网,在建议框子网中,二进制分类损失L cls预测建议框表示是包含一个动作,回归损失L reg优化建议与基本事实之间的相对位移,在动作分类子网中,多类别分类损失L cls为建议框预测特定的动作类 别,类别数是动作数加一个作为背景的动作,回归损失L reg优化动作和基本事实之间的相对位移。 The human body action recognition method according to claim 2, wherein the L1 loss function is applied to both the temporary suggestion box subnet and the action classification subnet. In the suggestion box subnet, the binary classification loss L cls prediction suggestion box indicates Yes Contains an action, the relative displacement between the regression loss L reg optimization suggestion and the basic facts. In the action classification subnet, the multi-category classification loss L cls is the suggestion box to predict a specific action category, and the number of categories is the number of actions plus one as the background The return loss L reg optimizes the relative displacement between the action and the basic facts.
  4. 根据权利要求1所述的人体动作识别方法,其特征在于:步骤S01中,设置二维图像的最小邻域宽度为9,即取一个像素点和其周围8个像素点作为最小滤波邻域,基于该最小滤波邻域的卡尔曼滤波器设计过程为:The human body action recognition method according to claim 1, characterized in that: in step S01, the minimum neighborhood width of the two-dimensional image is set to 9, that is, one pixel and 8 pixels around it are taken as the minimum filtering neighborhood, The Kalman filter design process based on the minimum filtering neighborhood is:
    S11)、像素点(m,n)的灰度值X(m,n)的线性表示为:S11). The linear expression of the gray value X(m,n) of the pixel (m,n) is:
    X(m,n)=F(m|i,n|j)·X T(m|i,n|j)+Φ(m,n)  (11), X(m,n)=F(m|i,n|j)·X T (m|i,n|j)+Φ(m,n) (11),
    其中,T为转置操作,φ(m,n)为噪声项,Among them, T is the transposition operation, φ(m,n) is the noise term,
    Figure PCTCN2020137991-appb-100021
    Figure PCTCN2020137991-appb-100021
    Figure PCTCN2020137991-appb-100022
    Figure PCTCN2020137991-appb-100022
    则公式11表示为:The formula 11 is expressed as:
    Figure PCTCN2020137991-appb-100023
    Figure PCTCN2020137991-appb-100023
    其中:x(m+i,n+j)为图像中每个点的像素值,为已知量;c(m+i,n+j)为原始视频帧图像每个点的权重,为未知量;Among them: x(m+i,n+j) is the pixel value of each point in the image, which is a known quantity; c(m+i,n+j) is the weight of each point of the original video frame image, which is unknown the amount;
    S12)、c(m+i,n+j)的计算标准为:S12), c(m+i,n+j) calculation standard is:
    Figure PCTCN2020137991-appb-100024
    Figure PCTCN2020137991-appb-100024
    c(m+i,n+j)的取值必须使公式15达到最小值,则:The value of c(m+i,n+j) must make the formula 15 reach the minimum value, then:
    Figure PCTCN2020137991-appb-100025
    Figure PCTCN2020137991-appb-100025
    上式的A,B分别表示为:A and B of the above formula are respectively expressed as:
    A=x(m+i,n+j)  (17),A=x(m+i,n+j) (17),
    B=x(m+i,n+j)-x(m+i-1,n+j)B=x(m+i,n+j)-x(m+i-1,n+j)
    S13)、设观测方程为:S13). Suppose the observation equation is:
    Z(m,n)=X(m,n)+V(m,n)  (18),Z(m,n)=X(m,n)+V(m,n) (18),
    其中v(m,n)为噪声,Where v(m,n) is noise,
    S14)、按最小线性方差,得到像素点(m,n)点的3×3邻域内的二维离散卡尔曼滤波器的递推公式为:S14). According to the minimum linear variance, the recursive formula of the two-dimensional discrete Kalman filter in the 3×3 neighborhood of the pixel point (m, n) is:
    X(m,n)=F(m|i,n|j)X T(m|i,n|j)+K(m,n)[Z(m,n)-F(m|i,n|j)X T(m|i,n|j)]  (19), X(m,n)=F(m|i,n|j)X T (m|i,n|j)+K(m,n)[Z(m,n)-F(m|i,n |j)X T (m|i,n|j)] (19),
    一步预报方差方程为:The one-step forecast variance equation is:
    Figure PCTCN2020137991-appb-100026
    Figure PCTCN2020137991-appb-100026
    增益方程为:The gain equation is:
    K(m,n)=P m/m-1(m,n)/[P m/m-1(m,n)+r(m,n)]  (111), K(m,n)=P m/m-1 (m,n)/[P m/m-1 (m,n)+r(m,n)] (111),
    误差方差矩阵方程:Error variance matrix equation:
    P m/m(m,n)=[1-K(m,n)] 2P m/m-1(m,n)+K 2(m,n)r(m,n)  (112) P m/m (m,n)=[1-K(m,n)] 2 P m/m-1 (m,n)+K 2 (m,n)r(m,n) (112)
    由公式19、110、111、112四式构建滤波器,完成对输入数据的预处理。The filter is constructed by formulas 19, 110, 111, and 112 to complete the preprocessing of the input data.
PCT/CN2020/137991 2019-12-25 2020-12-21 Human action recognition method WO2021129569A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911362989.X 2019-12-25
CN201911362989.XA CN111062355A (en) 2019-12-25 2019-12-25 Human body action recognition method

Publications (1)

Publication Number Publication Date
WO2021129569A1 true WO2021129569A1 (en) 2021-07-01

Family

ID=70303695

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/137991 WO2021129569A1 (en) 2019-12-25 2020-12-21 Human action recognition method

Country Status (2)

Country Link
CN (1) CN111062355A (en)
WO (1) WO2021129569A1 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113743339A (en) * 2021-09-09 2021-12-03 三峡大学 Indoor fall detection method and system based on scene recognition
CN114694075A (en) * 2022-04-07 2022-07-01 合肥工业大学 Dangerous behavior identification method based on deep reinforcement learning
CN114943904A (en) * 2022-06-07 2022-08-26 国网江苏省电力有限公司泰州供电分公司 Operation monitoring method based on unmanned aerial vehicle inspection
CN116527407A (en) * 2023-07-04 2023-08-01 贵州毅丹恒瑞医药科技有限公司 Encryption transmission method for fundus image
CN116582195A (en) * 2023-06-12 2023-08-11 浙江瑞通电子科技有限公司 Unmanned aerial vehicle signal spectrum recognition algorithm based on artificial intelligence
CN116580343A (en) * 2023-07-13 2023-08-11 合肥中科类脑智能技术有限公司 Small sample behavior recognition method, storage medium and controller
CN117095694A (en) * 2023-10-18 2023-11-21 中国科学技术大学 Bird song recognition method based on tag hierarchical structure attribute relationship
CN117541991A (en) * 2023-11-22 2024-02-09 无锡科棒安智能科技有限公司 Intelligent recognition method and system for abnormal behaviors based on security robot
CN117690062A (en) * 2024-02-02 2024-03-12 武汉工程大学 Method for detecting abnormal behaviors of miners in mine

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111062355A (en) * 2019-12-25 2020-04-24 神思电子技术股份有限公司 Human body action recognition method
CN113033283B (en) * 2020-12-18 2022-11-22 神思电子技术股份有限公司 Improved video classification system
CN113362324B (en) * 2021-07-21 2023-02-24 上海脊合医疗科技有限公司 Bone health detection method and system based on video image

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180137362A1 (en) * 2016-11-14 2018-05-17 Axis Ab Action recognition in a video sequence
CN108108722A (en) * 2018-01-17 2018-06-01 深圳市唯特视科技有限公司 A kind of accurate three-dimensional hand and estimation method of human posture based on single depth image
CN108470139A (en) * 2018-01-25 2018-08-31 天津大学 A kind of small sample radar image human action sorting technique based on data enhancing
CN109271931A (en) * 2018-09-14 2019-01-25 辽宁奇辉电子系统工程有限公司 It is a kind of that gesture real-time identifying system is pointed sword at based on edge analysis
CN111062355A (en) * 2019-12-25 2020-04-24 神思电子技术股份有限公司 Human body action recognition method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105160310A (en) * 2015-08-25 2015-12-16 西安电子科技大学 3D (three-dimensional) convolutional neural network based human body behavior recognition method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180137362A1 (en) * 2016-11-14 2018-05-17 Axis Ab Action recognition in a video sequence
CN108108722A (en) * 2018-01-17 2018-06-01 深圳市唯特视科技有限公司 A kind of accurate three-dimensional hand and estimation method of human posture based on single depth image
CN108470139A (en) * 2018-01-25 2018-08-31 天津大学 A kind of small sample radar image human action sorting technique based on data enhancing
CN109271931A (en) * 2018-09-14 2019-01-25 辽宁奇辉电子系统工程有限公司 It is a kind of that gesture real-time identifying system is pointed sword at based on edge analysis
CN111062355A (en) * 2019-12-25 2020-04-24 神思电子技术股份有限公司 Human body action recognition method

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113743339B (en) * 2021-09-09 2023-10-03 三峡大学 Indoor falling detection method and system based on scene recognition
CN113743339A (en) * 2021-09-09 2021-12-03 三峡大学 Indoor fall detection method and system based on scene recognition
CN114694075A (en) * 2022-04-07 2022-07-01 合肥工业大学 Dangerous behavior identification method based on deep reinforcement learning
CN114694075B (en) * 2022-04-07 2024-02-13 合肥工业大学 Dangerous behavior identification method based on deep reinforcement learning
CN114943904A (en) * 2022-06-07 2022-08-26 国网江苏省电力有限公司泰州供电分公司 Operation monitoring method based on unmanned aerial vehicle inspection
CN116582195A (en) * 2023-06-12 2023-08-11 浙江瑞通电子科技有限公司 Unmanned aerial vehicle signal spectrum recognition algorithm based on artificial intelligence
CN116582195B (en) * 2023-06-12 2023-12-26 浙江瑞通电子科技有限公司 Unmanned aerial vehicle signal spectrum identification method based on artificial intelligence
CN116527407A (en) * 2023-07-04 2023-08-01 贵州毅丹恒瑞医药科技有限公司 Encryption transmission method for fundus image
CN116527407B (en) * 2023-07-04 2023-09-01 贵州毅丹恒瑞医药科技有限公司 Encryption transmission method for fundus image
CN116580343A (en) * 2023-07-13 2023-08-11 合肥中科类脑智能技术有限公司 Small sample behavior recognition method, storage medium and controller
CN117095694A (en) * 2023-10-18 2023-11-21 中国科学技术大学 Bird song recognition method based on tag hierarchical structure attribute relationship
CN117095694B (en) * 2023-10-18 2024-02-23 中国科学技术大学 Bird song recognition method based on tag hierarchical structure attribute relationship
CN117541991A (en) * 2023-11-22 2024-02-09 无锡科棒安智能科技有限公司 Intelligent recognition method and system for abnormal behaviors based on security robot
CN117690062A (en) * 2024-02-02 2024-03-12 武汉工程大学 Method for detecting abnormal behaviors of miners in mine
CN117690062B (en) * 2024-02-02 2024-04-19 武汉工程大学 Method for detecting abnormal behaviors of miners in mine

Also Published As

Publication number Publication date
CN111062355A (en) 2020-04-24

Similar Documents

Publication Publication Date Title
WO2021129569A1 (en) Human action recognition method
CN110111366B (en) End-to-end optical flow estimation method based on multistage loss
CN107529650B (en) Closed loop detection method and device and computer equipment
CN109035172B (en) Non-local mean ultrasonic image denoising method based on deep learning
CN109685045B (en) Moving target video tracking method and system
CN111079764B (en) Low-illumination license plate image recognition method and device based on deep learning
CN108230354B (en) Target tracking method, network training method, device, electronic equipment and storage medium
CN117372706A (en) Multi-scale deformable character interaction relation detection method
CN107564041B (en) Method for detecting visible light image aerial moving target
Zhang et al. Spatial–temporal gray-level co-occurrence aware CNN for SAR image change detection
CN108647605B (en) Human eye gaze point extraction method combining global color and local structural features
CN113379789B (en) Moving target tracking method in complex environment
CN115049954A (en) Target identification method, device, electronic equipment and medium
CN113936034A (en) Apparent motion combined weak and small moving object detection method combined with interframe light stream
CN111160372B (en) Large target identification method based on high-speed convolutional neural network
CN116912338A (en) Pixel picture vectorization method for textile
CN115457365B (en) Model interpretation method and device, electronic equipment and storage medium
Zhang et al. Research on the algorithm of license plate recognition based on MPGAN Haze Weather
CN115223033A (en) Synthetic aperture sonar image target classification method and system
CN111008555B (en) Unmanned aerial vehicle image small and weak target enhancement extraction method
CN114463379A (en) Dynamic capturing method and device for video key points
CN113436251A (en) Pose estimation system and method based on improved YOLO6D algorithm
CN107016675A (en) A kind of unsupervised methods of video segmentation learnt based on non local space-time characteristic
CN111597967A (en) Infrared image multi-target pedestrian identification method
CN111862152A (en) Moving target detection method based on interframe difference and super-pixel segmentation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20907289

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20907289

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 521430725

Country of ref document: SA