WO2022134655A1 - End-to-end video action detection and positioning system - Google Patents

End-to-end video action detection and positioning system Download PDF

Info

Publication number
WO2022134655A1
WO2022134655A1 PCT/CN2021/116771 CN2021116771W WO2022134655A1 WO 2022134655 A1 WO2022134655 A1 WO 2022134655A1 CN 2021116771 W CN2021116771 W CN 2021116771W WO 2022134655 A1 WO2022134655 A1 WO 2022134655A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
data
feature map
module
feature
Prior art date
Application number
PCT/CN2021/116771
Other languages
French (fr)
Chinese (zh)
Inventor
席道亮
许野平
刘辰飞
陈英鹏
张朝瑞
高朋
Original Assignee
神思电子技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 神思电子技术股份有限公司 filed Critical 神思电子技术股份有限公司
Publication of WO2022134655A1 publication Critical patent/WO2022134655A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/40Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video transcoding, i.e. partial or full decoding of a coded input stream followed by re-encoding of the decoded output stream

Definitions

  • the present invention relates to the technical field of human action recognition, in particular to an end-to-end video action detection and positioning system.
  • Behavior recognition analyzes a given video clip for multiple consecutive frames, which can realize the recognition of the content in the video, usually to recognize human actions, such as fighting, falling to the ground, etc. In practical application scenarios, it can identify the occurrences in the scene. Dangerous behavior has a wide range of application scenarios and is a hot research topic in computer vision. At present, the behavior recognition algorithm based on deep learning can not only identify the type of action, but also locate the spatial position of the action. higher accuracy.
  • the paper "Two-Stream Convolutional Networks for Action Recognition in Videos” proposes a dual-stream network detection method for action classification.
  • the method uses parallel networks spatial stream ConvNet and temporal stream ConvNet.
  • the former is a classification network
  • the input is a static image.
  • the image information is obtained, and the latter inputs the dense optical flow of multiple consecutive frames to obtain the motion information.
  • the two networks finally fuse the classification scores through softmax.
  • This method has high calculation accuracy and can be applied to complex multi-person scenes.
  • the disadvantage of the method is that the optical flow information of the video clip to be detected needs to be obtained in advance, which cannot achieve real-time detection, and also cannot locate the position where the action occurs.
  • the Chinese patent with patent number 201810292563 discloses a video action classification model training method, device and video action classification method. On the basis of the features of the training video frames with less difficulty, learning the difference features between the training image frames with greater training difficulty and other training image frames with less training difficulty can classify the training videos more accurately. The method still exists and cannot locate the spatial location and starting time of the action in the screen.
  • the Chinese patent with the patent number of 201810707711 discloses a video-based behavior recognition method, behavior recognition device and terminal equipment.
  • the innovation lies in the use of convolutional neural network and long short-term memory network LSTM for time series modeling, increasing the number of frames between frames. It can effectively solve the problems of complex background information and insufficient time series modeling ability in existing behavior recognition methods.
  • this method cannot achieve end-to-end training, and can detect a single RGB image frame separately. The recognition accuracy is low.
  • the Chinese patent with patent number 201210345589.X discloses a behavior recognition method based on action subspace and weighted behavior recognition model.
  • the advantage is that the input is the video sequence to be detected, the time information of the action is extracted, and the background subtraction method is used. Removing the influence of background noise on the foreground can not only accurately identify human behaviors that change with time and people inside and outside the area, but also has strong robustness to noise and other influencing factors. Unable to make accurate judgments.
  • the purpose of the present invention is to provide an end-to-end video motion detection and positioning system capable of locating the spatial position of the action after inputting the video sequence to be detected.
  • the present invention specifically adopts following technical scheme:
  • An end-to-end video action detection and positioning system includes a video decoding module and a data reorganization module, and the positioning process includes the following steps:
  • Video decoding the video decoding module inputs the network video stream to the video decoding unit through the network line, decodes the video stream into a frame of RGB image through the SOC system-on-chip, and then inputs it to the data reorganization module for data preprocessing operate;
  • Channel information integration mining the data features obtained by the spatiotemporal information analysis unit module are integrated into channel-level information to mine motion information, pay attention to the mining of motion information between frames, and pay attention to the types of behaviors that occur;
  • Prediction result output use 1x1 convolution to output the feature map of the corresponding number of channels.
  • the specific process of data reorganization is:
  • the prediction starts to take a video clip of a fixed length n, and then the unit data Ydst is input to the spatiotemporal information parsing unit module, and n is equal to 8 or 16.
  • n is equal to 8 or 16.
  • This pixel value f(i+u, j+v) can be obtained from the coordinates in the original image as (i, j), (i+1, j), (i, j+1), (i+1, j
  • f(i,j) represents the pixel at the source image (i,j) value.
  • the calculation operation on the input data includes the following processes:
  • the four-dimensional data is transformed into three-dimensional data, and the output feature map is expressed as
  • the extraction of spatial key information includes the following processes:
  • f 1 () represents the mean operation on the feature matrix
  • f2 () represents the feature extraction operation on the matrix
  • R f R f1 +R f2
  • X represents the fused feature map
  • the fusion function ffuse() integrates the information of the feature Rf
  • the enhanced feature is normalized between 0 and 1 through the normalization function f normalize ().
  • the channel information integration mining includes the following steps:
  • the data features obtained by the spatial key information extraction module are expressed as
  • the spatiotemporal information parsing unit module features are expressed as In order to reduce the information loss of the channel information integration mining module, X out and After input, merge feature information by channel and output feature map Y;
  • T represents the transpose of the matrix
  • the eigenmatrix I is generated, and each element in the matrix is the value of the inner product of Z and Z T , where the matrix I is The generating dimension is C 3 *C 3 , and the formula for generating the matrix I is:
  • the parameters i and j are the index representations of the rows and columns of the matrix Z, and the maximum value of n is calculated from zero to N, and the following operations are performed on the matrix to generate the feature map
  • the formula for the calculation formula of matrix E is:
  • Feature map Each value in is 0 to 1, and its meaning indicates the degree of influence of the jth channel on the ith channel;
  • the pre-result output includes the following steps:
  • the regression of coordinates mainly generates the final output size of each layer of the model according to the number of predicted categories [(3 ⁇ (NumClass+5)) ⁇ H ⁇ W], where NumClass is the number of predicted categories, and in the training, in order to To adapt to the categories in the current dataset, we use the following loss function for category prediction, and the loss coord calculation formula is:
  • y′ represents the real coordinate value in the label
  • y represents the output value of the model predicted coordinate
  • the spatial key information extraction module and the channel information integration mining module are used to improve the accuracy of behavior recognition, and can recognize multiple behaviors at the same time in complex scenarios.
  • Figure 1 Structure diagram of an end-to-end video action detection and localization system.
  • an end-to-end video action detection and positioning system includes a video decoding module and a data reorganization module, and the positioning process includes the following steps:
  • Video decoding the video decoding module inputs the network video stream to the video decoding unit through the network line, decodes the video stream into a frame of RGB image through the SOC system-on-chip, and then inputs it to the data reorganization module for data preprocessing operate.
  • Channel information integration mining The data features obtained by the spatiotemporal information analysis unit module are integrated into channel-level information to mine motion information, focus on the mining of motion information between frames, and pay attention to the types of behaviors.
  • Prediction result output use 1x1 convolution to output the feature map of the corresponding number of channels.
  • the prediction starts to take a video clip of a fixed length n, and after processing, the unit data Ydst is input to the spatiotemporal information parsing unit module, and n is equal to 8 or 16.
  • n is equal to 8 or 16.
  • This pixel value f(i+u, j+v) can be obtained from the coordinates in the original image as (i, j), (i+1, j), (i, j+1), (i+1, j).
  • the surrounding four pixel values corresponding to +1) are determined, that is,
  • f(i+u,j+v) (1-u)(1-v)f(i,j)+(1-u)vf(i,j+1)+u(1-v)f( i+1,j)+uvf(i+1,j+1) where f(i,j) represents the pixel value at the source image (i,j).
  • the calculation operation on the input data includes the following processes:
  • the four-dimensional data is transformed into three-dimensional data, and the output feature map is expressed as
  • the extraction of spatial key information includes the following processes:
  • f 1 () represents the mean operation on the feature matrix
  • f2 () represents the feature extraction operation on the matrix
  • R f R f1 +R f2
  • X represents the fused feature map
  • the fusion function ffuse() integrates the information of the feature Rf
  • the enhanced feature is normalized between 0 and 1 through the normalization function f normaliz ().
  • Channel information integration mining includes the following steps:
  • the data features obtained by the spatial key information extraction module are expressed as
  • the spatiotemporal information parsing unit module features are expressed as In order to reduce the information loss of the channel information integration mining module, X out and After input, merge feature information by channel and output feature map Y;
  • the eigen matrix I is generated, and each element in the matrix is the value of the inner product of Z and Z T , where the matrix I
  • the generating dimension is C 3 *C 3
  • the formula for generating the matrix I is:
  • the parameters i and j are the index representations of the rows and columns of the matrix Z, and the maximum value of n is calculated from zero to N, and the following operations are performed on the matrix to generate the feature map
  • the formula for the calculation formula of matrix E is:
  • Feature map Each value in is 0 to 1, and its meaning indicates the degree of influence of the jth channel on the ith channel;
  • the prediction result output includes the following steps:
  • the regression of coordinates mainly generates the final output size of each layer of the model according to the number of predicted categories [(3 ⁇ (NumClass+5)) ⁇ H ⁇ W], where NumClass is the number of predicted categories, and in the training, in order to To adapt to the categories in the current dataset, we use the following loss function for category prediction, and the loss coord calculation formula is:
  • y′ represents the real coordinate value in the label
  • y represents the output value of the model predicted coordinate

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)

Abstract

An end-to-end video action detection and positioning system, which relates to the field of human action recognition. A positioning process of the end-to-end video action detection and positioning system comprises: decoding a video; recombining data; setting a data sampling frequency, reading a video clip of a fixed length, and recombining data into an inputtable data mode and inputting same into the next module; performing a calculation operation on input data; extracting spatial key information; processing feature information that is extracted by a spatio-temporal information parsing unit module, such that a feature extracted via a network can better focus on more useful spatial information in an image, filtering out background information, and enhancing a feature of a position at which an action occurs in the image; integrating and mining channel information; performing channel-level information integration on a data feature that is obtained by the spatio-temporal information parsing unit module, mining motion information, focusing on the mining of motion information between frames, and focusing on the type of a behavior action that occurs; outputting a prediction result; and outputting, by using 1×1 convolution, a feature map having a corresponding channel quantity.

Description

一种端到端的视频动作检测定位系统An end-to-end video action detection and localization system 技术领域technical field
本发明涉及人体动作识别相关技术领域,具体的说,是涉及一种端到端的视频动作检测定位系统。The present invention relates to the technical field of human action recognition, in particular to an end-to-end video action detection and positioning system.
背景技术Background technique
这里的陈述仅提供与本发明相关的背景技术,而不必然地构成现有技术。The statements herein merely provide background related to the present invention and do not necessarily constitute prior art.
行为识别将给定得一段视频片段进行连续得多帧分析,能够实现识别视频中得内容,通常为识别人的动作,如打架、倒地等等,在实际应用场景中能够识别出场景内发生得危险行为,应用场景广泛,是计算机视觉一直研究的热点问题,目前基于深度学习的行为识别算法不仅能够识别动作发生的类型,还能定位动作发生的空间位置,在多目标,复杂场景下取得了较高的准确度。Behavior recognition analyzes a given video clip for multiple consecutive frames, which can realize the recognition of the content in the video, usually to recognize human actions, such as fighting, falling to the ground, etc. In practical application scenarios, it can identify the occurrences in the scene. Dangerous behavior has a wide range of application scenarios and is a hot research topic in computer vision. At present, the behavior recognition algorithm based on deep learning can not only identify the type of action, but also locate the spatial position of the action. higher accuracy.
Du Tran等人在论文《Learning Spatiotemporal Features with 3D Convolutional Networks》中提出了一个简单有效的方法,在大规模有监督视频数据集上使用深度3维卷积网络(3D ConvNets),该方法相比于2D ConvNets更适用于时空特征的学习,更能表达帧与帧之间的连续信息,在UCF101数据集上用更少的维度与当时最好的方法精度相当,采用简单的3D卷积架构,计算效率高,前向传播速度快,更易于训练和使用,该方法的不足之处在于识别目标为单人简单场景,在复杂场景下应用识别精度低误报率高,基本无泛化能力,无法在实际复杂环境下推广应用,而且无法对画面中动作发生的位置进行定位。In the paper "Learning Spatiotemporal Features with 3D Convolutional Networks", Du Tran et al. proposed a simple and effective method to use deep 3-dimensional convolutional networks (3D ConvNets) on large-scale supervised video datasets. 2D ConvNets are more suitable for the learning of spatiotemporal features, and can better express the continuous information between frames. On the UCF101 dataset, the accuracy is comparable to the best method at that time with fewer dimensions. It adopts a simple 3D convolutional architecture and calculates It has high efficiency, fast forward propagation speed, and is easier to train and use. The disadvantage of this method is that the recognition target is a simple scene of a single person. In complex scenes, the recognition accuracy is low and the false alarm rate is high. There is basically no generalization ability and cannot be used. Promote the application in the actual complex environment, and it is impossible to locate the position where the action occurs in the picture.
论文《Two-Stream Convolutional Networks for Action Recognition in Videos》针对动作分类提出了一种双流网络检测方法,该方法采用并行网络spatial stream ConvNet和temporal stream ConvNet,前者是一个分类网络,输入的是静态图像,得到图像信息,后者输入的连续多帧的稠密光流,得到运动信息,两个网络最后经过softmax做分类分数的融合,通过该方法计算准确度高,能够应用于复杂多人场景,但是该方法的不足之处在于需要预先得到待检测视频片段的光流信息,无法达到实时检测,同样无法定位动作发生的位置。The paper "Two-Stream Convolutional Networks for Action Recognition in Videos" proposes a dual-stream network detection method for action classification. The method uses parallel networks spatial stream ConvNet and temporal stream ConvNet. The former is a classification network, and the input is a static image. The image information is obtained, and the latter inputs the dense optical flow of multiple consecutive frames to obtain the motion information. The two networks finally fuse the classification scores through softmax. This method has high calculation accuracy and can be applied to complex multi-person scenes. The disadvantage of the method is that the optical flow information of the video clip to be detected needs to be obtained in advance, which cannot achieve real-time detection, and also cannot locate the position where the action occurs.
专利号为201810292563的中国专利,公开了专利一种视频动作分类模型训 练方法、装置及视频动作分类方法,优点在于可以获取多个带有标签的训练视频中的训练图像帧,能够在学习到训练难度较小的训练视频帧特征的基础上,学习训练难度较大的训练图像帧与其他训练难度较小的训练图像帧之间的差异性特征,能够为训练视频进行更准确的分类,但是该方法仍然存在无法定位画面中动作发生得空间位置和起始时间。The Chinese patent with patent number 201810292563 discloses a video action classification model training method, device and video action classification method. On the basis of the features of the training video frames with less difficulty, learning the difference features between the training image frames with greater training difficulty and other training image frames with less training difficulty can classify the training videos more accurately. The method still exists and cannot locate the spatial location and starting time of the action in the screen.
专利号为201810707711的中国专利,公开了一种基于视频的行为识别方法、行为识别装置及终端设备,创新点在于利用卷积神经网络和长短记忆网络LSTM进行时序建模,增加帧与帧之间的时序信息,有效解决现有行为识别方法存在背景信息复杂、对时序建模能力不够强等问题,但是该方法不能实现端到端的训练,对单张RGB图像帧单独检测,在背景复杂场景下识别精度较低。The Chinese patent with the patent number of 201810707711 discloses a video-based behavior recognition method, behavior recognition device and terminal equipment. The innovation lies in the use of convolutional neural network and long short-term memory network LSTM for time series modeling, increasing the number of frames between frames. It can effectively solve the problems of complex background information and insufficient time series modeling ability in existing behavior recognition methods. However, this method cannot achieve end-to-end training, and can detect a single RGB image frame separately. The recognition accuracy is low.
专利号为201210345589.X的中国专利,公开了一种基于动作子空间与权重化行为识别模型的行为识别方法优势在于输入为待检测得视频序列,提取了动作的时间信息,利用减背景的方法去除背景噪声对于前景的影响,不仅能够准确地识别随时间、区域内外人员变化的人类行为,而且对噪声和其它影响因素鲁棒性强,但是该方法对同一场景下多种存在多种行为时无法准确的做出判断。The Chinese patent with patent number 201210345589.X discloses a behavior recognition method based on action subspace and weighted behavior recognition model. The advantage is that the input is the video sequence to be detected, the time information of the action is extracted, and the background subtraction method is used. Removing the influence of background noise on the foreground can not only accurately identify human behaviors that change with time and people inside and outside the area, but also has strong robustness to noise and other influencing factors. Unable to make accurate judgments.
发明内容SUMMARY OF THE INVENTION
针对现有技术存在的不足,本发明的目的是提供一种当输入待检测视频序列后能够定位动作发生的空间位置的端到端的视频动作检测定位系统。In view of the deficiencies in the prior art, the purpose of the present invention is to provide an end-to-end video motion detection and positioning system capable of locating the spatial position of the action after inputting the video sequence to be detected.
本发明具体采用如下技术方案:The present invention specifically adopts following technical scheme:
一种端到端的视频动作检测定位系统,包括视频解码模块和数据重组模块,定位过程包括以下步骤:An end-to-end video action detection and positioning system includes a video decoding module and a data reorganization module, and the positioning process includes the following steps:
(1)视频解码;视频解码模块将网络视频流通过网络线路输入到视频解码单元,通过SOC片上系统将视频流解码为一帧帧的RGB图像,然后输入到数据重组模块,进行数据的预处理操作;(1) Video decoding; the video decoding module inputs the network video stream to the video decoding unit through the network line, decodes the video stream into a frame of RGB image through the SOC system-on-chip, and then inputs it to the data reorganization module for data preprocessing operate;
(2)数据重组;设定数据采样频率,读取固定长度的视频片段,将数据重新组合为可输入数据模式输入到下一模块;(2) data reorganization; set data sampling frequency, read video clips of fixed length, recombine the data into input data mode and input to the next module;
(3)对输入数据进行计算操作;(3) Calculating the input data;
(4)空间关键信息提取;将时空信息解析单元模块提取的特征信息进行处理,使网络提取的特征更能关注图像中更加有用空间信息,滤除背景信息,对图 像中动作发生的位置特征进行增强;(4) Extraction of key spatial information; the feature information extracted by the spatiotemporal information parsing unit module is processed, so that the features extracted by the network can pay more attention to the more useful spatial information in the image, filter out the background information, and analyze the location features of the action in the image. enhance;
(5)通道信息整合挖掘;将时空信息解析单元模块得到的数据特征进行通道级别的信息整合,挖掘运动信息,关注帧之间运动信息挖掘,关注行为动作发生的类型;(5) Channel information integration mining; the data features obtained by the spatiotemporal information analysis unit module are integrated into channel-level information to mine motion information, pay attention to the mining of motion information between frames, and pay attention to the types of behaviors that occur;
(6)预测结果输出;采用1x1卷积输出对应的通道数量的特征图。(6) Prediction result output; use 1x1 convolution to output the feature map of the corresponding number of channels.
优选地,数据重组具体的过程为:Preferably, the specific process of data reorganization is:
预测开始取固定长度n的视频片段处理后组成单元数据Ydst输入到时空信息解析单元模块,n等于8或者16,输入到时空信息解析单元模块之前需要将单元数据Ydst每张RGB图像的尺寸调整成固定尺寸大小;The prediction starts to take a video clip of a fixed length n, and then the unit data Ydst is input to the spatiotemporal information parsing unit module, and n is equal to 8 or 16. Before inputting to the spatiotemporal information parsing unit module, the size of each RGB image of the unit data Ydst needs to be adjusted to fixed size;
假定源视频片段单张图片用Xsrc表示,输入到时空信息解析单元模块的固定尺寸的图片用Xdst表示,尺寸缩放后对于Xdst中的每个像素的计算方法如下:Assuming that a single picture of the source video segment is represented by Xsrc, and the fixed-size picture input to the spatiotemporal information parsing unit module is represented by Xdst, the calculation method for each pixel in Xdst after size scaling is as follows:
(1)对于X dst中的每个像素,设置坐标通过反向变换得到的浮点坐标为(i+u,j+v),其中i、j均为浮点坐标的整数部分,u、v为浮点坐标的小数部分,是取值[0,1)区间的浮点数; (1) For each pixel in X dst , set the floating-point coordinates obtained by inverse transformation to (i+u, j+v), where i, j are the integer parts of floating-point coordinates, u, v is the fractional part of the floating-point coordinate, which is a floating-point number in the range [0,1);
(2)这个像素值f(i+u,j+v)可由原来图像中坐标为(i,j)、(i+1,j)、(i,j+1)、(i+1,j+1)所对应的周围四个像素值决定,即f(i+u,j+v)=(1-u)(1-v)f(i,j)+(1-u)vf(i,j+1)+u(1-v)f(i+1,j)+uuf(i+1,j+1)其中,f(i,j)表示源图像(i,j)处的像素值。(2) This pixel value f(i+u, j+v) can be obtained from the coordinates in the original image as (i, j), (i+1, j), (i, j+1), (i+1, j The surrounding four pixel values corresponding to +1) are determined, that is, f(i+u, j+v)=(1-u)(1-v)f(i, j)+(1-u)vf(i , j+1)+u(1-v)f(i+1,j)+uuf(i+1,j+1) where, f(i,j) represents the pixel at the source image (i,j) value.
优选地,对输入数据进行计算操作包括以下过程:Preferably, the calculation operation on the input data includes the following processes:
(1)将视频单元数据Ydst输入到时空信息解析单元模块中,将一系列的RGB图像帧R C×D×H×W输入到该模块,C=3代表每一张RGB图像帧的通道数,D表示每组单元数据Ydst的图片的数量,最大为16,H和W代表该组单元数据Ydst的每张图片的宽和高;时空信息解析单元模块输出特征图
Figure PCTCN2021116771-appb-000001
C 1、H 1、W 1分别表示输出特征图的通道数、宽和高,为了适应空间关键信息提取模块的输出维度,强制D′=1,然后通过维度变换将时空信息解析单元模块输出的四维数据变换为三维数据,输出的特征图表示为
Figure PCTCN2021116771-appb-000002
(1) Input the video unit data Ydst into the spatiotemporal information analysis unit module, and input a series of RGB image frames R C×D×H×W into the module, C=3 represents the number of channels of each RGB image frame , D represents the number of pictures of each group of unit data Ydst, the maximum is 16, H and W represent the width and height of each picture of this group of unit data Ydst; the spatiotemporal information parsing unit module outputs the feature map
Figure PCTCN2021116771-appb-000001
C 1 , H 1 , and W 1 represent the number of channels, width and height of the output feature map respectively. In order to adapt to the output dimension of the spatial key information extraction module, D′=1 is enforced, and then the output of the spatiotemporal information parsing unit module is converted through dimension transformation. The four-dimensional data is transformed into three-dimensional data, and the output feature map is expressed as
Figure PCTCN2021116771-appb-000002
(2)采用增加空间关键信息提取模块,使网络更加关注行为发生的对象的特征,该模块的输入为
Figure PCTCN2021116771-appb-000003
输出特征图为
Figure PCTCN2021116771-appb-000004
(2) Add a spatial key information extraction module to make the network pay more attention to the characteristics of the object where the behavior occurs. The input of this module is
Figure PCTCN2021116771-appb-000003
The output feature map is
Figure PCTCN2021116771-appb-000004
优选地,空间关键信息提取包括以下过程:Preferably, the extraction of spatial key information includes the following processes:
(1)设定时空信息解析单元模块输出特征图尺寸为
Figure PCTCN2021116771-appb-000005
将特征图输入到空间关键信息提取模块获取R f1,R f2
(1) Set the output feature map size of the spatiotemporal information analysis unit module as
Figure PCTCN2021116771-appb-000005
Input the feature map into the spatial key information extraction module to obtain R f1 , R f2 ;
Figure PCTCN2021116771-appb-000006
Figure PCTCN2021116771-appb-000006
Figure PCTCN2021116771-appb-000007
Figure PCTCN2021116771-appb-000007
其中f 1()表示对特征矩阵均值化操作,f2()表示对矩阵的特征抽取操作; Among them, f 1 () represents the mean operation on the feature matrix, and f2 () represents the feature extraction operation on the matrix;
(2)将R f1和R f2按照第一个维度进行相加的处理,获取合并后的空间特征信息 (2) Add R f1 and R f2 according to the first dimension to obtain the combined spatial feature information
R f=R f1+R f2 R f =R f1 +R f2
(3)将R f进行空间特征融合,将R f输入到融合特征归一化单元,该单元可以将空间特征增强化,对增强化后的特征进行归一化处理后计算效率更加高效: (3) Perform spatial feature fusion on R f , and input R f into the fusion feature normalization unit, which can enhance the spatial features and normalize the enhanced features. The calculation efficiency is more efficient:
x=f fuse(Rf) x=f fuse (Rf)
X out=f normalize(X) X out = f normalize (X)
X表示融合后的特征图,融合函数ffuse()将特征Rf的信息整合,通过归一化函数f normalize()将增强后的特征归一化到0~1之间。 X represents the fused feature map, the fusion function ffuse() integrates the information of the feature Rf, and the enhanced feature is normalized between 0 and 1 through the normalization function f normalize ().
优选地,通道信息整合挖掘包括以下步骤:Preferably, the channel information integration mining includes the following steps:
(1)空间关键信息提取模块得到的数据特征表示为
Figure PCTCN2021116771-appb-000008
时空信息解析单元模块特征表示为
Figure PCTCN2021116771-appb-000009
为了减少通道信息整合挖掘模块的信息损失将X out
Figure PCTCN2021116771-appb-000010
输入后按通道合并特征信息,输出特征图Y;
(1) The data features obtained by the spatial key information extraction module are expressed as
Figure PCTCN2021116771-appb-000008
The spatiotemporal information parsing unit module features are expressed as
Figure PCTCN2021116771-appb-000009
In order to reduce the information loss of the channel information integration mining module, X out and
Figure PCTCN2021116771-appb-000010
After input, merge feature information by channel and output feature map Y;
(2)用通道压缩单元将特征图Y向量化为Z,函数f vector()表示向量化函数,特征图Z表示对特征图的向量化符号表示,其中C 3表示通道标量的相加和,其数值C 3=C 1+C 2,N表示对每张特征图向量化的数值表示,其数值为N=H 1*W 1(2) The channel compression unit is used to vectorize the feature map Y into Z, the function f vector ( ) represents the vectorization function, the feature map Z represents the vectorized symbol representation of the feature map, and C 3 represents the addition and sum of the channel scalars, Its value is C 3 =C 1 +C 2 , and N represents the quantized numerical representation of each feature map, and its value is N=H 1 *W 1 ;
Figure PCTCN2021116771-appb-000011
Figure PCTCN2021116771-appb-000011
通过将特征向量Z与Z的转置特征矩阵Z T,T表示矩阵的转置,生成特征矩阵I,该矩阵中的每个元素均为Z与Z T的内积的值,其中矩阵I的生成维度为C 3*C 3,矩阵I生成计算的公式为: By transposing the eigenvectors Z and the eigenmatrix Z T of Z, T represents the transpose of the matrix, the eigenmatrix I is generated, and each element in the matrix is the value of the inner product of Z and Z T , where the matrix I is The generating dimension is C 3 *C 3 , and the formula for generating the matrix I is:
Figure PCTCN2021116771-appb-000012
Figure PCTCN2021116771-appb-000012
其中参数i,j是对矩阵Z行列的索引表示,n从零开始计算最大值为N,对该矩阵进行如下运算操作,生成特征图
Figure PCTCN2021116771-appb-000013
矩阵E的计算公式的公式为:
The parameters i and j are the index representations of the rows and columns of the matrix Z, and the maximum value of n is calculated from zero to N, and the following operations are performed on the matrix to generate the feature map
Figure PCTCN2021116771-appb-000013
The formula for the calculation formula of matrix E is:
Figure PCTCN2021116771-appb-000014
Figure PCTCN2021116771-appb-000014
特征图
Figure PCTCN2021116771-appb-000015
中的每个值均为0到1,其意义表示第j个通道对第i个通道影响的程度;
Feature map
Figure PCTCN2021116771-appb-000015
Each value in is 0 to 1, and its meaning indicates the degree of influence of the jth channel on the ith channel;
(3)为了进一步说明特征图E对原始特征图Z的影响,需要计算出Z *,首先讲矩阵E进行矩阵的转置操作,其计算公式为: (3) In order to further illustrate the influence of the feature map E on the original feature map Z, it is necessary to calculate Z * . First, the matrix E is used to perform the matrix transposition operation. The calculation formula is:
Z′=E T*Z Z′=E T *Z
将Z *进行维度变换还原为3维的输出: Dimensionally transform Z * to 3-dimensional output:
Figure PCTCN2021116771-appb-000016
Figure PCTCN2021116771-appb-000016
其中函数f reshape()主要对维度进行了展开的操作,最后特征图的输出为
Figure PCTCN2021116771-appb-000017
计算公式为O=Z *+x out
The function f reshape () mainly expands the dimension, and the output of the final feature map is
Figure PCTCN2021116771-appb-000017
The calculation formula is O=Z * +x out .
优选地,预结果输出包括以下步骤:Preferably, the pre-result output includes the following steps:
对于图片中的每个特征点生成3个预测框,设计整个网络模为是四层输出,因此在网络训练之前需要对数据集利用聚类算法对所有的bbox进行聚类生成12个预置框,坐标的回归主要根据预测种类的数量生成了模型的每一层最后的输出尺寸大小[(3×(NumClass+5))×H×W],其中NumClass是预测的种类个数,训练中为了适应当前数据集中的类别,对于类别预测我们采用了如下损失函数,其损失值loss coord计算公式为: For each feature point in the picture, 3 prediction boxes are generated, and the entire network model is designed to be a four-layer output. Therefore, before network training, it is necessary to use the clustering algorithm to cluster all bboxes on the data set to generate 12 preset boxes. , the regression of coordinates mainly generates the final output size of each layer of the model according to the number of predicted categories [(3×(NumClass+5))×H×W], where NumClass is the number of predicted categories, and in the training, in order to To adapt to the categories in the current dataset, we use the following loss function for category prediction, and the loss coord calculation formula is:
loss c=-∑a′*lna loss c =-∑a′*lna
其中y表示标签中的真实值,a表示模型预测的类别输出值,坐标损失函数损失值loss coord计算公式: Where y represents the true value in the label, a represents the category output value predicted by the model, and the coordinate loss function loss value loss coord calculation formula:
loss coord=-y′*log(y)-(1-y′)*log(1-y) loss coord =-y'*log(y)-(1-y')*log(1-y)
其中y′表示标签中真实的坐标值,y表示模型预测坐标的输出值。where y′ represents the real coordinate value in the label, and y represents the output value of the model predicted coordinate.
本发明具有如下有益效果:The present invention has the following beneficial effects:
采用了空间关键信息提取模块和通道信息整合挖掘模块提高了对行为识别的准确率,适应在复杂场景下的可以同时识别多种行为。The spatial key information extraction module and the channel information integration mining module are used to improve the accuracy of behavior recognition, and can recognize multiple behaviors at the same time in complex scenarios.
将目标检测网络中的边框回归的思想与视频分类相结合增加了模型的泛化能力,提高了对不同场景下识别的鲁棒性。Combining the idea of bounding box regression in the object detection network with video classification increases the generalization ability of the model and improves the robustness of recognition in different scenarios.
附图说明Description of drawings
构成本发明的一部分的说明书附图用来提供对本发明的进一步理解,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。The accompanying drawings forming a part of the present invention are used to provide further understanding of the present invention, and the exemplary embodiments of the present invention and their descriptions are used to explain the present invention, and do not constitute an improper limitation of the present invention.
图1端到端的视频动作检测定位系统的结构图。Figure 1. Structure diagram of an end-to-end video action detection and localization system.
具体实施方式Detailed ways
应该指出,以下详细说明都是例示性的,旨在对本发明提供进一步的说明。除非另有指明,本发明使用的所有技术和科学术语具有与本发明所属技术领域的普通技术人员通常理解的相同含义。It should be noted that the following detailed description is exemplary and intended to provide further explanation of the invention. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
需要注意的是,这里所使用的术语仅是为了描述具体实施方式,而非意图限制根据本发明的示例性实施方式。如在这里所使用的,除非本发明另外明确指出, 否则单数形式也意图包括复数形式,此外,还应当理解的是,当在本说明书中使用术语“包含”和/或“包括”时,其指明存在特征、步骤、操作、器件、组件和/或它们的组合;It should be noted that the terminology used herein is for the purpose of describing specific embodiments only, and is not intended to limit the exemplary embodiments according to the present invention. As used herein, unless the invention clearly dictates otherwise, the singular is intended to include the plural as well, and it should also be understood that when the terms "comprising" and/or "including" are used in this specification, Indicate the presence of features, steps, operations, devices, components and/or combinations thereof;
下面结合附图和具体实施例对本发明的具体实施方式做进一步说明:The specific embodiments of the present invention will be further described below in conjunction with the accompanying drawings and specific embodiments:
结合图1,一种端到端的视频动作检测定位系统,包括视频解码模块和数据重组模块,定位过程包括以下步骤:With reference to Figure 1, an end-to-end video action detection and positioning system includes a video decoding module and a data reorganization module, and the positioning process includes the following steps:
(1)视频解码;视频解码模块将网络视频流通过网络线路输入到视频解码单元,通过SOC片上系统将视频流解码为一帧帧的RGB图像,然后输入到数据重组模块,进行数据的预处理操作。(1) Video decoding; the video decoding module inputs the network video stream to the video decoding unit through the network line, decodes the video stream into a frame of RGB image through the SOC system-on-chip, and then inputs it to the data reorganization module for data preprocessing operate.
(2)数据重组;设定数据采样频率,读取固定长度的视频片段,将数据重新组合为可输入数据模式输入到下一模块。(2) Data reorganization; set the data sampling frequency, read video clips of fixed length, and recombine the data into an input data mode for input to the next module.
(3)对输入数据进行计算操作。(3) Calculate the input data.
(4)空间关键信息提取;将时空信息解析单元模块提取的特征信息进行处理,使网络提取的特征更能关注图像中更加有用空间信息,滤除背景信息,对图像中动作发生的位置特征进行增强。(4) Extraction of key spatial information; the feature information extracted by the spatiotemporal information parsing unit module is processed, so that the features extracted by the network can pay more attention to the more useful spatial information in the image, filter out the background information, and analyze the location features of the action in the image. enhanced.
(5)通道信息整合挖掘;将时空信息解析单元模块得到的数据特征进行通道级别的信息整合,挖掘运动信息,关注帧之间运动信息挖掘,关注行为动作发生的类型。(5) Channel information integration mining: The data features obtained by the spatiotemporal information analysis unit module are integrated into channel-level information to mine motion information, focus on the mining of motion information between frames, and pay attention to the types of behaviors.
(6)预测结果输出;采用1x1卷积输出对应的通道数量的特征图。(6) Prediction result output; use 1x1 convolution to output the feature map of the corresponding number of channels.
数据重组具体的过程为:The specific process of data reorganization is as follows:
预测开始取固定长度n的视频片段处理后组成单元数据Ydst输入到时空信息解析单元模块,n等于8或者16,输入到时空信息解析单元模块之前需要将单元数据Ydst每张RGB图像的尺寸调整成固定尺寸大小;The prediction starts to take a video clip of a fixed length n, and after processing, the unit data Ydst is input to the spatiotemporal information parsing unit module, and n is equal to 8 or 16. Before inputting to the spatiotemporal information parsing unit module, the size of each RGB image of the unit data Ydst needs to be adjusted to fixed size;
假定源视频片段单张图片用Xsrc表示,输入到时空信息解析单元模块的固定尺寸的图片用Xdst表示,尺寸缩放后对于Xdst中的每个像素的计算方法如下:Assuming that a single picture of the source video clip is represented by Xsrc, and the fixed-size picture input to the spatiotemporal information parsing unit module is represented by Xdst, the calculation method for each pixel in Xdst after size scaling is as follows:
(1)对于X dst中的每个像素,设置坐标通过反向变换得到的浮点坐标为(i+u,j+v),其中i、j均为浮点坐标的整数部分,u、v为浮点坐标的小数部分,是取值[0,1)区间的浮点数; (1) For each pixel in X dst , set the floating-point coordinates obtained by inverse transformation to (i+u, j+v), where i and j are integer parts of floating-point coordinates, u, v is the fractional part of the floating-point coordinate, which is a floating-point number in the range [0,1);
(2)这个像素值f(i+u,j+v)可由原来图像中坐标为 (i,j)、(i+1,j)、(i,j+1)、(i+1,j+1)所对应的周围四个像素值决定,即(2) This pixel value f(i+u, j+v) can be obtained from the coordinates in the original image as (i, j), (i+1, j), (i, j+1), (i+1, j The surrounding four pixel values corresponding to +1) are determined, that is,
f(i+u,j+v)=(1-u)(1-v)f(i,j)+(1-u)vf(i,j+1)+u(1-v)f(i+1,j)+uvf(i+1,j+1)其中f(i,j)表示源图像(i,j)处的像素值。f(i+u,j+v)=(1-u)(1-v)f(i,j)+(1-u)vf(i,j+1)+u(1-v)f( i+1,j)+uvf(i+1,j+1) where f(i,j) represents the pixel value at the source image (i,j).
优选地,对输入数据进行计算操作包括以下过程:Preferably, the calculation operation on the input data includes the following processes:
(1)将视频单元数据Ydst输入到时空信息解析单元模块中,将一系列的RGB图像帧R C×D×H×W输入到该模块,C=3代表每一张RGB图像帧的通道数,D表示每组单元数据Ydst的图片的数量,最大为16,H和W代表该组单元数据Ydst的每张图片的宽和高;时空信息解析单元模块输出特征图
Figure PCTCN2021116771-appb-000018
C 1、H 1、W 1分别表示输出特征图的通道数、宽和高,为了适应空间关键信息提取模块的输出维度,强制D′=1,然后通过维度变换将时空信息解析单元模块输出的四维数据变换为三维数据,输出的特征图表示为
Figure PCTCN2021116771-appb-000019
(1) Input the video unit data Ydst into the spatiotemporal information analysis unit module, and input a series of RGB image frames R C×D×H×W into the module, C=3 represents the number of channels of each RGB image frame , D represents the number of pictures of each group of unit data Ydst, the maximum is 16, H and W represent the width and height of each picture of this group of unit data Ydst; the spatiotemporal information parsing unit module outputs the feature map
Figure PCTCN2021116771-appb-000018
C 1 , H 1 , and W 1 represent the number of channels, width and height of the output feature map respectively. In order to adapt to the output dimension of the spatial key information extraction module, D′=1 is enforced, and then the output of the spatiotemporal information parsing unit module is converted through dimension transformation. The four-dimensional data is transformed into three-dimensional data, and the output feature map is expressed as
Figure PCTCN2021116771-appb-000019
(2)采用增加空间关键信息提取模块,使网络更加关注行为发生的对象的特征,该模块的输入为
Figure PCTCN2021116771-appb-000020
输出特征图为
Figure PCTCN2021116771-appb-000021
(2) Add a spatial key information extraction module to make the network pay more attention to the characteristics of the object where the behavior occurs. The input of this module is
Figure PCTCN2021116771-appb-000020
The output feature map is
Figure PCTCN2021116771-appb-000021
空间关键信息提取包括以下过程:The extraction of spatial key information includes the following processes:
(1)设定时空信息解析单元模块输出特征图尺寸为
Figure PCTCN2021116771-appb-000022
将特征图输入到空间关键信息提取模块获取R f1,R f2
(1) Set the output feature map size of the spatiotemporal information analysis unit module as
Figure PCTCN2021116771-appb-000022
Input the feature map into the spatial key information extraction module to obtain R f1 , R f2 ;
Figure PCTCN2021116771-appb-000023
Figure PCTCN2021116771-appb-000023
Figure PCTCN2021116771-appb-000024
Figure PCTCN2021116771-appb-000024
其中f 1()表示对特征矩阵均值化操作,f2()表示对矩阵的特征抽取操作; Among them, f 1 () represents the mean operation on the feature matrix, and f2 () represents the feature extraction operation on the matrix;
(2)将R f1和R f2按照第一个维度进行相加的处理,获取合并后的空间特征信息 (2) Add R f1 and R f2 according to the first dimension to obtain the combined spatial feature information
R f=R f1+R f2 R f =R f1 +R f2
(3)将R f进行空间特征融合,将R f输入到融合特征归一化单元,该单元可以将空间特征增强化,对增强化后的特征进行归一化处理后计算效率更加高效: (3) Perform spatial feature fusion on R f , and input R f into the fusion feature normalization unit, which can enhance the spatial features and normalize the enhanced features. The calculation efficiency is more efficient:
x=f fuse(R f) x=f fuse (R f )
X out=f normalize(X) X out = f normalize (X)
X表示融合后的特征图,融合函数ffuse()将特征Rf的信息整合,通过归一化函数f normaliz()将增强后的特征归一化到0~1之间。 X represents the fused feature map, the fusion function ffuse() integrates the information of the feature Rf, and the enhanced feature is normalized between 0 and 1 through the normalization function f normaliz ().
通道信息整合挖掘包括以下步骤:Channel information integration mining includes the following steps:
(1)空间关键信息提取模块得到的数据特征表示为
Figure PCTCN2021116771-appb-000025
时空信息解析单元模块特征表示为
Figure PCTCN2021116771-appb-000026
为了减少通道信息整合挖掘模块的信息损失将X out
Figure PCTCN2021116771-appb-000027
输入后按通道合并特征信息,输出特征图Y;
(1) The data features obtained by the spatial key information extraction module are expressed as
Figure PCTCN2021116771-appb-000025
The spatiotemporal information parsing unit module features are expressed as
Figure PCTCN2021116771-appb-000026
In order to reduce the information loss of the channel information integration mining module, X out and
Figure PCTCN2021116771-appb-000027
After input, merge feature information by channel and output feature map Y;
(2)用通道压缩单元将特征图Y向量化为Z,函数f vector()表示向量化函数,特征图Z表示对特征图的向量化符号表示,其中C 3表示通道标量的相加和,其数值C 3=C 1+C 2,N表示对每张特征图向量化的数值表示,其数值为N=H 1*W 1(2) The channel compression unit is used to vectorize the feature map Y into Z, the function f vector ( ) represents the vectorization function, the feature map Z represents the vectorized symbol representation of the feature map, and C 3 represents the addition and sum of the channel scalars, Its value is C 3 =C 1 +C 2 , and N represents the quantized numerical representation of each feature map, and its value is N=H 1 *W 1 ;
Figure PCTCN2021116771-appb-000028
Figure PCTCN2021116771-appb-000028
通过将特征向量Z与Z的转置特征矩阵Z T,T表示矩阵的转置,生成特征矩阵I,该矩阵中的每个元素均为Z与Z T的内积的值,其中矩阵I的生成维度为C 3*C 3,矩阵I生成计算的公式为: By transposing the eigenvectors Z and Z to the eigenmatrix Z T , where T represents the transpose of the matrix, the eigen matrix I is generated, and each element in the matrix is the value of the inner product of Z and Z T , where the matrix I The generating dimension is C 3 *C 3 , and the formula for generating the matrix I is:
Figure PCTCN2021116771-appb-000029
Figure PCTCN2021116771-appb-000029
其中参数i,j是对矩阵Z行列的索引表示,n从零开始计算最大值为N,对该矩阵进行如下运算操作,生成特征图
Figure PCTCN2021116771-appb-000030
矩阵E的计算公式的公式为:
The parameters i and j are the index representations of the rows and columns of the matrix Z, and the maximum value of n is calculated from zero to N, and the following operations are performed on the matrix to generate the feature map
Figure PCTCN2021116771-appb-000030
The formula for the calculation formula of matrix E is:
Figure PCTCN2021116771-appb-000031
Figure PCTCN2021116771-appb-000031
特征图
Figure PCTCN2021116771-appb-000032
中的每个值均为0到1,其意义表示第j个通道对第i个通道影响的程度;
Feature map
Figure PCTCN2021116771-appb-000032
Each value in is 0 to 1, and its meaning indicates the degree of influence of the jth channel on the ith channel;
(3)为了进一步说明特征图E对原始特征图Z的影响,需要计算出Z *,首先讲矩阵E进行矩阵的转置操作,其计算公式为: (3) In order to further illustrate the influence of the feature map E on the original feature map Z, it is necessary to calculate Z * . First, the matrix E is used to perform the matrix transposition operation. The calculation formula is:
Z′=E T*Z Z′=E T *Z
将Z *进行维度变换还原为3维的输出: Dimensionally transform Z * to 3-dimensional output:
Figure PCTCN2021116771-appb-000033
Figure PCTCN2021116771-appb-000033
其中函数f reshape()主要对维度进行了展开的操作,最后特征图的输出为
Figure PCTCN2021116771-appb-000034
计算公式为O=Z *+x out
The function f reshape () mainly expands the dimension, and the output of the final feature map is
Figure PCTCN2021116771-appb-000034
The calculation formula is O=Z * +x out .
预测结果输出包括以下步骤:The prediction result output includes the following steps:
对于图片中的每个特征点生成3个预测框,设计整个网络模为是四层输出,因此在网络训练之前需要对数据集利用聚类算法对所有的bbox进行聚类生成12个预置框,坐标的回归主要根据预测种类的数量生成了模型的每一层最后的输出尺寸大小[(3×(NumClass+5))×H×W],其中NumClass是预测的种类个数,训练中为了适应当前数据集中的类别,对于类别预测我们采用了如下损失函数,其损失值loss coord计算公式为: For each feature point in the picture, 3 prediction boxes are generated, and the entire network model is designed to be a four-layer output. Therefore, before network training, it is necessary to use the clustering algorithm to cluster all bboxes on the data set to generate 12 preset boxes. , the regression of coordinates mainly generates the final output size of each layer of the model according to the number of predicted categories [(3×(NumClass+5))×H×W], where NumClass is the number of predicted categories, and in the training, in order to To adapt to the categories in the current dataset, we use the following loss function for category prediction, and the loss coord calculation formula is:
loss c=-∑a′*lna loss c =-∑a′*lna
其中y表示标签中的真实值,a表示模型预测的类别输出值,坐标损失函数损失值loss coord计算公式: Where y represents the true value in the label, a represents the category output value predicted by the model, and the coordinate loss function loss value loss coord calculation formula:
loss coord=-y′*log(y)-(1-y′)*log(1-y) loss coord =-y'*log(y)-(1-y')*log(1-y)
其中y′表示标签中真实的坐标值,y表示模型预测坐标的输出值。where y′ represents the real coordinate value in the label, and y represents the output value of the model predicted coordinate.
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included within the protection scope of the present invention.

Claims (6)

  1. 一种端到端的视频动作检测定位系统,包括视频解码模块和数据重组模块,其特征在于,定位过程包括以下步骤:An end-to-end video motion detection and positioning system, comprising a video decoding module and a data reorganization module, is characterized in that the positioning process comprises the following steps:
    (1)视频解码;视频解码模块将网络视频流通过网络线路输入到视频解码单元,通过SOC片上系统将视频流解码为一帧帧的RGB图像,然后输入到数据重组模块,进行数据的预处理操作;(1) Video decoding; the video decoding module inputs the network video stream to the video decoding unit through the network line, decodes the video stream into a frame of RGB image through the SOC system-on-chip, and then inputs it to the data reorganization module for data preprocessing operate;
    (2)数据重组;设定数据采样频率,读取固定长度的视频片段,将数据重新组合为可输入数据模式输入到下一模块;(2) data reorganization; set data sampling frequency, read video clips of fixed length, recombine the data into input data mode and input to the next module;
    (3)对输入数据进行计算操作;(3) Calculating the input data;
    (4)空间关键信息提取;将时空信息解析单元模块提取的特征信息进行处理,使网络提取的特征更能关注图像中更加有用空间信息,滤除背景信息,对图像中动作发生的位置特征进行增强;(4) Extraction of key spatial information; the feature information extracted by the spatiotemporal information parsing unit module is processed, so that the features extracted by the network can pay more attention to the more useful spatial information in the image, filter out the background information, and analyze the location features of the action in the image. enhance;
    (5)通道信息整合挖掘;将时空信息解析单元模块得到的数据特征进行通道级别的信息整合,挖掘运动信息,关注帧之间运动信息挖掘,关注行为动作发生的类型;(5) Channel information integration mining; integrate the data features obtained by the spatiotemporal information analysis unit module to channel-level information, mine motion information, pay attention to the mining of motion information between frames, and pay attention to the types of behaviors that occur;
    (6)预测结果输出;采用1x1卷积输出对应的通道数量的特征图。(6) Prediction result output; use 1x1 convolution to output the feature map of the corresponding number of channels.
  2. 如权利要求1所述的一种端到端的视频动作检测定位系统,其特征在于,数据重组具体的过程为:A kind of end-to-end video motion detection and positioning system as claimed in claim 1, is characterized in that, the concrete process of data reorganization is:
    预测开始取固定长度n的视频片段处理后组成单元数据Ydst输入到时空信息解析单元模块,n等于8或者16,输入到时空信息解析单元模块之前需要将单元数据Ydst每张RGB图像的尺寸调整成固定尺寸大小;The prediction starts to take a video clip of a fixed length n, and then the unit data Ydst is input to the spatiotemporal information parsing unit module, and n is equal to 8 or 16. Before inputting to the spatiotemporal information parsing unit module, the size of each RGB image of the unit data Ydst needs to be adjusted to fixed size;
    假定源视频片段单张图片用Xsrc表示,输入到时空信息解析单元模块的固定尺寸的图片用Xdst表示,尺寸缩放后对于Xdst中的每个像素的计算方法如下:Assuming that a single picture of the source video clip is represented by Xsrc, and the fixed-size picture input to the spatiotemporal information parsing unit module is represented by Xdst, the calculation method for each pixel in Xdst after size scaling is as follows:
    (1)对于Xdst中的每个像素,设置坐标通过反向变换得到的浮点坐标为(i+u,j+v),其中i、j均为浮点坐标的整数部分,u、v为浮点坐标的小数部分,是取值[0,1)区间的浮点数;(1) For each pixel in Xdst, set the floating-point coordinates obtained by inverse transformation as (i+u, j+v), where i and j are the integer parts of floating-point coordinates, and u and v are The fractional part of floating-point coordinates is a floating-point number in the range [0,1);
    (2)这个像素值f(i+u,j+v)由原来图像中坐标为(i,j)、(i+1,j)、(i,j+1)、(i+1,j+1)所对应的周围四个像素值决定,即(2) The pixel value f(i+u, j+v) is changed from the coordinates in the original image to (i, j), (i+1, j), (i, j+1), (i+1, j) The surrounding four pixel values corresponding to +1) are determined, that is,
    f(i+u,j+v)=(1-u)(1-v)f(i,j)+(1-u)vf(i,j+1)+u(1-v)f(i+1,j)+uvf(i+1,j+1) 其中f(i,j)表示源图像(i,j)处的像素值。f(i+u,j+v)=(1-u)(1-v)f(i,j)+(1-u)vf(i,j+1)+u(1-v)f( i+1,j)+uvf(i+1,j+1) where f(i,j) represents the pixel value at the source image (i,j).
  3. 如权利要求1所述的一种端到端的视频动作检测定位系统,其特征在于,对输入数据进行计算操作包括以下过程:A kind of end-to-end video action detection and positioning system as claimed in claim 1, is characterized in that, the calculation operation to the input data comprises the following process:
    (1)将视频单元数据Ydst输入到时空信息解析单元模块中,将一系列的RGB图像帧R C×D×H×W输入到该模块,C=3代表每一张RGB图像帧的通道数,D表示每组单元数据Ydst的图片的数量,最大为16,H和W代表该组单元数据Ydst的每张图片的宽和高;时空信息解析单元模块输出特征图
    Figure PCTCN2021116771-appb-100001
    C 1、H 1、W 1分别表示输出特征图的通道数、宽和高,为了适应空间关键信息提取模块的输出维度,强制D′=1,然后通过维度变换将时空信息解析单元模块输出的四维数据变换为三维数据,输出的特征图表示为
    Figure PCTCN2021116771-appb-100002
    (1) Input the video unit data Ydst into the spatiotemporal information analysis unit module, and input a series of RGB image frames R C×D×H×W into the module, C=3 represents the number of channels of each RGB image frame , D represents the number of pictures of each group of unit data Ydst, the maximum is 16, H and W represent the width and height of each picture of this group of unit data Ydst; the spatiotemporal information parsing unit module outputs the feature map
    Figure PCTCN2021116771-appb-100001
    C 1 , H 1 , and W 1 represent the number of channels, width and height of the output feature map respectively. In order to adapt to the output dimension of the spatial key information extraction module, D′=1 is enforced, and then the output of the spatiotemporal information parsing unit module is converted through dimension transformation. Four-dimensional data is transformed into three-dimensional data, and the output feature map is expressed as
    Figure PCTCN2021116771-appb-100002
    (2)采用增加空间关键信息提取模块,使网络更加关注行为发生的对象的特征,该模块的输入为
    Figure PCTCN2021116771-appb-100003
    输出特征图为
    Figure PCTCN2021116771-appb-100004
    (2) Add a spatial key information extraction module to make the network pay more attention to the characteristics of the object where the behavior occurs. The input of this module is
    Figure PCTCN2021116771-appb-100003
    The output feature map is
    Figure PCTCN2021116771-appb-100004
  4. 如权利要求1所述的一种端到端的视频动作检测定位系统,其特征在于,空间关键信息提取包括以下过程:An end-to-end video action detection and positioning system as claimed in claim 1, wherein the extraction of key spatial information comprises the following process:
    (1)设定时空信息解析单元模块输出特征图尺寸为
    Figure PCTCN2021116771-appb-100005
    将特征图输入到空间关键信息提取模块获取R f1,R f2
    (1) Set the output feature map size of the spatiotemporal information analysis unit module as
    Figure PCTCN2021116771-appb-100005
    Input the feature map into the spatial key information extraction module to obtain R f1 , R f2 ;
    Figure PCTCN2021116771-appb-100006
    Figure PCTCN2021116771-appb-100006
    Figure PCTCN2021116771-appb-100007
    Figure PCTCN2021116771-appb-100007
    其中f 1()表示对特征矩阵均值化操作,f2()表示对矩阵的特征抽取操作; Among them, f 1 () represents the mean operation on the feature matrix, and f2 () represents the feature extraction operation on the matrix;
    (2)将R f1和R f2按照第一个维度进行相加的处理,获取合并后的空间特征信息 (2) Add R f1 and R f2 according to the first dimension to obtain the combined spatial feature information
    R f=R f1+R f2 R f =R f1 +R f2
    (3)将R f进行空间特征融合,将R f输入到融合特征归一化单元,该单元将空间特征增强化,对增强化后的特征进行归一化处理后计算效率更加高效: (3) Perform spatial feature fusion on R f , and input R f into the fusion feature normalization unit, which enhances the spatial features and normalizes the enhanced features. The calculation efficiency is more efficient:
    x=f fuse(R f) x=f fuse (R f )
    X out=f normalize(X) X out = f normalize (X)
    X表示融合后的特征图,融合函数use()将特征的信息整合,通过归一化函数f normalize()将增强后的特征归一化到0~1之间。 X represents the fused feature map, the fusion function use() integrates the information of the feature, and the enhanced feature is normalized between 0 and 1 through the normalization function f normalize ().
  5. 如权利要求1所述的一种端到端的视频动作检测定位系统,其特征在于,通道信息整合挖掘包括以下步骤:An end-to-end video motion detection and positioning system as claimed in claim 1, wherein the channel information integration mining comprises the following steps:
    (1)空间关键信息提取模块得到的数据特征表示为
    Figure PCTCN2021116771-appb-100008
    时空信息解析单元模块特征表示为
    Figure PCTCN2021116771-appb-100009
    为了减少通道信息整合挖掘模块的信息损失将X out
    Figure PCTCN2021116771-appb-100010
    输入后按通道合并特征信息,输出特征图Y;
    (1) The data features obtained by the spatial key information extraction module are expressed as
    Figure PCTCN2021116771-appb-100008
    The spatiotemporal information parsing unit module features are expressed as
    Figure PCTCN2021116771-appb-100009
    In order to reduce the information loss of the channel information integration mining module, X out and
    Figure PCTCN2021116771-appb-100010
    After input, merge feature information by channel and output feature map Y;
    (2)用通道压缩单元将特征图Y向量化为Z,函数f vector()表示向量化函数,特征图Z表示对特征图的向量化符号表示,其中C 3表示通道标量的相加和,其数值C 3=C 1+C 2,N表示对每张特征图向量化的数值表示,其数值为N=H 1*W 1(2) The channel compression unit is used to vectorize the feature map Y into Z, the function f vector ( ) represents the vectorization function, the feature map Z represents the vectorized symbol representation of the feature map, and C 3 represents the addition and sum of the channel scalars, Its value is C 3 =C 1 +C 2 , and N represents the quantized numerical representation of each feature map, and its value is N=H 1 *W 1 ;
    Figure PCTCN2021116771-appb-100011
    Figure PCTCN2021116771-appb-100011
    通过将特征向量Z与Z的转置特征矩阵ZT,T表示矩阵的转置,生成特征矩阵,该矩阵中的每个元素均为Z与ZT的内积的值,其中矩阵I的生成维度为C 3*C 3,矩阵I生成计算的公式为: By transposing the eigenvectors Z and Z into the eigenmatrix ZT, where T represents the transpose of the matrix, a eigenmatrix is generated, each element in the matrix is the value of the inner product of Z and ZT, and the generation dimension of the matrix I is C 3 *C 3 , the formula for the generation and calculation of matrix I is:
    Figure PCTCN2021116771-appb-100012
    Figure PCTCN2021116771-appb-100012
    其中参数i,j是对矩阵Z行列的索引表示,n从零开始计算最大值为N,对 该矩阵进行如下运算操作,生成特征图
    Figure PCTCN2021116771-appb-100013
    矩阵E的计算公式的公式为:
    The parameters i and j are the index representations of the rows and columns of the matrix Z, and the maximum value of n is calculated from zero to N, and the following operations are performed on the matrix to generate the feature map
    Figure PCTCN2021116771-appb-100013
    The formula for the calculation formula of matrix E is:
    Figure PCTCN2021116771-appb-100014
    Figure PCTCN2021116771-appb-100014
    特征图
    Figure PCTCN2021116771-appb-100015
    中的每个值均为0到1,其意义表示第j个通道对第i个通道影响的程度;
    Feature map
    Figure PCTCN2021116771-appb-100015
    Each value in is 0 to 1, and its meaning indicates the degree of influence of the jth channel on the ith channel;
    (3)为了进一步说明特征图E对原始特征图Z的影响,需要计算出Z′,首先讲矩阵E进行矩阵的转置操作,其计算公式为:(3) In order to further illustrate the influence of the feature map E on the original feature map Z, it is necessary to calculate Z'. First, the matrix E is used to perform the matrix transposition operation. The calculation formula is:
    Z′=E T*Z Z′=E T *Z
    将Z′进行维度变换还原为3维的输出:Transform Z' into a 3-dimensional output:
    Figure PCTCN2021116771-appb-100016
    Figure PCTCN2021116771-appb-100016
    其中函数f reshape()主要对维度进行了展开的操作,最后特征图的输出为
    Figure PCTCN2021116771-appb-100017
    计算公式为0=Z″+x out
    The function f reshape () mainly expands the dimension, and the output of the final feature map is
    Figure PCTCN2021116771-appb-100017
    The calculation formula is 0=Z″+x out .
  6. 如权利要求1所述的一种端到端的视频动作检测定位系统,其特征在于,预测结果输出包括以下步骤:An end-to-end video action detection and positioning system as claimed in claim 1, wherein the prediction result output comprises the following steps:
    对于图片中的每个特征点生成3个预测框,设计整个网络模为是四层输出,因此在网络训练之前需要对数据集利用聚类算法对所有的bbox进行聚类生成12个预置框,坐标的回归主要根据预测种类的数量生成了模型的每一层最后的输出尺寸大小[(3×(NumClass+5))×H×W],其中NumClass是预测的种类个数,训练中为了适应当前数据集中的类别,对于类别预测我们采用了如下损失函数,其损失值loss coord计算公式为: For each feature point in the picture, 3 prediction boxes are generated, and the entire network model is designed to be a four-layer output. Therefore, before network training, it is necessary to use the clustering algorithm to cluster all bboxes on the data set to generate 12 preset boxes. , the regression of coordinates mainly generates the final output size of each layer of the model according to the number of predicted categories [(3×(NumClass+5))×H×W], where NumClass is the number of predicted categories, and in the training, in order to To adapt to the categories in the current dataset, we use the following loss function for category prediction, and the loss coord calculation formula is:
    loss c=-∑a′*lna loss c =-∑a′*lna
    其中,y表示标签中的真实值,a表示模型预测的类别输出值,坐标损失函数损失值loss coord计算公式: Among them, y represents the real value in the label, a represents the category output value predicted by the model, and the coordinate loss function loss value loss coord calculation formula:
    loss coord=-y′*log(y)-(1-y′)*log(1-y). loss coord =-y'*log(y)-(1-y')*log(1-y).
    其中,y′表示标签中真实的坐标值,y表示模型预测坐标的输出值。Among them, y' represents the real coordinate value in the label, and y represents the output value of the model predicted coordinate.
PCT/CN2021/116771 2020-12-25 2021-09-06 End-to-end video action detection and positioning system WO2022134655A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011560837.3 2020-12-25
CN202011560837.3A CN113158723B (en) 2020-12-25 2020-12-25 End-to-end video motion detection positioning system

Publications (1)

Publication Number Publication Date
WO2022134655A1 true WO2022134655A1 (en) 2022-06-30

Family

ID=76878004

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/116771 WO2022134655A1 (en) 2020-12-25 2021-09-06 End-to-end video action detection and positioning system

Country Status (2)

Country Link
CN (1) CN113158723B (en)
WO (1) WO2022134655A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115131710A (en) * 2022-07-05 2022-09-30 福州大学 Real-time action detection method based on multi-scale feature fusion attention
CN115580564A (en) * 2022-11-09 2023-01-06 深圳桥通物联科技有限公司 Dynamic calling device for communication gateway of Internet of things
CN116030189A (en) * 2022-12-20 2023-04-28 中国科学院空天信息创新研究院 Target three-dimensional reconstruction method based on single-view remote sensing image
CN116189281A (en) * 2022-12-13 2023-05-30 北京交通大学 End-to-end human behavior classification method and system based on space-time self-adaptive fusion
CN116503406A (en) * 2023-06-28 2023-07-28 中铁水利信息科技有限公司 Hydraulic engineering information management system based on big data
CN117788302A (en) * 2024-02-26 2024-03-29 山东全维地信科技有限公司 Mapping graphic processing system

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113158723B (en) * 2020-12-25 2022-06-07 神思电子技术股份有限公司 End-to-end video motion detection positioning system
CN115719508A (en) * 2021-08-23 2023-02-28 香港大学 Video motion detection method based on end-to-end framework and electronic equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180285692A1 (en) * 2017-03-28 2018-10-04 Ulsee Inc. Target Tracking with Inter-Supervised Convolutional Networks
CN108830252A (en) * 2018-06-26 2018-11-16 哈尔滨工业大学 A kind of convolutional neural networks human motion recognition method of amalgamation of global space-time characteristic
CN109447014A (en) * 2018-11-07 2019-03-08 东南大学-无锡集成电路技术研究所 A kind of online behavioral value method of video based on binary channels convolutional neural networks
CN110032942A (en) * 2019-03-15 2019-07-19 中山大学 Action identification method based on Time Domain Piecewise and signature differential
CN110059598A (en) * 2019-04-08 2019-07-26 南京邮电大学 The Activity recognition method of the long time-histories speed network integration based on posture artis
CN111259779A (en) * 2020-01-13 2020-06-09 南京大学 Video motion detection method based on central point trajectory prediction
CN113158723A (en) * 2020-12-25 2021-07-23 神思电子技术股份有限公司 End-to-end video motion detection positioning system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10200666B2 (en) * 2015-03-04 2019-02-05 Dolby Laboratories Licensing Corporation Coherent motion estimation for stereoscopic video
CN111079646B (en) * 2019-12-16 2023-06-06 中山大学 Weak supervision video time sequence action positioning method and system based on deep learning

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180285692A1 (en) * 2017-03-28 2018-10-04 Ulsee Inc. Target Tracking with Inter-Supervised Convolutional Networks
CN108830252A (en) * 2018-06-26 2018-11-16 哈尔滨工业大学 A kind of convolutional neural networks human motion recognition method of amalgamation of global space-time characteristic
CN109447014A (en) * 2018-11-07 2019-03-08 东南大学-无锡集成电路技术研究所 A kind of online behavioral value method of video based on binary channels convolutional neural networks
CN110032942A (en) * 2019-03-15 2019-07-19 中山大学 Action identification method based on Time Domain Piecewise and signature differential
CN110059598A (en) * 2019-04-08 2019-07-26 南京邮电大学 The Activity recognition method of the long time-histories speed network integration based on posture artis
CN111259779A (en) * 2020-01-13 2020-06-09 南京大学 Video motion detection method based on central point trajectory prediction
CN113158723A (en) * 2020-12-25 2021-07-23 神思电子技术股份有限公司 End-to-end video motion detection positioning system

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115131710A (en) * 2022-07-05 2022-09-30 福州大学 Real-time action detection method based on multi-scale feature fusion attention
CN115580564A (en) * 2022-11-09 2023-01-06 深圳桥通物联科技有限公司 Dynamic calling device for communication gateway of Internet of things
CN115580564B (en) * 2022-11-09 2023-04-18 深圳桥通物联科技有限公司 Dynamic calling device for communication gateway of Internet of things
CN116189281A (en) * 2022-12-13 2023-05-30 北京交通大学 End-to-end human behavior classification method and system based on space-time self-adaptive fusion
CN116189281B (en) * 2022-12-13 2024-04-02 北京交通大学 End-to-end human behavior classification method and system based on space-time self-adaptive fusion
CN116030189A (en) * 2022-12-20 2023-04-28 中国科学院空天信息创新研究院 Target three-dimensional reconstruction method based on single-view remote sensing image
CN116030189B (en) * 2022-12-20 2023-07-04 中国科学院空天信息创新研究院 Target three-dimensional reconstruction method based on single-view remote sensing image
CN116503406A (en) * 2023-06-28 2023-07-28 中铁水利信息科技有限公司 Hydraulic engineering information management system based on big data
CN116503406B (en) * 2023-06-28 2023-09-19 中铁水利信息科技有限公司 Hydraulic engineering information management system based on big data
CN117788302A (en) * 2024-02-26 2024-03-29 山东全维地信科技有限公司 Mapping graphic processing system
CN117788302B (en) * 2024-02-26 2024-05-14 山东全维地信科技有限公司 Mapping graphic processing system

Also Published As

Publication number Publication date
CN113158723A (en) 2021-07-23
CN113158723B (en) 2022-06-07

Similar Documents

Publication Publication Date Title
WO2022134655A1 (en) End-to-end video action detection and positioning system
CN109389055B (en) Video classification method based on mixed convolution and attention mechanism
Yuan et al. High-order local ternary patterns with locality preserving projection for smoke detection and image classification
Ahmad et al. Human action recognition using shape and CLG-motion flow from multi-view image sequences
Charfi et al. Definition and performance evaluation of a robust SVM based fall detection solution
Avgerinakis et al. Recognition of activities of daily living for smart home environments
CN112434608B (en) Human behavior identification method and system based on double-current combined network
CN110826447A (en) Restaurant kitchen staff behavior identification method based on attention mechanism
CN111488805B (en) Video behavior recognition method based on salient feature extraction
Chenarlogh et al. A multi-view human action recognition system in limited data case using multi-stream CNN
CN111199212B (en) Pedestrian attribute identification method based on attention model
Luo et al. Traffic analytics with low-frame-rate videos
WO2023159898A1 (en) Action recognition system, method, and apparatus, model training method and apparatus, computer device, and computer readable storage medium
CN106326851B (en) A kind of method of number of people detection
Hermina et al. A Novel Approach to Detect Social Distancing Among People in College Campus
Sahoo et al. DISNet: A sequential learning framework to handle occlusion in human action recognition with video acquisition sensors
Islam et al. Representation for action recognition with motion vector termed as: SDQIO
Huo et al. 3DVSD: An end-to-end 3D convolutional object detection network for video smoke detection
Tong et al. D3-LND: A two-stream framework with discriminant deep descriptor, linear CMDT and nonlinear KCMDT descriptors for action recognition
Ming Hand fine-motion recognition based on 3D Mesh MoSIFT feature descriptor
Pavlov et al. Application for video analysis based on machine learning and computer vision algorithms
Su et al. A multiattribute sparse coding approach for action recognition from a single unknown viewpoint
Hatipoglu et al. A gender recognition system from facial images using SURF based BoW method
CN116824641A (en) Gesture classification method, device, equipment and computer storage medium
Xia et al. Human action recognition using high-order feature of optical flows

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21908658

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 06.11.2023)