CN114972874A - Three-dimensional human body classification and generation method and system for complex action sequence - Google Patents
Three-dimensional human body classification and generation method and system for complex action sequence Download PDFInfo
- Publication number
- CN114972874A CN114972874A CN202210635201.3A CN202210635201A CN114972874A CN 114972874 A CN114972874 A CN 114972874A CN 202210635201 A CN202210635201 A CN 202210635201A CN 114972874 A CN114972874 A CN 114972874A
- Authority
- CN
- China
- Prior art keywords
- action
- training
- model
- complex
- complex action
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000009471 action Effects 0.000 title claims abstract description 200
- 238000000034 method Methods 0.000 title claims abstract description 43
- 238000012549 training Methods 0.000 claims abstract description 74
- 238000012360 testing method Methods 0.000 claims abstract description 23
- 238000009826 distribution Methods 0.000 claims abstract description 9
- 238000007781 pre-processing Methods 0.000 claims abstract description 5
- 230000008569 process Effects 0.000 claims description 16
- 230000006870 function Effects 0.000 claims description 10
- 238000013528 artificial neural network Methods 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 5
- 238000013145 classification model Methods 0.000 claims description 2
- 230000000875 corresponding effect Effects 0.000 description 21
- 238000001514 detection method Methods 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 210000000988 bone and bone Anatomy 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011065 in-situ storage Methods 0.000 description 1
- 230000009191 jumping Effects 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 230000000116 mitigating effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
- G06V40/23—Recognition of whole body movements, e.g. for sport training
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Human Computer Interaction (AREA)
- Image Analysis (AREA)
Abstract
本发明公开了一种复杂动作序列的三维人体分类与生成方法、系统,应用于虚拟现实领域,包括:获取并对复杂动作视频进行预处理,构建数据集;对数据集进行关键点识别,得到人体关键点以及动作序列姿态信息,作为训练集;构建基于三维几何的复杂动作序列分类编码模型,并将其输入和输出合并为一个序列进行编码与解码训练,构建基于三维几何的复杂动作序列生成模型;将测试集输入模型,得到多种测试集动作类别的动作序列。本发明通过将标准三维几何序列编码为含有时间信息的几何参数,不仅增强了网络对不同动作类别在隐空间分布的学习,还实现了在复杂动作情况下,能够准确识别动作类型,生成合理的动作序列,提高了识别的准确率和动作的多样性。
The invention discloses a three-dimensional human body classification and generation method and system for complex action sequences, which are applied to the field of virtual reality and include: acquiring and preprocessing complex action videos to construct a data set; The key points of the human body and the pose information of the action sequence are used as the training set; the classification and coding model of complex action sequences based on 3D geometry is constructed, and its input and output are combined into a sequence for encoding and decoding training, and a complex action sequence generation based on 3D geometry is constructed. Model; input the test set into the model to obtain action sequences of various test set action categories. By encoding the standard three-dimensional geometric sequence into geometric parameters containing time information, the invention not only enhances the network's learning of the distribution of different action categories in the latent space, but also realizes that in the case of complex actions, the action type can be accurately identified and a reasonable Action sequences improve the accuracy of recognition and the diversity of actions.
Description
技术领域technical field
本发明涉及虚拟现实领域,特别涉及一种复杂动作序列的三维人体分类与生成方法、系统。The invention relates to the field of virtual reality, in particular to a three-dimensional human body classification and generation method and system for complex action sequences.
背景技术Background technique
在人体三维重建领域,人体运动预测是一项极具挑战性的任务。基于CAVE,将动作的语义标签作为先验条件,与动作序列一起放入网络训练中,根据标签生成无限多个三维人体动作序列,使这些序列看起来更真实。之前的许多工作都是基于运动序列,通过结构化预测,对人体骨骼的空间结构进行隐式建模,并将其应用于各个关节。现在大多数用于动作识别的深度学习方法都使用了浅层卷积网络。使用卷积神经网络,即可以使用随机梯度下降学习特征及分类减轻的端到端学习,并限制一些支持向量机和手工特征相关的方法。提取特征是通过直接卷积滤波器中的训练数据学习得到的,主要好处有:用非常简单便捷的端到端方式优化特征提取器和分类器参数,提取的特征针对特定的属性进行自适应的优化。多标签的卷积神经网络优于支持向量机,因为它们能够更加全面的学习属性之间的关系。In the field of human 3D reconstruction, human motion prediction is an extremely challenging task. Based on CAVE, the semantic labels of actions are used as prior conditions, and are put into the network training together with the action sequences, and an infinite number of 3D human action sequences are generated according to the labels, making these sequences look more realistic. Many previous works are based on motion sequences, through structured prediction, to implicitly model the spatial structure of the human skeleton and apply it to individual joints. Most deep learning methods for action recognition today use shallow convolutional networks. With convolutional neural networks, stochastic gradient descent can be used to learn features and end-to-end learning with classification mitigation, and limit some methods related to support vector machines and handcrafted features. Extracted features are learned from training data in direct convolution filters. The main benefits are: a very simple and convenient end-to-end way to optimize feature extractor and classifier parameters, and the extracted features are adaptive for specific attributes. optimization. Multi-label convolutional neural networks are superior to support vector machines because they can learn the relationship between attributes more comprehensively.
三维人体动作序列生成旨在,给定条件下,生成满足预期动作类型的三维人体。基于深度学习的方法一直以来主导着重识别领域。这些方法可以大致分为两类:第一类主要集中从三维信息中提取动作特征,以得处不同动作在隐空间的分布,通过不同的分布生成复杂多样的动作序列;第二类是通过风格迁移的方式,在给定动作数据的前提下,从形象中预测骨骼,并绑定权重,生成风格多样的人体单一动作。The purpose of 3D human action sequence generation is to generate a 3D human body that meets the expected action type under given conditions. Methods based on deep learning have been dominant in the field of recognition. These methods can be roughly divided into two categories: the first category mainly focuses on extracting action features from 3D information, in order to obtain the distribution of different actions in the latent space, and generate complex and diverse action sequences through different distributions; the second category is to use style The method of migration is to predict the bones from the image under the premise of given action data, and bind the weights to generate a single action of the human body with diverse styles.
现有的三维人体动作小数据集(NTU RGBD、HumanAct12、UESTC、BABEL)都是通过普通相机或者深度相机获得人体运动序列后,通过VIBE的处理得到人体关键点的信息以及体态,姿势等信息,最大的数据集也就几万张图片。面对不同场景拍摄角度,通过对特征变化进行学习得到摄像机视角转化函数模型的过程较为复杂。The existing 3D human action small datasets (NTU RGBD, HumanAct12, UESTC, BABEL) are all obtained through the ordinary camera or depth camera to obtain the human motion sequence, and then obtain the information of the key points of the human body, posture, posture and other information through VIBE processing. The largest dataset is only tens of thousands of images. In the face of different shooting angles of the scene, the process of obtaining the camera perspective conversion function model by learning the feature changes is more complicated.
目前三维动作生成的技术还处于发展阶段,现有的模型在动作类别多时,则生成的动作类型较为简单;或者只能生成单一动作的动作序列。在复杂的动作序列中,相关动作会有交集,更会进一步导致动作的准确率降低,生成相应的动作序列更加困难。At present, the technology of 3D action generation is still in the development stage. When the existing model has many action categories, the action type generated is relatively simple; or it can only generate action sequences of a single action. In complex action sequences, related actions will overlap, which will further reduce the accuracy of actions and make it more difficult to generate corresponding action sequences.
为此,如何提供一种能够在多复杂动作类型的情况下,还能准确识别分类以及生成对应动作的人体序列的复杂动作序列的三维人体分类与生成方法、系统是本领域技术人员亟需解决的问题。Therefore, how to provide a three-dimensional human body classification and generation method and system that can accurately identify, classify and generate complex action sequences of human body sequences corresponding to actions in the case of multiple complex action types is an urgent need for those skilled in the art to solve The problem.
发明内容SUMMARY OF THE INVENTION
有鉴于此,本发明提出了一种复杂动作序列的三维人体分类与生成方法、系统。通过采用VIBE对图片数据进行处理,得到每一帧图片中的人体关键点以及动作序列姿态信息,可以尽可能地减弱镜头扭曲对动作生成的影响;通过对标准三维几何序列采用两个线性全连接层,以及附带时间信息的长短记忆神经网络,得到含有时间信息的几何参数,再将其作为先验输入到基于三维几何的复杂动作序列生成模型中,可以增强网络对不同动作类别在隐空间分布的学习;并且对数据集中的复杂动作进行参数化处理,可以实现在复杂动作的情况下,依然能准确识别动作类型,生成合理的动作序列,提高了识别的准确率和动作的多样性。In view of this, the present invention proposes a three-dimensional human body classification and generation method and system for complex action sequences. By using VIBE to process the picture data, the key points of the human body and the pose information of the action sequence in each frame of the picture can be obtained, which can reduce the influence of lens distortion on the action generation as much as possible; by using two linear full connections for the standard 3D geometric sequence layer, and a long-short-term memory neural network with time information to obtain geometric parameters containing time information, and then input them as a priori into the complex action sequence generation model based on 3D geometry, which can enhance the network’s ability to understand the distribution of different action categories in the latent space. In addition, the parameterized processing of complex actions in the data set can still accurately identify the action type in the case of complex actions, generate a reasonable action sequence, and improve the accuracy of recognition and the diversity of actions.
为了实现上述目的,本发明采用如下技术方案:In order to achieve the above object, the present invention adopts the following technical solutions:
一种复杂动作序列的三维人体分类与生成方法,包括:A three-dimensional human body classification and generation method for complex action sequences, comprising:
步骤(1):获取复杂动作视频,对复杂动作视频进行预处理,构建数据集。Step (1): Obtain complex action videos, preprocess the complex action videos, and construct a dataset.
步骤(2):对数据集进行关键点识别,得到人体关键点以及动作序列姿态信息,作为训练集。Step (2): Perform key point recognition on the data set to obtain human body key points and action sequence pose information as a training set.
步骤(3):构建基于三维几何的复杂动作序列分类编码模型,以训练集为输入,得到与训练集动作类别相对应的含有时间信息的几何参数的输出。Step (3): construct a three-dimensional geometry-based complex action sequence classification and coding model, take the training set as input, and obtain the output of geometric parameters containing time information corresponding to the action categories of the training set.
步骤(4):将基于三维几何的复杂动作序列分类编码模型的输入和输出合并为一个序列进行编码与解码训练,构建基于三维几何的复杂动作序列生成模型。Step (4): Combine the input and output of the three-dimensional geometry-based complex action sequence classification and encoding model into a sequence for encoding and decoding training, and construct a three-dimensional geometry-based complex action sequence generation model.
步骤(5):将待生成动作的测试集输入所述基于三维几何的复杂动作序列分类编码模型,获取与测试集动作类别相对应的含有时间信息的几何参数,再通过已经训练好的所述基于三维几何的复杂动作序列生成模型,得到多种所述测试集动作类别的动作序列。Step (5): input the test set of actions to be generated into the described three-dimensional geometry-based complex action sequence classification and coding model, obtain geometric parameters containing time information corresponding to the action categories of the test set, and then pass the trained described Based on the complex action sequence generation model of three-dimensional geometry, various action sequences of the test set action categories are obtained.
可选的,步骤(1)中,对复杂动作视频进行预处理包括:剪辑、截断帧、将视频数据转换为图片数据,并选取有丰富运动特征的图片数据构建数据集。Optionally, in step (1), preprocessing the complex action video includes: editing, truncating frames, converting video data into picture data, and selecting picture data with rich motion features to construct a data set.
可选的,步骤(2)中,对数据集进行关键点识别具体为:采用VIBE对图片数据进行处理,得到每一帧图片中的人体关键点以及动作序列姿态信息。Optionally, in step (2), the key point identification of the data set is specifically: using VIBE to process the picture data to obtain the key points of the human body and the posture information of the action sequence in each frame of the picture.
可选的,步骤(3)中,基于三维几何的复杂动作序列分类编码模型基于categorical编码,构建训练完成的复杂动作序列分类模型,具体包括以下步骤:Optionally, in step (3), the three-dimensional geometry-based complex action sequence classification and coding model constructs a trained complex action sequence classification model based on categorical coding, which specifically includes the following steps:
采用数据集函数对训练集和对应标签进行处理,获取训练集长度;Use the dataset function to process the training set and corresponding labels to obtain the length of the training set;
通过对数据迭代器中的训练集进行迭代,获取相应的三维几何模型的张量;Obtain the tensor of the corresponding 3D geometric model by iterating over the training set in the data iterator;
将categorical编码迁移到GPU上,通过两个线性全连接层,以及附带时间信息的长短记忆神经网络,得到含有时间信息的几何参数。The categorical encoding is transferred to the GPU, and the geometric parameters containing time information are obtained through two linear fully connected layers and a long-short-term memory neural network with time information.
可选的,步骤(4)中,基于transformer模型构建基于三维几何的复杂动作序列生成模型,包括:编码训练和解码训练两个阶段。Optionally, in step (4), a three-dimensional geometry-based complex action sequence generation model is constructed based on the transformer model, including two stages: encoding training and decoding training.
可选的,编码训练具体为:Optionally, the encoding training is specifically:
将训练集以及与训练集相对应的含有时间信息的几何信息参数相结合为一个序列,通过门控循环单元后输入到transformer编码,得到不同动作类别在隐空间的高斯分布;Combining the training set and the geometric information parameters with time information corresponding to the training set into a sequence, and then inputting it to the transformer encoding after passing through the gated cyclic unit to obtain the Gaussian distribution of different action categories in the latent space;
解码训练具体为:The decoding training is specifically:
通过学习到不同动作类别在隐空间的方差与均值,通过transformer解码器中,得到人体三维信息以及动作姿态,使用SMPL模型对参数进行渲染,获得完整的人体序列。By learning the variance and mean of different action categories in the latent space, through the transformer decoder, the three-dimensional information and action pose of the human body are obtained, and the SMPL model is used to render the parameters to obtain the complete human body sequence.
可选的,基于transformer模型构建三维几何的复杂动作序列生成模型的训练损失函数为:Optionally, the training loss function of the three-dimensional geometric complex action sequence generation model based on the transformer model is:
其中:Vt表示输入数据的三维几何信息,Pt表示预测的人体关节点以及人体动作姿态信息。Among them: Vt represents the three-dimensional geometric information of the input data, and Pt represents the predicted human joint points and human action posture information.
可选的,还包括在基于transformer模型构建基于三维几何的复杂动作序列生成模型的过程中引入DropPath模块,使编码训练和解码训练两个阶段交替进行。Optionally, it also includes introducing a DropPath module in the process of constructing a complex action sequence generation model based on three-dimensional geometry based on the transformer model, so that the two stages of encoding training and decoding training are alternately performed.
本发明还提供一种复杂动作序列的三维人体分类与生成系统,包括:The present invention also provides a three-dimensional human body classification and generation system for complex action sequences, including:
获取模块:用于获取复杂动作视频,对复杂动作视频进行预处理,构建数据集;Acquisition module: used to acquire complex action videos, preprocess complex action videos, and construct data sets;
数据识别模块:用于采用VIBE对图片数据进行处理,对数据集进行关键点识别,得到每一帧图片中的人体关键点以及动作序列姿态信息,作为训练集;Data recognition module: used to process image data using VIBE, identify key points in the data set, and obtain the key points of the human body and the posture information of the action sequence in each frame of the picture, as a training set;
第一构建模块:基于categorical,对与训练集相对应的三维几何模型进行编码训练,得到含有时间信息的几何信息参数,构建基于三维几何的复杂动作序列分类编码模型;The first building module: Based on categorical, coding and training the 3D geometric model corresponding to the training set, obtaining geometric information parameters containing time information, and constructing a 3D geometry-based complex action sequence classification and coding model;
第二构建模块:用于将训练集以及与训练集相对应的含有时间信息的几何信息参数合并为一个序列,基于transformer模型进行编码和解码训练,直至基于transformer模型构建三维几何的复杂动作序列生成模型的训练损失函数收敛,构建基于三维几何的复杂动作序列生成模型;The second building module is used to combine the training set and the geometric information parameters with time information corresponding to the training set into a sequence, and perform encoding and decoding training based on the transformer model until the complex action sequence of constructing three-dimensional geometry based on the transformer model is generated. The training loss function of the model converges, and a complex action sequence generation model based on 3D geometry is constructed;
输入生成模块:用于将待生成动作的测试集输入所述基于三维几何的复杂动作序列分类编码模型,获取与所述测试集动作类别相对应的含有时间信息的几何参数,再通过已经训练好的所述基于三维几何的复杂动作序列生成模型,得到多种所述测试集动作类别的动作序列。Input generation module: used to input the test set of actions to be generated into the three-dimensional geometry-based complex action sequence classification and coding model, obtain geometric parameters containing time information corresponding to the action categories of the test set, and then pass the trained The complex action sequence generation model based on 3D geometry can obtain a variety of action sequences of the test set action categories.
经由上述的技术方案可知,与现有技术相比,本发明公开的一种复杂动作序列的三维人体分类与生成方法、系统。通过采用VIBE对图片数据进行处理,得到每一帧图片中的人体关键点以及动作序列姿态信息,可以尽可能地减弱镜头扭曲对动作生成的影响;通过对标准三维几何序列采用两个线性全连接层,以及附带时间信息的长短记忆神经网络,得到含有时间信息的几何参数,再将其作为先验输入到基于三维几何的复杂动作序列生成模型中,可以增强网络对不同动作类别在隐空间分布的学习;并且对数据集中的复杂动作进行参数化处理,可以实现在复杂动作的情况下,依然能准确识别动作类型,生成合理的动作序列,提高了识别的准确率和动作的多样性。It can be known from the above technical solutions that, compared with the prior art, the present invention discloses a method and system for classifying and generating a 3D human body with complex action sequences. By using VIBE to process the picture data, the key points of the human body and the pose information of the action sequence in each frame of the picture can be obtained, which can reduce the influence of lens distortion on the action generation as much as possible; by using two linear full connections for the standard 3D geometric sequence layer, and a long-short-term memory neural network with time information to obtain geometric parameters containing time information, and then input them as a priori into the complex action sequence generation model based on 3D geometry, which can enhance the network’s ability to understand the distribution of different action categories in the latent space. In addition, the parameterized processing of complex actions in the data set can still accurately identify the action type in the case of complex actions, generate a reasonable action sequence, and improve the accuracy of recognition and the diversity of actions.
附图说明Description of drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only It is an embodiment of the present invention. For those of ordinary skill in the art, other drawings can also be obtained according to the provided drawings without creative work.
图1为本发明的方法流程示意图。FIG. 1 is a schematic flow chart of the method of the present invention.
图2为本发明的单动作类型训练过程示意图。FIG. 2 is a schematic diagram of a single action type training process of the present invention.
图3为本发明的系统结构示意图。FIG. 3 is a schematic diagram of the system structure of the present invention.
具体实施方式Detailed ways
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
实施例1Example 1
本发明实施例1公开了一种复杂动作序列的三维人体分类与生成方法,包括:Embodiment 1 of the present invention discloses a three-dimensional human body classification and generation method for complex action sequences, including:
步骤(1):获取复杂动作视频,对复杂动作视频进行预处理,包括:剪辑、截断帧、将视频数据转换为图片数据,并选取有丰富运动特征的图片数据构建数据集,具体为:Step (1): obtaining complex action videos, preprocessing the complex action videos, including: editing, truncating frames, converting video data into picture data, and selecting picture data with rich motion features to construct a data set, specifically:
通过相机采集多种动作类型的动作视频。Capture action videos of various action types through the camera.
对动作视频进行剪辑,获取每帧均有人体的视频。Edit the action video to obtain a video with a human body in each frame.
对每帧视频进行截断帧操作,将视频数据转换为图片数据,并采用有丰富运动特征的图片构建数据集。The frame truncation operation is performed on each frame of video, the video data is converted into picture data, and the data set is constructed by using pictures with rich motion features.
步骤(2):采用VIBE对图片数据进行处理,对数据集进行关键点识别,得到每一帧图片中的人体关键点以及动作序列姿态信息,作为训练集。Step (2): Use VIBE to process the image data, identify the key points of the data set, and obtain the key points of the human body and the posture information of the action sequence in each frame of the picture, as a training set.
进一步地,关于数据集的采集与构建:采用普通手机后置摄像头采集数据。采集的视频需要满足人体均在视角内,无遮挡,无移动(原地动作),如果随意采集数据,很多数据是无效的。为了解决这个问题,通过短时间高频率的采集视频,来保证每个视频动作的高有效性。为了获得更好的实验结果,采集分辨率为1080*1920、平均每秒30帧的视频,同时将每段视频的持续时间控制在15-20秒左右。同时基于HumanAct12数据集中基础的12种动作类型,创新定义了五种复杂动作类型的视频:Further, regarding the collection and construction of the data set: use the rear camera of a common mobile phone to collect data. The collected video needs to meet the human body in the viewing angle, no occlusion, no movement (in-situ action), if you collect data at will, many data are invalid. In order to solve this problem, the high-efficiency of each video action is ensured by collecting videos at high frequency in a short time. In order to obtain better experimental results, a video with a resolution of 1080*1920 and an average of 30 frames per second was collected, and the duration of each video was controlled at about 15-20 seconds. At the same time, based on the 12 basic action types in the HumanAct12 dataset, we innovatively define videos of five complex action types:
1、走路、坐下、喝水1. Walk, sit, drink water
2、蹲下、起立、走路2. Squat down, stand up, walk
3、坐下、接电话3. Sit down and answer the phone
4、跑步、跳跃4. Running, jumping
5、吃东西、抛东西5. Eating and throwing things
进一步地,为了减少网络的规模,将输入简化为视频中人体关键点以及动作序列姿态。因此,为了提高数据集的质量,在本实施例中通过VIBE的方法来处理数据集。Further, in order to reduce the scale of the network, the input is simplified to human key points and action sequence poses in the video. Therefore, in order to improve the quality of the dataset, the VIBE method is used to process the dataset in this embodiment.
步骤(3):基于categorical编码构建基于三维几何的复杂动作序列分类编码模型,具体为:Step (3): construct a three-dimensional geometry-based complex action sequence classification and coding model based on categorical coding, specifically:
采用数据集函数对训练集和对应标签进行处理,获取训练集长度。Use the dataset function to process the training set and corresponding labels to obtain the length of the training set.
通过对数据迭代器中的训练集进行迭代,获取相应的三维几何模型的张量。The tensors of the corresponding 3D geometric model are obtained by iterating over the training set in the data iterator.
将categorical编码迁移到GPU上,通过两个线性全连接层,以及附带时间信息的长短记忆神经网络,得到含有时间信息的几何参数。The categorical encoding is transferred to the GPU, and the geometric parameters containing time information are obtained through two linear fully connected layers and a long-short-term memory neural network with time information.
以训练集为输入,得到与训练集动作类别相对应的含有时间信息的几何参数的输出。Taking the training set as input, the output of geometric parameters containing time information corresponding to the action category of the training set is obtained.
步骤(4):将基于三维几何的复杂动作序列分类编码模型的输入和输出合并为一个序列,基于transformer模型进行编码与解码训练,构建基于三维几何的复杂动作序列生成模型。Step (4): Combine the input and output of the three-dimensional geometry-based complex action sequence classification and encoding model into a sequence, perform encoding and decoding training based on the transformer model, and construct a three-dimensional geometry-based complex action sequence generation model.
编码训练具体为:The encoding training is specifically:
将训练集以及与训练集相对应的含有时间信息的几何信息参数相结合为一个序列,通过门控循环单元后输入到transformer编码,得到不同动作类别在隐空间的高斯分布。The training set and the geometric information parameters containing time information corresponding to the training set are combined into a sequence, which is input to the transformer encoding after passing through the gated recurrent unit, and the Gaussian distribution of different action categories in the latent space is obtained.
解码训练具体为:The decoding training is specifically:
通过学习到不同动作类别在隐空间的方差与均值,通过transformer解码器中,得到人体三维信息以及动作姿态,使用SMPL模型对参数进行渲染,获得完整的人体序列。By learning the variance and mean of different action categories in the latent space, through the transformer decoder, the three-dimensional information and action pose of the human body are obtained, and the SMPL model is used to render the parameters to obtain the complete human body sequence.
基于transformer模型构建三维几何的复杂动作序列生成模型的训练损失函数为:The training loss function of the three-dimensional geometric complex action sequence generation model based on the transformer model is:
其中:Vt表示输入数据的三维几何信息,Pt表示预测的人体关节点以及人体动作姿态信息。Among them: Vt represents the three-dimensional geometric information of the input data, and Pt represents the predicted human joint points and human action posture information.
还包括在基于transformer模型构建基于三维几何的复杂动作序列生成模型的过程中引入DropPath模块,使编码训练和解码训练两个阶段交替进行,具体为:1、将以一定的概率随机丢弃Join层,但必须保证起码有一条分支是通的。2、全局丢弃,随机选择一条分支。It also includes introducing the DropPath module in the process of building a complex action sequence generation model based on three-dimensional geometry based on the transformer model, so that the two stages of encoding training and decoding training are alternately performed, specifically: 1. The Join layer will be randomly discarded with a certain probability, But it must be ensured that at least one branch is open. 2. Globally discard, and randomly select a branch.
步骤(5):将待生成动作的测试集输入所述基于三维几何的复杂动作序列分类编码模型,获取与测试集动作类别相对应的含有时间信息的几何参数(与测试集动作类别相对应的模型隐空间编码),再通过已经训练好的所述基于三维几何的复杂动作序列生成模型(transformer变分自编码器),得到多种所述测试集动作类别的动作序列。Step (5): Input the test set of the action to be generated into the described three-dimensional geometry-based complex action sequence classification and coding model, and obtain the geometric parameters (corresponding to the test set action category) containing time information corresponding to the test set action category. Model latent space coding), and then through the trained 3D geometry-based complex action sequence generation model (transformer variational autoencoder), action sequences of various test set action categories are obtained.
此外,还可以对模型进行检测,对每一种单独的动作类型单独训练检测。检测数据及结果见表1。首先,将数据集中同一种动作类型的视频划分,分别训练输出训练结果,然后,统计各个动作模型对识别准确率以及生成动作的准确率,以判断各个子模型对相应动作的识别与生成更为准确。In addition, the model can be detected, and the detections can be trained individually for each individual action type. The test data and results are shown in Table 1. First, divide the videos of the same action type in the data set, train and output the training results separately, and then count the recognition accuracy of each action model and the accuracy of the generated action to judge the recognition and generation of the corresponding action by each sub-model. precise.
表1单动作的检测数据及结果Table 1 Detection data and results of single action
实施例2Example 2
本发明实施例2公开了一种复杂动作序列的三维人体分类与生成系统,包括:Embodiment 2 of the present invention discloses a three-dimensional human body classification and generation system for complex action sequences, including:
获取模块:用于获取复杂动作视频,对所述复杂动作视频进行预处理,构建数据集;Acquisition module: used to acquire complex action videos, preprocess the complex action videos, and construct a data set;
数据识别模块:用于采用VIBE对图片数据进行处理,对所述数据集进行关键点识别,得到每一帧图片中的人体关键点以及动作序列姿态信息,作为训练集;Data recognition module: used to use VIBE to process image data, perform key point recognition on the data set, and obtain the human body key points and action sequence posture information in each frame of pictures, as a training set;
第一构建模块:基于categorical,对与所述训练集相对应的三维几何模型进行编码训练,得到含有时间信息的几何信息参数,构建基于三维几何的复杂动作序列分类编码模型;The first building module: based on categorical, coding and training the three-dimensional geometric model corresponding to the training set, obtaining geometric information parameters containing time information, and constructing a three-dimensional geometry-based complex action sequence classification and coding model;
第二构建模块:用于将训练集以及与所述训练集相对应的含有时间信息的几何信息参数合并为一个序列,基于transformer模型进行编码和解码训练,直至基于transformer模型构建三维几何的复杂动作序列生成模型的训练损失函数收敛,构建基于三维几何的复杂动作序列生成模型;The second building module: used to combine the training set and the geometric information parameters containing time information corresponding to the training set into a sequence, and perform encoding and decoding training based on the transformer model until the complex action of constructing three-dimensional geometry based on the transformer model The training loss function of the sequence generation model converges, and a complex action sequence generation model based on 3D geometry is constructed;
输入生成模块:用于将待生成动作的测试集输入所述基于三维几何的复杂动作序列分类编码模型,获取与所述测试集动作类别相对应的含有时间信息的几何参数(与测试集动作类别相对应的模型隐空间编码),再通过已经训练好的所述基于三维几何的复杂动作序列生成模型,得到多种所述测试集动作类别的动作序列。Input generation module: used to input the test set of actions to be generated into the three-dimensional geometry-based complex action sequence classification and coding model, and obtain geometric parameters containing time information corresponding to the action categories of the test set (and the action categories of the test set). Corresponding model latent space coding), and then through the trained three-dimensional geometry-based complex action sequence generation model, a variety of action sequences of the test set action categories are obtained.
此外,还包括单动作检测模块:用于对每一种单独的动作类型单独训练检测。In addition, a single-action detection module is included: used to train detections individually for each individual action type.
本发明公开的一种复杂动作序列的三维人体分类与生成方法、系统。通过采用VIBE对图片数据进行处理,得到每一帧图片中的人体关键点以及动作序列姿态信息,可以尽可能地减弱镜头扭曲对动作生成的影响;通过对标准三维几何序列采用两个线性全连接层,以及附带时间信息的长短记忆神经网络,得到含有时间信息的几何参数,再将其作为先验输入到基于三维几何的复杂动作序列生成模型中,可以增强网络对不同动作类别在隐空间分布的学习;并且对数据集中的复杂动作进行参数化处理,可以实现在复杂动作的情况下,依然能准确识别动作类型,生成合理的动作序列,提高了识别的准确率和动作的多样性。The invention discloses a three-dimensional human body classification and generation method and system for complex action sequences. By using VIBE to process the picture data, the key points of the human body and the pose information of the action sequence in each frame of the picture can be obtained, which can reduce the influence of lens distortion on the action generation as much as possible; by using two linear full connections for the standard 3D geometric sequence layer, and a long-short-term memory neural network with time information to obtain geometric parameters containing time information, and then input them as a priori into the complex action sequence generation model based on 3D geometry, which can enhance the network’s ability to understand the distribution of different action categories in the latent space. In addition, the parameterized processing of complex actions in the data set can still accurately identify the action type in the case of complex actions, generate a reasonable action sequence, and improve the accuracy of recognition and the diversity of actions.
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same and similar parts between the various embodiments can be referred to each other. As for the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant part can be referred to the description of the method.
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下,在其它实施例中实现。因此,本发明将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (9)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210635201.3A CN114972874A (en) | 2022-06-07 | 2022-06-07 | Three-dimensional human body classification and generation method and system for complex action sequence |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210635201.3A CN114972874A (en) | 2022-06-07 | 2022-06-07 | Three-dimensional human body classification and generation method and system for complex action sequence |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114972874A true CN114972874A (en) | 2022-08-30 |
Family
ID=82958738
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210635201.3A Pending CN114972874A (en) | 2022-06-07 | 2022-06-07 | Three-dimensional human body classification and generation method and system for complex action sequence |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114972874A (en) |
-
2022
- 2022-06-07 CN CN202210635201.3A patent/CN114972874A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107491726B (en) | A real-time expression recognition method based on multi-channel parallel convolutional neural network | |
Lu et al. | Multiple spatio-temporal feature learning for video-based emotion recognition in the wild | |
Nazir et al. | SemAttNet: Toward attention-based semantic aware guided depth completion | |
WO2022267641A1 (en) | Image defogging method and system based on cyclic generative adversarial network | |
CN109544442A (en) | The image local Style Transfer method of production confrontation network based on dual confrontation | |
CN108960059A (en) | A kind of video actions recognition methods and device | |
CN113221663A (en) | Real-time sign language intelligent identification method, device and system | |
CN113128424B (en) | Method for identifying action of graph convolution neural network based on attention mechanism | |
CN112507920B (en) | A Method for Recognition of Exam Abnormal Behavior Based on Time Shift and Attention Mechanism | |
CN109993077A (en) | A Behavior Recognition Method Based on Two-Stream Network | |
CN113313683B (en) | Non-reference video quality evaluation method based on meta-migration learning | |
CN108921032B (en) | Novel video semantic extraction method based on deep learning model | |
CN107767358B (en) | A method and device for determining the blurriness of an object in an image | |
CN113570689B (en) | Portrait cartoon method, device, medium and computing equipment | |
CN111160356A (en) | A kind of image segmentation classification method and device | |
CN110852199A (en) | A Foreground Extraction Method Based on Double Frame Encoding and Decoding Model | |
CN112818904A (en) | Crowd density estimation method and device based on attention mechanism | |
CN113191216A (en) | Multi-person real-time action recognition method and system based on gesture recognition and C3D network | |
CN111597978B (en) | Method for automatically generating pedestrian re-identification picture based on StarGAN network model | |
Wu et al. | [Retracted] 3D Film Animation Image Acquisition and Feature Processing Based on the Latest Virtual Reconstruction Technology | |
CN113538214B (en) | Control method, system and storage medium for makeup migration | |
CN111401209B (en) | Action recognition method based on deep learning | |
Li et al. | Blind image quality evaluation method based on cyclic generative adversarial network | |
CN114972874A (en) | Three-dimensional human body classification and generation method and system for complex action sequence | |
Roy et al. | Unmasking deepfake visual content with generative AI |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |