WO2022241583A1 - 一种基于多目视频的家庭场景动作捕捉方法 - Google Patents

一种基于多目视频的家庭场景动作捕捉方法 Download PDF

Info

Publication number
WO2022241583A1
WO2022241583A1 PCT/CN2021/093969 CN2021093969W WO2022241583A1 WO 2022241583 A1 WO2022241583 A1 WO 2022241583A1 CN 2021093969 W CN2021093969 W CN 2021093969W WO 2022241583 A1 WO2022241583 A1 WO 2022241583A1
Authority
WO
WIPO (PCT)
Prior art keywords
motion
key points
key point
key
human
Prior art date
Application number
PCT/CN2021/093969
Other languages
English (en)
French (fr)
Inventor
蔡洪斌
卢光辉
李一帆
王涵
卢平悦
黄娅婷
范云翼
王博洋
Original Assignee
电子科技大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 电子科技大学 filed Critical 电子科技大学
Priority to PCT/CN2021/093969 priority Critical patent/WO2022241583A1/zh
Publication of WO2022241583A1 publication Critical patent/WO2022241583A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects

Definitions

  • the invention belongs to the technical field of motion capture, and in particular relates to a multi-view video-based family scene motion capture method.
  • the family scene motion capture technology based on multi-view video involved in the present invention can capture motion information of family members in real time and generate three-dimensional virtual character animations, thereby protecting user privacy, providing viewers with multiple viewing angles, and enhancing the sense of immersion.
  • Human body motion capture technology is widely used in film and television, games, animation and other fields. This technology captures the action characteristics of the real human body, drives the virtual character model, and generates 3D animation.
  • Optical human motion capture technology can be divided into marker-based human motion capture technology and video-based human motion capture technology.
  • the body capture technology based on marker points requires the human body to wear specific sensors or cursors that can reflect infrared lasers to collect key point information of the human body. But this kind of method equipment is expensive, and is not suitable for motion capture in daily life.
  • the video-based human body motion capture technology does not need to wear equipment, and can calculate the spatial position of the key points of the human body based on the image sequence captured by multiple calibration cameras, and restore the human body posture.
  • motion capture based on multi-eye video is more robust to depth ambiguity and occlusion problems, and is more in line with the technical requirements of this patent.
  • the SMPL model (Skinned Multi-Person Linear model) is a parametric model of the human body that contains a large number of human body priors.
  • the SMPL model defines human posture and posture through 10 shape parameters and 72 pose parameters.
  • the objective function of the distance between the gesture features extracted from the video and the human body parameter model features can be established, and the motion capture problem can be transformed into the objective function minimization problem.
  • the present invention provides a family scene motion capture method based on multi-view video, which aims to generate real-time animation of the family scene by using the motion capture technology, and has robustness in occlusion situations.
  • the present invention comprises following main steps:
  • Step 1 camera placement, place multiple calibration cameras in the home to be detected, and obtain multi-angle videos of the home in real time.
  • Step 2 building and labeling the family scene model, creating a 3D virtual scene model based on the real family scene, and making necessary labels on the 3D virtual scene.
  • This step includes:
  • Step 2.1 perform 3D modeling of the family scene to be detected.
  • step 2.2 mark functional areas such as common walking passages and sitting areas in the 3D scene. And in the fixed functional areas such as sofas, tables and chairs, the face orientation of the characters when performing routine actions is defined, which is used to assist the generation of common behavior animations of the characters.
  • Step 2.3 Establish a family member action database, and pre-create family member models, guest standard appearance models, and common action animations, such as walking, standing, and sitting, based on the parametric human body model SMPL.
  • Step 3 human body 2D key point detection, detects human body 2D key point coordinates and PAF (PartAffinityField) in the multi-view video.
  • PAF PartAffinityField
  • This step includes:
  • J represents the number of key points in a single human skeleton
  • C represents the number of bones in a single human skeleton
  • PAF of bones of class c Denotes the PAF of bones of class c, where c ⁇ ⁇ 1,...,C ⁇ .
  • Step 3.2 use the non-maximum value suppression algorithm to find out the heat map set of all jth key points in S j in, Represents the heat map of the jth key point of the mth person in the scene, M is the number of people in the scene, m ⁇ 1,...,M ⁇ .
  • Step 3.3 calculate Coordinates of the mid-maximum point That is, the 2D coordinates of the jth key point of the mth person in the scene.
  • Step 4 human skeleton assembly, assemble the detected multi-person 2D key points to form multiple groups of human 2D skeletons, and establish the connection between key points in different perspectives, as well as the key points between the current frame and the previous frame. connect.
  • This step includes:
  • Step 4.1 construct the initial key point association graph G:
  • V is the point set of graph G
  • E is the edge set of graph G.
  • D t-1 indicates the bone 3D key point obtained in frame t-1, if there is no frame t-1, this item is ignored.
  • E P there are edges connecting two key points of different types in the human skeleton.
  • E V there is an edge connection between the key points of the same type of human skeleton, which is represented by E V .
  • each key point is connected to all key points of the same class in D t-1 , denoted by E T , if there is no t-1 frame, this item is ignored. .
  • Step 4.2 the goal is to solve the initial key point association graph G, and obtain the real key point association graph G’ that can correctly represent the key point connection:
  • Step 4.3 for the edge of graph G Perform weight assignment:
  • L c (x) represents the PAF value at point x.
  • x(u) means at the key point and Interpolation points on the line.
  • Step 4.4 for the edge of graph G Perform weight assignment:
  • K c represents the internal parameter matrix of camera c
  • Z is the normalization coefficient, the Normalized to [0,1].
  • Step 4.5 for the edges of graph G Perform weight assignment:
  • Step 4.6 calculate human bone bundle
  • human bone bundle Represents a subgraph consisting of keypoints of the i-th category of the m-th person and the j-th category of the m-th person in the real keypoint association graph G'.
  • This step includes:
  • step 4.6.1 in the initial key point association graph G, the subgraph composed of all i-th key points and all j-th key points is recorded as
  • q(z) p(z) ⁇ z
  • represents the number of points in g c
  • w p , w m , w t , w v are weight coefficients.
  • Step 4.6.2 let Repeat step 4.6.1 until Is empty.
  • step 4.7 traverse all the bones of the human body to obtain the set B of human bone bundles.
  • step 4.8 arrange the human bone bundles B according to the scores of the formula (10) from large to small to form a queue Q.
  • Step 4.9 Initially, the ground-truth keypoint association map
  • Step 4.10 take out the team leader skeleton bundle from the queue Q
  • all keypoints d included should be assigned the label of the same person. like And d i , d j have been given different character labels in G', then There is a conflict with G'.
  • the According to the character label in G' it is split into the bone bundles of different characters, and the new bone bundle score is calculated according to the formula (10), and it is added to the queue Q again.
  • Step 5 Reconstruct the existing actions in the action database. For recognizable common actions, directly call the preset action animations in the action database to save computing costs.
  • This step includes:
  • Step 5.1 using the collected image sequence and 2D skeleton information to identify the identity and action of the current person.
  • Step 5.2 judging whether the current character action has been stored in the action database. If it has been stored, use steps 5.3 and 5.4 to generate character animation. If not stored, go to step 6.
  • Step 5.3 based on the triangulation method, the three-dimensional coordinates of the root key points are calculated using the image coordinates of the root key points of the human body acquired by the dual-target fixed camera.
  • step 5.4 align the root node of the character model in the initial frame of animation in the action database with the three-dimensional coordinates calculated in step 5.3, and determine the rotation direction of the root node with the help of the facial direction annotation in step 2.2. Subsequently, the animation in the action database is played.
  • this step method can be used to calculate the position of the root node at the end of the action, and use the marking of the walking channel in step 2.2 to determine the path of the motion process.
  • Step 5.5 if it is detected that the action of the character is switched, return to step 5.2.
  • Step 6 real-time motion reconstruction. If the current motion is not stored in the motion database, use the 3D model to fit the 2D human skeleton to reconstruct the 3D motion of the character in real time.
  • This step includes,
  • step 6.1 according to the identification result of the person in step 5.1, the parameterized human body model of the corresponding family member is called out from the database. Fit the parametric human model to the motion of the 2D human skeleton assembled in step 3 by minimizing the objective function (11). If the current character identity is a family member, keep the initial shape parameter ⁇ of the model, and only optimize the pose parameter ⁇ . If the current character identity is a guest, the shape parameter ⁇ and the posture parameter ⁇ of the human body model are optimized at the same time in the first frame, and only the posture parameter ⁇ is optimized in subsequent frames.
  • ⁇ J , ⁇ shape , ⁇ temp , ⁇ ⁇ are weight parameters.
  • aE J is the joint distance penalty item:
  • ⁇ i,c represents the confidence score of the i-th key point of the person in the c-th viewing angle
  • R ⁇ (J( ⁇ ) i ) represents the 3D coordinates of the i-th key point in the SMPL model
  • J i,c represents the 2D coordinates of the i-th key point in the c-th viewing angle
  • ⁇ ( ⁇ ) is the Geman-McClure penalty function.
  • bE shape is the shape penalty item:
  • l i,t represents the length of the i-th bone in the current frame t
  • C represents the set of human bones.
  • cE temp is a time smoothing item:
  • is the weight parameter
  • ⁇ v j,t represents the trend of joint point j moving forward in frame t
  • ⁇ v j,t R ⁇ (J( ⁇ ) j,t-1 -R ⁇ (J( ⁇ ) j,t-2 , ⁇ i ,t represent the pose parameters of the i-th bone in the t-th frame.
  • ⁇ j (g j N( ⁇ ; ⁇ ⁇ ,j , ⁇ ⁇ ,j ) is the prior Gaussian mixture model about the pose parameter ⁇ established using the CMUMoCaP dataset.
  • Step 7 Judging and processing the occlusion situation during real-time motion reconstruction, judging and processing the occlusion of key points of the human body during real-time motion reconstruction, resulting in the problem that 2D key points cannot be recognized or recognized incorrectly.
  • This step includes:
  • Step 7.1 if the 2D human skeleton formed in step 4 is incomplete in all viewing angles, or the confidence of some detected key points in all viewing angles is lower than the preset threshold T, then it is considered that some key points of the human body are occluded , in the blind spot of the viewing angle.
  • Step 7.2 for the occlusion of shorter consecutive frames, when performing real-time reconstruction in step 6, increase the weight coefficient ⁇ temp of the occluded key points in formula (11), and strengthen the current human body 3D key point estimation to the key points of the previous frame rely.
  • step 7.3 for the occlusion of longer continuous frames, especially the long-term occlusion of specific key points, the processing of step 7.2 is prone to cumulative errors.
  • the character is generally in a relatively static state, for example, the key points of the lower body are blocked when sitting at the table.
  • the standard posture model closest to the current posture is called from the action database, such as standard sitting posture, standard standing posture, standard lying posture, etc., and its posture parameter ⁇ .
  • ⁇ j represents the axis angle rotation of the key point j in the skeletal joint chain relative to the parent key point.
  • the parameter ⁇ of the standard pose model is used as the initial value for action regression, and only the parameter ⁇ of the key points with high confidence is optimized during regression, and the occluded key points keep the original parameter ⁇ .
  • Fig. 1 shows a kind of family scene motion capture method based on multi-purpose video of the present invention
  • Fig. 2 shows an example of an initial key point association graph G of an example of the present invention
  • Fig. 3 shows the real key point association graph G' example of the example of the present invention
  • Fig. 4 shows an example of a skeletal bundle definition of an example of the present invention
  • Step 1 camera placement, place multiple calibration cameras in the home to be detected, and obtain multi-angle videos of the home in real time.
  • Step 2 building and labeling the family scene model, creating a 3D virtual scene model based on the real family scene, and making necessary labels on the 3D virtual scene.
  • This step includes:
  • Step 2.1 perform 3D modeling of the family scene to be detected.
  • step 2.2 mark functional areas such as common walking passages and sitting areas in the 3D scene. And in the fixed functional areas such as sofas, tables and chairs, the face orientation of the characters when performing routine actions is defined, which is used to assist the generation of common behavior animations of the characters.
  • Step 2.3 Establish a family member action database, and pre-create family member models, guest standard appearance models, and common action animations, such as walking, standing, and sitting, based on the parametric human body model SMPL.
  • Step 3 human body 2D key point detection, detects human body 2D key point coordinates and PAF (PartAffinityField) in the multi-view video.
  • PAF PartAffinityField
  • This step includes:
  • J represents the number of key points in a single human skeleton
  • C represents the number of bones in a single human skeleton
  • PAF of bones of class c Denotes the PAF of bones of class c, where c ⁇ ⁇ 1,...,C ⁇ .
  • Step 3.2 use the non-maximum value suppression algorithm to find out the heat map set of all jth key points in S j in, Represents the heat map of the jth key point of the mth person in the scene, M is the number of people in the scene, m ⁇ 1,...,M ⁇ .
  • Step 3.3 calculate Coordinates of the mid-maximum point That is, the 2D coordinates of the jth key point of the mth person in the scene.
  • Step 4 human skeleton assembly, assemble the detected multi-person 2D key points to form multiple groups of human 2D skeletons, and establish the connection between key points in different perspectives, as well as the key points between the current frame and the previous frame. connect.
  • This step includes:
  • Step 4.1 construct the initial key point association graph G:
  • V is the point set of graph G
  • E is the edge set of graph G.
  • D t-1 indicates the bone 3D key point obtained in frame t-1, if there is no frame t-1, this item is ignored.
  • E P there are edges connecting two key points of different types in the human skeleton.
  • E V there is an edge connection between the key points of the same type of human skeleton, which is represented by E V .
  • each key point is connected to all key points of the same class in D t-1 , denoted by E T , if there is no t-1 frame, this item is ignored.
  • the initial key point association graph G is shown in Figure 2. For the sake of clarity, only two perspectives and two types of key points are shown in Figure 2.
  • Step 4.2 the goal is to solve the initial key point association graph G, and obtain the real key point association graph G’ that can correctly represent the key point connection:
  • the real key point association graph G’ is shown in Figure 3. For clarity, only two perspectives and two types of key points are shown in Figure 3.
  • Step 4.3 for the edge of graph G Perform weight assignment:
  • L c (x) represents the PAF value at point x.
  • x(u) means at the key point and Interpolation points on the line.
  • Step 4.4 for the edge of graph G Perform weight assignment:
  • K c represents the internal parameter matrix of camera c
  • Z is the normalization coefficient, the Normalized to [0,1].
  • Step 4.5 for the edges of graph G Perform weight assignment:
  • Step 4.6 calculate human bone bundle
  • human bone bundle Represents a subgraph consisting of keypoints of the i-th category of the m-th person and the j-th category of the m-th person in the real keypoint association graph G'.
  • a bone bundle is shown in Figure 4.
  • This step includes:
  • step 4.6.1 in the initial key point association graph G, the subgraph composed of all i-th key points and all j-th key points is recorded as
  • q(z) p(z) ⁇ z
  • represents the number of points in g c
  • w p , w m , w t , w v are weight coefficients.
  • Step 4.6.2 let Repeat step 4.6.1 until Is empty.
  • step 4.7 traverse all the bones of the human body to obtain the set B of human bone bundles.
  • step 4.8 arrange the human bone bundles B according to the scores of the formula (10) from large to small to form a queue Q.
  • Step 4.9 Initially, the ground-truth keypoint association map
  • Step 4.10 take out the team leader skeleton bundle from the queue Q
  • all keypoints d included should be assigned the label of the same person. like And d i , d j have been given different character labels in G', then There is a conflict with G'.
  • the According to the character label in G' it is split into the bone bundles of different characters, and the new bone bundle score is calculated according to the formula (10), and it is added to the queue Q again.
  • Step 5 Reconstruct the existing actions in the action database. For recognizable common actions, directly call the preset action animations in the action database to save computing costs.
  • This step includes:
  • Step 5.1 using the collected image sequence and 2D skeleton information to identify the identity and action of the current person.
  • Step 5.2 judging whether the current character action has been stored in the action database. If it has been stored, use steps 5.3 and 5.4 to generate character animation. If not stored, go to step 6.
  • Step 5.3 based on the triangulation method, the three-dimensional coordinates of the root key points are calculated using the image coordinates of the root key points of the human body acquired by the dual-target fixed camera.
  • step 5.4 align the root node of the character model in the initial frame of animation in the action database with the three-dimensional coordinates calculated in step 5.3, and determine the rotation direction of the root node with the help of the facial direction annotation in step 2.2. Subsequently, the animation in the action database is played.
  • this step method can be used to calculate the position of the root node at the end of the action, and use the marking of the walking channel in step 2.2 to determine the path of the motion process.
  • Step 5.5 if it is detected that the action of the character is switched, return to step 5.2.
  • Step 6 real-time motion reconstruction. If the current motion is not stored in the motion database, use the 3D model to fit the 2D human skeleton to reconstruct the 3D motion of the character in real time.
  • This step includes,
  • step 6.1 according to the identification result of the person in step 5.1, the parameterized human body model of the corresponding family member is called out from the database. Fit the parametric human model to the motion of the 2D human skeleton assembled in step 3 by minimizing the objective function (11). If the current character identity is a family member, the initial shape parameter ⁇ of the model is maintained, and only the posture parameter ⁇ is optimized. If the current character identity is a guest, the shape parameter ⁇ and the posture parameter ⁇ of the human body model are optimized at the same time in the first frame, and only the posture parameter ⁇ is optimized in subsequent frames.
  • ⁇ J , ⁇ shape , ⁇ temp , ⁇ ⁇ are weight parameters.
  • aE J is the joint distance penalty item:
  • ⁇ i,c represents the confidence score of the i-th key point of the person in the c-th viewing angle
  • R ⁇ (J( ⁇ ) i ) represents the 3D coordinates of the i-th key point in the SMPL model
  • J i,c represents the 2D coordinates of the i-th key point in the c-th viewing angle
  • ⁇ ( ⁇ ) is the Geman-McClure penalty function.
  • bE shape is the shape penalty item:
  • l i,t represents the length of the i-th bone in the current frame t
  • C represents the set of human bones.
  • cE temp is a time smoothing item:
  • is the weight parameter
  • ⁇ v j,t represents the trend of joint point j moving forward in frame t
  • ⁇ v j,t R ⁇ (J( ⁇ ) j,t-1 -R ⁇ (J( ⁇ ) j,t-2 , ⁇ i ,t represent the pose parameters of the i-th bone in the t-th frame.
  • ⁇ j (g j N( ⁇ ; ⁇ ⁇ ,j , ⁇ ⁇ ,j ) is the prior Gaussian mixture model about the pose parameter ⁇ established using the CMUMoCaP dataset.
  • Step 7 Judging and processing the occlusion situation during real-time motion reconstruction, judging and processing the occlusion of key points of the human body during real-time motion reconstruction, resulting in the problem that 2D key points cannot be recognized or recognized incorrectly.
  • This step includes:
  • Step 7.1 if the 2D human skeleton formed in step 4 is incomplete in all viewing angles, or the confidence of some detected key points in all viewing angles is lower than the preset threshold T, then it is considered that some key points of the human body are occluded , in the blind spot of the viewing angle.
  • Step 7.2 for the occlusion of shorter consecutive frames, when performing real-time reconstruction in step 6, increase the weight coefficient ⁇ temp of the occluded key points in formula (11), and strengthen the current human body 3D key point estimation to the key points of the previous frame rely.
  • step 7.3 for the occlusion of longer continuous frames, especially the long-term occlusion of specific key points, the processing of step 7.2 is prone to cumulative errors.
  • the character is generally in a relatively static state, for example, the key points of the lower body are blocked when sitting at the table.
  • the standard posture model closest to the current posture is called from the action database, such as standard sitting posture, standard standing posture, standard lying posture, etc., and its posture parameter ⁇ .
  • ⁇ j represents the axis angle rotation of the key point j in the skeletal joint chain relative to the parent key point.
  • the parameter ⁇ of the standard pose model is used as the initial value for action regression, and only the parameter ⁇ of the key points with high confidence is optimized during regression, and the occluded key points keep the original parameter ⁇ .

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Processing Or Creating Images (AREA)

Abstract

本发明提供了一种基于多目视频的家庭场景动作捕捉方法,可在家庭场景下进行多人动作捕捉,帮助使用者通过电子设备与远程家庭进行交互。该方法包括相机放置、家庭场景模型构建及标注、人体2D关键点检测、人体骨架组装、动作数据库中已有动作重建、实时动作重建、对遮挡情况的判断与处理。放置相机是获取家庭中多角度视频的准备工作。家庭场景模型构建及标注为后续动作重建提供了动作约束条件和先验信息。在实际动作重建时,本方法利用人体2D关键点检测确定多人场景中所有人体关键点的二维坐标。随后,人体骨架组装是为了在多人场景中将正确的2D关键点连接,形成单人2D骨架,并建立多视角2D骨骼点以及前一帧3D骨骼点之间的联系,为人体3D关键点预测提供时间和空间维度的信息。动作数据库中已有动作重建是利用家庭场景人物行动较为单一的特性,通过预先定义的人物动作动画减少实时动作重建次数的手段。实时动作重建针对动作数据库中没有的动作,用3D模型拟合2D关键点,最终以3D模型呈现出当前人物三维姿势。最后,本方法还对遮挡情况进行了判断和纠正,从而减少人体关键点被遮挡时出现的动作重建错误,使本方法在家庭场景中拥有更强的鲁棒性。本发明可以有效地适应家庭场景的多人动作捕捉,在保证家庭隐私的情况下,为用户提供了远程家庭场景本地呈现的技术手段。

Description

一种基于多目视频的家庭场景动作捕捉方法 技术领域
本发明属于动作捕捉的技术领域,尤其涉及一种基于多目视频的家庭场景动作捕捉方法。
背景技术
随着我国老龄化日渐严重,空巢老人越来越多。通过技术手段,将远程子女的家庭情景呈现在本地,可缓解独居老人内心的孤独感。然而,以家庭视频监控为代表的相关技术虽然实施简单,却有容易泄露家庭隐私、视角单一、缺乏沉浸感的缺点。本发明涉及的基于多目视频的家庭场景动作捕捉技术可实时捕获家庭成员动作信息,生成三维虚拟人物动画,从而保护使用者隐私,并为观看者提供多种观看视角,增强浸入感。
人体动作捕捉技术被广泛应用于影视、游戏、动画等领域。该技术通过捕捉真实人体的动作特征,驱动虚拟人物模型,产生三维动画。光学式人体动作捕捉技术可被分为基于标记点的人体动作捕捉技术和基于视频的人体动作捕捉技术。基于标记点的人体捕捉技术需要人体佩戴特定的传感器或可反射红外激光的光标,以采集人体关键点信息。但此类方法设备造价昂贵,且不适合日常生活中的动作捕捉。基于视频的人体动作捕捉技术无需佩戴设备,可根据多个标定相机拍摄的图像序列计算出人体关键点的空间位置,恢复出人体姿态。相比于基于单目视频的动作捕捉,基于多目视频的动作捕捉对深度歧义和遮挡问题有更好的鲁棒性,更符合本专利的技术需求。
SMPL模型(Skinned Multi-Person Linear model)是包含大量人体先验的人体参数化模型。SMPL模型通过10个形状参数和72个姿势参数对人体体态和姿势进行定义。利用SMPL模型,可以建立从视频提取的姿态特征和人体参数模型特征之间距离的目标函数,将动作捕捉问题转化为目标函数最小化问题。
发明内容
本发明提供了一种基于多目视频的家庭场景动作捕捉方法,旨在利用动作捕捉技术,生成家庭场景的实时动画,并在遮挡情景下具有鲁棒性。本发明包括以下主要步骤:
步骤1,相机放置,在待检测家庭中放置多个标定相机,实时获取家庭的多 角度视频。
步骤2,家庭场景模型构建及标注,根据真实家庭场景创建三维虚拟场景模型,对三维虚拟场景进行必要标注。
本步骤包括:
步骤2.1,对待检测家庭场景进行三维建模。
步骤2.2,在三维场景中标注常用行走通道、可坐区域等功能区。并在沙发、桌椅等固定的功能区中,对人物进行常规动作时的面部朝向进行定义,用来辅助人物常见行为动画的生成。
步骤2.3,建立家庭成员动作数据库,基于参数化人体模型SMPL预先创建各家庭成员模型、客人标准样貌模型,以及常见动作动画,如行走、站立、静坐等。
步骤3,人体2D关键点检测,检测多目视频中的人体2D关键点坐标和PAF(PartAffinityField)。
本步骤包括:
步骤3.1,将各角度视频的当前帧输入OpenPose卷积神经网络,得到置信度图集合S=(S 1,S 2,...,S J)和PAF集合L=(L 1,L 2,...,L C)。
其中J表示单个人体骨架中关键点个数,
Figure PCTCN2021093969-appb-000001
表示第j类关键点的置信度图,其中j∈{1,...,J}。C表示单个人体骨架中骨骼的个数,
Figure PCTCN2021093969-appb-000002
表示第c类骨骼的PAF,其中c∈{1,...,C}。
步骤3.2,利用非极大值抑制算法,找出S j中所有第j类关键点的热图集合
Figure PCTCN2021093969-appb-000003
Figure PCTCN2021093969-appb-000004
其中,
Figure PCTCN2021093969-appb-000005
表示场景中第m个人的第j类关键点的热图,M为场景中人物个数,m∈{1,...,M}。
步骤3.3,计算
Figure PCTCN2021093969-appb-000006
中最大值点的坐标
Figure PCTCN2021093969-appb-000007
即为场景中第m个人的第j类关键点2D坐标。
步骤4,人体骨架组装,对检测到的多人2D关键点进行组装,形成多组人体2D骨架,并建立不同视角中关键点之间的联系,以及当前帧与前一帧关键点之间的联系。
本步骤包括:
步骤4.1,构建初始关键点关联图G:
G=(V,E),V=D j(c)∪D t-1,E=E P∪E V∪E T  (1)
其中,V为图G的点集,E为图G的边集。
Figure PCTCN2021093969-appb-000008
表示在当前帧t中,视角c里第j类关键点中的第m个候选点,j∈{1,2,...,J},c∈{1,2,...,N},N为相机个数。D t-1表示t-1帧求出的骨骼3D关键点,若不存在t-1帧,则忽略这一项。在图G中,同一视角里,人体骨架中不同类的关键点两两之间有边相连,用E P表示。不同视角中,人体骨架相同类的关键点两两之间有边相连,用E V表示。每个视角中,每个关键点与D t-1中所有相同类的关键点相连,用E T表示,若不存在t-1帧,则忽略这一项。。
步骤4.2,目标是对初始关键点关联图G求解,得到能够正确表示关键点联系的真实关键点关联图G’:
G’=(V,E’),V=D j(c)∪D t-1,E’=E’ p∪E’ v∪E’ T  (2)
其中,在G’中,同一视角中的关键点以真实人体骨架对应的边相连,用E’ p表示;不同视角中,同一人物的同类关键点以边相连,用E’ v表示;每个视角中,每个关键点与D t-1中同一人物的同类关键点相连,用E’ T表示。步骤4.1-4.10即对G’求解过程。
步骤4.3,对图G的边
Figure PCTCN2021093969-appb-000009
进行权重赋值:
Figure PCTCN2021093969-appb-000010
Figure PCTCN2021093969-appb-000011
其中,
Figure PCTCN2021093969-appb-000012
表示以
Figure PCTCN2021093969-appb-000013
Figure PCTCN2021093969-appb-000014
为顶点的边。
Figure PCTCN2021093969-appb-000015
表示
Figure PCTCN2021093969-appb-000016
在G’中保留,
Figure PCTCN2021093969-appb-000017
表示
Figure PCTCN2021093969-appb-000018
在G’中不保留。L c(x)表示点x处的PAF值。x(u)表示在关键点
Figure PCTCN2021093969-appb-000019
Figure PCTCN2021093969-appb-000020
连线上的插值点。
步骤4.4,对图G的边
Figure PCTCN2021093969-appb-000021
进行权重赋值:
Figure PCTCN2021093969-appb-000022
Figure PCTCN2021093969-appb-000023
其中,
Figure PCTCN2021093969-appb-000024
K c表示相机c的内参矩阵,
Figure PCTCN2021093969-appb-000025
表示关键点在相机坐标系的坐标,
Figure PCTCN2021093969-appb-000026
表示相机c 1光心和
Figure PCTCN2021093969-appb-000027
所在的直线与相机c 2光心和
Figure PCTCN2021093969-appb-000028
所在的直线之间的直线距离。Z为归一化系数,将
Figure PCTCN2021093969-appb-000029
归一化到[0,1]。
步骤4.5,对图G的边
Figure PCTCN2021093969-appb-000030
进行权重赋值:
Figure PCTCN2021093969-appb-000031
其中,
Figure PCTCN2021093969-appb-000032
表示t-1帧第i类关键点的第k个候选点,
Figure PCTCN2021093969-appb-000033
为相机光心和
Figure PCTCN2021093969-appb-000034
两点所在直线与
Figure PCTCN2021093969-appb-000035
之间的距离,T为归一化系数,将
Figure PCTCN2021093969-appb-000036
归一化到[0,1]。
步骤4.6,计算人体骨骼束,人体骨骼束
Figure PCTCN2021093969-appb-000037
表示在真实关键点关联图G’中由第m个人的第i类和第m个人的第j类关键点组成的子图。
本步骤包括:
步骤4.6.1,将初始关键点关联图G中,所有第i类关键点和所有第j类关键点组成的子图记为
Figure PCTCN2021093969-appb-000038
在多人场景下,
Figure PCTCN2021093969-appb-000039
中包含多个人体骨骼束。从
Figure PCTCN2021093969-appb-000040
生成的所有候选骨骼束中计算出可令目标方程(10)最大的骨骼束g c,作为真实的骨骼束。
Figure PCTCN2021093969-appb-000041
Figure PCTCN2021093969-appb-000042
Figure PCTCN2021093969-appb-000043
其中,q(z)=p(z)·z,|V c|表示g c中点的个数,w p,w m,w t,w v为权重系数。
步骤4.6.2,令
Figure PCTCN2021093969-appb-000044
重复步骤4.6.1,直到
Figure PCTCN2021093969-appb-000045
为空。
步骤4.7,根据步骤4.6,遍历人体所有骨骼,求出人体骨骼束集合B。
步骤4.8,将人体骨骼束B按照公式(10)的得分,由大到小排列,构成队列Q。
步骤4.9,初始时,真实关键点关联图
Figure PCTCN2021093969-appb-000046
步骤4.10,从队列Q中,取出队首骨骼束
Figure PCTCN2021093969-appb-000047
在加入G’时,包含的所有关键点d应被赋予同一个人的标签。若
Figure PCTCN2021093969-appb-000048
且d i,d j在G’中已被赋予不同的人物标签,则
Figure PCTCN2021093969-appb-000049
与G’存在冲突。
判断
Figure PCTCN2021093969-appb-000050
与G’是否有冲突。
a.若有冲突,则将
Figure PCTCN2021093969-appb-000051
按照G’中的人物标签拆分成不同人物的骨骼束,根据公式(10)计算出新的骨骼束得分,将其重新加入队列Q。
b.若没有冲突,则
Figure PCTCN2021093969-appb-000052
并为
Figure PCTCN2021093969-appb-000053
赋予相应的人物标签。
步骤5,动作数据库中已有动作重建,对于可识别的常见动作,直接调用动作数据库中预置动作动画,节约运算开销。
本步骤包括:
步骤5.1,利用采集的图像序列和2D骨骼信息,识别当前人物身份与动作。
步骤5.2,判断当前人物动作是否已存储于动作数据库中。若已存储,则利用步骤5.3,5.4生成人物动画。若未存储,则进入步骤6。
步骤5.3,基于三角测量法,利用双目标定相机获取的人体根关键点图像坐标计算出根关键点的三维坐标。
步骤5.4,将动作数据库中动画初始帧的人物模型根结点对齐步骤5.3中计算出的三维坐标,并借助步骤2.2的面部方向标注确定根结点旋转方向。随后,播放动作数据库中的动画。在处理行走类动作时,可利用本步骤方法计算出动作结束时根结点的位置,并利用步骤2.2中对行走通道的标记确定运动过程路径。
步骤5.5,若检测到人物动作发生切换,则返回步骤5.2。
步骤6,实时动作重建,若当前动作未存储在动作数据库中,则利用三维模型拟合2D人体骨架,实时重建出人物三维动作。
本步骤包括,
步骤6.1,根据步骤5.1中人物身份识别结果,从数据库中调出相应家庭成员的参数化人体模型。通过最小化目标函数(11),令参数化人体模型与步骤3中组装的2D人体骨架动作拟合。若当前人物身份是家庭成员,则保持模型初始 形状参数β,只对姿势参数θ进行优化。若当前人物身份是客人,则在第一帧同时优化人体模型的形状参数β与姿势参数θ,后续帧只对姿势参数θ进行优化。
E(β,θ)=λ JE JshapeE shapetempE tempθE θ  (11)
其中,λ J,λ shape,λ temp,λ θ为权重参数。
a.E J为关节距离惩罚项:
Figure PCTCN2021093969-appb-000054
其中,对于单个人物,η i,c表示第c个视角中此人的第i类关键点的置信分数,R θ(J(β) i)表示SMPL模型中第i类关键点的3D坐标,
Figure PCTCN2021093969-appb-000055
表示第i类关键点向第c个相机的图像平面投影的2D坐标,J i,c表示第c个视角中第i类关键点的2D坐标,ρ(·)为Geman-McClure惩罚函数。
b.E shape为形状惩罚项:
Figure PCTCN2021093969-appb-000056
其中,对于单个人物,l i,t表示当前帧t的第i类骨骼的长度,
Figure PCTCN2021093969-appb-000057
为利用当前人物初始五帧图像计算出的第i类骨骼的平均长度先验,C表示人体骨骼集合。
c.E temp为时间平滑项:
Figure PCTCN2021093969-appb-000058
其中,α为权重参数,Δv j,t表示表示第t帧关节点j向前运动的趋势,Δv j,t=R θ(J(β) j,t-1-R θ(J(β) j,t-2,θ i,t表示第t帧第i类骨骼的姿势参数。
d.E θ为动作惩罚项:
Figure PCTCN2021093969-appb-000059
其中,∑ j(g jN(θ;μ θ,jθ,j)为利用CMUMoCaP数据集建立的关于姿势参数θ 的先验高斯混合模型。
步骤7,判断并处理实时动作重建时的遮挡情况,判断并处理实时动作重建时人体关键点被遮挡,导致2D关键点无法识别或识别错误的问题。
本步骤包括:
步骤7.1,若步骤4组成的2D人体骨架在所有视角中都不完整,或者检测出的部分关键点在所有视角的置信度都低于预设阈值T,则认为该人体有部分关键点被遮挡,处于视角盲区。
步骤7.2,对于较短连续帧的遮挡,在步骤6进行实时重建时,增大式(11)中被遮挡关键点的权重系数λ temp,加强当前人体3D关键点估算对前一帧关键点的依赖。
步骤7.3,对于较长连续帧的遮挡,特别是特定关键点的长时间遮挡,步骤7.2的处理容易产生累积误差。此种情况下,人物一般处于较为静止的状态,例如,坐于桌前时下半身的关键点被遮挡。此时,根据图片识别结果,从动作数据库中调出最接近当前姿态的标准姿态模型,如标准坐姿、标准站姿、标准卧姿等,及其姿势参数θ。
Figure PCTCN2021093969-appb-000060
其中,ω j表示骨骼关节链中关键点j相对于父关键点的轴角旋转。
根据式(11),以标准姿态模型的参数θ为初始值进行动作回归,进行回归时只对置信度高的关键点的参数ω进行优化,被遮挡关键点保持原本的参数ω。
附图说明
图1示出了本发明一种基于多目视频的家庭场景动作捕捉方法;
图2示出了本发明实例的初始关键点关联图G示例;
图3示出了本发明实例的真实关键点关联图G'示例;
图4示出了本发明实例的骨骼束定义示例;
具体实施方式
下面结合附图和实施例对本发明优先实施方式进一步说明。
图1所示的流程图给出了本发明整个实施的具体过程:
步骤1,相机放置,在待检测家庭中放置多个标定相机,实时获取家庭的多 角度视频。
步骤2,家庭场景模型构建及标注,根据真实家庭场景创建三维虚拟场景模型,对三维虚拟场景进行必要标注。
本步骤包括:
步骤2.1,对待检测家庭场景进行三维建模。
步骤2.2,在三维场景中标注常用行走通道、可坐区域等功能区。并在沙发、桌椅等固定的功能区中,对人物进行常规动作时的面部朝向进行定义,用来辅助人物常见行为动画的生成。
步骤2.3,建立家庭成员动作数据库,基于参数化人体模型SMPL预先创建各家庭成员模型、客人标准样貌模型,以及常见动作动画,如行走、站立、静坐等。
步骤3,人体2D关键点检测,检测多目视频中的人体2D关键点坐标和PAF(PartAffinityField)。
本步骤包括:
步骤3.1,将各角度视频的当前帧输入OpenPose卷积神经网络,得到置信度图集合S=(S 1,S 2,...,S J)和PAF集合L=(L 1,L 2,...,L C)。
其中J表示单个人体骨架中关键点个数,
Figure PCTCN2021093969-appb-000061
表示第j类关键点的置信度图,其中j∈{1,...,J}。C表示单个人体骨架中骨骼的个数,
Figure PCTCN2021093969-appb-000062
表示第c类骨骼的PAF,其中c∈{1,...,C}。
步骤3.2,利用非极大值抑制算法,找出S j中所有第j类关键点的热图集合
Figure PCTCN2021093969-appb-000063
Figure PCTCN2021093969-appb-000064
其中,
Figure PCTCN2021093969-appb-000065
表示场景中第m个人的第j类关键点的热图,M为场景中人物个数,m∈{1,...,M}。
步骤3.3,计算
Figure PCTCN2021093969-appb-000066
中最大值点的坐标
Figure PCTCN2021093969-appb-000067
即为场景中第m个人的第j类关键点2D坐标。
步骤4,人体骨架组装,对检测到的多人2D关键点进行组装,形成多组人体2D骨架,并建立不同视角中关键点之间的联系,以及当前帧与前一帧关键点之间的联系。
本步骤包括:
步骤4.1,构建初始关键点关联图G:
G=(V,E),V=D j(c)∪D t-1,E=E P∪E V∪E T  (1)
其中,V为图G的点集,E为图G的边集。
Figure PCTCN2021093969-appb-000068
表示在当前帧t中,视角c里第j类关键点中的第m个候选点,j∈{1,2,...,J},c∈{1,2,...,N},N为相机个数。D t-1表示t-1帧求出的骨骼3D关键点,若不存在t-1帧,则忽略这一项。在图G中,同一视角里,人体骨架中不同类的关键点两两之间有边相连,用E P表示。不同视角中,人体骨架相同类的关键点两两之间有边相连,用E V表示。每个视角中,每个关键点与D t-1中所有相同类的关键点相连,用E T表示,若不存在t-1帧,则忽略这一项。初始关键点关联图G如图2所示,为了表述清晰,图2中只画出了两个视角、两类关键点的示意图。
步骤4.2,目标是对初始关键点关联图G求解,得到能够正确表示关键点联系的真实关键点关联图G’:
G’=(V,E’),V=D j(c)∪D t-1,E’=E’ p∪E’ v∪E’ T  (2)
其中,在G’中,同一视角中的关键点以真实人体骨架对应的边相连,用E’ p表示;不同视角中,同一人物的同类关键点以边相连,用E’ v表示;每个视角中,每个关键点与D t-1中同一人物的同类关键点相连,用E’ T表示。步骤4.1-4.10即对G’求解过程。
真实关键点关联图G’如图3所示,为了表述清晰,图3中只画出了两个视角、两类关键点的示意图。
步骤4.3,对图G的边
Figure PCTCN2021093969-appb-000069
进行权重赋值:
Figure PCTCN2021093969-appb-000070
Figure PCTCN2021093969-appb-000071
其中,
Figure PCTCN2021093969-appb-000072
表示以
Figure PCTCN2021093969-appb-000073
Figure PCTCN2021093969-appb-000074
为顶点的边。
Figure PCTCN2021093969-appb-000075
表示
Figure PCTCN2021093969-appb-000076
在G’中保留,
Figure PCTCN2021093969-appb-000077
表示
Figure PCTCN2021093969-appb-000078
在G’中不保留。L c(x)表示点x处的PAF值。x(u)表示在关键点
Figure PCTCN2021093969-appb-000079
Figure PCTCN2021093969-appb-000080
连线上的插值点。
步骤4.4,对图G的边
Figure PCTCN2021093969-appb-000081
进行权重赋值:
Figure PCTCN2021093969-appb-000082
Figure PCTCN2021093969-appb-000083
其中,
Figure PCTCN2021093969-appb-000084
K c表示相机c的内参矩阵,
Figure PCTCN2021093969-appb-000085
表示关键点在相机坐标系的坐标,
Figure PCTCN2021093969-appb-000086
表示相机c 1光心和
Figure PCTCN2021093969-appb-000087
所在的直线与相机c 2光心和
Figure PCTCN2021093969-appb-000088
所在的直线之间的直线距离。Z为归一化系数,将
Figure PCTCN2021093969-appb-000089
归一化到[0,1]。
步骤4.5,对图G的边
Figure PCTCN2021093969-appb-000090
进行权重赋值:
Figure PCTCN2021093969-appb-000091
其中,
Figure PCTCN2021093969-appb-000092
表示t-1帧第i类关键点的第k个候选点,
Figure PCTCN2021093969-appb-000093
为相机光心和
Figure PCTCN2021093969-appb-000094
两点所在直线与
Figure PCTCN2021093969-appb-000095
之间的距离,T为归一化系数,将
Figure PCTCN2021093969-appb-000096
归一化到[0,1]。
步骤4.6,计算人体骨骼束,人体骨骼束
Figure PCTCN2021093969-appb-000097
表示在真实关键点关联图G’中由第m个人的第i类和第m个人的第j类关键点组成的子图。一个骨骼束如图4所示。
本步骤包括:
步骤4.6.1,将初始关键点关联图G中,所有第i类关键点和所有第j类关键点组成的子图记为
Figure PCTCN2021093969-appb-000098
在多人场景下,
Figure PCTCN2021093969-appb-000099
中包含多个人体骨骼束。从
Figure PCTCN2021093969-appb-000100
生成的所有候选骨骼束中计算出可令目标方程(10)最大的骨骼束g c,作为真实的骨骼束。
Figure PCTCN2021093969-appb-000101
Figure PCTCN2021093969-appb-000102
Figure PCTCN2021093969-appb-000103
其中,q(z)=p(z)·z,|V c|表示g c中点的个数,w p,w m,w t,w v为权重系数。
步骤4.6.2,令
Figure PCTCN2021093969-appb-000104
重复步骤4.6.1,直到
Figure PCTCN2021093969-appb-000105
为空。
步骤4.7,根据步骤4.6,遍历人体所有骨骼,求出人体骨骼束集合B。
步骤4.8,将人体骨骼束B按照公式(10)的得分,由大到小排列,构成队列Q。
步骤4.9,初始时,真实关键点关联图
Figure PCTCN2021093969-appb-000106
步骤4.10,从队列Q中,取出队首骨骼束
Figure PCTCN2021093969-appb-000107
在加入G’时,包含的所有关键点d应被赋予同一个人的标签。若
Figure PCTCN2021093969-appb-000108
且d i,d j在G’中已被赋予不同的人物标签,则
Figure PCTCN2021093969-appb-000109
与G’存在冲突。
判断
Figure PCTCN2021093969-appb-000110
与G’是否有冲突。
a.若有冲突,则将
Figure PCTCN2021093969-appb-000111
按照G’中的人物标签拆分成不同人物的骨骼束,根据公式(10)计算出新的骨骼束得分,将其重新加入队列Q。
b.若没有冲突,则
Figure PCTCN2021093969-appb-000112
并为
Figure PCTCN2021093969-appb-000113
赋予相应的人物标签。
步骤5,动作数据库中已有动作重建,对于可识别的常见动作,直接调用动作数据库中预置动作动画,节约运算开销。
本步骤包括:
步骤5.1,利用采集的图像序列和2D骨骼信息,识别当前人物身份与动作。
步骤5.2,判断当前人物动作是否已存储于动作数据库中。若已存储,则利用步骤5.3,5.4生成人物动画。若未存储,则进入步骤6。
步骤5.3,基于三角测量法,利用双目标定相机获取的人体根关键点图像坐标计算出根关键点的三维坐标。
步骤5.4,将动作数据库中动画初始帧的人物模型根结点对齐步骤5.3中计算出的三维坐标,并借助步骤2.2的面部方向标注确定根结点旋转方向。随后,播放动作数据库中的动画。在处理行走类动作时,可利用本步骤方法计算出动作结束时根结点的位置,并利用步骤2.2中对行走通道的标记确定运动过程路径。
步骤5.5,若检测到人物动作发生切换,则返回步骤5.2。
步骤6,实时动作重建,若当前动作未存储在动作数据库中,则利用三维模型拟合2D人体骨架,实时重建出人物三维动作。
本步骤包括,
步骤6.1,根据步骤5.1中人物身份识别结果,从数据库中调出相应家庭成员的参数化人体模型。通过最小化目标函数(11),令参数化人体模型与步骤3中组装的2D人体骨架动作拟合。若当前人物身份是家庭成员,则保持模型初始形状参数β,只对姿势参数θ进行优化。若当前人物身份是客人,则在第一帧同时优化人体模型的形状参数β与姿势参数θ,后续帧只对姿势参数θ进行优化。
E(β,θ)=λ JE JshapeE shapetempE tempθE θ  (11)
其中,λ J,λ shape,λ temp,λ θ为权重参数。
a.E J为关节距离惩罚项:
Figure PCTCN2021093969-appb-000114
其中,对于单个人物,η i,c表示第c个视角中此人的第i类关键点的置信分数,R θ(J(β) i)表示SMPL模型中第i类关键点的3D坐标,
Figure PCTCN2021093969-appb-000115
表示第i类关键点向第c个相机的图像平面投影的2D坐标,J i,c表示第c个视角中第i类关键点的2D坐标,ρ(·)为Geman-McClure惩罚函数。
b.E shape为形状惩罚项:
Figure PCTCN2021093969-appb-000116
其中,对于单个人物,l i,t表示当前帧t的第i类骨骼的长度,
Figure PCTCN2021093969-appb-000117
为利用当前人物初始五帧图像计算出的第i类骨骼的平均长度先验,C表示人体骨骼集合。
c.E temp为时间平滑项:
Figure PCTCN2021093969-appb-000118
其中,α为权重参数,Δv j,t表示表示第t帧关节点j向前运动的趋势,Δv j,t=R θ(J(β) j,t-1-R θ(J(β) j,t-2,θ i,t表示第t帧第i类骨骼的姿势参数。
d.E θ为动作惩罚项:
Figure PCTCN2021093969-appb-000119
其中,∑ j(g jN(θ;μ θ,jθ,j)为利用CMUMoCaP数据集建立的关于姿势参数θ的先验高斯混合模型。
步骤7,判断并处理实时动作重建时的遮挡情况,判断并处理实时动作重建时人体关键点被遮挡,导致2D关键点无法识别或识别错误的问题。
本步骤包括:
步骤7.1,若步骤4组成的2D人体骨架在所有视角中都不完整,或者检测出的部分关键点在所有视角的置信度都低于预设阈值T,则认为该人体有部分关键点被遮挡,处于视角盲区。
步骤7.2,对于较短连续帧的遮挡,在步骤6进行实时重建时,增大式(11)中被遮挡关键点的权重系数λ temp,加强当前人体3D关键点估算对前一帧关键点的依赖。
步骤7.3,对于较长连续帧的遮挡,特别是特定关键点的长时间遮挡,步骤7.2的处理容易产生累积误差。此种情况下,人物一般处于较为静止的状态,例如,坐于桌前时下半身的关键点被遮挡。此时,根据图片识别结果,从动作数据库中调出最接近当前姿态的标准姿态模型,如标准坐姿、标准站姿、标准卧姿等,及其姿势参数θ。
Figure PCTCN2021093969-appb-000120
其中,ω j表示骨骼关节链中关键点j相对于父关键点的轴角旋转。
根据式(11),以标准姿态模型的参数θ为初始值进行动作回归,进行回归时只对置信度高的关键点的参数ω进行优化,被遮挡关键点保持原本的参数ω。

Claims (6)

  1. 一种基于多目视频的家庭场景动作捕捉方法,其特征在于,包括以下步骤:
    步骤1,相机放置,在待检测家庭中放置多个标定相机,实时获取家庭的多角度视频。
    步骤2,家庭场景模型构建及标注,根据真实家庭场景创建三维虚拟场景模型,对三维虚拟场景进行必要标注。
    步骤3,人体2D关键点检测,检测多目视频中的人体2D关键点坐标和PAF(Part Affinity Field)。
    步骤4,人体骨架组装,对检测到的多人2D关键点进行组装,形成多组人体2D骨架,并建立不同视角中关键点之间的联系,以及当前帧与前一帧关键点之间的联系。
    步骤5,动作数据库中已有动作重建,对于可识别的常见动作,直接调用动作数据库中预置动作动画,节约运算开销。
    步骤6,实时动作重建,若当前动作未存储在动作数据库中,则利用三维模型拟合2D人体骨架,实时重建出人物三维动作。
    步骤7,判断并处理实时动作重建时的遮挡情况,判断并处理实时动作重建时人体关键点被遮挡,导致2D关键点无法识别或识别错误的问题。
  2. 根据权利要求1所述的一种基于多目视频的家庭场景动作捕捉方法,其特征在于,所述的步骤2中家庭场景模型构建及标注,构建与真实家庭场景对应的三维场景模型,对三维场景进行必要标注。所述的步骤2进一步包括:
    步骤2.1,对待检测家庭场景进行三维建模。
    步骤2.2,在三维场景中标注常用行走通道、可坐区域等功能区。并在沙发、桌椅等固定的功能区中,对人物进行常规动作时的面部朝向进行定义,用来辅助人物常见行为动画的生成。
    步骤2.3,建立家庭成员动作数据库,基于参数化人体模型SMPL预先创建各家庭成员模型、客人标准样貌模型,以及常见动作动画,如行走、站立、静坐等。
  3. 根据权利要求1所述的一种基于多目视频的家庭场景动作捕捉方法,其特征在于,所述的步骤4中人体骨架组装,对检测到的多人2D关键点进行组装,形成多组人体骨架。所述的步骤4进一步包括:
    步骤4.1,构建初始关键点关联图G:
    G=(V,E),V=D j(c)∪D t-1,E=E P∪E V∪E T  (1)
    其中,V为图G的点集,E为图G的边集。
    Figure PCTCN2021093969-appb-100001
    表示在当前帧t中,视角c里第j类关键点中的第m个候选点,j∈{1,2,...,J},c∈{1,2,...,N},N为相机个数。D t-1表示t-1帧求出的骨骼3D关键点,若不存在t-1帧,则忽略这一项。在图G中,同一视角里,人体骨架中不同类的关键点两两之间有边相连,用E P表示。不同视角中,人体骨架相同类的关键点两两之间有边相连,用E V表示。每个视角中,每个关键点与D t-1中所有相同类的关键点相连,用E T表示,若不存在t-1帧,则忽略这一项。初始关键点关联图G如图2所示,为了表述清晰,图2中只画出了两个视角、两类关键点的示意图。
    步骤4.2,目标是对初始关键点关联图G求解,得到能够正确表示关键点联系的真实关键点关联图G’:
    G’=(V,E’),V=D j(c)∪D t-1,E’=E’ p∪E’ v∪E’ T  (2)
    其中,在G’中,同一视角中的关键点以真实人体骨架对应的边相连,用E’ p表示;不同视角中,同一人物的同类关键点以边相连,用E’ v表示;每个视角中,每个关键点与D t-1中同一人物的同类关键点相连,用E’ T表示。步骤4.1-4.10即对G’求解过程。
    真实关键点关联图G’如图3所示,为了表述清晰,图3中只画出了两个视角、两类关键点的示意图。
    步骤4.3,对图G的边
    Figure PCTCN2021093969-appb-100002
    进行权重赋值:
    Figure PCTCN2021093969-appb-100003
    Figure PCTCN2021093969-appb-100004
    其中,
    Figure PCTCN2021093969-appb-100005
    表示以
    Figure PCTCN2021093969-appb-100006
    Figure PCTCN2021093969-appb-100007
    为顶点的边。
    Figure PCTCN2021093969-appb-100008
    表示
    Figure PCTCN2021093969-appb-100009
    在G’中保留,
    Figure PCTCN2021093969-appb-100010
    表示
    Figure PCTCN2021093969-appb-100011
    在G’中不保留。L c(x)表示点x处的PAF值。x(u)表示在关键点
    Figure PCTCN2021093969-appb-100012
    Figure PCTCN2021093969-appb-100013
    连线上的插值点。
    步骤4.4,对图G的边
    Figure PCTCN2021093969-appb-100014
    进行权重赋值:
    Figure PCTCN2021093969-appb-100015
    Figure PCTCN2021093969-appb-100016
    其中,
    Figure PCTCN2021093969-appb-100017
    K c表示相机c的内参矩阵,
    Figure PCTCN2021093969-appb-100018
    表示关键点在相机坐标系的坐标,
    Figure PCTCN2021093969-appb-100019
    表示相机c 1光心和
    Figure PCTCN2021093969-appb-100020
    所在的直线与相机c 2光心和
    Figure PCTCN2021093969-appb-100021
    所在的直线之间的直线距离。Z为归一化系数,将
    Figure PCTCN2021093969-appb-100022
    归一化到[0,1]。
    步骤4.5,对图G的边
    Figure PCTCN2021093969-appb-100023
    进行权重赋值:
    Figure PCTCN2021093969-appb-100024
    其中,
    Figure PCTCN2021093969-appb-100025
    表示t-1帧第i类关键点的第k个候选点,
    Figure PCTCN2021093969-appb-100026
    为相机光心和
    Figure PCTCN2021093969-appb-100027
    两点所在直线与
    Figure PCTCN2021093969-appb-100028
    之间的距离,T为归一化系数,将
    Figure PCTCN2021093969-appb-100029
    归一化到[0,1]。
    步骤4.6,计算人体骨骼束,人体骨骼束
    Figure PCTCN2021093969-appb-100030
    表示在真实关键点关联图G’中由第m个人的第i类和第m个人的第j类关键点组成的子图。一个骨骼束如图4所示。
    本步骤包括:
    步骤4.6.1,将初始关键点关联图G中,所有第i类关键点和所有第j类关键点组成的子图记为
    Figure PCTCN2021093969-appb-100031
    在多人场景下,
    Figure PCTCN2021093969-appb-100032
    中包含多个人体骨骼束。从
    Figure PCTCN2021093969-appb-100033
    生成的所有候选骨骼束中计算出可令目标方程(10)最大的骨骼束g c,作为真实的骨骼束。
    Figure PCTCN2021093969-appb-100034
    Figure PCTCN2021093969-appb-100035
    Figure PCTCN2021093969-appb-100036
    其中,q(z)=p(z)·z,|V c|表示g c中点的个数,w p,w m,w t,w v为权重系数。
    步骤4.6.2,令
    Figure PCTCN2021093969-appb-100037
    重复步骤4.6.1,直到
    Figure PCTCN2021093969-appb-100038
    为空。
    步骤4.7,根据步骤4.6,遍历人体所有骨骼,求出人体骨骼束集合B。
    步骤4.8,将人体骨骼束B按照公式(10)的得分,由大到小排列,构成队列Q。
    步骤4.9,初始时,真实关键点关联图
    Figure PCTCN2021093969-appb-100039
    步骤4.10,从队列Q中,取出队首骨骼束
    Figure PCTCN2021093969-appb-100040
    在加入G’时,包含的所有关键点d应被赋予同一个人的标签。若
    Figure PCTCN2021093969-appb-100041
    且d i,d j在G’中已被赋予不同的人物标签,则
    Figure PCTCN2021093969-appb-100042
    与G’存在冲突。
    判断
    Figure PCTCN2021093969-appb-100043
    与G’是否有冲突。
    a.若有冲突,则将
    Figure PCTCN2021093969-appb-100044
    按照G’中的人物标签拆分成不同人物的骨骼束,根据公式(10)计算出新的骨骼束得分,将其重新加入队列Q。
    b.若没有冲突,则
    Figure PCTCN2021093969-appb-100045
    并为
    Figure PCTCN2021093969-appb-100046
    赋予相应的人物标签。
  4. 根据权利要求1所述的一种基于多目视频的家庭场景动作捕捉方法,其特征在于,所述的步骤5中动作数据库中已有动作重建,对于可识别的常见动作,直接调用动作数据库中预置动作动画,节约运算开销。所述的步骤5进一步包括:
    步骤5.1,利用采集的图像序列和2D骨骼信息,识别当前人物身份与动作。
    步骤5.2,判断当前人物动作是否已存储于动作数据库中。若已存储,则利用步骤5.3,5.4生成人物动画。若未存储,则进入步骤6。
    步骤5.3,基于三角测量法,利用双目标定相机获取的人体根关键点图像坐标计算出根关键点的三维坐标。
    步骤5.4,将动作数据库中动画初始帧的人物模型根结点对齐步骤5.3中计算出的三维坐标,并借助步骤2.2的面部方向标注确定根结点旋转方向。随后,播放动作数据库中的动画。在处理行走类动作时,可利用本步骤方法计算出动作结束时根结点的位置,并利用步骤2.2中对行走通道的标记确定运动过程路径。
    步骤5.5,若检测到人物动作发生切换,则返回步骤5.2。
  5. 根据权利要求1所述的一种基于多目视频的家庭场景动作捕捉方法,其特征在于,所述的步骤6中实时动作重建,若当前动作未存储在动作数据库中,则利用三维模型拟合2D人体骨架,实时重建出人物三维动作。所述的步骤6中, 令参数化模型拟合2D人体骨架的目标方程的定义为:
    E(β,θ)=λ JE JshapeE shapetempE tempθE θ  (11)
    其中,λ J,λ shape,λ temp,λ θ为权重参数。
    a.E J为关节距离惩罚项:
    Figure PCTCN2021093969-appb-100047
    其中,对于单个人物,η i,c表示第c个视角中此人的第i类关键点的置信分数,R θ(J(β) i)表示SMPL模型中第i类关键点的3D坐标,
    Figure PCTCN2021093969-appb-100048
    表示第i类关键点向第c个相机的图像平面投影的2D坐标,J i,c表示第c个视角中第i类关键点的2D坐标,ρ(·)为Geman-McClure惩罚函数。
    b.E shape为形状惩罚项:
    Figure PCTCN2021093969-appb-100049
    其中,对于单个人物,l i,t表示当前帧t的第i类骨骼的长度,
    Figure PCTCN2021093969-appb-100050
    为利用当前人物初始五帧图像计算出的第i类骨骼的平均长度先验,C表示人体骨骼集合。
    c.E temp为时间平滑项:
    Figure PCTCN2021093969-appb-100051
    其中,α为权重参数,Δv j,t表示表示第t帧关节点j向前运动的趋势,Δv j,t=R θ(J(β) j,t-1-R θ(J(β) j,t-2,θ i,t表示第t帧第i类骨骼的姿势参数。
    d.E θ为动作惩罚项:
    Figure PCTCN2021093969-appb-100052
    其中,Σ j(g jN(θ;μ θ,jθ,j)为利用CMUMoCaP数据集建立的关于姿势参数θ的先验高斯混合模型。
  6. 根据权利要求1所述的一种基于多目视频的家庭场景动作捕捉方法,其特征在于,所述的步骤7中判断并处理实时动作重建时的遮挡情况,判断并处理实时动作重建时人体关键点被遮挡,导致2D关键点无法识别或识别错误的问题。所述的步骤7进一步包括:
    步骤7.1,若步骤4组成的2D人体骨架在所有视角中都不完整,或者检测出的部分关键点在所有视角的置信度都低于预设阈值T,则认为该人体有部分关键点被遮挡,处于视角盲区。
    步骤7.2,对于较短连续帧的遮挡,在步骤6进行实时重建时,增大式(11)中被遮挡关键点的权重系数λ temp,加强当前人体3D关键点估算对前一帧关键点的依赖。
    步骤7.3,对于较长连续帧的遮挡,特别是特定关键点的长时间遮挡,步骤7.2的处理容易产生累积误差。此种情况下,人物一般处于较为静止的状态,例如,坐于桌前时下半身的关键点被遮挡。此时,根据图片识别结果,从动作数据库中调出最接近当前姿态的标准姿态模型,如标准坐姿、标准站姿、标准卧姿等,及其姿势参数θ。
    Figure PCTCN2021093969-appb-100053
    其中,ω j表示骨骼关节链中关键点j相对于父关键点的轴角旋转。
    根据式(11),以标准姿态模型的参数θ为初始值进行动作回归,进行回归时只对置信度高的关键点的参数ω进行优化,被遮挡关键点保持原本的参数ω。
PCT/CN2021/093969 2021-05-15 2021-05-15 一种基于多目视频的家庭场景动作捕捉方法 WO2022241583A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/093969 WO2022241583A1 (zh) 2021-05-15 2021-05-15 一种基于多目视频的家庭场景动作捕捉方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/093969 WO2022241583A1 (zh) 2021-05-15 2021-05-15 一种基于多目视频的家庭场景动作捕捉方法

Publications (1)

Publication Number Publication Date
WO2022241583A1 true WO2022241583A1 (zh) 2022-11-24

Family

ID=84140927

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/093969 WO2022241583A1 (zh) 2021-05-15 2021-05-15 一种基于多目视频的家庭场景动作捕捉方法

Country Status (1)

Country Link
WO (1) WO2022241583A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115565253A (zh) * 2022-12-08 2023-01-03 季华实验室 一种动态手势实时识别方法、装置、电子设备和存储介质
CN115984972A (zh) * 2023-03-20 2023-04-18 乐歌人体工学科技股份有限公司 基于运动视频驱动的人体姿态识别方法
CN116403288A (zh) * 2023-04-28 2023-07-07 中南大学 运动姿态的识别方法、识别装置及电子设备
CN116403275A (zh) * 2023-03-14 2023-07-07 南京航空航天大学 基于多目视觉检测封闭空间中人员行进姿态的方法及系统
CN116880687A (zh) * 2023-06-07 2023-10-13 黑龙江科技大学 一种基于单目多算法的悬浮触控方法
CN117911632A (zh) * 2024-03-19 2024-04-19 电子科技大学 一种人体节点三维虚拟角色动作重构方法、设备及计算机可读存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107845129A (zh) * 2017-11-07 2018-03-27 深圳狗尾草智能科技有限公司 三维重构方法及装置、增强现实的方法及装置
CN110020611A (zh) * 2019-03-17 2019-07-16 浙江大学 一种基于三维假设空间聚类的多人动作捕捉方法
CN110544302A (zh) * 2019-09-06 2019-12-06 广东工业大学 基于多目视觉的人体动作重建系统、方法和动作训练系统
US20210012100A1 (en) * 2019-07-10 2021-01-14 Hrl Laboratories, Llc Action classification using deep embedded clustering

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107845129A (zh) * 2017-11-07 2018-03-27 深圳狗尾草智能科技有限公司 三维重构方法及装置、增强现实的方法及装置
CN110020611A (zh) * 2019-03-17 2019-07-16 浙江大学 一种基于三维假设空间聚类的多人动作捕捉方法
US20210012100A1 (en) * 2019-07-10 2021-01-14 Hrl Laboratories, Llc Action classification using deep embedded clustering
CN110544302A (zh) * 2019-09-06 2019-12-06 广东工业大学 基于多目视觉的人体动作重建系统、方法和动作训练系统

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115565253A (zh) * 2022-12-08 2023-01-03 季华实验室 一种动态手势实时识别方法、装置、电子设备和存储介质
CN115565253B (zh) * 2022-12-08 2023-04-18 季华实验室 一种动态手势实时识别方法、装置、电子设备和存储介质
CN116403275A (zh) * 2023-03-14 2023-07-07 南京航空航天大学 基于多目视觉检测封闭空间中人员行进姿态的方法及系统
CN116403275B (zh) * 2023-03-14 2024-05-24 南京航空航天大学 基于多目视觉检测封闭空间中人员行进姿态的方法及系统
CN115984972A (zh) * 2023-03-20 2023-04-18 乐歌人体工学科技股份有限公司 基于运动视频驱动的人体姿态识别方法
CN115984972B (zh) * 2023-03-20 2023-08-11 乐歌人体工学科技股份有限公司 基于运动视频驱动的人体姿态识别方法
CN116403288A (zh) * 2023-04-28 2023-07-07 中南大学 运动姿态的识别方法、识别装置及电子设备
CN116880687A (zh) * 2023-06-07 2023-10-13 黑龙江科技大学 一种基于单目多算法的悬浮触控方法
CN116880687B (zh) * 2023-06-07 2024-03-19 黑龙江科技大学 一种基于单目多算法的悬浮触控方法
CN117911632A (zh) * 2024-03-19 2024-04-19 电子科技大学 一种人体节点三维虚拟角色动作重构方法、设备及计算机可读存储介质
CN117911632B (zh) * 2024-03-19 2024-05-28 电子科技大学 一种人体节点三维虚拟角色动作重构方法、设备及计算机可读存储介质

Similar Documents

Publication Publication Date Title
WO2022241583A1 (zh) 一种基于多目视频的家庭场景动作捕捉方法
EP3602494B1 (en) Robust mesh tracking and fusion by using part-based key frames and priori model
Wang et al. EM enhancement of 3D head pose estimated by point at infinity
Cheung et al. Shape-from-silhouette across time part ii: Applications to human modeling and markerless motion tracking
CN109242950B (zh) 多人紧密交互场景下的多视角人体动态三维重建方法
Ye et al. Accurate 3d pose estimation from a single depth image
Kumano et al. Pose-invariant facial expression recognition using variable-intensity templates
Rafi et al. A semantic occlusion model for human pose estimation from a single depth image
KR20210079542A (ko) 3d 골격 정보를 이용한 사용자 동작 인식 방법 및 시스템
CN111582036B (zh) 可穿戴设备下基于形状和姿态的跨视角人物识别方法
Argyros et al. Binocular hand tracking and reconstruction based on 2D shape matching
CN111832386A (zh) 一种估计人体姿态的方法、装置及计算机可读介质
Goto et al. Facial feature extraction for quick 3D face modeling
Rius et al. Action-specific motion prior for efficient Bayesian 3D human body tracking
Haker et al. Self-organizing maps for pose estimation with a time-of-flight camera
Okada et al. Virtual fashion show using real-time markerless motion capture
Lefevre et al. Structure and appearance features for robust 3d facial actions tracking
Leow et al. 3-D–2-D spatiotemporal registration for sports motion analysis
Muhlbauer et al. A model-based algorithm to estimate body poses using stereo vision
Zúniga et al. Fast and reliable object classification in video based on a 3D generic model
Raskin et al. Using gaussian processes for human tracking and action classification
Joo Sensing, Measuring, and Modeling Social Signals in Nonverbal Communication
Metaxas et al. Dynamically adaptive tracking of gestures and facial expressions
Pala et al. Person Re-Identification from Depth Cameras using Skeleton and 3D Face Data.
Kehl Markerless motion capture of complex human movements from multiple views

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21940042

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21940042

Country of ref document: EP

Kind code of ref document: A1