CN116721468A - Intelligent guided broadcast switching method based on multi-person gesture estimation action amplitude detection - Google Patents

Intelligent guided broadcast switching method based on multi-person gesture estimation action amplitude detection Download PDF

Info

Publication number
CN116721468A
CN116721468A CN202310762371.2A CN202310762371A CN116721468A CN 116721468 A CN116721468 A CN 116721468A CN 202310762371 A CN202310762371 A CN 202310762371A CN 116721468 A CN116721468 A CN 116721468A
Authority
CN
China
Prior art keywords
amplitude
image
motion
joint point
gesture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310762371.2A
Other languages
Chinese (zh)
Inventor
帅千钧
何健强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Communication University of China
Original Assignee
Communication University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Communication University of China filed Critical Communication University of China
Priority to CN202310762371.2A priority Critical patent/CN116721468A/en
Publication of CN116721468A publication Critical patent/CN116721468A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Psychiatry (AREA)
  • Human Computer Interaction (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Social Psychology (AREA)
  • Image Analysis (AREA)

Abstract

本发明提供了一种基于多人姿态估计动作幅度检测的智能导播切换方法,首先将视频信号抽帧成图像,将图像输入到搭建并训练好的多人姿态估计模型检测出关键点坐标和关联信息,动作幅度检测模块实际考虑了画面景深和人物尺度不同的问题,基于关键点的坐标和关联信息通过动作幅度检测算法进行归一化处理计算得到姿态特征值,并据此进行动作幅度的判别,采用阈值法判别是否为大幅度动作,若存在一个姿态特征的数值超过系统设定阈值时,判定该动作为大幅度动作,否则为小幅度动作,最后,根据检测结果进行智能导播系统的镜头切换。本发明着力于提高人体关键点定位以及动作幅度检测的准确性,达到节目的自动制作或辅助制作。

The invention provides an intelligent director switching method based on multi-person posture estimation action amplitude detection. First, the video signal is framed into an image, and the image is input into a built and trained multi-person posture estimation model to detect key point coordinates and correlations. Information, the motion range detection module actually considers the problem of different depth of field and character scale. Based on the coordinates and related information of key points, it is normalized through the motion range detection algorithm to calculate the posture feature value, and based on this, the motion range is judged. , the threshold method is used to determine whether it is a large-scale action. If there is a gesture feature whose value exceeds the system-set threshold, the action is determined to be a large-scale action, otherwise it is a small-scale action. Finally, the intelligent guidance system is used to perform shots based on the detection results. switch. The present invention focuses on improving the accuracy of human body key point positioning and movement range detection to achieve automatic or auxiliary production of programs.

Description

一种基于多人姿态估计动作幅度检测的智能导播切换方法An intelligent director switching method based on multi-person posture estimation and motion range detection

技术领域Technical field

本发明涉及一种基于多人姿态估计动作幅度检测的智能导播切换方法,属于人工智能技术领域。The invention relates to an intelligent broadcast switching method based on multi-person posture estimation and motion range detection, and belongs to the field of artificial intelligence technology.

背景技术Background technique

演艺表演类节目通常采用多机位对舞台的不同角度进行拍摄,由导播对不同机位的镜头进行选择切换播出,导播对镜头画面的选择对于节目的整体效果和表现力有着至关重要的作用。然而,传统的导播制作方式需要经验丰富的导演导播团队,耗费大量的人力物力,在节目的多镜头拍摄、导播切换和后期剪辑等方面需要较长的制作周期,制作过程较为繁琐。尤其是在直播演艺场景当中,需要导播快速地对多组摄像机画面进行判断,捕捉到表演者的精彩瞬间,导播对于镜头画面的选择切换尤为重要。基于人工智能技术的智能导播旨在解决多种形态演艺节目的智能化导播问题,提供自动化、智能化的决策判断方法,通过镜头中人物在场景中的实时状态识别进行导播切换。该技术的应用将大大减少制作周期和人力物力成本,并提高节目的制作效率和质量。与动作识别、动作检测不同,动作识别旨在检测动作的种类,本方法的动作幅度检测旨在检测特定场景下人体动作的伸展变化幅度,属于新的人工智能技术领域应用方向。动作幅度的大小可以影响演员的表演风格,也可以帮助演员在舞台上表现出更多的情感和动态。通过动作幅度检测,对视频进行逐帧分析,实现主持人及嘉宾的实时状态识别,将结果反馈到智能导播系统当中,达到节目的自动制作或者辅助制作的效果,具有重要研究意义和应用价值。Performing arts programs usually use multiple cameras to shoot from different angles of the stage. The director selects and switches the shots from different camera positions for broadcast. The director's choice of lens images is crucial to the overall effect and expressiveness of the program. effect. However, the traditional directing production method requires an experienced director and directing team, consumes a lot of manpower and material resources, requires a long production cycle in terms of multi-camera shooting, director switching and post-editing of the program, and the production process is relatively cumbersome. Especially in live performance scenes, the director needs to quickly judge multiple sets of camera images to capture the performers' wonderful moments. The director is particularly important in the selection and switching of camera images. Intelligent broadcasting based on artificial intelligence technology aims to solve the problem of intelligent broadcasting of various forms of performing arts programs, provide automated and intelligent decision-making methods, and conduct switching of broadcasts through real-time status recognition of characters in the scene in the scene. The application of this technology will greatly reduce the production cycle and human and material costs, and improve the production efficiency and quality of programs. Different from action recognition and action detection, action recognition aims to detect the type of action. The action range detection of this method aims to detect the extension change range of human action in a specific scene, which belongs to the application direction of the new artificial intelligence technology field. The size of the movement can affect the actor's performance style and can also help the actor show more emotion and dynamics on stage. Through motion amplitude detection, the video is analyzed frame by frame to realize real-time status identification of the host and guests, and the results are fed back to the intelligent broadcasting system to achieve the effect of automatic production or auxiliary production of the program, which has important research significance and application value.

随着深度学习的发展,目前基于深度学习的动作幅度检测方法主要有两种思路:第一,基于图像分类的动作幅度检测方法,使用图像分类技术对演员的动作进行识别检测,简单的实现了动作幅度检测。但是缺少对动作的姿态特征分析,并且它的实时性和多人动作分析能力仍然存在挑战。第二,基于多人姿态估计的动作幅度检测方法,多人姿态估计主要通过检测人体的关节点,如肩膀,膝盖和足踝等,而演艺动作的幅度变化主要是通过头部以下的人体姿态关节点体现,分析这些关节点之间的相对位置和运动情况来实现演艺动作幅度检测。因此,这种方法具备可以更好地减少背景环境的干扰,专注于人物本身,具有快速运行和准确检测的优势。With the development of deep learning, there are currently two main ideas for motion amplitude detection methods based on deep learning: First, the motion amplitude detection method based on image classification uses image classification technology to identify and detect the actor's movements, which is simply implemented. Range of motion detection. However, there is a lack of posture feature analysis of actions, and there are still challenges in its real-time and multi-person action analysis capabilities. Second, the motion range detection method is based on multi-person pose estimation. Multi-person pose estimation mainly detects the joint points of the human body, such as shoulders, knees and ankles, etc., while the amplitude changes of performing arts movements are mainly based on the human body posture below the head. The joint points are reflected and the relative positions and movements between these joint points are analyzed to realize the range detection of performing arts movements. Therefore, this method can better reduce the interference of the background environment, focus on the character itself, and has the advantages of fast operation and accurate detection.

中国专利CN201910530100.8公开了一种人物动作幅度的度量方法,包括的步骤为解析第一视频和第二视频成为帧序列,提取第一视频的最后一帧和第二视频的第一帧,使用关节点检测算法找出最后一帧和第一帧中相应的关节点,计算出各对相应的关节点的位移均值,将位移均值Dis进行归一化处理得到两帧的动作幅度值;有益效果是:通过该发明的方法,最终计算出后一帧相对于前一帧的动作幅度,这个动作幅度也是我们插帧位置的度量标准,对动作幅度偏大的两帧进行裁剪,对动作幅度偏小的两帧进行插帧,最后处理后拼接起来的视频的流畅度更好。Chinese patent CN201910530100.8 discloses a method for measuring the amplitude of human action, which includes the steps of parsing the first video and the second video into a frame sequence, extracting the last frame of the first video and the first frame of the second video, using The joint point detection algorithm finds the corresponding joint points in the last frame and the first frame, calculates the mean displacement of each pair of corresponding joint points, and normalizes the mean displacement Dis to obtain the motion amplitude value of the two frames; beneficial effects Yes: Through the method of this invention, we can finally calculate the motion amplitude of the next frame relative to the previous frame. This motion amplitude is also the measurement standard for our frame insertion position. We will crop the two frames with a larger motion amplitude, and crop the two frames with a larger motion amplitude. The two smaller frames are interpolated, and the smoothness of the spliced video after final processing is better.

如发明专利所述,当前动作幅度检测主要依靠相邻视频帧的关节点相对位移。但是在演艺场景中,由于动作种类繁多,现存的动作幅度检测算法容易出现误判,算法的可靠性较低。除此之外,当前人工智能应用领域使用的多人姿态估计算法为自顶向下和自底向上的两阶段多人姿态估计模型,自顶向下的多人姿态估计模型存在内存需求较大、实时性较差和计算成本高等缺点,自底向上的多人姿态估计模型存在复杂背景影响较大、容易出现误判和匹配错误等缺点,因而现有多人姿态估计技术还有待改进和提高。目前尚缺乏速度和精度达到高水平的基于多人姿态估计的动作幅度检测算法,以满足演艺场景中智能导播系统的应用。As stated in the invention patent, current motion amplitude detection mainly relies on the relative displacement of joint points in adjacent video frames. However, in performing arts scenes, due to the wide variety of movements, existing movement range detection algorithms are prone to misjudgments, and the reliability of the algorithms is low. In addition, the current multi-person pose estimation algorithm used in artificial intelligence applications is a top-down and bottom-up two-stage multi-person pose estimation model. The top-down multi-person pose estimation model has large memory requirements. , poor real-time performance and high computational cost. The bottom-up multi-person pose estimation model has shortcomings such as a large impact of complex backgrounds and prone to misjudgments and matching errors. Therefore, the existing multi-person pose estimation technology needs to be improved and improved. . At present, there is still a lack of motion range detection algorithm based on multi-person pose estimation that reaches a high level of speed and accuracy to meet the application of intelligent guidance systems in performing arts scenes.

综上所述,动作幅度检测的技术方案存在以下缺陷:To sum up, the technical solution for motion range detection has the following shortcomings:

1.占用大量的计算资源:1. Occupies a lot of computing resources:

模型需要人工操作,如裁剪,非极大抑制以及分组等操作,借助目标检测模型辅助动作幅度检测,占用大量计算资源。The model requires manual operations, such as cropping, non-maximum suppression, and grouping operations. The target detection model is used to assist in motion amplitude detection, which consumes a lot of computing resources.

2.多人姿态估计模型无法达到速度和精度的均衡:2. The multi-person pose estimation model cannot achieve a balance between speed and accuracy:

目前已有的基于多人姿态估计的动作分析类(包括动作识别,动作分类,动作幅度检测等)方法几乎使用的是自顶向下和自底向上的两阶段多人姿态估计模型,无法达到完全端到端的多人姿态估计,实现速度和精度的均衡。Most of the existing action analysis methods based on multi-person pose estimation (including action recognition, action classification, motion range detection, etc.) use top-down and bottom-up two-stage multi-person pose estimation models, which cannot be achieved. Complete end-to-end multi-person pose estimation, achieving a balance between speed and accuracy.

3.动作幅度检测准确率低:3. The accuracy of motion range detection is low:

动作幅度检测主要通过图像分类或相邻视频帧的关节点相对位移实现,缺少人体姿态特征分析,导致动作幅度检测容易出现误判,可靠性较差以及准确率较低的问题。Motion range detection is mainly implemented through image classification or the relative displacement of joint points in adjacent video frames. The lack of human posture feature analysis leads to problems such as misjudgment, poor reliability and low accuracy in motion range detection.

发明内容Contents of the invention

为解决上述技术问题,本发明提供了一种基于多人姿态估计动作幅度检测的智能导播切换方法,着力于解决由于多人姿态估计模型无法达到速度和精度的均衡,导致系统占用大量计算资源以及动作幅度检测准确率低等问题,且取得了优异的效果。本发明将基于Transformers的端到端多人姿态估计(End-to-End Multi-Person Pose Estimationwith Transformers,PETR)模型应用于动作幅度检测方向,实际考虑了画面景深和人物尺度不同的问题,基于关节点坐标,通过动作幅度检测算法进行归一化处理计算得到姿态特征值,并据此进行动作幅度的判别,以实现智能导播切换,达到节目的自动制作或辅助制作的效果,节省了人力、物力,提高工作效率。In order to solve the above technical problems, the present invention provides an intelligent broadcast switching method based on multi-person posture estimation motion range detection, focusing on solving the problem that the multi-person posture estimation model cannot achieve a balance between speed and accuracy, resulting in the system occupying a large amount of computing resources and It solves the problems of low motion range detection accuracy and achieves excellent results. This invention applies the End-to-End Multi-Person Pose Estimation with Transformers (PETR) model based on Transformers to the motion range detection direction, and actually considers the problem of different depth of field and character scale in the picture, based on joints The point coordinates are normalized and calculated through the motion range detection algorithm to obtain the posture characteristic value, and the motion range is judged accordingly to achieve intelligent director switching, achieve the effect of automatic production or auxiliary production of programs, and save manpower and material resources. ,Improve work efficiency.

本发明采用的技术方案为一种基于多人姿态估计动作幅度检测的智能导播切换方法,包括以下步骤:The technical solution adopted by the present invention is an intelligent broadcast switching method based on multi-person posture estimation and motion range detection, which includes the following steps:

步骤1,基于摄像机采集的演艺视频进行抽帧获取图像;Step 1: Extract frames to obtain images based on the performance video collected by the camera;

步骤2,搭建并训练多人姿态估计模型;Step 2, build and train a multi-person pose estimation model;

步骤3,将步骤1的图像输入步骤2中多人姿态估计模型,通过动作幅度检测算法计算各个姿态特征;Step 3: Input the image from Step 1 into the multi-person pose estimation model in Step 2, and calculate each pose feature through the motion range detection algorithm;

步骤4,对步骤3获取得到姿态特征进行动作幅度判断;Step 4: Judge the movement range based on the posture features obtained in step 3;

步骤5,根据动作幅度的判断结果进行镜头切换。Step 5: Switch lenses based on the judgment result of the movement range.

所述步骤1,具体包括:The step 1 specifically includes:

基于多台摄像机获取多人姿态动作幅度视频,基于该多人姿态动作幅度视频抽帧得到多人姿态动作幅度图像。Multi-person posture and movement range videos are obtained based on multiple cameras, and multi-person posture and movement range images are obtained by extracting frames based on the multi-person posture and movement range videos.

步骤1.1,以n帧为间隔选取多人姿态动作幅度视频的视频帧,其他视频帧由于信息冗余直接丢弃;其中n为大于1的正整数,例如,n为10。Step 1.1, select the video frames of the multi-person gesture action range video at intervals of n frames, and other video frames are directly discarded due to information redundancy; where n is a positive integer greater than 1, for example, n is 10.

步骤1.2,将每一视频帧按照摄像机机位信息进行命名,以此得到多个图像,每个图像代表一个机位摄像机的当前节目画面。例如,1_01.jpg代表1号摄像机的第一帧图像。Step 1.2: Name each video frame according to the camera position information to obtain multiple images, each image representing the current program picture of a camera position. For example, 1_01.jpg represents the first frame of image from camera No. 1.

所述步骤2,具体包括:The step 2 specifically includes:

多人姿态估计模型即PETR模型主要包括骨干网络模块、位置编码模块、视觉特征编码器模块、姿态解码器模块,以及关节点解码器模块。基于步骤1得到的图像,PETR模型能够输出图像中的人体关节点坐标,并按照关节点关联信息将人体姿态绘制于图像中。The multi-person pose estimation model, the PETR model, mainly includes a backbone network module, a position encoding module, a visual feature encoder module, a pose decoder module, and a joint point decoder module. Based on the image obtained in step 1, the PETR model can output the coordinates of the human joint points in the image, and draw the human posture in the image according to the joint point association information.

步骤2.1,骨干网络模块用以录入步骤1的图像,输出为多尺度特征图。骨干网络模块为ResNet-50,ResNet-50是50layer的残差网络,用于提取图像的特征图,骨干网络模块用于提取高分辨率的多尺度特征图。Step 2.1, the backbone network module is used to input the image in step 1 and output it as a multi-scale feature map. The backbone network module is ResNet-50. ResNet-50 is a 50-layer residual network used to extract feature maps of images. The backbone network module is used to extract high-resolution multi-scale feature maps.

步骤2.2,视觉特征编码器基于步骤2.1得到的多尺度特征图和位置编码模块为每个像素点生成的位置编码,并生成多尺度特征令牌和姿态查询;Step 2.2, the visual feature encoder generates the position code for each pixel based on the multi-scale feature map obtained in step 2.1 and the position coding module, and generates multi-scale feature tokens and pose queries;

步骤2.3,姿态解码器基于步骤2.2得到的多尺度特征令牌F和N个随机初始化的姿态查询Qpose∈RN×D,并获取N个身体的姿态其中/>表示第i个人的K个关节点的坐标,D表示查询键的维度。姿态解码器借助所有解码器层逐层估计姿态坐标,每一层基于前一层的预测去细化姿态。Step 2.3, the pose decoder queries Q pose ∈R N×D based on the multi-scale feature token F obtained in step 2.2 and N randomly initialized poses, and obtains the poses of N bodies Among them/> represents the coordinates of the K joint points of the i-th person, and D represents the dimension of the query key. The pose decoder estimates pose coordinates layer by layer with the help of all decoder layers, with each layer refining the pose based on the predictions of the previous layer.

步骤2.4,关节点解码器基于步骤2.3姿态解码器预测的每一组人体姿态的K个关节点信息作为随机初始化的关节点查询,并细化关节点位置信息以及关节点结构信息。通过自注意力和可变形的交叉注意力不断更新关节点查询和键的特征信息,最后关节点查询通过偏差聚合特,即可进一步细化关节点位置信息以及关节点结构信息。Step 2.4, the joint point decoder uses the K joint point information of each group of human postures predicted by the posture decoder in step 2.3 as a randomly initialized joint point query, and refines the joint point position information and joint point structure information. The feature information of joint point queries and keys is continuously updated through self-attention and deformable cross-attention. Finally, the joint point query aggregates features through deviation to further refine the joint point position information and joint point structure information.

步骤2.5,训练时采用基于集合匈牙利损失旨在对每个真实姿态强制唯一的预测,减少关节点的错检和误检。分类损失函数记为Lcls在分类头中被使用。为了消除预测结果的相对误差,PETR模型采用OKS损失,OKS损失是模型预测的人体姿态的关节点坐标与真实的关节点坐标的回归损失计算。L1损失记为Lreg和OKS损失记为Loks,分别用于姿态回归头和关节回归头的损失函数,使姿态解码器和关节点解码器更好地收敛;In step 2.5, the set-based Hungarian loss is used during training to force a unique prediction for each true pose and reduce false detections and false detections of joint points. The classification loss function denoted L cls is used in the classification header. In order to eliminate the relative error in the prediction results, the PETR model uses OKS loss. OKS loss is the regression loss calculation between the joint point coordinates of the human posture predicted by the model and the real joint point coordinates. The L 1 loss is recorded as L reg and the OKS loss is recorded as L oks , which are used in the loss functions of the attitude regression head and the joint regression head respectively, so that the attitude decoder and joint point decoder can converge better;

使用基于热力图回归的方法辅助训练,能够给模型的训练增加一个方向性的引导,距离目标关节点越近,激活值越大,模型能按照引导方向快速地接近目标关节点,达到快速收敛的作用。因此,PETR模型采用一个可变形的transformer编码器去产生热力图预测,然后在预测的和真实的热力图之间计算一个变体的Focal Loss即焦点损失,记为Lhm。热力图分支仅用于辅助训练,在推理阶段会去掉。综上所述,该PETR模型的所有损失函数表述为:Using methods based on heat map regression to assist training can add a directional guidance to the training of the model. The closer it is to the target joint point, the greater the activation value. The model can quickly approach the target joint point according to the guidance direction and achieve rapid convergence. effect. Therefore, the PETR model uses a deformable transformer encoder to generate heat map predictions, and then calculates a variant Focal Loss between the predicted and true heat maps, denoted as L hm . The heat map branch is only used to assist training and will be removed during the inference phase. To sum up, all the loss functions of the PETR model are expressed as:

L=Lcls1Lreg2Loks3Lhm L=L cls1 L reg2 L oks3 L hm

其中,λ123表示对应的损失权值。Among them, λ 1 , λ 2 , λ 3 represent the corresponding loss weights.

步骤2.6,搭建COCO关节点数据集作为训练集,并对PETR模型进行训练和测试;具体地,COCO关节点数据总共包含200000张图像,其中关节点图像共有64355张,包含250000人的关节点。每人的标记有17个关节点,例如头、肩、臂、手、膝、踝等关节点。PETR模型总共训练了100个迭代周期。学习率为2.5e-5,为了使PETR模型训练更好的收敛,当训练到80个epoch时,将学习率降到2.5e-6。通过训练好PETR模型,对输入的每一图像检测出人体关节点坐标,并生成人体姿态。Step 2.6, build the COCO joint point data set as a training set, and train and test the PETR model; specifically, the COCO joint point data contains a total of 200,000 images, including a total of 64,355 joint point images, including 250,000 people's joint points. Each person's mark has 17 joint points, such as head, shoulders, arms, hands, knees, ankles and other joint points. The PETR model was trained for a total of 100 iteration cycles. The learning rate is 2.5e-5. In order to make the PETR model training converge better, when the training reaches 80 epochs, the learning rate is reduced to 2.5e-6. By training the PETR model, the human body joint point coordinates are detected for each input image and the human body posture is generated.

所述步骤3,具体包括:The step 3 specifically includes:

该步骤输入为步骤2的人体关节点坐标,输出为图像中人体的姿态特征值。实际考虑了画面景深和人物尺度不同的问题,动作幅度检测算法由四个检测条件组成。The input of this step is the coordinates of the human body joint points in step 2, and the output is the posture feature value of the human body in the image. Actually taking into account the different depth of field of the picture and the scale of the characters, the motion range detection algorithm consists of four detection conditions.

步骤3.1,姿态估计关节点连接格式采用COCO人体关节点连接格式,如图5所示,(xi,yi)表示为人体第i个关节点的x坐标和y坐标,计算左肩、右肩关节点(关节点编号:5,6)中点s的坐标,左髋、右髋关节点(关节点编号:11,12)中点h的坐标,左膝、右膝关节点(关节点编号:13,14)的中点k的坐标。Step 3.1, the posture estimation joint point connection format adopts the COCO human body joint point connection format, as shown in Figure 5, ( xi , y i ) is expressed as the x coordinate and y coordinate of the i-th joint point of the human body, and the left shoulder and right shoulder are calculated The coordinates of the midpoint s of the joint point (joint point numbers: 5, 6), the coordinates of the midpoint h of the left and right hip joint points (joint point numbers: 11, 12), the left knee and right knee joint points (joint point numbers :13,14) The coordinates of the midpoint k.

步骤3.2,检测条件一:计算双手开合长度与身体中心躯干长度比值。双手开合长度由左腕、右腕关节点(关节点编号:9,10)构成,身体中心躯干长度由s和h组成。基于关节点坐标计算出双手开合长度d_hand,身体中心躯干长度d_body,当双手开合长度与身体中心躯干长度的比值大于等于第一阈值时,判定为大幅度动作,否则为小幅度动作,根据实验结果第一阈值设为1.8,判别公式如下所示:Step 3.2, detection condition 1: Calculate the ratio of the opening and closing length of the hands to the length of the trunk in the center of the body. The opening and closing length of the hands is composed of the joint points of the left and right wrists (joint point numbers: 9, 10), and the length of the body center trunk is composed of s and h. Based on the joint point coordinates, the opening and closing length of the hands d_hand and the length of the body center trunk d_body are calculated. When the ratio of the opening and closing length of the hands to the length of the body center trunk is greater than or equal to the first threshold, it is determined to be a large-scale movement. Otherwise, it is a small-scale movement. According to The first threshold of the experimental results is set to 1.8, and the discrimination formula is as follows:

步骤3.3,检测条件二:计算身体躯干倾斜角度。s和h组成的向量x1,h和k组成的向量x2,两个向量的角度为身体躯干倾斜角度。当身体躯干倾斜角度angle_1小于等于第二阈值时,判定为大幅度动作,否则为小幅度动作,根据实验结果第二阈值设为150°,判别公式如下所示:Step 3.3, detection condition 2: Calculate the tilt angle of the body trunk. The vector x 1 composed of s and h, the vector x 2 composed of h and k, the angle of the two vectors is the inclination angle of the body trunk. When the body trunk tilt angle angle_1 is less than or equal to the second threshold, it is determined to be a large-scale movement, otherwise it is a small-scale movement. According to the experimental results, the second threshold is set to 150°. The discrimination formula is as follows:

步骤3.4,检测条件三:计算手臂与躯干开合角度。左臂开合角由左肘、左肩(关节点编号:7,5)的向量x3和左肩、左髋(关节点编号:5,11)的向量x4组成,右臂开合角由右肘、右肩(关节点编号:8,6)的向量x5和右肩、右髋(关节点编号:6,12)的向量x6组成。当左臂开合角angle_2,右臂开合角angle_3大于等于第三阈值时,判定为大幅度动作,否则为小幅度动作,根据实验结果第三阈值设为90°,判别公式如下所示:Step 3.4, detection condition three: Calculate the opening and closing angle of the arm and the torso. The opening and closing angle of the left arm is composed of the vector x 3 of the left elbow and left shoulder (joint point number: 7, 5) and the vector x 4 of the left shoulder and left hip (joint point number: 5, 11). The opening and closing angle of the right arm is composed of the right It consists of the vector x 5 of the elbow and right shoulder (joint point numbers: 8, 6) and the vector x 6 of the right shoulder and right hip (joint point numbers: 6, 12). When the left arm opening and closing angle angle_2 and the right arm opening and closing angle angle_3 are greater than or equal to the third threshold, it is determined to be a large-scale movement, otherwise it is a small-scale movement. According to the experimental results, the third threshold is set to 90°. The discrimination formula is as follows:

步骤3.5,检测条件四:计算腿部与躯干开合角度。左腿开合角由左膝、左髋(关节点编号:13,11)的向量x7和左髋、右髋(关节点编号:11,12)的向量x8组成,右腿开合角由右膝、右髋(关节点编号:14,12)的向量x9和右髋、左髋(关节点编号:11,12)的向量x10组成。当左腿开合角angle_4,右腿开合角angle_5大于等于第四阈值时,判定为大幅度动作,否则为小幅度动作,根据实验结果第四阈值设为125°,判别公式如下所示:Step 3.5, detection condition four: Calculate the opening and closing angle of the legs and torso. The opening and closing angle of the left leg is composed of the vector x 7 of the left knee and left hip (joint point numbers: 13, 11) and the vector x 8 of the left hip and right hip (joint point number: 11, 12). The opening and closing angle of the right leg It consists of the vector x 9 of the right knee and right hip (joint point numbers: 14, 12) and the vector x 10 of the right hip and left hip (joint point numbers: 11, 12). When the opening and closing angle of the left leg angle_4 and the opening and closing angle of the right leg angle_5 are greater than or equal to the fourth threshold, it is determined to be a large-scale movement, otherwise it is a small-scale movement. According to the experimental results, the fourth threshold is set to 125°. The discrimination formula is as follows:

所述步骤4,具体包括:The step 4 specifically includes:

该步骤输入为步骤3输出的图像中人体的姿态特征值,输出动作幅度检测结果。通过计算姿态特征变化的幅度,采用阈值法判别是否为大幅度动作。The input of this step is the posture feature value of the human body in the image output in step 3, and the motion amplitude detection result is output. By calculating the amplitude of changes in posture characteristics, the threshold method is used to determine whether it is a large-scale action.

步骤4.1,步骤3的输出结果与设定的阈值比较,当四个检测条件中的姿态特征值一个超过设定的阈值时,判定该演艺动作为大幅度动作,否则为小幅度动作。Step 4.1, the output result of step 3 is compared with the set threshold. When one of the posture characteristic values in the four detection conditions exceeds the set threshold, the performing action is determined to be a large-amplitude action, otherwise it is a small-amplitude action.

所述步骤5,具体包括:The step 5 specifically includes:

该步骤输入为步骤4输出动作幅度检测结果,智能导播系统控制摄像机将镜头切换至存在大幅度动作图像对应的摄像机画面。The input of this step is the motion amplitude detection result output in step 4. The intelligent broadcasting system controls the camera to switch the lens to the camera screen corresponding to the large motion image.

步骤5.1,如步骤1.2所述,每张图像按照摄像机机位信息进行命名,因此每个图像代表一个机位摄像机的当前节目画面,例如,1_01.jpg代表1号摄像机的第一帧图像。所以智能导播系统可以根据图像名称和动作幅度检测结果将镜头切换至存在大幅度动作图像对应的摄像机画面。Step 5.1, as described in step 1.2, each image is named according to the camera position information, so each image represents the current program picture of a camera, for example, 1_01.jpg represents the first frame of camera No. 1. Therefore, the intelligent broadcasting system can switch the lens to the camera screen corresponding to the image with large motion based on the image name and motion amplitude detection results.

本发明将完全端到端的多人姿态估计模型应用于动作幅度检测,获取功能和性能的提升,着力于提高人体关节点定位以及动作幅度检测的准确性。基于关节点坐标准确快速地计算姿态特征,并据此进行动作幅度的判别,智能导播系统达到节目的自动制作或辅助制作的效果,节省了大量的人力、物力,提高工作效率。The present invention applies a complete end-to-end multi-person posture estimation model to motion range detection to obtain improvements in function and performance, focusing on improving the accuracy of human joint point positioning and motion range detection. By accurately and quickly calculating posture characteristics based on joint point coordinates, and judging the range of movements accordingly, the intelligent broadcasting system achieves the effect of automatic production or auxiliary production of programs, saving a lot of manpower and material resources, and improving work efficiency.

与现有技术相比,本发明的有益效果如下:Compared with the prior art, the beneficial effects of the present invention are as follows:

(1)高运行:本发明相比基于图像分类的动作幅度检测方法和基于姿态估计的动作幅度检测方法,可以更好地减少背景环境的干扰,专注于人物本身,该模型省略了人工操作,不需要借助目标检测模型,减少了大量的计算资源,具有高速运行,内存消耗小的优势。(1) High operation: Compared with the motion amplitude detection method based on image classification and the motion amplitude detection method based on posture estimation, this invention can better reduce the interference of the background environment and focus on the character itself. This model omits manual operation. There is no need to rely on the target detection model, which reduces a large amount of computing resources, has the advantages of high-speed operation and low memory consumption.

(2)完全端到端,速度和精度互相权衡:本发明将PETR模型应用于动作幅度检测方向,相比于自顶向下和自底向上多人姿态估计等两阶段模型,该模型将人体姿态估计归纳为分层集合预测问题,人体实例和细颗粒度的人体关节点坐标统一处理,实现了完全端到端的多人姿态估计,速度和精度达到了较高水平,两者互相权衡。(2) Completely end-to-end, speed and accuracy trade off each other: This invention applies the PETR model to the motion range detection direction. Compared with the two-stage model such as top-down and bottom-up multi-person pose estimation, this model integrates the human body Pose estimation is summarized as a hierarchical set prediction problem. Human body instances and fine-grained human joint point coordinates are processed uniformly, achieving complete end-to-end multi-person pose estimation. The speed and accuracy have reached a high level, and the two trade off each other.

(3)准确度高:传统的动作幅度检测判断主要依靠相邻视频帧关节点的相对位移,检测的动作类别较少。本发明实际考虑了画面景深和人物尺度不同的问题,提出了动作幅度检测算法,通过姿态特性分析对动作幅度检测进行归一化处理,在演艺场景中,对演艺动作幅度的检测效果更好,准确度高。(3) High accuracy: Traditional motion amplitude detection and judgment mainly rely on the relative displacement of joint points in adjacent video frames, and there are fewer detected motion categories. The present invention actually considers the problem of different depth of field of the picture and different scales of the characters, proposes a movement amplitude detection algorithm, and normalizes the movement range detection through gesture characteristic analysis. In the performing arts scene, the detection effect of the performing arts movement amplitude is better. High accuracy.

附图说明Description of the drawings

图1为本发明具体实施方式的流程图;Figure 1 is a flow chart of a specific embodiment of the present invention;

图2为本发明PETR模型整体结构图;Figure 2 is an overall structural diagram of the PETR model of the present invention;

图3为本发明姿态解码器结构图;Figure 3 is a structural diagram of the gesture decoder of the present invention;

图4为本发明关节点解码器结构图;Figure 4 is a structural diagram of the joint point decoder of the present invention;

图5为本发明COCO人体关节点连接图。Figure 5 is a connection diagram of joint points of the COCO human body according to the present invention.

具体实施方式Detailed ways

以下结合附图和实施例对本方法进行详细说明。The method will be described in detail below with reference to the drawings and examples.

实施方式的流程图如图1所示,包括以下步骤:The flow chart of the implementation is shown in Figure 1, which includes the following steps:

步骤S10,视频抽帧;Step S10, video frame extraction;

步骤S20,搭建并训练多人姿态估计模型;Step S20, build and train a multi-person pose estimation model;

步骤S30,通过动作幅度检测算法计算姿态特征;Step S30: Calculate posture characteristics through the motion range detection algorithm;

步骤S40,动作幅度判断;Step S40, motion range determination;

步骤S50,根据判断结果,智能导播系统切换镜头。Step S50: According to the judgment result, the intelligent guidance system switches lenses.

基于人工智能技术的智能导播旨在解决多种形态演艺节目制作过程中的问题,包括主持人和嘉宾在不同场景下的实时状态识别和导播切换。该技术的核心在于使用多人姿态估计对人体关节点的动态变化的识别,从而实现多种节目的自动导播切换。该技术的应用将大大减少制作周期和人力物力成本,并提高节目的制作效率和质量。动作识别和动作检测旨在检测动作的种类,与两者不同的是,动作幅度检测旨在检测人体动作的伸展变化幅度,属于新的人工智能技术领域应用方向。动作幅度的大小可以影响演员的表演风格,也可以帮助演员在舞台上表现出更多的情感和动态。通过动作幅度检测,对视频进行逐帧分析,实现主持人及嘉宾的实时状态识别,将结果反馈到智能导播系统当中,达到节目的自动制作或者辅助制作的效果。Intelligent broadcasting based on artificial intelligence technology aims to solve problems in the production process of various forms of performing arts programs, including real-time status recognition and switching of directors in different scenarios. The core of this technology is to use multi-person pose estimation to identify dynamic changes in human joint points, thereby realizing automatic switching of multiple programs. The application of this technology will greatly reduce the production cycle and human and material costs, and improve the production efficiency and quality of programs. Action recognition and action detection aim to detect the type of action. Different from the two, action amplitude detection aims to detect the stretching change range of human actions, which belongs to the application direction of new artificial intelligence technology field. The size of the movement can affect the actor's performance style and can also help the actor show more emotion and dynamics on stage. Through motion range detection, the video is analyzed frame by frame to realize real-time status identification of the host and guests, and the results are fed back to the intelligent broadcasting system to achieve the effect of automatic production or auxiliary production of the program.

实施方式的视频抽帧步骤S10,具体包括:The video frame extraction step S10 of the embodiment specifically includes:

该步骤输入为多台摄像机的视频,输出为视频抽帧得到的图像。通过抽取视频帧得到图像作为步骤S20的PETR模型的输入。The input of this step is video from multiple cameras, and the output is the image obtained by extracting frames from the video. The image is obtained by extracting the video frame as the input of the PETR model in step S20.

步骤S100,图像命名方式如下所述,将每张图像按照摄像机机位信息进行命名,因此每个图像代表一个机位摄像机的当前节目画面;例如,1_01.jpg代表1号摄像机的第一帧图像。Step S100, the image naming method is as follows. Each image is named according to the camera position information, so each image represents the current program screen of a camera; for example, 1_01.jpg represents the first frame image of camera No. 1. .

步骤S110,步骤S100的视频抽帧方式:以n帧为间隔选取视频帧,其他视频帧由于信息冗余直接丢弃;其中n为大于1的正整数,例如,n为10。Step S110, the video frame extraction method of step S100: video frames are selected at intervals of n frames, and other video frames are directly discarded due to information redundancy; n is a positive integer greater than 1, for example, n is 10.

搭建并训练多人姿态估计模型步骤S20,具体包括:Step S20 of building and training a multi-person pose estimation model specifically includes:

该步骤输入为步骤S10得到的图像,输出为人体关节点坐标,并按照关节点关联信息将人体姿态绘制在图像中。PETR模型主要包括骨干网络模块,位置编码模块,视觉特征编码器模块,姿态解码器模块,以及关节点解码器模块;PETR模型整体结构图如图2所示。The input of this step is the image obtained in step S10, the output is the coordinates of human body joint points, and the human body posture is drawn in the image according to the joint point association information. The PETR model mainly includes a backbone network module, a position encoding module, a visual feature encoder module, a posture decoder module, and a joint point decoder module; the overall structure of the PETR model is shown in Figure 2.

步骤S200,该步骤输入为步骤S10得到的图像,输出为多尺度特征图。ResNet-50作为PETR模型的骨干网络,ResNet-50是50layer的残差网络,可用于提取图像的特征图,本发明模型用于提取最后三个阶段的多尺度特征图。输入图像其中H为图像高度,W为宽度,通过ResNet-50网络提取最后三个阶段的多尺度特征图;ResNet-50是50layer的残差网络,可用于提取图像的特征图。Step S200, the input of this step is the image obtained in step S10, and the output is a multi-scale feature map. ResNet-50 serves as the backbone network of the PETR model. ResNet-50 is a 50-layer residual network that can be used to extract feature maps of images. The model of the present invention is used to extract multi-scale feature maps in the last three stages. input image Where H is the height of the image, W is the width, and the multi-scale feature maps of the last three stages are extracted through the ResNet-50 network; ResNet-50 is a 50-layer residual network that can be used to extract the feature map of the image.

步骤S210,PETR模型的视觉特征编码器。该步骤输入为步骤S200得到的多尺度特征图和位置编码模块为每个像素点生成的位置编码,输出为多尺度特征信息和姿态查询键。由于多头注意力模块具有输入尺度的平方复杂度,因此视觉特征编码器采用可变形的多头注意力模块来实现特征编码。将含有多尺度特征信息F∈RL×256,其中L是令牌的总量。最后,将F与姿态查询键一起输入到姿态解码器中进行姿态预测。Step S210: Visual feature encoder of PETR model. The input of this step is the multi-scale feature map obtained in step S200 and the position code generated by the position coding module for each pixel point, and the output is multi-scale feature information and posture query key. Since the multi-head attention module has the square complexity of the input scale, the visual feature encoder adopts a deformable multi-head attention module to achieve feature encoding. will contain multi-scale feature information F∈R L×256 , where L is the total number of tokens. Finally, F and the pose query key are input into the pose decoder for pose prediction.

步骤S220,PETR模型的姿态解码器。该步骤输入为步骤S210得到的多尺度特征F和N个随机初始化的姿态查询Qpose∈RN×D,姿态解码器输出为N个身体的姿态其中/>表示第i个人的K个关节点的坐标,D表示查询键的维度。Step S220, pose decoder of PETR model. The input of this step is the multi-scale feature F obtained in step S210 and N randomly initialized pose queries Q pose ∈R N×D . The output of the pose decoder is the pose of N bodies. Among them/> represents the coordinates of the K joint points of the i-th person, and D represents the dimension of the query key.

姿态解码器结构图如图3所示,首先姿态查询被输入到自注意力(SelfAttention)中用于姿态查询之间的交互,即Pose-To-Pose SelfAttention。然后,每一个姿态查询从多尺度特征记忆F中通过可变形的交叉注意力(Deformable Cross-Attention)逐层提取K个特征图像素点作为键。在姿态查询基础上根据偏差聚合特征,作为值。交叉注意力模块输出K个推导关节点坐标,作为人体姿态关节点初始坐标。随后,具有键的注意力特征信息的姿态查询输入到多任务预测头中,其中分类头通过线性映射层来预测每个目标的置信度;姿态回归头使用隐含层个数为256的MLP去预测K个推导点的相对位置偏移。The structure diagram of the pose decoder is shown in Figure 3. First, the pose query is input into the self-attention (SelfAttention) for interaction between pose queries, that is, Pose-To-Pose SelfAttention. Then, each pose query extracts K feature image pixels as keys layer by layer through deformable cross-attention from the multi-scale feature memory F. Aggregate features based on biases as values based on pose queries. The cross attention module outputs K derived joint point coordinates as the initial coordinates of human posture joint points. Subsequently, the pose query with the key's attention feature information is input into the multi-task prediction head, where the classification head predicts the confidence of each target through a linear mapping layer; the pose regression head uses an MLP with a hidden layer number of 256 to Predict the relative position offset of K derivation points.

姿态解码器可以由多个解码层构成。这与其他的Transformer方法仅使用最后的解码器层去预测姿态坐标有所不同,PETR的姿态解码器借助所有解码器层去逐层估计姿态坐标,每一层基于前一层的预测去细化姿态。The gesture decoder can be composed of multiple decoding layers. This is different from other Transformer methods that only use the last decoder layer to predict pose coordinates. PETR's pose decoder uses all decoder layers to estimate pose coordinates layer by layer, and each layer is refined based on the prediction of the previous layer. attitude.

步骤S230,PETR模型的关节点解码器。该步骤输入为步骤S220姿态解码器预测的每一组人体姿态的K个关节点信息作为随机初始化的关节点查询,输出为进一步细化的关节点位置信息以及关节点结构信息。通过自注意力和可变形的交叉注意力不断更新关节点查询和键的特征信息,最后关节点查询通过偏差聚合特,即可进一步细化关节点位置信息以及关节点结构信息。由于每一组人体姿态关节点互不相关,因此所有姿态能够并行处理,这大大减少了网络预测和推导的时间复杂度。关节点解码器的结构图如图4所示。Step S230: Joint point decoder of the PETR model. The input of this step is the K joint point information of each group of human postures predicted by the posture decoder in step S220 as a randomly initialized joint point query, and the output is further refined joint point position information and joint point structure information. The feature information of joint point queries and keys is continuously updated through self-attention and deformable cross-attention. Finally, the joint point query aggregates features through deviation to further refine the joint point position information and joint point structure information. Since each group of human posture joint points are independent of each other, all postures can be processed in parallel, which greatly reduces the time complexity of network prediction and derivation. The structure diagram of the joint point decoder is shown in Figure 4.

关节点查询首先通过自注意力模块进行关节点查询特征信息的相互交互,即Joint-To-Joint注意力。然后通过可变形的交叉注意力模块即Feature-To-Joint去提取视觉特征。随后,关节点预测头通过MLP去预测2D关节点之间的相对位移ΔJ=(Δx,Δy)。与姿态解码器相似,关节点坐标也是渐进式的细化。Joint point query first uses the self-attention module to interact with the joint point query feature information, that is, Joint-To-Joint attention. Then the visual features are extracted through the deformable cross-attention module, namely Feature-To-Joint. Subsequently, the joint point prediction head uses MLP to predict the relative displacement ΔJ = (Δx, Δy) between 2D joint points. Similar to the pose decoder, the joint point coordinates are also progressively refined.

上述步骤S200到S230为PETR模型整体框架的搭建部分,下面描述PETR模型训练的具体步骤。The above steps S200 to S230 are the building part of the overall framework of the PETR model. The specific steps of PETR model training are described below.

步骤S240,训练时采用基于集合匈牙利损失旨在对每个真实姿态强制唯一的预测,减少关节点的错检和误检。分类损失函数记为Lcls在分类头中被使用。为了消除预测结果的相对误差,PETR模型采用OKS损失,OKS损失是模型预测的人体姿态的关节点坐标与真实的关节点坐标的回归损失计算。L1损失记为Lreg和OKS损失记为Loks,分别用于姿态回归头和关节回归头的损失函数,使姿态解码器和关节点解码器更好地收敛In step S240, the set-based Hungarian loss is used during training to force a unique prediction for each real pose and reduce false detections and false detections of joint points. The classification loss function denoted L cls is used in the classification header. In order to eliminate the relative error in the prediction results, the PETR model uses OKS loss. OKS loss is the regression loss calculation between the joint point coordinates of the human posture predicted by the model and the real joint point coordinates. The L 1 loss is recorded as L reg and the OKS loss is recorded as L oks , which are used in the loss functions of the attitude regression head and the joint regression head respectively, so that the attitude decoder and joint point decoder can better converge.

使用基于热力图回归的方法辅助训练,能够给模型的训练增加一个方向性的引导,距离目标关节点越近,激活值越大,模型能按照引导方向快速地接近目标关节点,达到快速收敛的作用。因此,PETR模型采用一个可变形的transformer编码器去产生热力图预测,然后在预测的和真实的热力图之间计算一个变体的Focal Loss即焦点损失,记为Lhm。热力图分支仅用于辅助训练,在推理阶段会去掉。综上所述,该PETR模型的所有损失函数表述为:Using methods based on heat map regression to assist training can add a directional guidance to the training of the model. The closer it is to the target joint point, the greater the activation value. The model can quickly approach the target joint point according to the guidance direction and achieve rapid convergence. effect. Therefore, the PETR model uses a deformable transformer encoder to generate heat map predictions, and then calculates a variant Focal Loss between the predicted and true heat maps, denoted as L hm . The heat map branch is only used to assist training and will be removed during the inference phase. To sum up, all the loss functions of the PETR model are expressed as:

L=Lcls1Lreg2Loks3Lhm L=L cls1 L reg2 L oks3 L hm

其中,λ123表示对应的损失权值。Among them, λ 1 , λ 2 , λ 3 represent the corresponding loss weights.

步骤S250,COCO关节点数据集作为训练集,resnet-50作为骨干网络。模型总共训练了100个迭代周期。学习率为2.5e-5,为了使模型训练更好的收敛,当训练到80个epoch时,将学习率降到2.5e-6。对PETR模型进行训练和测试。通过训练好的PETR模型,对输入的每一图像检测出人体关节点坐标,并生成人体姿态。Step S250, the COCO joint point data set is used as the training set, and resnet-50 is used as the backbone network. The model was trained for a total of 100 iterations. The learning rate is 2.5e-5. In order to make the model training converge better, when the training reaches 80 epochs, the learning rate is reduced to 2.5e-6. Train and test the PETR model. Through the trained PETR model, the human body joint point coordinates are detected for each input image and the human body posture is generated.

通过动作幅度检测算法计算姿态特征步骤S30,具体包括:The step S30 of calculating posture features through the motion range detection algorithm specifically includes:

该步骤输入为步骤S20的人体关节点坐标,输出为图像中人体的姿态特征值。实际考虑了画面景深和人物尺度不同的问题,动作幅度检测算法由四个检测条件组成。The input of this step is the human body joint point coordinates in step S20, and the output is the posture feature value of the human body in the image. Actually taking into account the different depth of field of the picture and the scale of the characters, the motion range detection algorithm consists of four detection conditions.

步骤S300,本发明姿态估计关节点连接格式采用的是COCO人体关节点连接格式,如图5所示,(xi,yi)表示为人体第i个关节点的x坐标和y坐标,计算左肩、右肩关节点(关节点编号:5,6)中点s的坐标,左髋、右髋关节点(关节点编号:11,12)中点h的坐标,左膝、右膝关节点(关节点编号:13,14)的中点k的坐标;Step S300, the posture estimation joint point connection format of the present invention adopts the COCO human body joint point connection format, as shown in Figure 5, ( xi , y i ) is expressed as the x coordinate and y coordinate of the i-th joint point of the human body, and the calculation The coordinates of the midpoint s of the left shoulder and right shoulder joint points (joint point numbers: 5, 6), the coordinates of the midpoint h of the left and right hip joint points (joint point numbers: 11, 12), the left knee and right knee joint points The coordinates of the midpoint k of (joint point number: 13, 14);

步骤S310,检测条件一:计算双手开合长度与身体中心躯干长度比值。双手开合长度由左腕、右腕关节点(关节点编号:9,10)构成,身体中心躯干长度由s和h组成。基于关节点坐标计算出双手开合长度d_hand,身体中心躯干长度d_body,当双手开合长度与身体中心躯干长度的比值大于等于第一阈值时,判定为大幅度动作,否则为小幅度动作,根据实验结果第一阈值设为1.8,判别公式如下所示:Step S310, detection condition 1: Calculate the ratio of the opening and closing length of the hands to the length of the central torso of the body. The opening and closing length of the hands is composed of the joint points of the left and right wrists (joint point numbers: 9, 10), and the length of the body center trunk is composed of s and h. Based on the joint point coordinates, the opening and closing length of the hands d_hand and the length of the body center trunk d_body are calculated. When the ratio of the opening and closing length of the hands to the length of the body center trunk is greater than or equal to the first threshold, it is determined to be a large-scale movement. Otherwise, it is a small-scale movement. According to The first threshold of the experimental results is set to 1.8, and the discrimination formula is as follows:

步骤S320,检测条件二:计算身体躯干倾斜角度。s和h组成的向量x1,h和k组成的向量x2,两个向量的角度为身体躯干倾斜角度。当身体躯干倾斜角度angle_1小于等于第二阈值时,判定为大幅度动作,否则为小幅度动作,根据实验结果第二阈值设为150°,判别公式如下所示:Step S320, detection condition 2: calculate the tilt angle of the body trunk. The vector x 1 composed of s and h, the vector x 2 composed of h and k, the angle of the two vectors is the inclination angle of the body trunk. When the body trunk tilt angle angle_1 is less than or equal to the second threshold, it is determined to be a large-scale movement, otherwise it is a small-scale movement. According to the experimental results, the second threshold is set to 150°. The discrimination formula is as follows:

步骤S330,检测条件三:计算手臂与躯干开合角度。左臂开合角由左肘、左肩(关节点编号:7,5)的向量x3和左肩、左髋(关节点编号:5,11)的向量x4组成,右臂开合角由右肘、右肩(关节点编号:8,6)的向量x5和右肩、右髋(关节点编号:6,12)的向量x6组成。当左臂开合角angle_2或右臂开合角angle_3大于等于第三阈值时,判定为大幅度动作,否则为小幅度动作,根据实验结果第三阈值设为90°,判别公式如下所示:Step S330, detection condition three: calculate the opening and closing angle of the arm and the trunk. The opening and closing angle of the left arm is composed of the vector x 3 of the left elbow and left shoulder (joint point number: 7, 5) and the vector x 4 of the left shoulder and left hip (joint point number: 5, 11). The opening and closing angle of the right arm is composed of the right It consists of the vector x 5 of the elbow and right shoulder (joint point numbers: 8, 6) and the vector x 6 of the right shoulder and right hip (joint point numbers: 6, 12). When the left arm opening and closing angle angle_2 or the right arm opening and closing angle angle_3 is greater than or equal to the third threshold, it is determined to be a large-scale movement, otherwise it is a small-scale movement. According to the experimental results, the third threshold is set to 90°. The discrimination formula is as follows:

步骤S340,检测条件四:计算腿部与躯干开合角度。左腿开合角由左膝、左髋(关节点编号:13,11)的向量x7和左髋、右髋(关节点编号:11,12)的向量x8组成,右腿开合角由右膝、右髋(关节点编号:14,12)的向量x9和右髋、左髋(关节点编号:11,12)的向量x10组成。当左腿开合角angle_4或右腿开合角angle_5大于等于第四阈值时,判定为大幅度动作,否则为小幅度动作,根据实验结果第四阈值设为125°,判别公式如下所示:Step S340, detection condition four: calculate the opening and closing angle of the legs and the trunk. The opening and closing angle of the left leg is composed of the vector x 7 of the left knee and left hip (joint point numbers: 13, 11) and the vector x 8 of the left hip and right hip (joint point number: 11, 12). The opening and closing angle of the right leg It consists of the vector x 9 of the right knee and right hip (joint point numbers: 14, 12) and the vector x 10 of the right hip and left hip (joint point numbers: 11, 12). When the left leg opening and closing angle angle_4 or the right leg opening and closing angle angle_5 is greater than or equal to the fourth threshold, it is determined to be a large-scale movement, otherwise it is a small-scale movement. According to the experimental results, the fourth threshold is set to 125°. The discrimination formula is as follows:

动作幅度判断步骤S40,具体包括:The action range determination step S40 specifically includes:

该步骤输入为步骤S30输出的图像中人体的姿态特征值,输出动作幅度检测结果。通过计算姿态特征变化的幅度,采用阈值法判别是否为大幅度动作;The input of this step is the posture feature value of the human body in the image output in step S30, and the motion range detection result is output. By calculating the amplitude of changes in posture characteristics, the threshold method is used to determine whether it is a large-scale action;

步骤S400,步骤S30的输出结果与设定的阈值比较,当四个检测条件中的姿态特征值,存在一个超过系统设定的阈值时,判定该演艺动作为大幅度动作,否则为小幅度动作。Step S400, the output result of step S30 is compared with the set threshold. When one of the posture characteristic values among the four detection conditions exceeds the threshold set by the system, it is determined that the performing action is a large-amplitude action, otherwise it is a small-amplitude action. .

根据判断结果,智能导播系统切换镜头步骤S50,具体包括:According to the judgment result, the intelligent broadcasting system switches lenses step S50, which specifically includes:

该步骤输入为步骤S40输出动作幅度检测结果,智能导播系统将镜头切换至存在大幅度动作图像对应的摄像机画面。The input of this step is the motion amplitude detection result output in step S40, and the intelligent broadcasting system switches the lens to the camera screen corresponding to the large motion image.

步骤S500,如步骤S100所述,每张图像按照摄像机机位信息进行命名,因此每个图像代表一个机位摄像机的当前节目画面,例如,1_01.jpg代表1号摄像机的第一帧图像。所以智能导播系统可以根据图像名称和动作幅度检测结果将镜头切换至存在大幅度动作图像对应的摄像机画面Step S500, as described in step S100, each image is named according to the camera position information, so each image represents the current program picture of a camera, for example, 1_01.jpg represents the first frame image of camera No. 1. Therefore, the intelligent director system can switch the lens to the camera screen corresponding to the image with large motion based on the image name and motion amplitude detection results.

下面给出应用本发明的实验结果。The experimental results of applying the present invention are given below.

本发明智能导播系统中关键帧图像的间隔时长约为:T1=0.3333s,处理一张关键帧图像需要时间为:T2=0.3024s。因为T1>T2,所以系统可在下一张图像输入时完成动作幅度检测,达到实时镜头切换,因此本发明检测速度和精度符合实际场景应用需求。The interval length of key frame images in the intelligent broadcasting system of the present invention is approximately: T 1 =0.3333s, and the time required to process one key frame image is: T 2 =0.3024s. Because T 1 > T 2 , the system can complete motion amplitude detection when the next image is input, achieving real-time lens switching. Therefore, the detection speed and accuracy of the present invention meet the application requirements of actual scenarios.

表1展示了使用本发明在600张演艺类测试图像中进行测试的结果。Table 1 shows the results of testing using the present invention on 600 performing arts test images.

表1演艺动作幅度测试结果Table 1 Performance range of motion test results

图像类别Image category 数量quantity 准确率Accuracy 大幅度演艺动作Big acting moves 300300 88.088.0 小幅度演艺动作Small acting moves 300300 86.3386.33

本发明的有益效果如下:The beneficial effects of the present invention are as follows:

本发明将完全端到端的多人姿态估计模型应用于动作幅度检测,获取功能和性能的提升,着力于提高人体关节点定位以及动作幅度检测的准确性。基于关节点坐标准确快速地计算姿态特征,并据此进行动作幅度的判别,达到节目的自动制作或辅助制作的效果,节省了大量的人力、物力,提高工作效率。此外,本发明还具备以下几点有益效果:The present invention applies a complete end-to-end multi-person posture estimation model to motion range detection to obtain improvements in function and performance, focusing on improving the accuracy of human joint point positioning and motion range detection. Accurately and quickly calculate posture characteristics based on joint point coordinates, and judge the movement range accordingly, achieving the effect of automatic production or auxiliary production of programs, saving a lot of manpower and material resources, and improving work efficiency. In addition, the present invention also has the following beneficial effects:

1.高运行:可以更好地减少背景环境的干扰,专注于人物本身,省略了人工操作,不需要借助目标检测模型,减少了大量的计算资源,具有高速运行,内存消耗小的优势。1. High operation: It can better reduce the interference of the background environment, focus on the character itself, omit manual operations, do not need to rely on target detection models, reduce a large amount of computing resources, and have the advantages of high-speed operation and low memory consumption.

2.完全端到端,速度和精度互相权衡:相比于使用自顶向下和自底向上多人姿态估计等两阶段模型的方法,该模型将人体姿态估计归纳为分层集合预测问题,人体实例和细颗粒度的人体关节点坐标统一处理,实现了完全端到端的多人姿态估计,速度和精度达到了较高水平,两者互相权衡。2. Completely end-to-end, speed and accuracy trade off each other: Compared with methods using two-stage models such as top-down and bottom-up multi-person pose estimation, this model summarizes human pose estimation as a hierarchical set prediction problem. Human body instances and fine-grained human body joint point coordinates are processed uniformly, achieving complete end-to-end multi-person pose estimation. The speed and accuracy have reached a high level, and the two trade off each other.

3.准确度高:传统的动作幅度检测判断主要依靠相邻视频帧关节点的相对位移,检测的动作类别较少。本发明实际考虑了画面景深和人物尺度不同的问题,提出了动作幅度检测算法,通过姿态特性分析对动作幅度检测进行归一化处理,在演艺场景中,对演艺动作幅度的检测效果更好,准确度高。3. High accuracy: Traditional motion amplitude detection and judgment mainly rely on the relative displacement of joint points in adjacent video frames, and there are fewer detected motion categories. The present invention actually considers the problem of different depth of field of the picture and different scales of the characters, proposes a movement amplitude detection algorithm, and normalizes the movement range detection through gesture characteristic analysis. In the performing arts scene, the detection effect of the performing arts movement amplitude is better. High accuracy.

Claims (6)

1. An intelligent guided broadcast switching method based on multi-person gesture estimation motion amplitude detection is characterized by comprising the following steps:
step 1, performing frame extraction based on a performance video acquired by a camera to acquire an image;
step 2, building and training a multi-person gesture estimation model;
step 3, inputting the image in the step 1 into the multi-person gesture estimation model in the step 2, and calculating each gesture characteristic through a motion amplitude detection algorithm;
step 4, judging the action amplitude of the gesture features obtained in the step 3;
and step 5, performing lens switching according to the judgment result of the action amplitude.
2. The intelligent guided broadcast switching method based on the multi-person gesture estimation motion amplitude detection of claim 1, wherein the step 1 specifically includes:
and acquiring multi-person gesture motion amplitude videos based on the plurality of cameras, and obtaining multi-person gesture motion amplitude images based on the multi-person gesture motion amplitude videos.
Step 1.1, selecting video frames of a multi-person gesture motion amplitude video at intervals of n frames, and directly discarding other video frames due to information redundancy; wherein n is a positive integer greater than 1.
And 1.2, naming each video frame according to camera position information to obtain a plurality of images, wherein each image represents a current program picture of a camera position camera.
3. The intelligent guided broadcast switching method based on the multi-person gesture estimation motion amplitude detection of claim 2, wherein the step 2 specifically includes:
the multi-person pose estimation model, PETR model, mainly includes a backbone network module, a position coding module, a visual feature encoder module, a pose decoder module, and a joint decoder module. Based on the image obtained in the step 1, the PETR model can output coordinates of human joints in the image and draw human gestures in the image according to joint related information.
And 2.1, the backbone network module is used for inputting the image in the step 1 and outputting the image as a multi-scale characteristic map. The backbone network module is ResNet-50, and ResNet-50 is a residual network of 50 layers and is used for extracting the characteristic diagram of the image, and the backbone network module is used for extracting the multi-scale characteristic diagram with high resolution.
Step 2.2, the visual feature encoder generates a position code for each pixel point based on the multi-scale feature map and the position code module obtained in the step 2.1, and generates a multi-scale feature token and a gesture query;
step 2.3, the gesture decoder based on the multi-scale feature token F and N randomly initialized gesture queries Q obtained in step 2.2 pose ∈R N×D And acquires N body posesWherein->Represents the coordinates of the K joints of the ith person, D represents the dimension of the query key. The pose decoder estimates the pose coordinates layer by means of all decoder layers, each layer refining the pose based on the predictions of the previous layer.
Step 2.4, the joint point decoder uses the K joint point information of each group of human body gestures predicted by the gesture decoder in step 2.3 as randomly initialized joint point query, and refines the joint point position information and the joint point structure information. The characteristic information of the node query and the key is continuously updated through self-attention and deformable cross-attention, and finally the node query is aggregated through deviation, so that the node position information and the node structure information can be further refined.
Step 2.5, training by adopting a Hungary loss based on the set, and recording a classification loss function as L cls Is used in the sorting head. In order to eliminate the relative error of the predicted result, the PETR model adopts OKS loss, which is the regression loss calculation of the joint point coordinates of the human body posture predicted by the model and the real joint point coordinates. L (L) 1 Loss is marked as L reg And OKS loss is noted as L oks The loss functions are respectively used for the gesture regression head and the joint regression head;
the PETR model uses a deformable transducer encoder to generate thermodynamic diagram predictions, and calculates a variant FocalLoss, i.e., focus loss, between the predicted and true thermodynamic diagrams, denoted as L hm . Thermodynamic branches are used only for training assistance and are removed during the inference phase. All loss functions of the PETR model are expressed as:
L=L cls1 L reg2 L oks3 L hm
wherein lambda is 123 Representing the corresponding loss weight.
Step 2.6, constructing a COCO node data set as a training set, and training and testing a PETR model; the COCO joint point data contains 200000 images in total, wherein the joint point images total 64355, including 250000 human joint points. Each person is marked with 17 nodes. The PETR model is trained for a total of 100 iteration cycles. The learning rate was 2.5e-5, and was reduced to 2.5e-6 when 80 epochs were trained in order to allow better convergence of PETR model training. And detecting the coordinates of the joint points of the human body for each input image through training the PETR model, and generating the human body posture.
4. The intelligent guided broadcast switching method based on the multi-person gesture estimation motion amplitude detection of claim 3, wherein the step 3 specifically includes:
the input is the coordinates of the joint points of the human body in the step 2, and the coordinates are output as the characteristic values of the posture of the human body in the image. The problem that the depth of field of the picture is different from the figure scale is actually considered, and the action amplitude detection algorithm consists of four detection conditions.
Step 3.1, the joint point connection format of the gesture estimation adopts a COCO human body joint point connection format, (x) i ,y i ) The coordinate of the midpoint s of the joint points of the left shoulder and the right shoulder, the coordinate of the midpoint h of the joint points of the left hip and the right hip, and the coordinate of the midpoint k of the joint points of the left knee and the right knee are calculated by representing the coordinate of the x coordinate and the y coordinate of the ith joint point of the human body.
Step 3.2, detection condition one: and calculating the ratio of the opening and closing length of the two hands to the length of the trunk at the center of the body. The opening and closing length of the two hands is composed of left wrist joint points and right wrist joint points, and the length of the trunk at the center of the body is composed of s and h. Calculating the opening and closing length d_hand of the two hands and the length d_body of the body center based on the joint point coordinates, judging that the two hands move in a large amplitude mode when the ratio of the opening and closing length of the two hands to the length of the body center and the body is larger than or equal to a first threshold value, otherwise, setting the first threshold value as 1.8 according to an experimental result, and judging that a formula is shown as follows:
step 3.3, detecting a second condition: the body trunk tilt angle is calculated. Vector x composed of s and h 1 Vector x consisting of h and k 2 The angle of the two vectors is the body trunk inclination angle. When the body trunk inclination angle angle_1 is smaller than or equal to a second threshold value, judging that the body trunk inclination angle is large-amplitude motion, otherwise, judging that the body trunk inclination angle is small-amplitude motion, setting the second threshold value to be 150 degrees according to an experimental result, and judging that a formula is as follows:
step 3.4, detecting a third condition: and calculating the opening and closing angles of the arms and the trunk. The opening and closing angle of the left arm is defined by the vector x of the left elbow and the left shoulder 3 Vector x and left shoulder, left hip 4 The right arm opening and closing angle consists of a vector x of a right elbow and a right shoulder 5 Vector x for right shoulder and hip 6 Composition is prepared. When the left arm opening and closing angle angle_2 and the right arm opening and closing angle angle_3 are larger than or equal to a third threshold, the large-amplitude motion is judged, otherwise, the small-amplitude motion is judged, the third threshold is set to 90 degrees according to the experimental result, and the judgment formula is as follows:
step 3.5, detection condition four: and calculating the opening and closing angles of the legs and the trunk. The left leg opening and closing angle is defined by the vector x of the left knee and the left hip 7 And vector x of left hip and right hip 8 The right leg opening and closing angle consists of a vector x of a right knee and a right hip 9 And vector x of right and left hip 10 Composition is prepared. When the left leg opening and closing angle angle_4 and the right leg opening and closing angle angle_5 are larger than or equal to a fourth threshold value, judging that the motion is large-amplitude motion, otherwise, judging that the motion is small-amplitude motion, and according to the experimental result, the fourth threshold valueThe value is set to 125 degrees, and the discrimination formula is as follows:
5. the intelligent guided broadcast switching method based on the multi-person gesture estimation motion amplitude detection of claim 4, wherein the step 4 specifically includes:
the input of the step is the characteristic value of the human body in the image output in the step 3, and the detection result of the action amplitude is output. And judging whether the motion is a large-amplitude motion or not by calculating the amplitude of the change of the gesture characteristics and adopting a threshold method.
And 4.1, comparing the output result of the step 3 with a set threshold value, and judging that the performance acts as a large-amplitude action when one of the gesture characteristic values in the four detection conditions exceeds the set threshold value, or else, judging as a small-amplitude action.
6. The intelligent guided broadcast switching method based on the multi-person gesture estimation motion amplitude detection of claim 5, wherein the step 5 specifically includes:
the step of inputting is that the step of 4 outputs the detection result of the movement amplitude, the intelligent guiding and broadcasting system controls the camera to switch the lens to the camera picture corresponding to the large-amplitude movement image.
Step 5.1, as described in step 1.2, each image is named according to camera position information, and each image represents the current program picture of a camera position. And the intelligent guide system switches the lens to a camera picture corresponding to the large-amplitude action image according to the image name and the action amplitude detection result.
CN202310762371.2A 2023-06-27 2023-06-27 Intelligent guided broadcast switching method based on multi-person gesture estimation action amplitude detection Pending CN116721468A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310762371.2A CN116721468A (en) 2023-06-27 2023-06-27 Intelligent guided broadcast switching method based on multi-person gesture estimation action amplitude detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310762371.2A CN116721468A (en) 2023-06-27 2023-06-27 Intelligent guided broadcast switching method based on multi-person gesture estimation action amplitude detection

Publications (1)

Publication Number Publication Date
CN116721468A true CN116721468A (en) 2023-09-08

Family

ID=87867724

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310762371.2A Pending CN116721468A (en) 2023-06-27 2023-06-27 Intelligent guided broadcast switching method based on multi-person gesture estimation action amplitude detection

Country Status (1)

Country Link
CN (1) CN116721468A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117860242A (en) * 2024-03-12 2024-04-12 首都儿科研究所 Infant walking action development detection method, equipment and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117860242A (en) * 2024-03-12 2024-04-12 首都儿科研究所 Infant walking action development detection method, equipment and device
CN117860242B (en) * 2024-03-12 2024-05-28 首都儿科研究所 Infant walking action development detection method, equipment and device

Similar Documents

Publication Publication Date Title
CN104601964B (en) Pedestrian target tracking and system in non-overlapping across the video camera room of the ken
CN111310659B (en) Human body action recognition method based on enhanced graph convolution neural network
CN111199207B (en) Two-dimensional multi-human body posture estimation method based on depth residual error neural network
Zheng et al. A joint relationship aware neural network for single-image 3D human pose estimation
CN113361542A (en) Local feature extraction method based on deep learning
Yang et al. Human exercise posture analysis based on pose estimation
Liao et al. Ai golf: Golf swing analysis tool for self-training
CN117671738B (en) Human body posture recognition system based on artificial intelligence
CN102629329A (en) Personnel indoor positioning method based on adaptive SIFI (scale invariant feature transform) algorithm
CN110781962A (en) Target detection method based on lightweight convolutional neural network
CN112906520A (en) Gesture coding-based action recognition method and device
Hammam et al. Real-time multiple spatiotemporal action localization and prediction approach using deep learning
CN112560620B (en) Target tracking method and system based on target detection and feature fusion
Zhang et al. Dynamic fry counting based on multi-object tracking and one-stage detection
CN114639117A (en) A method and device for cross-border specific pedestrian tracking
Nguyen et al. Combined YOLOv5 and HRNet for high accuracy 2D keypoint and human pose estimation
CN116721468A (en) Intelligent guided broadcast switching method based on multi-person gesture estimation action amplitude detection
Shi et al. Occlusion-aware graph neural networks for skeleton action recognition
Wang et al. GaitParsing: Human semantic parsing for gait recognition
Fu et al. Traffic police 3D gesture recognition based on spatial–temporal fully adaptive graph convolutional network
CN111898566A (en) Attitude estimation method, device, electronic device and storage medium
CN112633261A (en) Image detection method, device, equipment and storage medium
CN114529944B (en) A portrait scene recognition method combined with human body key point heat map features
CN114862904B (en) A twin network target continuous tracking method for underwater robots
CN117173777A (en) Learner front posture estimation method based on limb direction clue decoding network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination