CN116863509B

CN116863509B - Method for human silhouette detection and gesture recognition using improved PolarMask

Info

Publication number: CN116863509B
Application number: CN202311119512.5A
Authority: CN
Inventors: 温廷羲; 童斌斌; 侯晴霏; 陈雨萍; 谢建华; 曾焕强
Original assignee: Fujian Huanyutong Technology Co ltd; Huaqiao University
Current assignee: Fujian Huanyutong Technology Co ltd; Huaqiao University
Priority date: 2023-09-01
Filing date: 2023-09-01
Publication date: 2024-02-20
Anticipated expiration: 2043-09-01
Also published as: CN116863509A

Abstract

The present invention uses an improved PolarMask method for humanoid contour detection and gesture recognition, adopts an improved PolarMask model, and designs humanoid contour polar coordinate modeling based on humanoid contour characteristics; then constructs an improved PolarMask model as a humanoid contour segmentation model, and adds The channel attention mechanism module adds skip connections to the original YOLOV7-based feature pyramid network to make up for the loss of detailed information during the feature fusion process. Finally, a training strategy based on weak labels is adopted to train a model that can identify features containing A primary humanoid outline segmentation model based on the rectangular frame of humanoid position information and the human posture type; in the formal training process, the pre-trained weights that have learned the relevant information of the humanoid outline in advance are used for transfer learning, and in the real humanoid In the process of contour learning, the predicted humanoid contours are continuously converged, and the humanoid contours and posture types are accurately identified.

Description

Method for human silhouette detection and gesture recognition using improved PolarMask

技术领域Technical field

本发明属于图像识别领域，尤其涉及运用改进的PolarMask进行人形轮廓检测和姿态识别的方法。The invention belongs to the field of image recognition, and in particular relates to a method of using an improved PolarMask for humanoid contour detection and gesture recognition.

背景技术Background technique

随着国内计算机技术和成像技术的发展，人形轮廓检测和姿态识别方法受到越来越多的关注。各个领域对人形轮廓的识别和姿态识别提出了更高的需求，人形轮廓检测和姿态识别技术能够在改善人机交互、运动分析、健康监测、虚拟现实和增强现实以及安全监控等多个领域起到关键的作用。它们为各种应用场景提供了更直观、高效和沉浸式的解决方案，推动了人工智能技术在实际应用中的发展。特别是针对人体的行为识别与智能监控存在的需求，现有视频图像中对人体行为识别与监控大多数基于目标识别方法，大致识别出人所在位置以及人体当时的状态，这些状态包括直立或者跌倒。通过目标识别和姿态识别的方法，首先获取包含人形的目标框，通过人体骨骼节点的坐标识别人体的姿态。虽然目前的方法在一定程度上可以起到识别人体位置及其状态的效果，但是无法对人体的轮廓进行准确的识别。因此需要一种人形轮廓识别方法，不仅能够准确识别人体，还能对人体轮廓进行准确分割，为判断人体的姿态提供更多有用的信息，从而提高人体的姿态识别的准确性，同时更有利于将该技术运用到更多的相关领域。With the development of domestic computer technology and imaging technology, humanoid contour detection and gesture recognition methods have received more and more attention. Various fields have put forward higher demands for humanoid contour recognition and posture recognition. Humanoid contour detection and posture recognition technology can play a role in improving human-computer interaction, motion analysis, health monitoring, virtual reality and augmented reality, and security monitoring. play a key role. They provide more intuitive, efficient and immersive solutions for various application scenarios, and promote the development of artificial intelligence technology in practical applications. Especially for the needs of human body behavior recognition and intelligent monitoring, most of the human body behavior recognition and monitoring in existing video images are based on target recognition methods, which roughly identify the person's location and the state of the human body at that time, including upright or fallen. . Through the method of target recognition and posture recognition, first obtain the target frame containing the human figure, and identify the posture of the human body through the coordinates of the human skeleton nodes. Although the current method can achieve the effect of identifying the position and status of the human body to a certain extent, it cannot accurately identify the outline of the human body. Therefore, a humanoid outline recognition method is needed, which can not only accurately identify the human body, but also accurately segment the human body outline, provide more useful information for judging the human body's posture, thereby improving the accuracy of human body posture recognition, and at the same time, it is more conducive to Apply this technology to more related fields.

基于PolarMask的实例分割方法其精确度远远低于其他例如Mask R-CNN等实例分割方法。并且由于是深度学习模型，PolarMask需要大量的数据集进行训练才能达到预想的成果，而对一张图片进行分割类型的标记是十分费时费力的，平均标记一张图片需要1分钟的时间，因此训练模型也需要投入大量的成本。The accuracy of the instance segmentation method based on PolarMask is much lower than other instance segmentation methods such as Mask R-CNN. And because it is a deep learning model, PolarMask requires a large number of data sets for training to achieve the expected results. Labeling a picture by segmentation type is very time-consuming and laborious. It takes 1 minute on average to label a picture, so training Models also require a lot of investment.

发明内容Contents of the invention

本发明的目的在于提供一种运用改进的PolarMask进行人形轮廓检测和姿态识别方法，实现人体的轮廓分割和姿态识别的目的，该方法不仅可以极大程度地降低成本，同时也能节省训练的时间，加快推理速度。The purpose of the present invention is to provide a method for humanoid contour detection and posture recognition using improved PolarMask to achieve the purpose of human body contour segmentation and posture recognition. This method can not only greatly reduce costs, but also save training time. , speed up reasoning.

本发明运用改进的PolarMask进行人形轮廓检测和姿态识别的方法，采用改进的PolarMask模型，基于人形轮廓特点，将Polarmask模型识别出的人形包围框的长和宽的比值来分配包围框中每个区域的射线数，进行人形轮廓极坐标建模的设计；然后构建改进的PolarMask模型作为人形轮廓分割模型，在特征金字塔网络进行特征融合前，在每个不同尺度的特征后加入通道注意力机制模块，在原基于YOLOV7的特征金字塔网络中添加了跳跃连接，弥补了在特征融合过程中损失的细节信息，最后，采取基于弱标签的训练策略，使用Box类型的弱标签数据集进行人形轮廓分割模型的预训练，用于训练出一个能够识别出包含人形位置信息的矩形框和人的姿态类型的初级人形轮廓分割模型；在正式训练过程中，使用预训练出的提前学习了人形轮廓的相关信息的预训练权重进行迁移学习，在对真实的人形轮廓进行学习的过程中，使得对预测出的人形轮廓不断收敛，准确地识别出人形轮廓和姿态类型；The present invention uses an improved PolarMask method for humanoid contour detection and gesture recognition, adopts an improved PolarMask model, and uses the ratio of the length and width of the humanoid bounding box identified by the Polarmask model to allocate each area in the bounding box based on the characteristics of the humanoid outline. The number of rays is used to design the humanoid contour polar coordinate modeling; then an improved PolarMask model is constructed as a humanoid contour segmentation model. Before the feature pyramid network performs feature fusion, a channel attention mechanism module is added after each feature of different scales. Skip connections were added to the original feature pyramid network based on YOLOV7 to make up for the loss of detailed information during the feature fusion process. Finally, a training strategy based on weak labels was adopted, and the Box type weak label data set was used to pre-program the humanoid contour segmentation model. Training is used to train a primary humanoid contour segmentation model that can identify rectangular boxes containing humanoid position information and human posture types; in the formal training process, the pre-trained model that has learned the relevant information of the humanoid contour in advance is used. The training weights are used for transfer learning. In the process of learning the real humanoid outline, the predicted humanoid outline is continuously converged and the humanoid outline and posture type are accurately identified;

所述构建改进的PolarMask模型作为人形轮廓分割模型，以原PolarMask模型为基础，以YOLOV7的网络结构为依据，设计出YOLOAT_FPN特征金字塔网络替换原PolarMask模型的FPN网络结构，改进了原PolarMask模型的骨干网络和特征金字塔结构；人形轮廓分割模型由一个编码器和三个解码器组成；The constructed and improved PolarMask model is used as a humanoid contour segmentation model. Based on the original PolarMask model and the network structure of YOLOV7, the YOLOAT_FPN feature pyramid network is designed to replace the FPN network structure of the original PolarMask model and improve the backbone of the original PolarMask model. Network and feature pyramid structure; human silhouette segmentation model consists of one encoder and three decoders;

所述编码器采用YOLOAT_FPN特征金字塔网络，该YOLOAT_FPN特征金字塔网络以YOLOV7的主干网络为基础进行如下改进：The encoder uses the YOLOAT_FPN feature pyramid network, which is improved as follows based on the backbone network of YOLOV7:

(1)替换了原卷积模块的激活函数，将原激活函数SiLU替换为一种运用于自然语言处理的非线性的激活函数GELU；(1) Replaced the activation function of the original convolution module and replaced the original activation function SiLU with a nonlinear activation function GELU used in natural language processing;

(2)在特征金字塔中进行特征融合前加入了通道注意力机制模块；(2) The channel attention mechanism module is added before feature fusion in the feature pyramid;

(3)将原YOLOV7的主干网络提取出的多个尺度的特征图通过通道注意力机制模块进一步提取重要的细节信息，并通过1×1的卷积核将浅层信息与深层信息进行跳跃连接，补全了经过特征融合所丢失的细节信息；(3) Use the multi-scale feature maps extracted from the original YOLOV7 backbone network to further extract important detailed information through the channel attention mechanism module, and jump-connect shallow information and deep information through a 1×1 convolution kernel. , completes the detailed information lost through feature fusion;

所述三个解码器指的是三条并行处理的分支，分别为分类分支、中心度分支和极坐标掩膜分支，其中，分类分支依次使用4×4的Conv和1×1的Conv进行特征的提取，产生H×W×N的特征图进行N种姿态的预测，实现对分割目标类别的预测，H、W分别代表输入的特征图的长和宽，N代表需要预测的姿态的种类；中心度分支使用与分类分支共有的4×4的Conv和1×1的Conv进行特征的提取，产生H×W×1的特征图进行极坐标中心点的预测；极坐标掩膜分支依次使用4×4的Conv和1×1的Conv进行特征提取，产生H×W×60的特征图对极坐标的60根射线的距离进行预测。The three decoders refer to three parallel processing branches, namely the classification branch, the centrality branch and the polar coordinate mask branch. Among them, the classification branch uses 4×4 Conv and 1×1 Conv in turn to perform feature extraction. Extract and generate H×W×N feature maps to predict N postures to achieve prediction of segmented target categories. H and W represent the length and width of the input feature map respectively, and N represents the type of posture that needs to be predicted; center The degree branch uses the 4×4 Conv and 1×1 Conv shared with the classification branch to extract features, and generates an H×W×1 feature map to predict the polar coordinate center point; the polar coordinate mask branch uses 4× Conv of 4 and Conv of 1×1 are used for feature extraction, and a feature map of H×W×60 is generated to predict the distance of 60 rays in polar coordinates.

所述采用改进的PolarMask模型，基于人形轮廓特点，将Polarmask模型识别出的人形包围框的长和宽的比值来分配包围框中每个区域的射线数，进行人形轮廓极坐标建模的设计，具体为：将Polarmask模型识别出的人形包围框的四个顶点A、B、C、D与人体中心O构成四个区域，按照识别出的包围框的长和宽的比值来分配每个区域的射线数，进行人形轮廓极坐标建模的设计，计算公式如公式(2)所示：The improved PolarMask model is used. Based on the characteristics of the humanoid outline, the ratio of the length and width of the humanoid bounding box identified by the Polarmask model is used to allocate the number of rays in each area of the bounding box, and the design of polar coordinate modeling of the humanoid outline is carried out. Specifically: the four vertices A, B, C, and D of the humanoid bounding box recognized by the Polarmask model and the human body center O form four regions, and the proportion of each area is allocated according to the ratio of the length and width of the identified bounding box. The number of rays is used to design the humanoid contour polar coordinate modeling. The calculation formula is as shown in formula (2):

其中，0为人形中心点，人形包围框的四个顶点A、B、C、D与人形中心点O构成四个区域，分别为AOB区域、BOC区域、COD区域和AOD区域，该Number_AOB表示AOB区域中为了构建人体轮廓所需要的射线数量，该Number_COD·表示COD区域中为了构建人体轮廓所需要的射线数量；该Number_AOD表示AOD区域中为了构建人体轮廓所需要的射线数量；该Number_BOC表示BOC区域中为了构建人体轮廓所需要的射线数量；N表示总射线数；Y表示包围框的高；X表示包围框的宽。Among them, 0 is the center point of the humanoid. The four vertices A, B, C, and D of the humanoid bounding box and the humanoid center point O constitute four areas, which are the AOB area, BOC area, COD area, and AOD area. The Number _AOB represents The number of rays required to construct the human body outline in the AOB area. The Number _COD represents the number of rays required to construct the human body outline in the COD area. The Number _AOD represents the number of rays required to construct the human body outline in the AOD area. The Number _BOC represents the number of rays required to construct the human body outline in the BOC area; N represents the total number of rays; Y represents the height of the bounding box; X represents the width of the bounding box.

所述通道注意力机制模块，选取SENet模型，该SENet模型包括压缩和激励两个阶段，在压缩阶段对全局空间信息进行压缩，然后在通道维度进行特征学习，从而形成各个通道的注意力权重，最后在激励阶段将压缩阶段生成的注意力权重作用于相应的通道上，具体为：The channel attention mechanism module selects the SENet model. The SENet model includes two stages: compression and excitation. In the compression stage, the global spatial information is compressed, and then feature learning is performed in the channel dimension to form the attention weight of each channel. Finally, in the excitation stage, the attention weight generated in the compression stage is applied to the corresponding channel, specifically:

先进行压缩阶段，使用Global pooling将H×W×C的输入压缩为1×1×C的输出，随后进行激励阶段，包括两个全连接层，第一个全连接层有C/r个神经元，输出为1×1×(C/r)，并使用激活函数ReLU；第二个全连接层有C个神经元，将输出恢复为1×1×C，并使用激活函数Sigmoid，其中r为第一个全连接层的压缩值；在激励阶段，通过学习每个通道的特征信息，生成每个通道的注意力权重，并将最终输出的1×1×C的通道注意力权重与原特征图相对应的通道相乘。The compression stage is performed first, using Global pooling to compress the H×W×C input into an output of 1×1×C, followed by the excitation stage, which includes two fully connected layers. The first fully connected layer has C/r neurons. element, the output is 1×1×(C/r), and uses the activation function ReLU; the second fully connected layer has C neurons, restores the output to 1×1×C, and uses the activation function Sigmoid, where r is the compression value of the first fully connected layer; in the excitation stage, by learning the characteristic information of each channel, the attention weight of each channel is generated, and the final output channel attention weight of 1×1×C is compared with the original The corresponding channels of the feature map are multiplied.

一种运用改进的PolarMask进行人形轮廓检测和姿态识别设备，所述设备包括处理器及存储器；所述存储器用于存储计算机程序；所述处理器用于根据所述计算机程序执行上述任意一种运用改进的PolarMask进行人形轮廓检测和姿态识别的方法。A device that uses an improved PolarMask for humanoid contour detection and gesture recognition. The device includes a processor and a memory; the memory is used to store a computer program; the processor is used to execute any of the above improvements according to the computer program. PolarMask is a method for human silhouette detection and gesture recognition.

一种计算机可读存储介质，所述计算机可读存储介质用于存储计算机程序，所述计算机程序用于执行上述任意一种运用改进的PolarMask进行人形轮廓检测和姿态识别的方法。A computer-readable storage medium. The computer-readable storage medium is used to store a computer program. The computer program is used to execute any of the above methods for humanoid contour detection and gesture recognition using improved PolarMask.

一种运行指令的芯片，该芯片用于执行上述任意一种运用改进的PolarMask进行人形轮廓检测和姿态识别的方法。A chip that runs instructions, and the chip is used to execute any of the above-mentioned methods of humanoid contour detection and gesture recognition using improved PolarMask.

本发明改进的PolarMask模型，是在原PolarMask模型的基础上借鉴YOLOV7的网络结构，在骨干网络增加了跳跃连接，加入注意力机制模块，并将该改进的PolarMask运用于人形轮廓实例分割以及姿态识别，采用Box类型的弱标签用于改进的PolarMask模型的预训练，从而在轮廓分割模型中引入了迁移学习，与现有技术相比存在如下的技术效果：The improved PolarMask model of the present invention draws on the network structure of YOLOV7 based on the original PolarMask model, adds jump connections to the backbone network, adds an attention mechanism module, and applies the improved PolarMask to humanoid contour instance segmentation and gesture recognition. Box type weak labels are used for pre-training of the improved PolarMask model, thus introducing transfer learning into the contour segmentation model. Compared with the existing technology, there are the following technical effects:

(1)本发明基于改进型PolarMask模型对人体轮廓进行分割，通过计算60根射线的距离和角度等来辅助识别人体的姿态，采用标注简单的box类型的弱标签数据集进行改进的PolarMask模型的预训练，之后对得到的预训练权重进行迁移学习。若采用COCO类型的实例分割标签进行基于PolarMask模型的预训练，数据集的处理是十分的耗费时间的，这就会造成较高的成本。而本发明采用Box类型的弱标签数据集进行模型预训练，通过Box类型的弱标签其标记的难度和花费的时间与实例分割类型的表现相比要少很多，因此所需的成本也会降低。在使用弱标签数据集进行模型预训练后，只需要使用少量的实例分割标签，进行修正就可以达到较好的效果。相较其他基于PolarMask的方法，本发明能够极大程度地节约成本。(1) The present invention segments the human body outline based on the improved PolarMask model, and assists in identifying the posture of the human body by calculating the distance and angle of 60 rays, and uses a weak label data set with a simple box type to perform the improved PolarMask model. Pre-training, and then perform transfer learning on the obtained pre-trained weights. If COCO type instance segmentation labels are used for pre-training based on the PolarMask model, the processing of the data set is very time-consuming, which will result in higher costs. The present invention uses Box type weak label data sets for model pre-training. The difficulty and time spent on labeling through Box type weak labels are much less than the performance of the instance segmentation type, so the required costs will also be reduced. . After using the weak label data set for model pre-training, you only need to use a small number of instance segmentation labels and make corrections to achieve better results. Compared with other methods based on PolarMask, this invention can save costs to a great extent.

(2)本发明重新设计了极坐标建模的方法，根据人形的特点将PolarMask中原本等间隔的射线变为不均匀排布的射线。对复杂的轮廓使用更多的射线进行表示，对简单的轮廓使用较少的射线进行表示。解决了原PolarMask模型中用均匀分布的射线表示轮廓时存在冗余的现象，最终能够用更少的射线表示人体的轮廓，并且能够减少了模型的参数，得到更加准确的人体轮廓。(2) The present invention redesigns the polar coordinate modeling method, and changes the originally equally spaced rays in PolarMask into unevenly arranged rays according to the characteristics of the human form. Use more rays to represent complex contours and fewer rays to represent simple contours. It solves the redundancy phenomenon in the original PolarMask model when using uniformly distributed rays to represent the contour. It can finally use fewer rays to represent the contour of the human body, and can reduce the parameters of the model to obtain a more accurate human contour.

(3)本发明替换了原PolarMask模型中的骨干网络，帮助模型更加有效地提取特征。新的骨干网络借鉴了YOLOV7的网络结构，改进了卷积模块的激活函数，并增加了跳跃连接从而改进了网络对细节信息的提取能力。(3) This invention replaces the backbone network in the original PolarMask model and helps the model extract features more effectively. The new backbone network draws on the network structure of YOLOV7, improves the activation function of the convolution module, and adds skip connections to improve the network's ability to extract detailed information.

(4)本发明在人体轮廓分割模型中引入注意力机制模块，在特征融合前使用了通道注意力网络，使得模型更加关注重要的特征信息，能在一定程度上提升网络的分割准确率。(4) The present invention introduces an attention mechanism module into the human contour segmentation model and uses a channel attention network before feature fusion, so that the model pays more attention to important feature information and can improve the segmentation accuracy of the network to a certain extent.

(5)本发明采用了迁移学习，使用预训练权重的值能更接近最优的收敛点，因此在之后的正式训练过程中只需更短的训练时间就可以到达收敛点，使得网络更容易收敛到最优点，在提高准确度的同时，效率得到了提高，也在一定程度上提高了模型的泛化能力。(5) The present invention adopts transfer learning. The value of pre-training weights can be closer to the optimal convergence point. Therefore, in the subsequent formal training process, it only takes shorter training time to reach the convergence point, making the network easier. Converging to the optimal point not only improves accuracy, but also improves efficiency and improves the generalization ability of the model to a certain extent.

附图说明Description of the drawings

图1为本发明的模型结构图；Figure 1 is a model structure diagram of the present invention;

图2为本发明的卷积模块改进图；Figure 2 is an improved diagram of the convolution module of the present invention;

图3为本发明的基于人形特点的极坐标建模改进图；Figure 3 is an improved diagram of polar coordinate modeling based on humanoid characteristics of the present invention;

图4为本发明基于弱标签的迁移训练方法流程图。Figure 4 is a flow chart of the weak label-based migration training method of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案和优点更加清楚，下面将结合附图对本发明作进一步地详细描述，显然，所描述的实施例仅仅是本发明的部份实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例，都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings. Obviously, the described embodiments are only some of the embodiments of the present invention, not all of them. . Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of the present invention.

实施例一Embodiment 1

本发明实施例一涉及一种运用改进的PolarMask进行人形轮廓检测和姿态识别的方法，采用改进的PolarMask模型的极坐标建模方法，结合人形的特点，重新设计人体的轮廓；然后以YOLOV7骨干网络为基础，重新改进了原PolarMask模型的骨干网络，并且改进了原卷积模块，在特征金字塔网络进行特征融合前，在每个不同尺度的特征后加入注意力机制模块，在原基于YOLOV7的特征金字塔网络中添加了跳跃连接，弥补了在特征融合过程中损失的细节信息，最后，采取基于弱标签的训练策略，使用弱标签数据集进行人形轮廓分割模型的预训练，在正式训练过程中，使用预训练出的权重进行迁移学习，在对真实的人形轮廓进行学习的过程中，使得对预测出的人形轮廓不断收敛，准确地识别出人形轮廓和姿态类型，具体包括如下步骤：Embodiment 1 of the present invention relates to a method of using improved PolarMask for humanoid contour detection and gesture recognition. It adopts the polar coordinate modeling method of the improved PolarMask model and combines the characteristics of the humanoid to redesign the outline of the human body; and then uses the YOLOV7 backbone network As the basis, the backbone network of the original PolarMask model was re-improved, and the original convolution module was improved. Before feature fusion in the feature pyramid network, an attention mechanism module was added after each feature of different scales. In the original feature pyramid based on YOLOV7 Skip connections are added to the network to make up for the loss of detailed information in the feature fusion process. Finally, a training strategy based on weak labels is adopted, and weak label data sets are used to pre-train the humanoid contour segmentation model. During the formal training process, using The pre-trained weights are transferred to the learning process. In the process of learning the real humanoid outline, the predicted humanoid outline is continuously converged and the humanoid outline and posture type are accurately identified. The specific steps include the following steps:

步骤1、构建改进的PolarMask模型作为人形轮廓分割模型Step 1. Construct an improved PolarMask model as a humanoid contour segmentation model

如图1所示，以原PolarMask模型为基础，以YOLOV7的网络结构为依据，改进了原PolarMask模型的骨干网络和特征金字塔结构，设计出YOLOAT_FPN特征金字塔网络替换原PolarMask模型的FPN网络结构；人形轮廓分割模型由一个编码器和三个解码器组成；As shown in Figure 1, based on the original PolarMask model and the network structure of YOLOV7, the backbone network and feature pyramid structure of the original PolarMask model were improved, and the YOLOAT_FPN feature pyramid network was designed to replace the FPN network structure of the original PolarMask model; humanoid The contour segmentation model consists of an encoder and three decoders;

(1)替换了原卷积模块的激活函数，将原激活函数SiLU替换为一种运用于自然语言处理的非线性的激活函数GELU(Gaussian Error Linear Unit)如图2所示；激活函数GELU是一个平滑的非线性函数，具有连续可导的性质，能更好地适应梯度下降算法，并且在训练过程中更容易收敛。GELU激活函数的具体数学公式如公式(1)所示：(1) The activation function of the original convolution module is replaced, and the original activation function SiLU is replaced by a nonlinear activation function GELU (Gaussian Error Linear Unit) used in natural language processing, as shown in Figure 2; the activation function GELU is A smooth nonlinear function with continuously differentiable properties can better adapt to the gradient descent algorithm and is easier to converge during the training process. The specific mathematical formula of the GELU activation function is shown in formula (1):

GELU(x)＝0.5×x×(1+tanh(sqrt(2/pi)×(x+0.044715×x^3))) (1)GELU(x)＝0.5×x×(1+tanh(sqrt(2/pi)×(x+0.044715×x^3))) (1)

(2)在特征金字塔中进行特征融合前加入了通道注意力机制模块，帮助模型更加关注重要信息，并且忽略不重要的信息，从而提升图像的分割精度；(2) A channel attention mechanism module is added before feature fusion in the feature pyramid to help the model pay more attention to important information and ignore unimportant information, thereby improving the segmentation accuracy of the image;

(3)改进特征金字塔结构，加入了跳跃连接，设计出YOLOAT_FPN特征金字塔网络帮助模型提升对细节信息的提取。原YOLOV7的网络结构，虽然通过特征金字塔进行了多尺度融合，但是在特征融合的过程中，随着网络层数的加深，不可避免的会造成细节特征的丢失，而且也会忽略掉浅层网络对于模型识别的重要性，本发明将原YOLOV7的主干网络提取出的多个尺度的特征图通过通道注意力机制模块(SENet模型)进一步提取重要的细节信息，并通过1×1的卷积核将浅层信息与深层信息进行跳跃连接，如图1中的红色虚线所示，这在很大程度上补全了经过特征融合所丢失的细节信息。(3) Improve the feature pyramid structure, add skip connections, and design the YOLOAT_FPN feature pyramid network to help the model improve the extraction of detailed information. Although the original YOLOV7 network structure performs multi-scale fusion through feature pyramid, in the process of feature fusion, as the number of network layers deepens, detailed features will inevitably be lost, and shallow networks will also be ignored. Regarding the importance of model identification, the present invention further extracts important detailed information from the feature maps of multiple scales extracted from the original YOLOV7 backbone network through the channel attention mechanism module (SENet model), and passes the 1×1 convolution kernel Jump connections are made between shallow information and deep information, as shown by the red dotted line in Figure 1, which to a large extent complements the detailed information lost through feature fusion.

所述三个解码器指的是三条并行处理的分支，分别为分类分支、中心度分支和极坐标掩膜分支，其中，分类分支依次使用4×4的Conv和1×1的Conv进行特征的提取，产生H×W×N的特征图进行N种姿态的预测，实现对分割目标类别的预测，H、W分别代表输入的特征图的长和宽，N代表需要预测的姿态的种类；中心度分支使用与分类分支共有的4×4的Conv和1×1的Conv进行特征的提取，产生H×W×1的特征图进行极坐标中心点的预测；极坐标掩膜分支依次使用4×4的Conv和1×1的Conv进行特征提取，产生H×W×60的特征图对极坐标的60根射线的距离进行预测，同时本发明根据人形的特点重新设计了这60根射线的排列方式，使其更加适合用于人图像的分割，具体方法将在步骤2中详细介绍。The three decoders refer to three parallel processing branches, namely the classification branch, the centrality branch and the polar coordinate mask branch. Among them, the classification branch uses 4×4 Conv and 1×1 Conv in turn to perform feature extraction. Extract and generate H×W×N feature maps to predict N postures to achieve prediction of segmented target categories. H and W represent the length and width of the input feature map respectively, and N represents the type of posture that needs to be predicted; center The degree branch uses the 4×4 Conv and 1×1 Conv shared with the classification branch to extract features, and generates an H×W×1 feature map to predict the polar coordinate center point; the polar coordinate mask branch uses 4× Conv of 4 and Conv of 1×1 are used for feature extraction, and a feature map of H×W×60 is generated to predict the distance of 60 rays in polar coordinates. At the same time, the present invention redesigns the arrangement of these 60 rays according to the characteristics of the humanoid. method, making it more suitable for segmentation of human images. The specific method will be introduced in detail in step 2.

所述注意力机制模块，选取属于通道注意力的SENet模型(Squeeze-and-Excitation Networks压缩与激励网络)，该SENet模型包括压缩和激励两个阶段，在压缩阶段对全局空间信息进行压缩，然后在通道维度进行特征学习，从而形成各个通道的注意力权重，最后在激励阶段将压缩阶段生成的注意力权重作用于相应的通道上，具体为：The attention mechanism module selects the SENet model (Squeeze-and-Excitation Networks compression and excitation network) that belongs to channel attention. The SENet model includes two stages of compression and excitation. In the compression stage, the global spatial information is compressed, and then Feature learning is performed in the channel dimension to form the attention weight of each channel. Finally, in the excitation stage, the attention weight generated in the compression stage is applied to the corresponding channel, specifically:

先进行压缩阶段，使用Global pooling将H×W×C的输入压缩为1×1×C的输出，随后进行激励阶段，包括两个全连接层，第一个全连接层有C/r个神经元，输出为1×1×(C/r)，并使用激活函数ReLU；第二个全连接层有C个神经元，将输出恢复为1×1×C，并使用激活函数Sigmoid，其中r为第一个全连接层的压缩值，一般取16效果较好。在激励阶段，通过学习每个通道的特征信息，生成每个通道的注意力权重，并将最终输出的1×1×C的通道注意力权重与原特征图相对应的通道相乘。The compression stage is performed first, using Global pooling to compress the H×W×C input into an output of 1×1×C, followed by the excitation stage, which includes two fully connected layers. The first fully connected layer has C/r neurons. element, the output is 1×1×(C/r), and uses the activation function ReLU; the second fully connected layer has C neurons, restores the output to 1×1×C, and uses the activation function Sigmoid, where r It is the compression value of the first fully connected layer. Generally, 16 has better effect. In the excitation stage, by learning the feature information of each channel, the attention weight of each channel is generated, and the final output channel attention weight of 1×1×C is multiplied by the channel corresponding to the original feature map.

步骤2、基于人形轮廓特点，将Polarmask模型识别出的人形包围框的四个顶点A、B、C、D与人体中心O构成四个区域，按照识别出的包围框的长和宽的比值来分配每个区域的射线数，进行人形轮廓极坐标建模的设计Step 2. Based on the characteristics of the humanoid outline, the four vertices A, B, C, and D of the humanoid bounding box recognized by the Polarmask model and the human body center O form four regions, based on the ratio of the length and width of the identified bounding box. Allocate the number of rays to each area and perform polar coordinate modeling of the humanoid outline design

原Polarmask模型是从需要分割的物体的中心等间隔地发出N条不同射线，通过依次连接不同射线的端点，设计出需要进行分割的物体轮廓，如图3中的a)所示。该轮廓建模方法较适用于圆形结构的物体。如果将该方法直接运用于人形轮廓的设计上将存在大量的冗余，不仅会增加不必要的计算负担，同时无法分割出较为精细的轮廓，这是因为人形的有些部位的轮廓只需要少量射线线段即可很好的表示出轮廓，而有些复杂部位的轮廓需要较多射线才能表示出轮廓。The original Polarmask model emits N different rays at equal intervals from the center of the object that needs to be segmented. By connecting the endpoints of the different rays in sequence, the outline of the object that needs to be segmented is designed, as shown in a) in Figure 3. This contour modeling method is more suitable for objects with circular structures. If this method is directly applied to the design of humanoid contours, there will be a lot of redundancy, which will not only increase unnecessary computational burden, but also fail to segment finer contours. This is because the contours of some parts of the humanoid only require a small number of rays. Line segments can well express the outline, but the outlines of some complex parts require more rays to express the outline.

如图3中的b)所示，根据Polarmask模型识别出的人形包围框的长和宽的比值来分配每个区域的射线数，具体计算公式如公式(2)所示：As shown in b) in Figure 3, the number of rays in each area is allocated according to the ratio of the length and width of the humanoid bounding box recognized by the Polarmask model. The specific calculation formula is as shown in formula (2):

其中，0为人形中心点，人形包围框的四个顶点A、B、C、D与人形中心点O构成四个区域，分别为AOB区域、BOC区域、COD区域和AOD区域，该Number_AOB表示如图3的b)中的AOB区域中为了构建人体轮廓所需要的射线数量，该Number_COD·表示COD区域中为了构建人体轮廓所需要的射线数量；该Number_AOD表示AOD区域中为了构建人体轮廓所需要的射线数量；该Number_BOC表示BOC区域中为了构建人体轮廓所需要的射线数量；N表示总射线数；Y表示包围框的高；X表示包围框的宽。Among them, 0 is the center point of the humanoid. The four vertices A, B, C, and D of the humanoid bounding box and the humanoid center point O constitute four areas, which are the AOB area, BOC area, COD area, and AOD area. The Number _AOB represents As shown in b) of Figure 3, the number of rays required to construct the human body outline in the AOB area, the Number _COD· represents the number of rays required to construct the human body outline in the COD area; the Number _AOD represents the number of rays required to construct the human body outline in the AOD area The number of rays required; the Number _BOC represents the number of rays required to construct the human body outline in the BOC area; N represents the total number of rays; Y represents the height of the bounding box; X represents the width of the bounding box.

该改进的人形轮廓极坐标建模方法不但能够对轮廓进行更加细致表示，减少了冗余，同时采用这种不均匀的极坐标建模方法可以减少模型的参数，通过测试发现原本需要90根均匀分布的射线才能准确的表示出人的轮廓，使用本发明改进的人形轮廓建模方法仅使用60根射线即可很好地表示出人形轮廓。This improved humanoid contour polar coordinate modeling method can not only represent the contour in more detail and reduce redundancy, but also use this uneven polar coordinate modeling method to reduce the parameters of the model. Through testing, it was found that 90 uniform polar coordinates were originally needed. Only distributed rays can accurately represent the human silhouette. The improved humanoid contour modeling method of the present invention can well represent the humanoid contour using only 60 rays.

步骤3、使用Box类型的弱标签数据集进行人形轮廓分割模型的预训练，用于训练出一个能够识别出包含人形位置信息的矩形框和人的姿态类型的初级人形轮廓分割模型；在正式训练过程中，使用预训练出的提前学习了人形轮廓的相关信息的预训练权重进行迁移学习，最终准确地识别出人形轮廓和姿态类型；Step 3. Use the Box type weak label data set to pre-train the humanoid contour segmentation model to train a primary humanoid contour segmentation model that can identify rectangular boxes containing humanoid position information and human posture types; in formal training In the process, the pre-trained weights that have learned the relevant information of the humanoid outline in advance are used for transfer learning, and finally the humanoid outline and posture type are accurately identified;

由于本发明改进的Polarmask模型在训练过程中需要先预测出人形包围框和中心点，根据预测出的包围框和中心点位置分配图像中各区域的射线数，因此本模型对于包围框和中心点的预测就显得格外重要，因此为了更快更准确的预测出人形轮廓，本发明采用了一种基于弱标签的迁移学习方法，利用预训练权重提前学习了人形轮廓的相关信息，更有利于接下来分割模型收敛到最优点，从而提高了模型的分割准确率。如图4所示。Since the improved Polarmask model of the present invention needs to predict the humanoid bounding box and center point first during the training process, and allocate the number of rays to each area in the image according to the predicted bounding box and center point positions, so this model is not suitable for the bounding box and center point. The prediction is particularly important. Therefore, in order to predict the humanoid outline faster and more accurately, the present invention adopts a transfer learning method based on weak labels, and uses pre-training weights to learn the relevant information of the humanoid outline in advance, which is more conducive to the connection. The segmentation model converges to the optimal point, thereby improving the segmentation accuracy of the model. As shown in Figure 4.

由于一些图片中有多个人像，并且会有重叠，如果直接采用真实标签，在制作和搜集标签的过程中将消耗大量的人力。本发明使用Box类型的弱标签数据集进行人形轮廓分割模型的预训练。由于Box标签大多是VOC格式的弱标签，因此，需要先将VOC格式的弱标签转换为COCO格式的标签，得到一个能够提供给PolarMask模型进行初步分割轮廓的标签，具体转换方法是：将由VOC格式中用来表示识别目标的矩形框的两个点，即boxes左上角的点minimum和boxes右下角的点maximum，转化为COCO格式中的“segmentation”的4个。Since some pictures contain multiple figures and overlap, if real tags are used directly, a lot of manpower will be consumed in the process of making and collecting tags. The present invention uses a Box type weak label data set to pre-train the humanoid contour segmentation model. Since most Box tags are weak tags in the VOC format, it is necessary to first convert the weak tags in the VOC format into tags in the COCO format to obtain a tag that can be provided to the PolarMask model for preliminary segmentation contours. The specific conversion method is: convert the VOC format to a tag The two points used to represent the rectangular frame of the recognition target, namely the point minimum in the upper left corner of the boxes and the point maximum in the lower right corner of the boxes, are converted into 4 "segmentation" in the COCO format.

本发明使用Box类型的弱标签数据集训练出的预训练权重进行迁移学习，能够有效地提高模型的分割精度。该数据集数量较多，相对分割类型的掩膜，使用边界框更容易对人形轮廓进行标注。本发明在使用真实标签前，已经得到了较为准确度的矩形框的预测，在训练过程中人形轮廓就能够正确地收敛，最终准确地识别出人形轮廓和人的姿态类型。The present invention uses the pre-training weights trained by the Box type weak label data set to perform migration learning, which can effectively improve the segmentation accuracy of the model. This data set has a large number. Compared with segmentation type masks, it is easier to use bounding boxes to annotate humanoid outlines. Before using real labels, the present invention has obtained a relatively accurate prediction of the rectangular frame. During the training process, the humanoid outline can converge correctly, and finally accurately identifies the humanoid outline and human posture type.

实施例二Embodiment 2

本发明实施例二提供一种运用改进的PolarMask进行人形轮廓检测和姿态识别设备，该电子设备可以为终端设备或者服务器，也可以为与其他终端设备或者服务器连接的实现本发明实施例一方法的终端设备或服务器。Embodiment 2 of the present invention provides a device for humanoid contour detection and gesture recognition using an improved PolarMask. The electronic device may be a terminal device or a server, or may be connected to other terminal devices or servers to implement the method of Embodiment 1 of the present invention. Terminal device or server.

该设备可以包括：处理器(例如CPU)、存储器、数据采集装置；处理器连接并控制数据采集装置。存储器中可以存储各种指令，以用于完成各种处理功能以及实现前述实施例一方法描述的处理步骤。The device may include: a processor (such as a CPU), a memory, and a data acquisition device; the processor connects to and controls the data acquisition device. Various instructions can be stored in the memory to complete various processing functions and implement the processing steps described in the method of the first embodiment.

实施例三Embodiment 3

本发明实施例三还提供一种计算机可读存储介质，该计算机可读存储介质中存储有指令，当其在计算机上运行时，使得计算机执行上述实施例一方法所描述的处理步骤。Embodiment 3 of the present invention also provides a computer-readable storage medium. The computer-readable storage medium stores instructions, which when run on a computer, cause the computer to perform the processing steps described in the method of Embodiment 1 above.

实施例四Embodiment 4

本发明实施例四还提供一种运行指令的芯片，该芯片用于执行前述实施例一方法所描述的处理步骤。Embodiment 4 of the present invention also provides a chip that runs instructions, and the chip is used to execute the processing steps described in the method of Embodiment 1.

专业人员应该还可以进一步意识到，结合本发明中所公开的实施例描述的各示例的单元及算法步骤，能够以电子硬件、计算机软件或者二者的结合来实现，为了清楚地说明硬件和软件的可互换性，在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本发明的范围。Those skilled in the art should further realize that the units and algorithm steps of each example described in connection with the embodiments disclosed in the present invention can be implemented by electronic hardware, computer software, or a combination of both. In order to clearly illustrate the hardware and software Interchangeability, in the above description, the composition and steps of each example have been generally described according to functions. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each specific application, but such implementations should not be considered to be beyond the scope of the present invention.

以上所述的具体实施方式，对本发明的目的、技术方案和有益效果进行了进一步详细说明，所应理解的是，以上所述仅为本发明的具体实施方式而已，并不用于限定本发明的保护范围，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above-described specific embodiments further describe the objectives, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above-mentioned are only specific embodiments of the present invention and are not intended to limit the scope of the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection scope of the present invention.

Claims

1. The method of using improved PolarMask for humanoid contour detection and gesture recognition is characterized by: using the improved PolarMask model, based on the characteristics of humanoid contours, and assigning bounding boxes based on the ratio of the length and width of the humanoid bounding box identified by the PolarMask model. The number of rays in each area is used to design the humanoid contour polar coordinate modeling; then an improved PolarMask model is constructed as a humanoid contour segmentation model. Before the feature pyramid network performs feature fusion, channel attention is added after each feature of different scales. The force mechanism module adds skip connections to the original YOLOV7-based feature pyramid network to make up for the detailed information lost in the feature fusion process. Finally, a training strategy based on weak labels is adopted, and the Box type weak label data set is used to conduct humanoid contours The pre-training of the segmentation model is used to train a primary humanoid outline segmentation model that can recognize the rectangular frame containing humanoid position information and the human posture type; in the formal training process, the pre-trained humanoid outline is used to learn in advance The pre-trained weights of relevant information are transferred and learned. In the process of learning the real humanoid outline, the predicted humanoid outline is continuously converged and the humanoid outline and posture type are accurately identified;

The constructed and improved PolarMask model is used as a humanoid contour segmentation model. Based on the original PolarMask model and the network structure of YOLOV7, the YOLOAT_FPN feature pyramid network is designed to replace the FPN network structure of the original PolarMask model and improve the backbone of the original PolarMask model. Network and feature pyramid structure; human silhouette segmentation model consists of one encoder and three decoders;

The encoder uses the YOLOAT_FPN feature pyramid network, which is improved as follows based on the backbone network of YOLOV7:

(1) Replaced the activation function of the original convolution module and replaced the original activation function SiLU with a nonlinear activation function GELU used in natural language processing;

(2) A channel attention mechanism module is added before feature fusion in the feature pyramid; the channel attention mechanism module selects the SENet model. The SENet model includes two stages: compression and excitation. In the compression stage, the global spatial information is processed Compression, and then perform feature learning in the channel dimension to form the attention weight of each channel. Finally, in the excitation stage, the attention weight generated in the compression stage is applied to the corresponding channel, specifically:

The compression stage is performed first, using Global pooling to compress the H×W×C input into an output of 1×1×C, followed by the excitation stage, which includes two fully connected layers. The first fully connected layer has C/r neurons. element, the output is 1×1×(C/r), and uses the activation function ReLU; the second fully connected layer has C neurons, restores the output to 1×1×C, and uses the activation function Sigmoid, where r is the compression value of the first fully connected layer; in the excitation stage, by learning the characteristic information of each channel, the attention weight of each channel is generated, and the final output channel attention weight of 1×1×C is compared with the original Multiply the corresponding channels of the feature map;

(3) Use the multi-scale feature maps extracted from the original YOLOV7 backbone network to further extract important detailed information through the channel attention mechanism module, and jump-connect shallow information and deep information through a 1×1 convolution kernel. , completes the detailed information lost through feature fusion;

The three decoders refer to three parallel processing branches, namely the classification branch, the centrality branch and the polar coordinate mask branch. Among them, the classification branch uses 4×4 Conv and 1×1 Conv in turn to perform feature extraction. Extract and generate H×W×N feature maps to predict N postures to achieve prediction of segmented target categories. H and W represent the length and width of the input feature map respectively, and N represents the type of posture that needs to be predicted; center The degree branch uses the 4×4 Conv and 1×1 Conv shared with the classification branch to extract features, and generates an H×W×1 feature map to predict the polar coordinate center point; the polar coordinate mask branch uses 4× Conv of 4 and Conv of 1×1 are used for feature extraction, and a feature map of H×W×60 is generated to predict the distance of 60 rays in polar coordinates.

2. The method for humanoid contour detection and gesture recognition using improved PolarMask according to claim 1, characterized in that the improved PolarMask model is used, and based on the humanoid contour characteristics, the length and width of the humanoid bounding box identified by the PolarMask model are To allocate the number of rays to each area in the bounding box with the ratio of the humanoid contour, the design of polar coordinate modeling of the humanoid outline is as follows: compare the four vertices A, B, C, and D of the humanoid bounding box recognized by the PolarMask model with the center of the human body O constitutes four areas, and the number of rays in each area is allocated according to the ratio of the length and width of the identified bounding box, and the design of polar coordinate modeling of human silhouette is carried out. The calculation formula is as shown in formula (2):

Among them, O is the center point of the humanoid. The four vertices A, B, C, and D of the humanoid bounding box and the humanoid center point O form four areas, which are the AOB area, BOC area, COD area, and AOD area. The Number _AOB represents The number of rays required to construct the human body outline in the AOB area. The Number _COD represents the number of rays required to construct the human body outline in the COD area. The Number _AOD represents the number of rays required to construct the human body outline in the AOD area. The Number _BOC represents the number of rays required to construct the human body outline in the BOC area; N represents the total number of rays; Y represents the height of the bounding box; X represents the width of the bounding box.

3. A device using improved PolarMask for humanoid contour detection and gesture recognition, characterized in that: the device includes a processor and a memory; the memory is used to store a computer program; the processor is used to execute according to the computer program Any method of using the improved PolarMask for humanoid contour detection and gesture recognition described in claims 1-2.

4. A computer-readable storage medium, characterized in that: the computer-readable storage medium is used to store a computer program, and the computer program is used to execute any one of claims 1-2 using the improved PolarMask to perform humanoid processing. Methods for contour detection and gesture recognition.

5. A chip for running instructions, characterized in that: the chip is used to execute any of the methods of humanoid contour detection and gesture recognition using the improved PolarMask described in claims 1-2.