CN112949498B - Target key point detection method based on heterogeneous convolutional neural network - Google Patents
Target key point detection method based on heterogeneous convolutional neural network Download PDFInfo
- Publication number
- CN112949498B CN112949498B CN202110242260.XA CN202110242260A CN112949498B CN 112949498 B CN112949498 B CN 112949498B CN 202110242260 A CN202110242260 A CN 202110242260A CN 112949498 B CN112949498 B CN 112949498B
- Authority
- CN
- China
- Prior art keywords
- convolution
- target
- heterogeneous
- multiplied
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 13
- 238000013527 convolutional neural network Methods 0.000 title claims abstract description 9
- 238000011176 pooling Methods 0.000 claims abstract description 7
- 238000000034 method Methods 0.000 claims description 14
- 238000010586 diagram Methods 0.000 claims description 7
- 238000012549 training Methods 0.000 claims description 6
- 238000005070 sampling Methods 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 2
- 230000004927 fusion Effects 0.000 abstract description 5
- 101000742346 Crotalus durissus collilineatus Zinc metalloproteinase/disintegrin Proteins 0.000 description 4
- 101000872559 Hediste diversicolor Hemerythrin Proteins 0.000 description 4
- 230000006870 function Effects 0.000 description 3
- BQCADISMDOOEFD-UHFFFAOYSA-N Silver Chemical compound [Ag] BQCADISMDOOEFD-UHFFFAOYSA-N 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 229910052709 silver Inorganic materials 0.000 description 1
- 239000004332 silver Substances 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Abstract
The invention discloses a target key point detection method based on a heterogeneous convolutional neural network. Using ResNet-50 as a skeleton network in a backbone network, and replacing standard convolution with a convolution kernel size of 3 multiplied by 3 in bootleneck with heterogeneous convolution; after the last layer of the backbone network, we add a hole space pyramid pooling layer; finally, feature pyramid fusion is carried out on feature graphs with 8×8, 16×16, 32×32 and 64×64 resolutions, feature graphs with the resolution of 64×64 and the channel number of 16 are output, a detection heat map is generated by using a Gaussian kernel function, and an attitude estimation result is output. And a cavity space feature pyramid pooling layer and a feature pyramid fusion module are used in the model. A novel light target key point detection algorithm is constructed, and multi-target attitude estimation can be carried out on pictures with any size.
Description
Technical Field
The invention belongs to the technical field of computer vision and digital image processing, and particularly relates to a target key point detection method based on a convolutional neural network.
Background
Human body posture estimation is a fundamental problem of human body behavior recognition, and the obtained skeleton structure can provide high-level semantics for human body motion recognition. Human body pose estimation itself has many applications in reality, for example: sports action specification, body correction, virtual reality games, video monitoring, robot motion control, and the like.
Existing human body posture estimation methods are divided into two categories: a bottom-up human body posture estimation and top-down human body posture estimation method. The method adopts a top-down method, and the existing top-down method has relatively high recall rate and accuracy. However, the model precision is pursued to be maximized, but the parameter quantity and floating point operation quantity of the model are increased. Human body posture estimation is landed in many practical applications, some of the human body posture estimation can be deployed on mobile phone terminals and microcomputers, and the parameter amount and floating point operand of an optimized model are one of important requirements for improvement of human body posture estimation due to limited storage amount and calculation amount of equipment.
Aiming at the problem of large parameter quantity and floating point operation quantity of a model, the method combines the traditional convolution and the grouping convolution to provide a heterogeneous convolution, and reduces the parameter quantity and the floating point operation quantity of the model on the premise of keeping accuracy and receptive field.
Disclosure of Invention
The invention discloses a target key point detection algorithm based on a heterogeneous convolution neural network, which is characterized by providing a heterogeneous convolution based on standard convolution and group convolution. The network model of the invention is divided into three parts, namely a main network part, a cavity space pyramid pooling part and a characteristic pyramid module part. Using ResNet-50 as a skeleton network in a backbone network, and replacing standard convolution with a convolution kernel size of 3 multiplied by 3 in bootleneck with heterogeneous convolution; after the last layer of the backbone network, we add a hole space pyramid pooling layer; finally, feature pyramid fusion is carried out on feature graphs with 8×8, 16×16, 32×32 and 64×64 resolutions, feature graphs with the resolution of 64×64 and the channel number of 16 are output, a detection heat map is generated by using a Gaussian kernel function, and an attitude estimation result is output.
The invention mainly provides a heterogeneous convolution based on combination of standard convolution and grouping convolution, wherein a cavity space feature pyramid pooling layer and a feature pyramid fusion module are used in a model. A novel light target key point detection algorithm is constructed, and multi-target attitude estimation (human body key point detection) can be carried out on pictures with any size. Mainly comprises the following steps:
step 1: inputting a picture, detecting a target in the picture by using a trained good fastermann target detector, acquiring coordinates of a target frame, and storing the coordinates in a data structure.
Step 2: and then widening the coordinate acquired by the target detector in the step 1 by 20% on the basis of the detected target frame, and independently intercepting the target frame.
Step 3: and (3) inputting the single target frame cut in the step (2) into a heterogeneous convolutional neural network, and detecting target key points.
Step 3-1: the single target image is then resized to a 256x256 resolution image.
Step 3-2: feature extraction is performed by using a ResNet50 network with an expansion coefficient of 2 as a skeleton network, so that the resolution of a feature map is 8×8, wherein 3×3 standard convolution with a step length of 1 is replaced by group convolution with a group of 4, and 1×1 standard convolution with a step length of 1 is replaced by group convolution with a group of 16, so that the parameter amount and floating point operation amount of a model are reduced on the basis of keeping a large receptive field.
The parameters of the traditional convolution are as follows:
S cp =N×C×K×K
the parameters of the packet convolution are:
the parameters of the deconvolution are:
wherein N is the number of channels of the input feature map, C is the number of channels of the output feature map, G is the number of packets of the packet convolution, K, K 1 、K 2 Are all convolution kernel sizes.
G cp ≤SH p ≤S cp
Compared with the grouping convolution, the heterogeneous convolution effectively integrates the channels of the feature map, and compared with the standard convolution, the heterogeneous convolution improves the receptive field of the model.
Step 3-3: the 8 x 8 convolutions obtained in step 3-2 are pooled into a layer of incremental feature pyramids.
Step 3-4: and (3) up-sampling the feature map obtained in the step (3-3) for three times to obtain a 64x64 feature map with the resolution.
Step 3-4: and (3) splicing the characteristic diagrams of 16 multiplied by 16, 32 multiplied by 32 and 64 multiplied by 64 obtained in the main network in the step 3-2 with the characteristic diagrams of corresponding resolutions obtained by up-sampling in the step 3-3 by using a jump connection layer.
Step 3-5: the channels of the feature map with the resolution of 64×64 obtained in step 3-4 are adjusted to the number of key points in the data set by convolution of 1×1, so that coordinates of the corresponding key points are output.
The network is optimized by using a random gradient descent continuous iteration mode in the training process. The loss function used is the mean square error loss function:
wherein m is the number of key points, y i For the coordinates of the marked group _ trunk key point,and (3) predicting coordinates of the key points for the model, wherein n is the number of training samples in each batch, and i is the index of the current key points.
Step 4: and (3) corresponding the detected single-target human body key points to the picture in the step (1), thereby obtaining a multi-human body posture estimation result.
The invention provides a standard convolution and packet convolution-based heterogeneous convolution. The convolution reduces the floating point operand significantly compared to standard convolution and has a receptive field of the same size as standard convolution. Compared with the grouping convolution, the heterogeneous convolution effectively integrates the convolved channels, and improves the accuracy of the model. The method proposed by the invention is verified on the MPII dataset. Experimental results show that the model with the heterogeneous convolution instead of the backbone network of the standard convolution and the cavity space pyramid pooling layer and the feature pyramid fusion module improves the precision by 1.2% and reduces the floating point operand by 72.18% compared with the original ResNet-50 method.
Drawings
FIG. 1 is a diagram of a heterogeneous convolutional neural network model based on RseNet50
FIG. 2 block convolution, standard convolution, and heterogeneous convolution block diagram
FIG. 3 human gesture estimation detection effect diagram
Detailed Description
The invention is now demonstrated with respect to other algorithms by the following examples.
We train the model using a training set of MPII data sets, with a validation set of MPII data sets to test the validity of the algorithm. The experimental environment is Ubuntu 18.04.3LTS, intel (R) Xeon (R) Silver 4110CPU@2.10Hzx 32, memory 64g, graphics card RTX2080Ti and software platforms of cuda10.0.130, cudnn7.5, pytorch1.4 and python 3.6.
During training, the batch size is set to 64, and the resolution size of the image is set to 256×256. The initial learning rate was 0.001, the learning rate was changed at 170 th and 200 th epochs, and the learning rate was decreased by 10% at 170 and 200 epochs, for a total of 210 epochs trained.
To verify the accuracy and efficiency of the improved algorithm, we used ResNet18 and ResNet50 for model comparison for the estimated network. Experimental results show that the model parameter and floating point operand can be reduced by the method under the condition of ensuring accuracy. The experimental results are shown in table 1.
Table 1 results comparison table in MPII dataset
Wherein the method comprises the steps ofIs a constant, and 60%PCKh@0.5, i being the head diagonal in group_trunk, is defined
Claims (2)
1. A target key point detection method based on a heterogeneous convolutional neural network is characterized by comprising the following steps of: the method comprises the following steps:
step 1: inputting a picture, detecting a target in the picture by using a trained fasterrcnn target detector, acquiring coordinates of a target frame, and storing the coordinates in a data structure;
step 2: then widening the target frame by 20% based on the coordinates acquired by the target detector in the step 1 and independently intercepting the widening of the target frame by 20%;
step 3: inputting the single target frame cut in the step 2 into a heterogeneous convolutional neural network to detect target key points;
step 4: corresponding the detected single-target human body key points to the picture in the step 1, thereby obtaining a multi-human body posture estimation result;
the step 3 specifically comprises the following steps of step 3-1: then the size of the single target image is adjusted to 256x256 resolution image;
step 3-2: feature extraction is carried out by using a ResNet50 network with an expansion coefficient of 2 as a skeleton network, so that the resolution of a feature map is 8 multiplied by 8, 3 multiplied by 3 standard convolution with a step length of 1 is replaced by group convolution with a group of 4, and 1 multiplied by 1 standard convolution with a step length of 1 is replaced by group convolution with a group of 16;
the parameters of the traditional convolution are:
S cp =N×C×K×K
the parameters of the packet convolution are:
the parameters of the deconvolution are:
wherein N is the number of channels of the input feature map, C is the number of channels of the output feature map, G is the number of packets of the packet convolution, K, K 1 、K 2 All are convolution kernel sizes;
G cp ≤SH p ≤S cp
compared with the grouping convolution, the heterogeneous convolution effectively integrates the channels of the feature map, and compared with the standard convolution, the heterogeneous convolution improves the receptive field of the model;
step 3-3: pooling the 8 x 8 convolved incremental feature pyramids obtained in step 3-2;
step 3-4: performing up-sampling on the feature map obtained in the step 3-3 for three times to obtain a feature map with the resolution of 64x 64;
step 3-4: splicing the characteristic diagrams of 16 multiplied by 16, 32 multiplied by 32 and 64 multiplied by 64 obtained in the step 3-2 of the main network with the characteristic diagrams of corresponding resolution obtained in the step 3-3 by using a jump connection layer;
step 3-5: the channels of the feature map with the resolution of 64×64 obtained in step 3-4 are adjusted to the number of key points in the data set by convolution with 1×1, so that coordinates of the corresponding key points are output.
2. The target key point detection method based on the heterogeneous convolutional neural network according to claim 1, wherein the method comprises the following steps of: optimizing the network in a random gradient descent continuous iteration mode in the training process; the loss function used is the mean square error loss function:
wherein m is the number of key points, y i For the coordinates of the marked group _ trunk key point,and (3) predicting coordinates of the key points for the model, wherein n is the number of training samples in each batch, and i is the index of the current key points.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110242260.XA CN112949498B (en) | 2021-03-04 | 2021-03-04 | Target key point detection method based on heterogeneous convolutional neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110242260.XA CN112949498B (en) | 2021-03-04 | 2021-03-04 | Target key point detection method based on heterogeneous convolutional neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112949498A CN112949498A (en) | 2021-06-11 |
CN112949498B true CN112949498B (en) | 2023-11-14 |
Family
ID=76247778
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110242260.XA Active CN112949498B (en) | 2021-03-04 | 2021-03-04 | Target key point detection method based on heterogeneous convolutional neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112949498B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113762137A (en) * | 2021-09-02 | 2021-12-07 | 甘肃同兴智能科技发展有限责任公司 | Remote sensing image forest region carbon sink calculation method based on deep learning |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111144209A (en) * | 2019-11-25 | 2020-05-12 | 浙江工商大学 | Monitoring video head detection method based on heterogeneous multi-branch deep convolutional neural network |
CN111160085A (en) * | 2019-11-19 | 2020-05-15 | 天津中科智能识别产业技术研究院有限公司 | Human body image key point posture estimation method |
CN111402226A (en) * | 2020-03-13 | 2020-07-10 | 浙江工业大学 | Surface defect detection method based on cascade convolution neural network |
CN111738237A (en) * | 2020-04-29 | 2020-10-02 | 上海海事大学 | Target detection method of multi-core iteration RPN based on heterogeneous convolution |
CN111738111A (en) * | 2020-06-10 | 2020-10-02 | 杭州电子科技大学 | Road extraction method of high-resolution remote sensing image based on multi-branch cascade void space pyramid |
CN112184752A (en) * | 2020-09-08 | 2021-01-05 | 北京工业大学 | Video target tracking method based on pyramid convolution |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110321999B (en) * | 2018-03-30 | 2021-10-01 | 赛灵思电子科技(北京)有限公司 | Neural network computational graph optimization method |
CN108875904A (en) * | 2018-04-04 | 2018-11-23 | 北京迈格威科技有限公司 | Image processing method, image processing apparatus and computer readable storage medium |
-
2021
- 2021-03-04 CN CN202110242260.XA patent/CN112949498B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111160085A (en) * | 2019-11-19 | 2020-05-15 | 天津中科智能识别产业技术研究院有限公司 | Human body image key point posture estimation method |
CN111144209A (en) * | 2019-11-25 | 2020-05-12 | 浙江工商大学 | Monitoring video head detection method based on heterogeneous multi-branch deep convolutional neural network |
CN111402226A (en) * | 2020-03-13 | 2020-07-10 | 浙江工业大学 | Surface defect detection method based on cascade convolution neural network |
CN111738237A (en) * | 2020-04-29 | 2020-10-02 | 上海海事大学 | Target detection method of multi-core iteration RPN based on heterogeneous convolution |
CN111738111A (en) * | 2020-06-10 | 2020-10-02 | 杭州电子科技大学 | Road extraction method of high-resolution remote sensing image based on multi-branch cascade void space pyramid |
CN112184752A (en) * | 2020-09-08 | 2021-01-05 | 北京工业大学 | Video target tracking method based on pyramid convolution |
Non-Patent Citations (7)
Title |
---|
Siamese Feature Pyramid Network for Visual Tracking;Shuo Chang等;2019 IEEE/CIC International Conference on Communications Workshops in China (ICCC Workshops);164-168 * |
一种基于通道重排的轻量级目标检测网络;徐晗智;艾中良;张志超;;计算机与现代化(02);92-96 * |
人体动作识别中特征提取算法的研究综述;尹晓杰等;计算机科学;第46卷(第10A期);157-160 * |
卢鑫等.基于深度学习的人脸活体检测.辽宁科技大学学报.2019,第42卷(第5期),389-397. * |
基于深度学习的人脸活体检测;卢鑫;田莹;;辽宁科技大学学报(05);73-80 * |
基于深度神经网络的图像分割技术及其在盲人视觉辅助中的应用;胡鑫欣;中国优秀硕士学位论文全文数据库信息科技辑(第2期);I138-2233 * |
联合检测和重识别的行人搜索研究及嵌入式实现;王桢;中国优秀硕士学位论文全文数据库信息科技辑(第2期);I138-1302 * |
Also Published As
Publication number | Publication date |
---|---|
CN112949498A (en) | 2021-06-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11908244B2 (en) | Human posture detection utilizing posture reference maps | |
CN114186632B (en) | Method, device, equipment and storage medium for training key point detection model | |
CN110765865B (en) | Underwater target detection method based on improved YOLO algorithm | |
CN112381061B (en) | Facial expression recognition method and system | |
CN113869282B (en) | Face recognition method, hyper-resolution model training method and related equipment | |
CN113837275B (en) | Improved YOLOv3 target detection method based on expanded coordinate attention | |
CN113011253B (en) | Facial expression recognition method, device, equipment and storage medium based on ResNeXt network | |
CN112966574A (en) | Human body three-dimensional key point prediction method and device and electronic equipment | |
CN113239825B (en) | High-precision tobacco beetle detection method in complex scene | |
CN113052185A (en) | Small sample target detection method based on fast R-CNN | |
CN112287802A (en) | Face image detection method, system, storage medium and equipment | |
CN111626379B (en) | X-ray image detection method for pneumonia | |
CN111325190A (en) | Expression recognition method and device, computer equipment and readable storage medium | |
CN111652054A (en) | Joint point detection method, posture recognition method and device | |
CN111274999A (en) | Data processing method, image processing method, device and electronic equipment | |
CN112949498B (en) | Target key point detection method based on heterogeneous convolutional neural network | |
CN111507184B (en) | Human body posture detection method based on parallel cavity convolution and body structure constraint | |
CN116052025A (en) | Unmanned aerial vehicle video image small target tracking method based on twin network | |
CN112364974A (en) | Improved YOLOv3 algorithm based on activation function | |
CN115410030A (en) | Target detection method, target detection device, computer equipment and storage medium | |
CN113487610B (en) | Herpes image recognition method and device, computer equipment and storage medium | |
CN111582057B (en) | Face verification method based on local receptive field | |
CN113221855A (en) | Small target detection method and system based on scale sensitive loss and feature fusion | |
CN111274985A (en) | Video text recognition network model, video text recognition device and electronic equipment | |
CN115761220A (en) | Target detection method for enhancing detection of occluded target based on deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |