CN112270213A

CN112270213A - Improved HRnet based on attention mechanism

Info

Publication number: CN112270213A
Application number: CN202011084171.9A
Authority: CN
Inventors: 王聪; 乔元风; 蒋伟; 柯钦瑜; 黄勇; 李紫薇
Original assignee: Xuanwei Beijing Biotechnology Co ltd
Current assignee: Xuanwei Beijing Biotechnology Co ltd
Priority date: 2020-10-12
Filing date: 2020-10-12
Publication date: 2021-01-26

Abstract

An improved HRnet model based on an attention mechanism is characterized in that: when inputtingFWhen the characteristic diagram is input, an attention mechanism module is added, and the following 2 operations are carried out on the attention mechanism module:

the invention adopting the technical scheme has the following beneficial effects: the invention adds an attention mechanism model on the basis of the original HRnet model, so that the improved HRnet is used for detecting the posture of a human body in the cardio-pulmonary resuscitation pressing action process, provides an accurate backbone network for example segmentation models such as a dummy chest and a head in the cardio-pulmonary resuscitation medical examination, and improves the detection precision of the model.

Description

Improved HRnet based on attention mechanism

Technical Field

The invention relates to an improved algorithm, in particular to an improved HRnet model based on an attention mechanism.

Background

Sudden cardiac arrest seriously threatens the life and health of people, and the survival rate of patients can be remarkably improved by carrying out cardio-pulmonary resuscitation (CPR) with high quality, and the method is also an important means for saving the lives of the patients. The American Heart Association (AHA) and the International Resuscitation Commission (ILCOR) have high quality cardiopulmonary Resuscitation as the core of Resuscitation [1 ]. At present, the conventional cardio-pulmonary resuscitation training and assessment mode is to apply a medical simulator and make a judgment by a judge. The method has several disadvantages, such as strong subjectivity of examiner judgment and not objective; in the assessment and judgment process, the specific pressing depth, frequency and the like of an examinee depend on the quality conditions of the anthropomorphic dummy, and the examiner is difficult to judge; in the training process, the trainees need to supervise and cooperate with the examinees at all times to correct and improve the self operation, and a large amount of labor cost for training and examination is consumed.

In the prior art, after the pressing image of the examinee is obtained, the pressing action is a dynamic process, so that whether the pressing posture of the examinee is qualified or not cannot be judged according to the pressing image, and the difficulty is brought to automatic judgment.

Meanwhile, when extracting image features, different models need to be segmented according to actual conditions. For each model, because the image data volume is large, the accuracy of the model is guaranteed to realize human body posture recognition better, and therefore how to provide the model accuracy is an urgent problem to be solved.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: how to provide the accuracy of the model, an improved HRnet model based on an attention mechanism is provided.

In order to solve the technical problems, the invention adopts the following technical scheme:

an improved HRnet model based on an attention mechanism is characterized in that: when inputtingFWhen the input feature map is used as an input feature map, adding an attention module attention block, and performing the following 2 operations on the attention module attention block:

the method represents the operation of attention extraction on the channel dimension, namely establishing a channel attention mechanism model,

the method is characterized in that attention extraction operation is carried out on a spatial dimension, namely a spatial attention mechanism model is built.

The channel attention mechanism model is as follows: original feature map X_inObtaining a feature map U and a feature map V through convolution operations with convolution kernels respectively having the sizes of 3X3 and 5X5, then adding the feature maps to obtain a feature map F, wherein the feature map F fuses information of a plurality of receptive fields and has the shape of [ C, H, W]C represents a channel, H represents height, W represents width, then averaging and maximum values are obtained along the dimensions H and W, and two one-dimensional vectors are obtained in total after two posing functions are carried out; then, element addition is carried out on the two one-dimensional vectors, and finally the information about the channel is a 1 multiplied by C one-dimensional vector which represents the importance degree of the information of each channel; performing linear transformation on the 1 × 1 × C one-dimensional vector, mapping the original C dimension into Z-dimensional information, then performing 2 linear transformations on the Z-dimensional one-dimensional vector, respectively, and converting the Z dimension into the original C dimension, thereby completing information extraction for channel dimensions, and then performing normalization by using Softmax, where each channel corresponds to a score, which represents the importance degree of the channel, which is equivalent to a mask; multiplying the 2 masks obtained respectively by the corresponding feature maps U and V to obtain feature maps U 'and V'; then adding the 2 modules of the characteristic diagrams U 'and V' for information fusion to obtain a final moduleX_out。

The spatial attention mechanism model is as follows: inputting an original feature map X_inPerforming pooling characteristic, wherein the pooling characteristic comprises 3 pooling layers, namely average pooling, maximum pooling and stripe pooling, performing convolution operation of 1X1 on the pooling characteristic to realize channel dimensionality reduction and obtain a characteristic diagram with the channel number of 1, and performing Sigmoid function and input original characteristic diagram X on the characteristic diagram_inPerforming element-by-element dot multiplication to obtain output X_out。

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is the original HRnet model.

FIG. 2 is a diagram of an improved HRnet model according to the present invention.

FIG. 3 is a schematic diagram of an embodiment of the channel attention mechanism of the present invention.

FIG. 4 is a model diagram of a spatial attention mechanism.

Fig. 5 is a modified overall structure diagram of the HRnet.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same technical meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be further understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of the stated features, steps, operations, devices, components, and/or combinations thereof.

In the present invention, terms such as "fixedly connected", "connected", and the like are to be understood in a broad sense, and mean either a fixed connection or an integrally connected or detachable connection; may be directly connected or indirectly connected through an intermediate. The specific meanings of the above terms in the present invention can be determined according to specific situations by persons skilled in the relevant scientific or technical field, and are not to be construed as limiting the present invention.

The HRNet is used for detecting the posture of a human body in the cardio-pulmonary resuscitation pressing action process and detecting the trunk network of example segmentation models such as dummy chests, heads and the like in the cardio-pulmonary resuscitation medical assessment, and the HRNet is optimized and improved for improving the accuracy of the models.

As shown in fig. 1, there are 4 stages in the original HRnet, and the 2 nd, 3 rd and 4 th stages are all repeated multi-resolution blocks (modulated multi-resolution blocks). Before each multiresolution module, there is a switching layer (Translation layer) where additional feature maps appear. While no additional feature maps appear for the multiresolution module (multiresolution packet convolution + multiresolution convolution). The invention improves and optimizes the HRnet and improves the detection precision. Adding an attribute block in the convolution process from the multi-resolution group conv to the multi-resolution conv so as to improve the feature expression capability of the network model. The attention can not only tell the network model what to pay attention to, but also enhance the characterization of a specific area. The structure is shown in fig. 2, and the whole frame refers to: CBAM: conditional Block Attention Module.

In FIG. 2, an attention mechanism is introduced in both the channel and spatial dimensions, when inputtingFWhen the input feature map is used as an input feature map, adding an attention block, and performing the following 2 operations on the attention block by using an attentive mechanism module:

the output is F',

the operation of attribute extraction on the channel dimension is shown, namely, a channel attention mechanism model is established,

the method is characterized in that an attribute extraction operation is performed on a spatial dimension, namely a spatial attention mechanism model is built.

The channel attention mechanism model is specifically, as shown in fig. 3, an original feature map X_inObtaining a U characteristic diagram and a V characteristic diagram through convolution operation with convolution kernel sizes of 3X3 and 5X5 respectively, then adding the U characteristic diagram and the V characteristic diagram to obtain a characteristic diagram F, wherein the characteristic diagram F fuses information of a plurality of receptive fields and has the shape of [ C, H, W]Wherein, C represents channel, H represents height, W represents width, then average and maximum values are obtained along H and W dimensions, two one-dimensional vectors can be obtained in total after two forcing functions, global average forcing has feedback to each pixel point on the feature map f (feature map), and global max forcing has feedback that there is gradient only where the response is maximum in the feature map f (feature map) when performing gradient back propagation calculation, and can be used as a supplement to global average forcing. Then, element addition is carried out, and finally, the information about the channel is a one-dimensional vector of 1 × 1 × C, which represents the importance degree of the information of each channel.

And then, a linear transformation is used for mapping the original C dimension into Z-dimension information, then 2 linear transformations are respectively used for changing the Z dimension into the original C dimension, so that information extraction aiming at the channel dimension is completed, then Softmax is used for normalization, and each channel corresponds to a score at this time and represents the importance degree of the channel, which is equivalent to a mask. And multiplying the 2 masks respectively obtained by the corresponding feature maps U and V to obtain feature maps U 'and V'. Then adding 2 modules for information fusion to obtain a final module X_outFinal module X_outFeature map X compared to the original feature map_inInformation of a plurality of receptive fields is fused through information extraction.

Considering the long-distance correlation of human joint points, the spatial attention mechanism model needs to effectively capture remote context information. The overall attention mechanism model is shown in FIG. 4:

raw feature map input X_inThrough Pooling Feature, wherein the Pooling Feature comprises 3 Pooling layers, namely, averaging Pooling, max Pooling and Strip Pooling, the Strip Pooling refers to Strip Poling, which refers to a Reitingling Spatial Pooling for Scene matching paper, and mainly solves the problem related to target distance. Poollg Feature is subjected to convolution operation of 1X1 to realize channel dimensionality reduction, and a Feature map with the channel number of 1 is obtained, and the Feature map is subjected to a Sigmoid function and is subjected to input original Feature map X_inPerforming element-wise dot multiplication to obtain outputX _out。

The improved HRNet overall structure is shown in fig. 5:

the Channel maps and the Attention Block are connected directly without the Upesple and Strided conv modules.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims

1. An improved HRnet based on attention mechanism, characterized in that: when inputtingFWhen the feature graph is used as an input feature graph, adding an attention mechanism module, and performing the following 2 operations on the attention mechanism module:

2. An improved attention mechanism HRnet according to claim 1, wherein: the channel attention mechanism model is as follows: original feature map X_inObtaining a feature map U and a feature map V through convolution operations with convolution kernels respectively having the sizes of 3X3 and 5X5, then adding the feature maps to obtain a feature map F, wherein the feature map F fuses information of a plurality of receptive fields and has the shape of [ C, H, W]C represents a channel, H represents height, W represents width, then averaging and maximum values are obtained along the dimensions H and W, and two one-dimensional vectors are obtained in total after two posing functions are carried out; then element addition is carried out on the two one-dimensional vectors, and finally the information about the channel is obtainedInformation is a 1 × 1 × C one-dimensional vector, which represents the importance of information of each channel; performing linear transformation on the 1 × 1 × C one-dimensional vector, mapping the original C dimension into Z-dimensional information, then performing 2 linear transformations on the Z-dimensional one-dimensional vector, respectively, and converting the Z dimension into the original C dimension, thereby completing information extraction for channel dimensions, and then performing normalization by using Softmax, where each channel corresponds to a score, which represents the importance degree of the channel, which is equivalent to a mask; multiplying the 2 masks obtained respectively by the corresponding feature maps U and V to obtain feature maps U 'and V'; then adding the 2 modules of the characteristic diagrams U 'and V' for information fusion to obtain a final module X_out。

3. An improved attention mechanism HRnet according to claim 1, wherein: the spatial attention mechanism model is as follows: inputting an original feature map X_inPerforming pooling characteristic, wherein the pooling characteristic comprises 3 pooling layers, namely average pooling, maximum pooling and stripe pooling, performing convolution operation of 1X1 on the pooling characteristic to realize channel dimensionality reduction and obtain a characteristic diagram with the channel number of 1, and performing Sigmoid function and input original characteristic diagram X on the characteristic diagram_inPerforming element-by-element dot multiplication to obtain output X_out。