CN114419732A

CN114419732A - HRNet human body posture identification method based on attention mechanism optimization

Info

Publication number: CN114419732A
Application number: CN202210036567.9A
Authority: CN
Inventors: 杨金龙; 冯雨; 刘佳; 张媛; 刘建军
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2022-01-11
Filing date: 2022-01-11
Publication date: 2022-04-29

Abstract

The invention discloses an attention mechanism optimization-based HRNet human body posture recognition method, and belongs to the field of human body posture recognition and deep learning processing. The method comprises the steps that firstly, expansion convolution is added in the process of cross-channel fusion of feature maps with different resolutions, so that the receptive field can be increased without changing the size of a low-resolution feature map under the condition of not generating additional parameters and calculated quantity, and important information is not ignored when a decision is made; secondly, a new characteristic fusion strategy is provided, the characteristic graphs with different resolutions are subjected to weighted fusion by introducing a channel attention mechanism, the other direction of the characteristic mapping is adaptively recalibrated to enhance meaningful characteristics and inhibit weak characteristics, the convergence speed is accelerated, the human body posture recognition performance is optimized, and the detection precision is further improved.

Description

HRNet human body posture identification method based on attention mechanism optimization

Technical Field

The invention relates to an attention mechanism optimization-based HRNet human body posture recognition method, and belongs to the field of human body posture recognition and deep learning processing.

Background

Human body posture estimation is an important research subject of computer vision, and at present, human body posture can be inferred by combining key points with a human body graph structure through a Partial Affinity Field (PAF) of a posture detection method. According to the algorithm, firstly, a human body structure is decomposed into a plurality of nodes, the relationship between the nodes is modeled by using a human body structure model, and finally, the nodes are divided and connected into a whole, so that a complete human body posture is formed. However, under the conditions of complex background and highly flexible human body posture, the accuracy and efficiency of the graph structure model under similar actions are rapidly reduced, and the level of practical application is difficult to achieve.

In recent years, many methods for learning and recognizing human behaviors directly from image measurements have been deeply explored in this field, but they still face a series of challenges including foreground occlusion, background chaos, illumination, pose complexity, multi-person overlap, similar computational complexity, etc. In 2013, the characteristics of Coredelia are matched through a hyper-photon descriptor and a dense optical flow, and the optical flows of all pixel points on an image are calculated, so that a large amount of calculation is needed, and the method is difficult to apply in real-time detection; in 2016, Wang et al divide the postures of the human body by improving a visual word bag and a fusion method, and cannot grasp the overall characteristics of the human body for similar actions; from the development of human posture estimation, the addition of the deep convolutional neural network has already achieved considerable effect; hiroaki et al reached 96.1% accuracy in UCF-50 by means of Linear Dynamic Systems (LDSs) in 2020; in 2021 there were a number of deep convolutional networks added to detect human body gestures. Most existing methods preprocess the video frames before inputting them into the network, transmit the input through a network, consist of high-resolution to low-resolution subnets connected in series, and then increase the resolution. For example, an hourglass network restores high resolution through a symmetrical low-to-high process; simplebaeline uses several transformed convolutional layers to generate a high resolution representation. Deep learning is a research direction of fire and heat in the field of machine learning in recent years, and a plurality of important achievements obtained at present are successfully applied to various image processing tasks. The intelligent human body action recognition based on vision is the most challenging direction in the field of computer vision in recent years, and the human body action recognition method is used for recognizing the action of a person in a video by detecting the action of the person in a video sequence, extracting action characteristics and learning the action characteristics. Therefore, it is important to study a human motion recognition method based on deep learning.

In the existing human body posture identification scheme based on the HRNet network, the extraction of features is realized through the stacking of a convolutional layer and a pooling layer. After the image is output to the network, the convolution layer is used for extracting the features, pooling is used for aggregating the features, the model has certain translation invariance to reduce the computational power of the subsequent convolution layer, and finally, a classification result is output to a full connection layer. However, stacking results in an increasing number of parameters and calculations, increasing the computational cost. The feature map is often predicted by the output of the last layer, and how many pixels of the original image can be mapped by one point on the feature map determines the upper limit of the size that can be detected by the network, so as to ensure that the receptive field is subjected to down-sampling, and the result of the down-sampling is that the small target is not easy to detect. Therefore, the existing human body posture recognition scheme in the HRNet network has the problems of high calculation cost, low precision and the like.

Disclosure of Invention

In order to solve the problems of high calculation cost and low precision of the existing human body posture recognition method, the invention provides an attention mechanism optimization-based HRNet human body posture recognition method.

The invention aims at providing a human body posture recognition system based on an optimized HRNet network, which is characterized by comprising the following components: the system comprises a video stream acquisition module, an optimized HRNet network module and a classification result output module; the video stream acquisition module, the optimized HRNet network module and the classification result output module are sequentially connected;

the optimized HRNet network module comprises: the system comprises a basic HRNet module, an expansion convolution module and an attention mechanism module; the expansion convolution module is positioned between the subnets with different resolutions in the basic HRNet module and is used for performing expansion convolution in the up-sampling or down-sampling process between the subnets with different resolutions so as to increase the receptive field; the attention mechanism module is used for carrying out weighted fusion on the feature maps with different resolutions.

A second object of the present invention is to provide a human body posture recognition method based on an optimized HRNet network, wherein the method is implemented based on the human body posture recognition system of claim 1, and the method comprises:

the method comprises the following steps: acquiring a video stream;

step two: the optimized HRNet network module acquires a human body posture picture to be recognized in a video stream, and inputs the picture into a channel with the original resolution to obtain a high-resolution subnet characteristic graph, namely an output characteristic graph in the first stage;

step three: down-sampling the high-resolution feature map to obtain a low-resolution subnet feature map with the resolution being 1/2 times of the original resolution, and increasing the receptive field by adopting expansion convolution in the sampling process;

step four: introducing an attention mechanism to perform cross-resolution feature fusion on the subnet feature maps with different resolutions, shrinking the space dimension after performing global average pooling on the feature map of each channel to obtain a corresponding weight value, and obtaining a weighted feature map, namely an output feature map of the second stage;

step five: taking the weighted feature map obtained in the fourth step as the input of the next stage, and repeating the second step to the fourth step until the output feature map of the fourth stage is output;

step six: and obtaining a final human body posture recognition result according to the output characteristic diagram of the fourth stage.

Optionally, the method for calculating the expansion convolution input and output feature maps includes:

wherein, W₁To be transportedInto the characteristic diagram, W₂To output the profile, d is the dilation rate, p is the pixel fill in the convolution, k is the size of the channel, and s is the step size.

Optionally, the calculation process of the step four includes:

uniformly up-sampling or down-sampling the feature maps obtained by the expansion convolution, fusing the feature maps with the same resolution, introducing a compression excitation module SE into a feature map fusion strategy to perform global average pooling on the feature maps of each channel, and then contracting the spatial dimension to obtain a corresponding weight value;

the SE module performs cross-channel feature fusion in a weighting mode, when the resolution cross interaction is performed from the n-1 stage to the n stage, b represents the serial number of the branch to which the fusion input belongs, r represents the serial number of the branch to which the fusion belongs, Down (H)_nr)^tAnd Upper (H)_nr)^tRespectively represent a pair of feature maps H_nrAnd performing downsampling and upsampling for t times, and obtaining a branch result formula by new fusion as follows:

f_nr(H_(n-1)r)＝H_(n-1)r b＝r

f_nr(H_(n-1)r)＝Down(H_(n-1)r)^r-b b<r

f_nr(H_(n-1)r)＝Upper(H_(n-1)r)^b-r b>r

the SE module will be followed by each H_nrAs input X, X ═ X¹,x²,……，x^C′]Wherein x is^C′A set of feature maps representing the C' th channel; for each mapping input X ∈ R^{H′×W′×C′}To feature mapping U ∈ R^H×W×C′Respectively calculating the weights corresponding to the characteristics;

wherein H_nrIs a characteristic diagram of the nth branch of the nth stage, f_nr() represents the sampling operation of the r-th branch in the nth stage, H ', W ' and C ' are respectively the height and width of the input original feature map, and the number of channels H, W are respectively the height and width of the output feature map of the SE module;

F_tris a convolution operator and uses V ═ V₁,v₂，……，v_c]Represents a learned set of filter kernels, where v_cA parameter set representing the c-th filter,

finally, the characteristic graph output by the SE module uses U ═ U₁，u₂，……，u_c]Is shown, in which:

wherein u is_cCharacteristic diagram, u, of the c-th channel output by the SE module_c∈R^H×W，

Is a two-dimensional spatial kernel representing a single channel acting on the corresponding channel of input X.

Optionally, the calculation process of the weighted feature map includes:

after the cross-channel characteristic diagram is obtained, processing the convolved characteristic diagram to obtain a one-dimensional vector with the same number as the number of channels as the evaluation score of each channel, and then respectively applying the score to the corresponding channels to obtain the result;

firstly, global space information is compressed into a channel descriptor through global pooling; the feature map of each channel has a local acceptance domain, so that each unit of the output feature map U cannot utilize the context information outside the region; obtaining a weight value set z of the feature map by compressing the space dimension of the feature map U, wherein the computing method of the c-th element of z is as follows:

wherein, F_sq() Representing compression of an image，u_c(i, j) wherein i and j are subscripts to the accumulation of the height and width of the output profile;

obtaining a one-dimensional vector, namely the weight corresponding to each channel, by global pooling after inputting the characteristic diagram, regarding the one-dimensional vector after the global pooling as having a global receptive field, and obtaining a channel level dependency relationship by self-adaptive recalibration in an excitation link;

therefore, through a Sigmoid activation function and a simple gating mechanism, the following conditions are satisfied:

Sigmoid＝F_ex(z,W)＝σ(g(z,W))＝σ(W₂δ(W₁z))

where, δ is the ReLU function,

F_ex() Indicating an activation operation;

in order to limit the complexity of the model and help generalization, after passing through two fully-connected layers, namely a dimensionality reduction layer with a reduction ratio of l, passing through a ReLU function, then carrying out dimensionality increase, and then returning to the channel dimensionality of the conversion output feature graph U, wherein the final output weighted feature graph is as follows:

G_c＝F_scale(u_c,s_c)＝s_cu_c

wherein G ═ G₁，G₂，……，G_C]，S_cRepresenting the weight value after the complex and generalized operations of the help model, F_scale(u_c，s_c) Is s_cAnd feature mapping u_c∈R^H×WThe channel-level multiplication between the original characteristic diagrams, namely the multiplication with the corresponding positions of the original characteristic diagrams, finally obtains the weighted characteristic diagrams.

Optionally, the convolution kernel size of the dilation convolution is:

d*(k-1)+1

where d is the expansion ratio and k represents the convolution kernel size before expansion.

Optionally, the training step of optimizing the HRNet network includes:

step 1: initializing network parameters, including: category represented by the tenasor category in the training, target category of the training in class.

Step 2: initializing a data set;

storing the picture data of different categories in corresponding folders to generate classification serial numbers of different categories;

and step 3: initializing a training network;

inputting the picture into a channel with the original resolution to obtain a characteristic diagram of a first stage;

and 4, step 4: multi-resolution parallel training;

for the feature map obtained in the first stage, next, each stage has subnetworks to sample the original resolution to 1/2 times, the subnetworks from high resolution to low resolution are connected in series, each subnetwork forms a stage and consists of a series of convolutions, a down-sampling layer exists between adjacent subnetworks, the resolution is halved, and the representation from high resolution to low resolution in the whole process can be maintained; taking a high-resolution subnet as a first stage, gradually increasing the high-resolution subnet to a low-resolution subnet to form more stages, and connecting the multi-resolution subnets in parallel;

and 5: sampling a cross-resolution feature map;

sampling feature maps of subnets with different resolutions by adopting expansion convolution, leading the feature maps obtained after sampling to have larger receptive field, introducing an exchange unit into the parallel subnets, and leading each subnet to repeatedly receive information from other parallel subnets;

step 6: fusing the cross-resolution feature maps;

performing up-sampling or down-sampling on the feature map obtained by the expansion convolution, fusing the feature map to the same resolution, adopting a channel attention model for a feature map fusion strategy, introducing an attention mechanism to perform global average pooling on the feature map of each channel, and then shrinking the space dimension to obtain a corresponding weight value;

and 7: learning depth features to generate a model file;

and (4) continuing to serve as input for the feature map after the step 6 is finished, entering the next stage, repeating the step 3 to the step 6, outputting the final feature map as a classification learning result through the multi-layer feature extraction and the feature learning of the deep network, and saving the output model.

Optionally, the method for acquiring a video stream includes video reading or camera shooting.

Optionally, the data set is: a KTH data set.

Optionally, the human body posture comprises: walking, jogging, running, boxing, waving and clapping.

The invention has the beneficial effects that:

the human body posture recognition method takes a high-resolution network HRNet as a basic framework, introduces an expansion convolution and attention mechanism compression excitation SE module, and realizes the improvement of an HRNet network model basic module: firstly, expansion convolution is added in the process of cross-channel fusion of feature maps with different resolutions, so that the receptive field can be increased without changing the size of the feature map with the low resolution under the condition of not generating additional parameters and calculated quantity, and important information is not ignored when decision making is ensured so as to ensure the identification precision; and secondly, a new feature fusion strategy is provided, which is different from the HRNet which is directly fused after upsampling, the feature maps with different resolutions are weighted and fused by introducing a channel attention mechanism, the other direction of the feature mapping is adaptively recalibrated to enhance meaningful features and inhibit weak features, the performance of HRNet identification is effectively improved, convergence is faster under the condition of maintaining the parameter quantity basically unchanged, the calculation cost can be saved, and the identification precision is further improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow chart of human gesture recognition of the present invention.

Fig. 2 is a diagram of the overall network structure of HRNet.

Fig. 3 is a block diagram of an optimized HRNet network module according to the present invention.

Fig. 4 is a network configuration diagram of the second stage of HRNet.

Fig. 5 is a diagram of a BasicBlock network architecture.

FIG. 6 is a diagram of the original block (left) and the SE block (right).

Fig. 7 is a plot of MEI of a clap.

FIG. 8 is a diagram of a dense trajectory algorithm for human pose estimation.

Fig. 9 is a diagram of a HRNet network model.

Fig. 10 is a KTH data set video sequence chart.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

The basic theoretical knowledge involved in the present invention is first introduced as follows:

1. human behavior recognition

The key of human behavior recognition is to extract robust behavior features, different from features in an image space, the behavior features of people in a video need to describe the appearance of people in the image space, and changes of the appearance and the posture need to be extracted, namely, the behavior features are expanded from two-dimensional space features to three-dimensional space-time features. Many human behavior recognition methods based on RGB data have been proposed in recent years, including a conventional method of manually extracting features and a method based on deep learning. With some depth sensor applications, such as microsoft's Kinect device, this is because the depth data is more robust to the background environment. The research on these human behavior recognition methods focuses only on a specific aspect, such as a depth data-based method, a depth learning-based method, and a 3D convolution-based method. Also in recent years many new behavior recognition methods have emerged, such as those based on graph-convolution neural networks.

1.1 human posture detection based on spatio-temporal volumes

Methods based on spatiotemporal volumes are mainly template matching techniques, but unlike object recognition in image processing, they use three-dimensional spatiotemporal templates for human behavior recognition. The core of the methods is to construct a reasonable human behavior template and perform effective matching based on the template. The earliest method was to use contours to describe the motion information of human body, and proposed to use motion energy Map (MEI) and motion history Map (MHI) to represent behavior characteristics, and fig. 7 shows MEI and MHI of 3 classes of different behaviors. The central region of the human body is divided in MHI by using polar coordinates, and behaviors are expressed using a Motion Context (MC) descriptor based on Scale Invariant Feature Transform (SIFT). A histogram of gradient (HOG) features of the image is expanded to the spatio-temporal dimension and 3-dimensional HOG features are used to describe human behavior in the video.

1.2 based on trajectory features

Trajectory-based features are the use of trajectories of key points or joints in the human skeleton to represent behavior. Such as the dense track method (DT) and its improved method (IDT). As shown in fig. 8; in a video frame, dense point clouds are sampled, the feature points are tracked through an optical flow method, a motion track is calculated, more effective boundary information (MBH) of a moving object is extracted along the track to describe human body behaviors, and an IDT algorithm is improved, for example, a local motion track is analyzed through split clustering, and clustering results are used for representing different motion levels to calculate human body behavior features.

2. HRNet deep network

HRNet is a series of sub-networks from high resolution to low resolution, each sub-network constituting a stage, consisting of a series of convolution operations, between adjacent sub-networks the resolution is halved by means of a down-sampling layer, while the different branches convey information in parallel so that the high resolution representation is maintained throughout the process. And taking the high-resolution subnet as a first stage, gradually increasing the high-resolution subnet to the low-resolution subnet to form more stages, and connecting the multi-resolution subnets in parallel. The HRNet network model is shown in fig. 9.

In a multiresolution subnet, with N_srTo represent the s-th stage, sub-network with resolution index r. High-to-low network with S-stageCan be expressed as:

N₁₁→N₂₂→N₃₃→N₄₄

the resolution of the parallel subnet of the next stage is composed of the resolution of the previous stage and the resolution of the next stage, and a network structure containing 4 parallel subnets is represented as follows:

N₁₁→N₂₁→N₃₁→N₄₁

↘N₂₂→N₃₂→N₄₂

↘N₃₃→N₄₃

↘N₄₄

and repeatedly exchanging information on the parallel multi-resolution subnets through the whole process so as to perform repeated multi-scale fusion. Switching units are introduced in the parallel subnets to enable each subnet to repeatedly receive information from other parallel subnets. The following is an example of a display information exchange scheme. The third stage is divided into three switching blocks, each block is composed of 3 parallel convolution units, and one switching unit spans the parallel units as follows:

C¹ _31↘↗C² _31↘↗C³ _31↘

C¹ _32→ε¹ _3→C² _32→ε² _3→C³ _32→ε³ ₃，

C¹ _33↗↘C² _33↗↘C³ _33↗

wherein C is^b _srDenotes the convolution unit in the r resolution, ε, in the b-th block of the S-th stage^b _sIs the corresponding switching unit. Responding to mapping: { X₁、X₂、……，X_sIs the input. The response of the output maps: { Y₁，Y₂，……，Y_sIts resolution and width are the same as the input. Function a (X)_iK) comprises up-sampling or down-sampling from resolution i to resolution k. Sampling is performed at 2 times by step 2 using a 3 × 3 convolution kernel stepSample, two consecutive 3x 3 convolutions with a 4x downsampling of step 2. For upsampling, a 1 × 1 convolution after the nearest neighbor sampling is used to align the number of channels. Each output is an aggregation of the input mappings,

the cross-phase switching unit has an additional output mapping Y_s+1：Y_s+1＝a(Y_s，s+1)。

For cross-channel feature fusion, when Stage_n-1To Stage_nWhen cross-resolution interaction is carried out, r is used for representing fusion input of a stage, X is fusion output, D (H)^t，U(H)^tRespectively, downsampling and upsampling are performed t times on H. The branch result formula obtained by the new fusion is as follows:

f_nr(H_(n-1)r)＝H_(n-1)r x＝r

f_nr(H_(n-1)r)＝D(H_(n-1)r)^r-x x<r

f_nr(H_(n-1)r)＝U(H_(n-1)r)^x-r x>r

most convolutional neural network keypoint heat map estimates are similar to classification networks, with a sub-network reducing the resolution, and the subject producing inputs representing the same resolution, followed by a keypoint position estimate of the regression variable estimate heat map, which is then transformed at full resolution. Subjects use mainly high-to-low, low-to-high frames, possibly enhancing multi-scale fusion and moderate (deep) surveillance. Most existing fusion schemes aggregate low-level and high-level representations. HRNet performs repeated multi-scale fusion to improve high-resolution representation and vice versa with low-resolution representation of the same depth and similar level, so that high-resolution representation also has rich pose estimation capability. Thus, the predicted feature map may be more accurate.

The first embodiment is as follows:

the embodiment provides a human body posture recognition system based on an optimized HRNet network, and the recognition system comprises: the system comprises a video stream acquisition module, an optimized HRNet network module and a classification result output module; the video stream acquisition module, the optimized HRNet network module and the classification result output module are sequentially connected;

the optimized HRNet network module comprises: the system comprises a basic HRNet module, an expansion convolution module and an attention mechanism module; the expansion convolution module is positioned between subnets with different resolutions in the basic HRNet module and used for increasing the receptive field in the up-sampling or down-sampling process between subnets with different resolutions; the attention mechanism module is used for carrying out weighted fusion on the feature maps with different resolutions.

Example two:

the embodiment provides a human body posture recognition method based on an optimized HRNet network, which is realized based on the human body posture recognition system recorded in the first embodiment and comprises the following steps:

the method comprises the following steps: acquiring a video stream;

Example three:

the method comprises the following steps: acquiring a video stream, wherein the method for acquiring the video stream is video reading;

the method for calculating the expansion convolution input and output characteristic diagram in the embodiment comprises the following steps:

wherein, W₁For inputting a feature map, W₂To output the profile, d is the dilation rate, p is the pixel fill in the convolution, k is the size of the channel, and s is the step size.

uniformly up-sampling or down-sampling the feature maps obtained by the expansion convolution, fusing the feature maps with the same resolution, introducing a compression excitation SE module to perform global average pooling on the feature maps of each channel and then contracting the spatial dimension to obtain a corresponding weight value, wherein a channel attention model is adopted in a feature map fusion strategy;

the SE module performs cross-channel feature fusion in a weighting mode, when the resolution cross interaction is performed from the n-1 stage to the n stage, b represents the serial number of a branch to which fusion input belongs, r represents the serial number of a branch to which fusion belongs after fusionBranch sequence number, Down (H)_nr)^tAnd Upper (H)_nr)^tRespectively represent a pair of feature maps H_nrAnd performing downsampling and upsampling for t times, and obtaining a branch result formula by new fusion as follows:

f_nr(H_(n-1)r)＝H_(n-1)r b＝r

f_nr(H_(n-1)r)＝Down(H_(n-1)r)^r-b b<r

f_nr(H_(n-1)r)＝Upper(H_(n-1)r)^b-r b>r

wherein, F_sq() Representing compression of an image, u_c(i, j) wherein i and j are subscripts to the accumulation of the height and width of the output profile;

Sigmoid＝F_ex(z,W)＝σ(g(z,W))＝σ(W₂δ(W₁z))

where, δ is the ReLU function,

F_ex() Indicating an activation operation;

G_c＝F_scale(u_c,s_c)＝scu_c

The human body gestures recognized in the embodiment include: walking, jogging, running, boxing, waving and clapping.

In this embodiment, before the human body posture is recognized, the method for training the training model first needs to:

step 1: initializing training parameters initializing items setting training model parameters

Relevant data in the configuration file is modified, a single GPU is matched with default parameters, and the type of the type identified by the GPU is modified, wherein the type comprises two parts: first the class represented by the tensor class in the training, and second the class. The trained hyper-parameters are then modified: the number of learning batches, the learning rate decay, and the number of learning.

Step 2: initializing a data set

And storing the picture data of different categories in corresponding folders to generate classification serial numbers of different categories.

And step 3: initializing a training network

Inputting a human body posture picture into a network, preprocessing the picture size into a size of 3x256x256, obtaining a feature map of a first stage in a channel with the maximum resolution, wherein the feature map size is 8x128x128, and the feature map is H₁₁。

And 4, step 4: multiresolution parallel training

For the feature map H obtained in the first stage₁₁In the second stage, the network samples the feature map by 1/2 times, and connects the sub-networks from high resolution to low resolution in series to obtain the feature map H₂₁，H₂₂At this time, the feature size H₂₂Is 32x64x 64. Each stage will then have the subnet to cycle down sample the original resolution. Each sub-network forms a stage consisting of a series of convolutions, with a down-sampling layer between adjacent sub-networks, halving the resolution, enabling the representation to be maintained from high to low resolution throughout the process. And taking the high-resolution subnet as a first stage, gradually increasing the high-resolution subnet to the low-resolution subnet to form more stages, and connecting the multi-resolution subnets in parallel. If a branch of the fourth stage is desired, it is necessary to use H₃₁(8x128x128),H₃₂(32x64x64)，H₃₃(64x32x32) to produce H₄₁(8x128x128),H₄₂(32x64x64),H₄₃(64x32x32),H₄₄(128x16x 16). The polymerization mode is as follows: a fusion mode of direct addition after size matching: h₄₁＝f₁(H₃₁)+f₂(H₃₂)+f₃(H₃₃) For the same resolution profile, for f₁(H₃₁)＝H₃₁And f is₂And f₃The size of the attitude feature map and the channel transformation are restored through upsampling.

And 5: cross-resolution feature map sampling

Reducing the size of the attitude feature map and channel transformation, performing cross-resolution fusion when the feature maps trained with different resolutions end at each stage, and repeatedly exchanging information on parallel multi-resolution subnets in the whole process so as to perform repeated multi-scale fusion.

The classification result is usually predicted by the output of the last layer, so that the number of pixels of the original image which can be over-mapped by one point on the feature map determines the upper limit of the size which can be detected by the network, the receptive field is ensured to be based on down-sampling, the result of the down-sampling is that a small target is not easy to be detected, and the expansion convolution is provided by a special module specially designed for intensive prediction to further improve the precision. And the expansion convolution is introduced in the sampling process to replace the traditional 1x1 convolution, so that the characteristic diagram has a larger receptive field. Switching units are introduced in the parallel subnets to enable each subnet to repeatedly receive information from other parallel subnets. Dilate convolution input (W)₁) And an output (W)₂) The feature map calculation method is as follows:

for the classification task of human body gestures, multi-scale context information needs to be aggregated after sampling without losing resolution or analyzing rescaled images. By adding the pixel points with the value of 0 between the pixel points of each channel, the size of the kernel is increased in a direction-changing manner, and thus the receptive field is increased. The convolution kernel for normal convolution is denoted by k, and the increase in the size of the convolution kernel after introducing the expansion rate d is:

d*(k-1)+1

the pair H is obtained by f2 and f3 after the convolution kernel is expanded₃₁,H₃₂,H₃₃Performing convolution operation to obtain H₄₂,H₄₃,H₄₄.

Step 6: cross-resolution feature map fusion

And uniformly restoring (up-sampling/down-sampling) the feature map obtained by the expansion convolution to the same resolution ratio, then fusing, wherein a feature map fusion strategy adopts a channel attention model, and a compressed excitation SE module is introduced to perform global average pooling on the feature map of each channel and then shrink the space dimension to obtain a corresponding weight value.

The SE module performs cross-channel feature fusion in a weighting mode, and performs cross-channel feature fusion from the (n-1) th stage to the (n) th stageWhen interacting across resolutions, b represents the branch sequence number to which the fusion input belongs, r represents the sequence number of the branch to which the fusion belongs, Down (H)_nr)^tAnd Upper (H)_nr)^tRespectively represent a pair of feature maps H_nrAnd performing downsampling and upsampling for t times, and obtaining a branch result formula by new fusion as follows:

f_nr(H_(n-1)r)＝H_(n-1)r b＝r

f_nr(H_(n-1)r)＝Down(H_(n-1)r)^r-b b<r

f_nr(H_(n-1)r)＝Upper(H_(n-1)r)^b-r b>r

wherein，u_cCharacteristic diagram, u, of the c-th channel output by the SE module_c∈R^H×W，

Sigmoid＝F_ex(z,W)＝σ(g(z,W))＝σ(W₂δ(W₁z))

where, δ is the ReLU function,

F_ex() Indicating an activation operation;

G_c＝F_scale(u_c,s_c)＝s_cu_c

And 7: learning depth features to generate model files

And 8: achieving gesture classification using weight file detection gestures

And loading the weight file of the trained model to detect the human body posture.

In order to further explain the beneficial effects that the human body posture recognition method can achieve, the experiment is specially carried out as follows:

1. experimental conditions and parameters

The data set adopted by the invention is a KTH data set, and the video sequence comprises six different types of human behaviors: walking, jogging, running, boxing, waving and clapping hands, vary in both environment and object, thus helping to determine the accuracy of our proposed method. With each action corresponding to multiple performances of 25 different subjects in a variety of different scenes, with varying proportions both indoors and outdoors and with varying wear. All motion sequences consider a static and uniform background, with a video frame rate of 25 frames per second, a resolution of 160 x 120 pixels, and an average duration of motion in the video of 4 seconds. There are 25 videos for each action in four different categories. Some snapshots of video sequences from the KTH dataset are shown in fig. 10.

2. Experiment and analysis of results

The method is realized by adopting Pycharm Community Edition 2021.1.2x64, runs on a workstation with a processor of Intel Core i7-8700, 3.2GHz and 12 cores, a memory of 16GB and a video card of NVIDIA Geformace GTX 1080Ti, and performs performance comparison and analysis with the HRNet network method provided in Deep High-Resolution reproduction Learning for Human Point Estimation published by Ke Sun Wang and the like in 2020. Because the original HRNet method lacks a mechanism for paying attention to the global feature attention channel, the SE is also added in the experiment for detection, and the experimental result is as follows.

(1) Ablation analysis

In order to verify the influence degree of the added channel attention module and the expanded convolution on the feature extraction capability of the HRNet and the human posture prediction accuracy, network structures with attention mechanisms and non-attention mechanisms are respectively constructed for carrying out ablation experiments, wherein the network model of the non-attention mechanism carries out cross-channel feature fusion by using 1x1 convolution. The experiment was trained on the KTH data set and validated on the check set, and neither was loaded with the pre-trained model.

The ablation experiment observes results through two comparisons, firstly, the performance comparison of 0-50 rounds of non-loaded pre-training models is carried out, the experimental results are shown in table 1, and A, B, C, D four groups of ablation experiments are carried out by adopting a variable control method. The network model without attention mechanism improves the basic module in the HRNet network model only by using the expansion convolution, and compared with the HRNet network model, the parameter number and the operation complexity are not increased. Compared with the network model with attention mechanism, the network model without attention mechanism uses the expansion convolution module for both network models.

TABLE 1 ablation experiment

The experimental result shows that when the compression excitation SE module is not added into the HRNet network, the overall performance of the model is only reduced by 1.6%; after the compression excitation SE module is added, the parameter quantity and the operation complexity of the network model are not increased, the performance of the network model can be ensured, and the purpose of improving the precision is achieved. And the network model with attention mechanism does not greatly influence the operation complexity of the model after the expansion convolution is added, and the precision is improved by 0.5 percent.

(2) Performance comparison

The performance of the network model proposed herein on the KTH data set is shown in table 2, and the data set KTH classified based on human body posture has few types of actions, but has high similarity in actions, time correlation is removed, and the depth features of many actions are difficult to learn. Because the pose estimation algorithms are different, some networks and improved versions thereof are involved, wherein the histograms, the visual word bags and the current mainstream are involved. So the computational complexity and parameters are only compared in terms of performance:

TABLE 2 comparative experiments

Table 3 and table 4 present different combinations of KTH dataset feature extraction and classifier techniques. The result shows that compared with the HRNet native network, the classification precision of each category is improved. There is only a small fraction of errors in classifying ambiguities from actions such as boxing, waving and clapping. In addition, small portions of the motion such as running, walking, and jogging are also wrongly classified by the classifier.

TABLE 3 HRNet Classification results

TABLE 4 optimization of HRNet classification results

Therefore, the recognition accuracy of different types is improved on the whole, the basic improvement is more than 1%, wherein small parts of three types of classifications of slow walking, running and walking are misjudged, on one hand, the similar actions in the gesture preparation process are misjudged due to the fact that the data set is caused by the video frame continuity, and on the other hand, the three gestures are difficult to distinguish through judgment of data streams intercepted by the video frames.

In summary, the human body posture recognition method of the embodiment of the invention can effectively improve the recognition accuracy under the condition of ensuring that the parameter number and the operation complexity of the network model are not increased; in addition, the invention introduces a channel attention mechanism to carry out weighted fusion on the feature maps with different resolutions, and the other direction of the feature mapping is adaptively recalibrated to enhance meaningful features and inhibit weak features, so that the convergence speed is higher, the operation cost can be further reduced, and the identification speed is accelerated.

Some steps in the embodiments of the present invention may be implemented by software, and the corresponding software program may be stored in a readable storage medium, such as an optical disc or a hard disk.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A human body posture recognition system based on an optimized HRNet network is characterized in that the recognition system comprises: the system comprises a video stream acquisition module, an optimized HRNet network module and a classification result output module; the video stream acquisition module, the optimized HRNet network module and the classification result output module are sequentially connected;

2. A human body posture recognition method based on an optimized HRNet network, which is characterized in that the method is realized based on the human body posture recognition system of claim 1, and comprises the following steps:

the method comprises the following steps: acquiring a video stream;

step four: introducing an attention mechanism to perform cross-resolution feature fusion on the subnet feature maps with different resolutions obtained in the step three, shrinking the space dimension after performing global average pooling on the feature map of each channel to obtain a corresponding weight value, and obtaining a weighted feature map, namely the output feature map of the second stage;

3. The human body posture recognition method according to claim 2, wherein the expansion convolution input and output feature map calculation method is:

wherein, W₁For inputting a feature map, W₂To output the profile, d is the dilation rate, p is the pixel fill in the convolution, k is the channel size, and s is the step size.

4. The human body posture recognition method according to claim 2, wherein the calculation process of the fourth step comprises:

f_nr(H_(n-1)r)＝H_(n-1)r b＝r

f_nr(H_(n-1)r)＝Down(H_(n-1)r)^r-b b<r

f_nr(H_(n-1)r)＝Upper(H_(n-1)r)^b-r b>r

F_trfor one convolution operator, V ═ V is used for conv2D operations in the network layer₁,v₂，……，v_c]Represents a learned set of filter kernels, where v_cA parameter set representing the c-th filter,

5. The human body posture recognition method according to claim 4, wherein the calculation process of the weighted feature map comprises:

after the cross-channel characteristic diagram is obtained, processing the convolved characteristic diagram to obtain a one-dimensional vector which is as many as the number of channels and is used as an evaluation score of each channel, and then applying the scores to the corresponding channels respectively to obtain the result;

Sigmoid＝F_ex(z,W)＝σ(g(z,W))＝σ(W₂δ(W₁z))

where, δ is the ReLU function,

F_ex() Indicating an activation operation;

G_c＝F_scale(u_c,s_c)＝s_cu_c

wherein G ═ G₁，G₂，……，G_C]，S_cRepresenting the weight value after the complex and generalized operations of the help model, F_scale(u_c，s_c) Is s_cAnd feature mapping u_c∈R^H×WChannel-level multiplication betweenThat is, multiplying the position corresponding to the original feature map, and finally obtaining the weighted feature map.

6. The human pose recognition method of claim 2, wherein a convolution kernel size of the dilation convolution is:

d*(k-1)+1

7. The human body posture recognition method according to claim 2, wherein the training step of optimizing the HRNet network comprises:

Step 2: initializing a data set;

and step 3: initializing a training network;

and 4, step 4: multi-resolution parallel training;

and 5: sampling a cross-resolution feature map;

step 6: fusing the cross-resolution feature maps;

and 7: learning depth features to generate a model file;

8. The human body posture recognition method according to claim 2, wherein the method of acquiring the video stream comprises video reading or camera shooting.

9. The human gesture recognition method of claim 7, wherein the data set is: a KTH data set.

10. The human body posture recognition method according to claim 2, wherein the human body posture includes: walking, jogging, running, boxing, waving and clapping.