CN113609993A

CN113609993A - Attitude estimation method, device and equipment and computer readable storage medium

Info

Publication number: CN113609993A
Application number: CN202110901833.5A
Authority: CN
Inventors: 刘振辉; 徐召飞; 王云奇; 刘晴
Original assignee: Iray Technology Co Ltd
Current assignee: Iray Technology Co Ltd
Priority date: 2021-08-06
Filing date: 2021-08-06
Publication date: 2021-11-05

Abstract

The application discloses a method, a device, equipment and a computer readable storage medium for attitude estimation, wherein the method comprises the following steps: training a target detection network by using an infrared image data set; training the feature extraction network by using the obtained human body candidate block diagram; the method comprises the steps of preprocessing a to-be-recognized human body candidate block diagram obtained by using a human body detection network, extracting features of the to-be-recognized candidate block diagram obtained by preprocessing by using a human body skeleton point extraction network, fusing and outputting human body skeleton point coordinates, and determining the posture of a human body. According to the technical scheme, the image enhancement is carried out on the to-be-recognized human body candidate block diagram, so that human body features are more obvious, the to-be-recognized human body candidate block diagram is subjected to feature extraction in different latitudes by utilizing the human body skeleton point extraction network, the features in different latitudes are fused, so that weak feature information under an infrared image can be learned, the features have better global information, and the overall accuracy of infrared image posture estimation is improved.

Description

Attitude estimation method, device and equipment and computer readable storage medium

Technical Field

The present application relates to the field of technology, and more particularly, to a method, an apparatus, a device and a computer-readable storage medium for estimating an attitude.

Background

In the fields of security protection, automatic driving, robots and the like, most of the methods utilize attitude estimation as a participant for auxiliary judgment or a decision maker for judgment results.

At present, most of existing attitude estimation is applied to a visible light environment, but a visible light image cannot be well acquired in environments without strong light supplementation, such as night, foggy weather and the like, so that the attitude estimation based on the visible light image cannot be well realized in the environments. And the infrared image can still clearly find the target in the environment such as the night, so that the method has great significance in estimating the posture by utilizing the infrared image in the environment such as the night. However, compared with the visible light image, the infrared light image has a contrast ratio, and the human face and limbs and other detailed information are less, so that the deep learning network is difficult to converge, and therefore, the existing posture estimation method applied to the visible light image cannot be well applied to the infrared light image.

In summary, how to improve the accuracy of the pose estimation in the infrared image is a technical problem to be solved urgently by those skilled in the art.

Disclosure of Invention

In view of the above, an object of the present application is to provide a method, an apparatus, a device and a computer readable storage medium for estimating an orientation, which are used to improve the accuracy of the orientation estimation in an infrared image.

In order to achieve the above purpose, the present application provides the following technical solutions:

an attitude estimation method, comprising:

acquiring an infrared image data set, and training a pre-constructed target detection network by using the infrared image data set to obtain a human body detection network;

obtaining a human body candidate block diagram according to the target detection network and the infrared image data set, preprocessing the human body candidate block diagram, and training a pre-constructed feature extraction network by using the preprocessed candidate block diagram to obtain a human body skeleton point extraction network; the pre-processing comprises image enhancement;

obtaining a candidate block diagram of a human body to be identified according to the infrared image to be identified and the human body detection network;

the human body candidate block diagram to be identified is subjected to the preprocessing, the human body skeleton point extraction network is utilized to perform feature extraction of different latitudes on the preprocessed human body candidate block diagram to be identified, the features of different latitudes are fused, and the coordinates of human body skeleton points are output;

and mapping the human skeleton point coordinates to the infrared image to be recognized to determine the human body posture.

Preferably, the human skeletal point extraction network includes a backbone network, an up-sampling network, a down-sampling network, a feature plane output network, and a human skeletal point processing network, wherein:

the main network comprises a ResNet network, a preset number of downsampling blocks and a convolution layer connected with the downsampling blocks, wherein the downsampling blocks are used for downsampling a first feature plane output by the ResNet network, in addition, the feature dimension of the feature plane obtained by downsampling is expanded during downsampling so as to obtain a second feature plane, and the convolution layer connected with the downsampling blocks is used for changing the feature dimension of the second feature plane output by the downsampling blocks to a preset dimension so as to obtain a third feature plane;

the up-sampling network is used for up-sampling the last third feature plane to generate a current up-sampling feature plane, connecting the current up-sampling feature plane with the third feature planes with the same size, merging the current up-sampling feature plane with the third feature planes with the same size in a feature dimension, performing convolution, outputting a current merged feature plane, up-sampling the current merged feature plane to generate a new current up-sampling feature plane, and executing the steps of connecting the current up-sampling feature plane with the third feature planes with the same size and merging the current up-sampling feature plane with the feature dimension until the size of the obtained current merged feature plane is the same as the size of the first third feature plane;

the down-sampling network is used for down-sampling the last current combined feature plane to obtain a fourth feature plane, taking the last current combined feature plane as a fourth feature plane, and adjusting the size of each fourth feature plane to obtain a corresponding fifth feature plane;

and the characteristic plane output network is used for connecting the fifth characteristic planes and performing convolution to obtain a human skeleton point characteristic plane.

And the human skeleton point processing network is used for acquiring the human skeleton point coordinates from the human skeleton point feature plane.

Preferably, the human skeleton point processing network is specifically configured to perform dimension expansion on each human skeleton point feature plane, perform filtering, obtain a maximum value point from the filtered human skeleton point feature plane, determine the maximum value point as a human skeleton point, and determine coordinates of the human skeleton point.

Preferably, the training of the pre-constructed target detection network by using the infrared image dataset includes:

inputting the infrared image in the infrared image data set into a BlazeFaceNet network to obtain a preset characteristic plane; the preset characteristic plane is determined according to the distribution proportion of the human body in the infrared image;

extracting a human body detection frame with the length-width ratio within a preset range from the preset feature plane; wherein the preset range is determined according to the aspect ratio of the human body.

Preferably, obtaining a human body candidate block diagram according to the target detection network and the infrared image data set includes:

expanding the length and the width of the human body detection frame according to preset expansion coefficients respectively to obtain an expanded human body detection frame; wherein the preset expansion coefficient is greater than 1;

and extracting the human body candidate block diagram from the infrared image according to the expanded human body detection frame.

Preferably, the pretreatment further comprises:

and adjusting the size of the human body candidate block diagram to a preset size.

Preferably, the image enhancement of the human body candidate block diagram includes:

and performing at least one of pixel inversion, stretching contrast, histogram contrast stretching and random image block pixel disturbance on the human body candidate block diagram.

An attitude estimation device comprising:

the acquisition module is used for acquiring an infrared image data set, and training a pre-constructed target detection network by using the infrared image data set to obtain a human body detection network;

the training module is used for obtaining a human body candidate block diagram according to the target detection network and the infrared image data set, preprocessing the human body candidate block diagram, and training a pre-constructed feature extraction network by using the candidate block diagram obtained through preprocessing to obtain a human body skeleton point extraction network; the pre-processing comprises image enhancement;

a candidate block diagram obtaining module, configured to obtain a candidate block diagram of a human body to be identified according to the infrared image to be identified and the human body detection network;

the feature extraction module is used for preprocessing the human body candidate block diagram to be identified, extracting features of different latitudes from the preprocessed human body candidate block diagram by using the human body bone point extraction network, fusing the features of the different latitudes and outputting human body bone point coordinates;

and the mapping module is used for mapping the human skeleton point coordinates to the infrared image to be recognized to determine the human posture.

An attitude estimation device, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the pose estimation method according to any of the above when executing said computer program.

A computer-readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the pose estimation method according to any of the above.

The application provides a method, a device, equipment and a computer readable storage medium for estimating a posture, wherein the method comprises the following steps: acquiring an infrared image data set, and training a pre-constructed target detection network by using the infrared image data set to obtain a human body detection network; obtaining a human body candidate block diagram according to the target detection network and the infrared image data set, preprocessing the human body candidate block diagram, and training a pre-constructed feature extraction network by using the candidate block diagram obtained through preprocessing to obtain a human body skeleton point extraction network; the preprocessing comprises image enhancement; obtaining a candidate block diagram of a human body to be identified according to the infrared image to be identified and the human body detection network; preprocessing a to-be-recognized human body candidate block diagram, performing feature extraction of different latitudes on the preprocessed to-be-recognized candidate block diagram by using a human body bone point extraction network, fusing the features of the different latitudes, and outputting human body bone point coordinates; and mapping the coordinates of the human skeleton points to the infrared image to be recognized to determine the human posture.

The technical scheme disclosed by the application trains a pre-constructed target detection network and a pre-constructed feature extraction network by utilizing an infrared image data set to correspondingly obtain a human body detection network and a human body skeleton point extraction network, obtains a human body candidate block diagram to be identified by utilizing the human body detection network and an infrared image to be identified, and performs preprocessing including image enhancement on the human body candidate block diagram to be identified to simulate more information of the infrared image so as to make human body features more obvious, thereby being convenient for effectively enhancing network generalization capability and improving the overall network precision, and then extracts features of different latitudes such as high latitude and low latitude and the like from the pre-processed candidate block diagram to be identified by utilizing the human body skeleton point extraction network and integrates the features of different latitudes such as high latitude and low latitude and the like so as to enable weak feature information under the infrared image to be learned, the features have better global information, and the attitude estimation precision of arms, thighs and the like in a human body in the attitude estimation is improved, so that the problems of less available information and difficult convergence of the infrared image attitude estimation are solved, and the overall accuracy of the infrared image attitude estimation is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of an attitude estimation method according to an embodiment of the present application;

fig. 2 is a schematic diagram of infrared images of persons in different scenes according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram of human skeleton point labeling performed on an infrared image according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram of a human bone point extraction network for performing feature extraction and outputting coordinates of human bone points according to an embodiment of the present application;

fig. 5 is a schematic diagram of an infrared image to be recognized according to an embodiment of the present application;

fig. 6 is a schematic diagram of a head bone point feature plane extracted by using a human bone point extraction network according to an embodiment of the present application;

fig. 7 is a schematic diagram of a shoulder, elbow and wrist skeleton point feature plane extracted by using a human skeleton point extraction network according to an embodiment of the present application;

FIG. 8 is a schematic illustration of a hip, knee and ankle skeletal point feature plane provided by an embodiment of the present application;

fig. 9 is a schematic diagram of filtered feature plane convergence point information provided in an embodiment of the present application;

FIG. 10 is a schematic diagram of human bone points obtained by performing pose estimation on FIG. 5 according to an embodiment of the present disclosure;

fig. 11 is a schematic structural diagram of an attitude estimation device according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of an attitude estimation device according to an embodiment of the present application.

Detailed Description

Compared with a visible light image, the infrared image can clearly find a target and accurately identify a front object in the environments without strong light supplementation, such as the night, the rainy and foggy weather and the like, and therefore, the method has great significance in posture estimation by utilizing the infrared image in the environments. However, when the infrared image is applied to the existing posture estimation for the visible light image (specifically, the infrared image is implemented by using algorithms such as HRNet and RSN), the infrared image has less effective information such as contrast and detail compared with the visible light image, and the posture recognition person does not have good characteristic information, and the deep learning network is difficult to converge, so that the problems of low posture estimation accuracy and poor effect exist when the existing method for applying the posture estimation for the visible light image is directly applied to the infrared image.

Therefore, the application provides a posture estimation method, a posture estimation device, posture estimation equipment and a computer readable storage medium, which are used for improving the precision of infrared image posture estimation.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, which shows a flowchart of an attitude estimation method provided in an embodiment of the present application, an attitude estimation method provided in an embodiment of the present application may include:

s11: and acquiring an infrared image data set, and training a pre-constructed target detection network by using the infrared image data set to obtain a human body detection network.

The method comprises the steps of collecting infrared images of people in different scenes (such as different road junctions, different weathers and the like), sorting the collected infrared images of the different road junctions, and comprehensively collecting wearing of the people under different weather conditions so as to improve the diversity of the obtained infrared image data sets. In order to adapt to more scenes, the acquired infrared images include different forms of wearing (such as summer clothing or winter clothing) of people under infrared, and characteristic environment data such as wet human bodies in rainy days, a plurality of snow spots on the human bodies in rainy days and the like are acquired and sorted synchronously. For example, as shown in fig. 2, which shows a schematic diagram of the acquired infrared images of people in different scenes provided by the embodiment of the present application, in the left infrared image of fig. 2, a human body wet area appears at the same time.

After acquiring infrared images in different scenes, a plurality of human skeleton points can be adopted to label the infrared images, wherein 17 human skeleton points are taken as an example for explanation in the application, and the 17 human skeleton points are respectively: the human body posture estimation device comprises a nose, left eyes, right eyes, left ears, right ears, left shoulders, right shoulders, left elbows, right elbows, left wrists, right wrists, left buttocks, right buttocks, left knees, right knees, left ankles and right ankles, so that the posture of the human body can be estimated according to the 17 human body skeletal points, and of course, other numbers of human body skeletal points can be included as long as the human body posture estimation can be accurately carried out according to the human body skeletal points. When the 17 human skeleton points are adopted to label the acquired infrared image, the labeled points can be specifically divided into the following 3 types: 1: human skeletal points are visible, 2: human skeletal points appear but are occluded; 3: human bone points do not appear in the infrared image, so that the labeled infrared image data can be used for training a subsequent network, for example, as shown in fig. 3, which shows a schematic diagram of labeling human bone points on a certain infrared image provided in the embodiment of the present application.

The infrared image data set is obtained through the acquisition of the infrared images in different scenes and the labeling of the human skeleton points in the infrared images. And then, training a pre-constructed target detection network by using the acquired infrared image data set to obtain a human body detection network. The pre-constructed target detection can be a lightweight target detection network, so that the network training calculated amount is reduced, the network training speed is improved, the calculated amount of the trained human body detection network in detection is reduced, and the detection speed of the trained human body detection network is improved. In addition, when the target detection network is trained, the target detection network can be trained by using a softmax loss function and a cross entropy loss function as loss functions, and the training is performed until the network converges.

In addition, during the training process of the target detection network by using the infrared image data set, the target detection network may output a human body detection frame, specifically, may output the coordinates of the upper left corner and the lower right corner of the human body detection frame, the type of the human body detection frame (for example, 0 represents a human body, 1 represents a car, etc.), and the probability of the human body detection frame (the larger the probability, the more the human body is determined to be a human body)

S12: obtaining a human body candidate block diagram according to the target detection network and the infrared image data set, preprocessing the human body candidate block diagram, and training a pre-constructed feature extraction network by using the candidate block diagram obtained through preprocessing to obtain a human body skeleton point extraction network; the pre-processing includes image enhancement.

Based on step S11, a human body candidate block diagram may be obtained according to the human body detection frame and the infrared image data set output during the target detection network training, and specifically, the human body candidate block diagram may be extracted from the infrared image corresponding to the infrared image data set according to the human body detection frame output during the target detection network training. In addition, aiming at the phenomenon that the infrared image has small image contrast difference, weak color information and the appearance time or different wearing times of people, the presented result is larger temperature difference and is reflected in the infrared image as larger pixel difference of a human body, the extracted human body candidate block diagram is subjected to preprocessing including image enhancement to obtain the candidate block diagram, wherein more infrared image information can be simulated as much as possible through the image enhancement, the generalization capability of a network in application is increased, the accuracy of infrared image posture estimation is improved, and the quality of the human body candidate block diagram can be improved through other preprocessing, so that the accuracy of the infrared image posture estimation is improved conveniently.

After the candidate block diagram is obtained through the preprocessing process, the pre-constructed feature extraction network is trained by using the candidate block diagram obtained through preprocessing, so that the human skeleton point extraction network is obtained. The system comprises a candidate block diagram, a feature extraction network, a human body skeleton point coordinate output network and an infrared image acquisition network, wherein the feature extraction network is used for extracting features of different latitudes of the candidate block diagram, specifically extracting features of different latitudes such as high latitude and low latitude, fusing the extracted features of different latitudes and outputting human body skeleton point coordinates, weak feature information under the infrared image is learned through extraction and fusion of the features of different latitudes, the problems that available information for posture estimation under the infrared image is few and difficult to converge are solved, original interlayer information interaction inside the infrared image is disordered, the features have better global information, and the posture estimation precision of arms, knees, thighs and other parts in a posture estimation figure is improved, so that the posture estimation precision is improved. In addition, when the pre-constructed feature extraction network is trained, the L2loss function can be specifically adopted as the loss function, and the training is performed until the network converges.

It should be noted that, steps S11 and S12 are training processes for respectively extracting networks from human body detection and human body bone points, and these two steps are only required to be executed once during posture estimation to obtain a corresponding network, and do not need to be trained every time posture estimation is performed, that is, steps S11 and S12 are steps in which posture estimation is performed in advance, and after a corresponding network is obtained through training, subsequent processes can be executed by using the corresponding network obtained through training.

S13: and obtaining a candidate block diagram of the human body to be identified according to the infrared image to be identified and the human body detection network.

After the human body detection network is obtained through the training in step S11 and the human body feature network is obtained through the training in step S12, the infrared image to be recognized may be input into the human body detection network to obtain a human body detection frame to be recognized using the human body detection network, and then a human body candidate frame to be recognized may be extracted from the infrared image to be recognized according to the human body detection frame to be recognized.

S14: preprocessing a frame diagram of the human body candidate to be recognized, extracting features of different latitudes of the preprocessed frame diagram of the candidate to be recognized by using a human body bone point extraction network, fusing the features of the different latitudes, and outputting coordinates of the human body bone points.

After the to-be-recognized human body candidate block diagram is extracted and obtained, the to-be-recognized human body candidate block diagram is subjected to preprocessing which is the same as the preprocessing performed on the human body candidate block diagram in the step S12, so that more infrared image information can be simulated as much as possible through image enhancement, the generalization capability of the network in application is increased, the accuracy of infrared image attitude estimation is improved, the quality of the human body candidate block diagram is improved through other preprocessing, and the accuracy of infrared image attitude estimation is also convenient to improve.

After preprocessing a block diagram of a human body candidate to be recognized to obtain a block diagram of the candidate to be recognized, inputting the block diagram of the candidate to be recognized into a human body skeleton point extraction network, performing feature extraction of different latitudes on the block diagram of the candidate to be recognized by using the human body skeleton point extraction network, specifically performing feature extraction of different latitudes such as high and low latitudes, fusing the extracted features of different latitudes by using the human body skeleton point extraction network, and outputting coordinates of the human body skeleton points after fusing the features of different latitudes.

The method has the advantages that weak feature information under the infrared image can be learned by extracting and fusing different latitude features of the candidate block diagram to be recognized, the problems that available information of posture estimation under the infrared image is few and difficult to converge are solved, meanwhile, original interlayer information interaction inside the candidate block diagram is disturbed, so that the features have better global information, the posture estimation precision of arms, knees, thighs and other parts in a posture estimation character is improved, and the posture estimation precision is improved conveniently.

S15: and mapping the coordinates of the human skeleton points to the infrared image to be recognized to determine the human posture.

After the human skeleton point extraction network is used for outputting the human skeleton point coordinates in the candidate block diagram to be identified, the human skeleton point coordinates are mapped to the corresponding infrared image to be identified so as to determine the human body posture.

In the posture estimation method provided by the embodiment of the application, the human skeleton point extraction network may include a backbone network, an up-sampling network, a down-sampling network, a feature plane output network, and a human skeleton point processing network, wherein:

the main network can comprise a ResNet network, a preset number of down-sampling blocks and a convolution layer connected with the down-sampling blocks, wherein the down-sampling blocks are used for down-sampling a first characteristic plane output by the ResNet network, and expanding the characteristic dimensionality of the characteristic plane obtained by the down-sampling during the down-sampling to obtain a second characteristic plane, and the convolution layer connected with the down-sampling blocks is used for changing the characteristic dimensionality of the second characteristic plane output by the down-sampling blocks to a preset dimensionality to obtain a third characteristic plane;

and the characteristic plane output network is used for connecting the fifth characteristic planes and performing convolution to obtain the characteristic plane of the human skeleton point.

And the human skeleton point processing network is used for acquiring the coordinates of the human skeleton points from the human skeleton point characteristic plane.

In the present application, the structure of the feature extraction network is the same as that of the human skeleton point extraction network, and the specific structure of the human skeleton point extraction network is described herein, and the specific structure of the feature extraction network may refer to the specific description of the corresponding structure in the human skeleton point extraction network, and is not described herein again.

The human skeleton point extraction network can sequentially comprise a backbone network, an up-sampling network, a down-sampling network, a feature plane output network and a human skeleton point processing network, wherein:

1) backbone network: the method comprises a ResNet network (residual network), a preset number of downsampling blocks and convolution layers connected with the downsampling blocks, wherein the ResNet network is used for extracting features of an infrared image (namely a candidate block diagram to be identified) and outputting a first feature plane, the ResNet network can be specifically ResNet34, and the first half part of the ResNet34 can be convolved by 7 x 7 and then connected by a max posing layer; the down-sampling block comprises a convolution module which is used for down-sampling the first feature plane, and simultaneously, the feature dimensionality of the feature plane obtained by down-sampling is expanded for one time while the down-sampling block is used for down-sampling the first feature plane so as to obtain a preset number of second feature planes; and the convolution layers respectively connected with the downsampling blocks are used for changing the characteristic dimension of the characteristic plane of the second characteristic plane output by the corresponding downsampling block to a preset dimension so as to correspondingly obtain a preset number of third characteristic planes. Wherein, carry out feature extraction to infrared image through the residual error network, the residual error is in order to possess better nonlinear learning, and the original intention lies in can learning the strong characteristic information under the infrared image.

Specifically, reference may be made to fig. 4 to 8, where fig. 4 illustrates a schematic diagram of a human skeletal point extraction network provided in an embodiment of the present application for feature extraction and outputting coordinates of human skeletal points, fig. 5 illustrates a schematic diagram of an infrared image to be recognized provided in an embodiment of the present application, fig. 6 illustrates a schematic diagram of a head skeletal point feature plane extracted by using a human skeletal point extraction network provided in an embodiment of the present application, where a first diagram corresponds to a nose, a second diagram corresponds to a left eye, a third diagram corresponds to a right eye, a fourth diagram corresponds to a left ear, a fifth diagram corresponds to a right ear, and fig. 7 illustrates a schematic diagram of a shoulder, elbow, and wrist skeletal point feature plane extracted by using a human skeletal point extraction network provided in an embodiment of the present application, where a first diagram in fig. 7 corresponds to a left shoulder, an elbow, and a wrist skeletal point, The second diagram corresponds to a right shoulder, the third diagram corresponds to a left elbow, the fourth diagram corresponds to a right elbow, the fifth diagram corresponds to a left wrist, and the sixth diagram corresponds to a right wrist, and fig. 8 shows a schematic diagram of a hip, knee, and ankle skeletal point feature plane provided by the embodiment of the present application, wherein the first diagram in fig. 8 corresponds to a left hip, the second diagram corresponds to a right hip, the third diagram corresponds to a left knee, the fourth diagram corresponds to a right knee, the fifth diagram corresponds to a left ankle, and the sixth diagram corresponds to a right ankle.

Taking the candidate block diagram to be identified as an example, which is input with [1 × 3 × 192 × 256], the first half of ResNet34 is convoluted with one 7 × 7 and then with one max posing layer, and then the feature plane of [1 × 16 × 128 × 96] is output, and then 4 downsampling blocks are connected, each downsampling block respectively comprises 6 convolutions of 3 × 3, wherein the last convolution is convolution with the step size of 2, each downsampling block comprises a ResNet residual network, and each downsampling feature dimension is expanded once, and the 4 downsampling blocks respectively output: f1 '[ 1 × 32 × 64 × 48], F2' [1 × 64 × 32 × 24], F3 '[ 1 × 128 × 12], F4' [1 × 256 × 8], then, respectively convolving the 4 feature planes by 1 × 1, convolving by 3 × 3, convolving by 1 × 1, outputting feature dimensions changed to 256, and then outputting 4 feature planes F1, F2, F3 and F4, wherein the four feature planes have the sizes of F1: [1 × 256 × 64 × 48], F2: [1 × 256 × 32 ], F3: [1 × 256 × 16 × 12], F4: [1 × 64 × 48], F2: [ 256 × 32 ], F3: [ 6 ].

2) An up-sampling network: performing up-sampling on the last third feature plane (specifically, the third feature plane with the smallest size) to perform bilinear difference up-sampling, generating a current up-sampling feature plane, connecting the current up-sampling feature plane with the third feature plane with the same size, merging the current up-sampling feature plane with the third feature plane with the same size in a feature dimension, performing convolution (specifically, 1 × 1 convolution can be performed), outputting the current merged feature plane, performing bilinear interpolation up-sampling on the current merged feature plane to generate a new current up-sampling feature plane, and then performing a step of connecting the current up-sampling feature plane with the third feature plane with the same size and merging the feature dimension until the size of the obtained current merged feature plane is the same as that of the first third feature plane (specifically, the third feature plane with the largest size).

Specifically, following the above example, bilinear interpolation upsampling was performed on F4 [1 × 256 × 8 × 6] to generate U3: [1 × 256 × 16 × 12] feature planes, merging U3: [1 × 256 × 16 ] feature planes with F3: [1 × 256 × 16 ] 12] feature planes at the same latitude as the original feature planes, merging U3: [1 × 256 × 16 ] 12] with F3: [1 × 256 × 16 ] 12] at the feature depth dimension (i.e., feature dimension), followed by 1 × 1 convolution of U2: [1 × 256 × 16 ] 12] feature planes, merging U36 2 [1 × 256 × 8] with U256: [ 256] 256 × 12] feature planes, and generating bilinear interpolation of U24: [ 256 × 256] feature planes, and U6332 [ 24] and U6332 ] with F2 6 × 256 × 16 [ 12] feature planes, and merging the feature planes with U2 [ 24: [ 24] and finally generating bilinear interpolation of U36256 × 256 × 24 [ 256 × 24] and U24 [ 24] feature planes, and finally merging of U3659 ] with U6332 ] and U24 [ 256 × 256 [ 24 [ 3 ] feature planes And after merging on the scale, performing convolution with 1 × 1, outputting U1: [1 × 256 × 32 × 24] feature planes, performing bilinear interpolation upsampling on U1: [1 × 256 × 32 × 24] feature planes to generate U1: [1 × 256 × 64 × 48] feature planes, merging U1: [1 × 256 × 64 × 48] feature planes with F1: [1 × 256 × 64 × 48] feature planes with the same latitude as the original feature planes, performing con on the feature layer depth (after merging on U1: [1 × 256 × 48] with F1: [1 × 64 × 48] and performing convolution with D1: [1 × 256 × 64 ] feature planes) on the feature layer depth (after merging on the scale, outputting D1: [1 × 256: [ 256 × 48 ].

The purpose of the above steps is to combine different semantic information (i.e. features) of high and low dimensions, and a large number of upsampling and downsampling in a human skeleton point extraction network finally make the learned feature points more accurate, and can learn a large amount of global information, the global information needs to be utilized in the learning of joint points by the posture estimation network, hands, footsteps and the like are very dependent on the global information, and local information is very weak especially under an infrared image, so that a large number of upsampling and downsampling are also important means for posture estimation of the infrared image.

3) A down-sampling network: and then, adjusting the sizes of the fourth feature planes with the preset number to obtain the fifth feature planes with the corresponding preset number.

Specifically, following the above example, the 3 × 3 convolution kernel step size for the D1: [1 × 256 × 64 × 48] eigenplane is 2 convolution operations, and D2: [1 × 256 × 32 × 24], D3: [1 × 256 × 16 × 12], D4: [1 × 256 × 8] eigenplane is output, and D1: [1 × 256 × 64 × 48], D2: [1 × 256 × 32 × 24], D3: [1 × 256 × 16 × 12], D4: [1 × 8 × 6] is adjusted in size, and output: o1: [1 × 256 × 64 × 48], O2: [1 × 64 × 48], O3: [1 × 16 × 64 × 48], O4: [1 × 4 × 64 × 48] feature plane.

The candidate block diagram to be recognized is subjected to one round of down sampling, one round of up sampling and one round of down sampling, each round of up and down sampling adopts feature planes with a preset number of dimensionalities, weak feature information under the infrared image can be learned by utilizing the network, the problems of less available information and difficult convergence of the infrared image attitude estimation are solved, and the overall progress of the attitude estimation is improved. In addition, the size adjustment is carried out on the fourth feature plane to replace the general convolution operation, the method mainly comprises the steps of information recombination, partial information adding and fusion is carried out on unclear skeleton points in the infrared image, more favorable combined information can be obtained, the position information of unclear key points is deduced, compared with convolution, the calculated amount is reduced, the whole network is lighter, meanwhile, the method directly breaks up the original inter-layer information interaction inside the infrared image for recombination, the features have better global information, for some points with longer limb body distance in the infrared image, such as legs, knees or arms, more global information in a larger range can be located to the specific point information, more global information can be obtained by utilizing the structure, and the precision of the whole network is improved.

4) Feature plane output network: and connecting the fifth feature planes, and performing convolution to obtain the human skeleton point feature plane.

Specifically, following the above example, the feature plane output network is used to concat four feature scores O1: [1 × 256 × 64 × 48], O2: [1 × 64 × 48], O3: [1 × 16 × 64 × 48], and O4: [1 × 4 × 64 × 48], and then the channel change is performed after convolution of 1 × 1 with one 3 × 3 and then convolution of 1 × 1, and then the feature plane of 1 × 17 × 64 × 48 is output, and at this time, the feature plane of 1 × 17 × 64 × 48 is the final convergence feature plane of the human bone point.

5) And acquiring coordinates of the human skeleton points from the human skeleton point characteristic plane by using a human skeleton point processing network.

Specifically, following the above example, the coordinates of the human bone points can be extracted from the 17 feature planes obtained above.

The posture estimation method provided by the embodiment of the application is characterized in that a human skeleton point processing network is specifically used for carrying out dimension expansion on each human skeleton point feature plane, carrying out filtering, obtaining a maximum value point from the filtered human skeleton point feature plane, determining the maximum value point as a human skeleton point, and determining coordinates of the human skeleton point.

In the application, when acquiring coordinates of human skeleton points from a human skeleton point feature plane, a human skeleton point processing network may first perform dimension expansion on each human skeleton point feature plane, specifically, 10 pixels may be filled around the human skeleton point feature plane, and then, may perform filtering to remove some noise points, specifically, may perform 11 × 11 gaussian filtering feature plane processing, where a large-scale gaussian kernel is used for filtering, mainly to remove the influence of some abnormal noise points, specifically, see fig. 9, which shows a schematic diagram of feature plane convergence point information after filtering provided by an embodiment of the present application. And then, searching the maximum point on the feature plane in the filtered human body bone point feature plane, wherein the maximum point on each feature plane is the corresponding human body bone point, and for 17 human body bone points, the maximum points on 17 feature planes are 17 human body bone points. Then, according to step S15, coordinate mapping is performed on the human skeleton points on the corresponding feature plane, the coordinates are mapped back to the original image, and finally, the human skeleton point coordinate information in the original image is output. Specifically, reference may be made to fig. 10, which shows a schematic diagram of human skeleton points obtained by performing pose estimation on fig. 5 according to an embodiment of the present application.

The posture estimation method provided by the embodiment of the application, which trains a pre-constructed target detection network by using an infrared image data set, may include:

inputting the infrared image in the infrared image data set into a BlazeFaceNet network to obtain a preset characteristic plane;

extracting a human body detection frame with the length-width ratio within a preset range from a preset feature plane; wherein the preset range is determined according to the aspect ratio of the human body.

When an infrared image data set is used for training a pre-constructed target detection network, an infrared image in the infrared image data set can be input into a BlazeFaceNet network, the network is a sub-millisecond network and is suitable for being used in industrial practical application, the BlazeFaceNet network is mainly used for disclosing the design of the data set, and the BlazeFaceNet network can be pruned and improved according to the task requirements of the application. During training, an infrared image in an infrared image data set is input, convolution operation is performed by utilizing a convolution block structure, downsampling is performed after every two convolution blocks, dimension compression is performed every time, network depth is expanded twice, downsampling is performed for a preset time in an accumulated mode, and finally a characteristic plane is output. Taking an image of 300 × 3 as an example, performing convolution operation by using a convolution block structure, performing down-sampling after every two convolution blocks, performing two-dimensional compression every time, performing two-time expansion on the network depth, performing 6 times of accumulation down-sampling, and finally outputting the feature plane with the size of 5 × 5, finally adopting two feature planes of 10 × 10 and 5 × 5, performing bounding box (bounding box) at each point on each feature plane by using an SSD idea to generate a detection frame, screening out an optimal bounding box, and outputting final pedestrian coordinate information.

Specifically, the infrared images in the infrared image data set are input into the BlazeFaceNet network, and the original feature planes of 19 × 19, 10 × 10, 5 × 5, 3 × 3, 2 × 2 and 1 × 1 are considered, and the size of the human body is mainly distributed in about 10% -20% of the proportion of the infrared images, so that the application only adopts two feature planes of 10 × 10 and 5 × 5 when in use, that is, the preset feature planes of 10 × 10 and 5 × 5 are determined according to the distribution proportion of the size of the human body in the infrared images. Of course, if the distribution ratio of the size of the human body in the infrared image changes, the selection of the preset feature plane changes accordingly, so that the accuracy of the feature plane is improved.

After the preset feature plane is obtained, the human body detection frame is extracted from the preset feature plane, and when the human body detection frame is extracted, it is considered that the posture estimation is mainly performed on the human body in the application, therefore, when the human body detection frame is extracted, the human body detection frame with the length-width ratio within the preset range can be specifically extracted from the preset feature plane, wherein the preset range is specifically determined according to the length-width ratio of the human body. Specifically, most of the original detection frames for detecting each point on each feature plane construct 6 detection frames, the detection frames with the length-width ratios of 1/2, 2, 1/3 and 3 and the two detection frames with the length-width ratio of 1, and considering that the length-width ratio of a human body is generally 2/3, only the two detection frames 1/2 and 1/3 and the two detection frames with the length-width ratio of 1 are adopted, so that the human body detection network is more suitable for human body proportion, convergence of the human body detection network is facilitated, meanwhile, the pressure of the branch of the feature regression network is reduced, and the display effect of the precision test result generated by the human body detection frame more suitable for real proportion is better.

According to the posture estimation method provided by the embodiment of the application, the human body candidate block diagram is obtained according to the target detection network and the infrared image data set, and the method can comprise the following steps:

expanding the length and the width of the human body detection frame according to preset expansion coefficients respectively to obtain an expanded human body detection frame; wherein the preset expansion coefficient is larger than 1;

and extracting a human body candidate block diagram from the infrared image according to the expanded human body detection frame.

After the human body detection frame is obtained by using the target detection network, the length and the width of the human body detection frame are respectively expanded according to preset expansion coefficients, namely the length and the width of the human body detection frame are respectively multiplied by the preset expansion coefficients to obtain the expanded human body detection frame, and after the expanded human body detection frame is obtained, a human body candidate block diagram is extracted from the infrared image according to the expanded human body detection frame.

The preset expansion coefficient is greater than 1, and may specifically be 1.2, and of course, the preset expansion coefficient may also be adjusted according to the size of the human body in the infrared image, so that the entire human body may be included in the human body candidate block diagram during extraction through expansion, and thus the accuracy of posture estimation is improved.

In the attitude estimation method provided in the embodiment of the present application, the preprocessing may further include:

and adjusting the size of the human body candidate frame to be a preset size.

In the present application, the preprocessing of the human detection frame includes, in addition to image enhancement, the following steps:

in the application, the extracted block diagram of the human body candidate to be recognized is preprocessed not only by image enhancement, but also by size adjustment of the block diagram of the human body candidate to be recognized so as to adjust the block diagram of the human body candidate to be recognized to a preset size, and then the block diagram of the human body candidate adjusted to the preset size is subjected to image enhancement, wherein the size of the preset size is determined according to a human body bone point extraction network, specifically, in the application, the size of the preset size is 256 × 192, so that the human body bone point extraction network can smoothly perform downsampling, the accuracy of extracting the human body bone points by the human body bone point extraction network is improved, and the accuracy of posture estimation is improved.

The posture estimation method provided by the embodiment of the application performs image enhancement on the human body candidate block diagram output by the target detection network, and may include:

Considering that people in the infrared image have different appearance time or different wearing times, the presented result is that the temperature difference of the body is larger, the pixel difference of the pedestrian is larger in the image, for example, in winter, the pedestrian walks from the outdoor to the indoor, the clothing temperature on the surface of the human body is lower than the indoor background temperature, but in common scenes such as summer, night and the like, the human body temperature is generally higher than the background temperature, and the pedestrian is highlighted on the infrared thermal imaging image.

An embodiment of the present application further provides an attitude estimation device, see fig. 11, which shows a schematic structural diagram of an attitude estimation device provided in an embodiment of the present application, and the attitude estimation device may include:

the acquisition module 111 is configured to acquire an infrared image dataset, train a pre-constructed target detection network by using the infrared image dataset, and obtain a human body detection network;

the training module 112 is used for obtaining a human body candidate block diagram according to the target detection network and the infrared image data set, preprocessing the human body candidate block diagram, and training a pre-constructed feature extraction network by using the candidate block diagram obtained through preprocessing to obtain a human body skeleton point extraction network; the preprocessing comprises image enhancement;

a candidate block diagram obtaining module 113, configured to obtain a candidate block diagram of a human body to be identified according to the infrared image to be identified and the human body detection network;

the feature extraction module 114 is configured to pre-process a block diagram of a human body candidate to be identified, perform feature extraction at different latitudes on the pre-processed block diagram of the candidate to be identified by using a human body bone point extraction network, fuse features at different latitudes, and output coordinates of human body bone points;

and the mapping module 115 is used for mapping the coordinates of the human skeleton points to the infrared image to be identified to determine the human posture.

The posture estimation device provided by the embodiment of the application, the human skeleton point extraction network can comprise a backbone network, an up-sampling network, a down-sampling network, a feature plane output network and a human skeleton point processing network, wherein:

The posture estimation device provided by the embodiment of the application, the human skeleton point processing network, is specifically used for carrying out dimension expansion on each human skeleton point feature plane, filtering, obtaining a maximum value point from the filtered human skeleton point feature plane, determining the maximum value point as a human skeleton point, and determining coordinates of the human skeleton point.

In an attitude estimation apparatus provided in an embodiment of the present application, the obtaining module 111 may include:

the acquisition unit is used for inputting the infrared images in the infrared image data set into a BlazeFaceNet network and acquiring a preset characteristic plane;

the first extraction unit is used for extracting a human body detection frame with the length-width ratio within a preset range from a preset feature plane; wherein the preset range is determined according to the aspect ratio of the human body.

In an attitude estimation apparatus provided in an embodiment of the present application, the training module 112 may include:

the expansion unit is used for expanding the length and the width of the human body detection frame according to preset expansion coefficients respectively to obtain an expanded human body detection frame; wherein the preset expansion coefficient is larger than 1;

the second extraction unit is used for extracting a human body candidate block diagram from the infrared image according to the expanded human body detection frame;

in an attitude estimation apparatus provided in an embodiment of the present application, the training module 112 may further include:

and the adjusting unit is used for adjusting the size of the human body candidate block diagram to a preset size.

and the processing unit is used for performing at least one of pixel inversion, post-stretching contrast, histogram contrast stretching and random image block pixel disturbance on the human body candidate block diagram.

An embodiment of the present application further provides an attitude estimation device, see fig. 12, which shows a schematic structural diagram of an attitude estimation device provided in an embodiment of the present application, and the attitude estimation device may include:

a memory 121 for storing a computer program;

the processor 122, when executing the computer program stored in the memory 121, may implement the following steps:

acquiring an infrared image data set, and training a pre-constructed target detection network by using the infrared image data set to obtain a human body detection network; obtaining a human body candidate block diagram according to the target detection network and the infrared image data set, preprocessing the human body candidate block diagram, and training a pre-constructed feature extraction network by using the candidate block diagram obtained through preprocessing to obtain a human body skeleton point extraction network; the preprocessing comprises image enhancement; obtaining a candidate block diagram of a human body to be identified according to the infrared image to be identified and the human body detection network; preprocessing a to-be-recognized human body candidate block diagram, performing feature extraction of different latitudes on the preprocessed to-be-recognized candidate block diagram by using a human body bone point extraction network, fusing the features of the different latitudes, and outputting human body bone point coordinates; and mapping the coordinates of the human skeleton points to the infrared image to be recognized to determine the human posture.

An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, can implement the following steps:

The computer-readable storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

For a description of a relevant part in an attitude estimation device, an apparatus, and a computer-readable storage medium provided in the embodiments of the present application, reference may be made to a detailed description of a corresponding part in an attitude estimation method provided in the embodiments of the present application, and details are not repeated here.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include elements inherent in the list. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element. In addition, parts of the above technical solutions provided in the embodiments of the present application, which are consistent with the implementation principles of corresponding technical solutions in the prior art, are not described in detail so as to avoid redundant description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An attitude estimation method, comprising:

2. The pose estimation method according to claim 1, wherein the human skeletal point extraction network comprises a backbone network, an up-sampling network, a down-sampling network, a feature plane output network, and a human skeletal point processing network, wherein:

the characteristic plane output network is used for connecting the fifth characteristic planes and performing convolution to obtain a human skeleton point characteristic plane;

3. The pose estimation method according to claim 2, wherein the human skeleton point processing network is specifically configured to perform dimension expansion on each human skeleton point feature plane, perform filtering, obtain a maximum point from the filtered human skeleton point feature plane, determine the maximum point as a human skeleton point, and determine the human skeleton point coordinates.

4. The pose estimation method of claim 1, wherein training a pre-constructed target detection network with the infrared image dataset comprises:

5. The pose estimation method of claim 4, wherein obtaining a human body candidate block diagram from the target detection network and the infrared image dataset comprises:

6. The pose estimation method of claim 5, wherein the preprocessing further comprises:

7. The pose estimation method according to claim 1, wherein image enhancing the human body candidate block diagram comprises:

8. An attitude estimation device, characterized by comprising:

9. An attitude estimation device, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the pose estimation method according to any of claims 1 to 7 when executing said computer program.

10. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, carries out the steps of the pose estimation method according to any of the claims 1 to 7.