CN110135304A

CN110135304A - Human body method for recognizing position and attitude and device

Info

Publication number: CN110135304A
Application number: CN201910363750.8A
Authority: CN
Inventors: 朱佳刚; 黄冠; 徐亮
Original assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Current assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority date: 2019-04-30
Filing date: 2019-04-30
Publication date: 2019-08-16

Abstract

Disclose a kind of human body method for recognizing position and attitude and device, comprising: the human figure region comprising human body is cut out from video to be identified；Pose identification is carried out to human body in human figure region, obtains the first pose recognition result and human skeleton data corresponding with human body；Based on the pose of human skeleton data identification human body, the second pose recognition result is obtained；Based on the first pose recognition result and the second pose recognition result, the pose of human body is determined.The application is due to during determining human body pose, not only consider the first pose recognition result obtained according to human figure region, it can also combine on the basis of the first pose recognition result and be analyzed according to the second pose recognition result that human skeleton data obtain, remove the pose of determining human body jointly by the first pose recognition result and the second pose recognition result, human skeleton noise can not only be reduced, nor scene and other objects in meeting over-fitting video, effectively improve the accuracy of human body pose identification.

Description

Human body method for recognizing position and attitude and device

Technical field

This application involves image identification technical field more particularly to human body method for recognizing position and attitude and device.

Background technique

Human body pose identifies in the fields such as computer vision, pattern-recognition, artificial intelligence, has become a great meaning The research hotspot of justice, with the man-machine friendship such as wide application field, including virtual reality, biomethanics, game, medical treatment & health Mutual field.

There are following two to carry out knowledge method for distinguishing to human body pose in video for the prior art:

In the first recognition methods, using human skeleton in depth transducer estimation video, so that it is determined that human body position out Appearance, it is more using human skeleton noise obtained by the above method, so that the human body pose identified is inaccurate, and due to depth The limitation for spending sensor use environment, is only applicable to interior.

In second of recognition methods, the image that random cropping goes out in video go forward side by side line position appearance identification, using the above method The scene and object over-fitting being easy to cause in video, so as to cause the human body pose inaccuracy identified.

In conclusion either using the first above-mentioned recognition methods or second of recognition methods, there is human body position The technical problem of appearance recognition result inaccuracy.

Summary of the invention

In order to solve the above-mentioned technical problem, the application is proposed.Embodiments herein provides a kind of human body pose knowledge Other method and device.

According to the first aspect of the application, a kind of human body method for recognizing position and attitude is provided, comprising:

The human figure region comprising human body is cut out from video to be identified；

The human figure region to the human body carry out pose identification, obtain the first pose recognition result and with it is described The corresponding human skeleton data of human body；

The pose that the human body is identified based on the human skeleton data obtains the second pose recognition result；

Based on the first pose recognition result and the second pose recognition result, the pose of the human body is determined.

According to the second aspect of the application, a kind of human body pose identification device is provided, comprising:

Module is cut, for cutting out the human figure region comprising human body from video to be identified；

First output module obtains first for carrying out pose identification to the human body in the human figure region Appearance recognition result and human skeleton data corresponding with the human body；

Second output module obtains the second pose for identifying the pose of the human body based on the human skeleton data Recognition result；

Determining module, described in determining based on the first pose recognition result and the second pose recognition result The pose of human body.

In terms of according to the third of the application, a kind of computer readable storage medium, the storage medium storage are provided There is computer program, the computer program is used to execute the human body method for recognizing position and attitude of above-mentioned first aspect.

According to the 4th of the application the aspect, a kind of electronic equipment is provided, the electronic equipment includes:

Processor；

For storing the memory of the processor-executable instruction；

The processor, for executing the human body method for recognizing position and attitude of above-mentioned first aspect.

According to the human body method for recognizing position and attitude and device of the application, by cutting out from video to be identified comprising human body Human figure region, and pose identification is carried out to human body in human figure region, obtain the first pose recognition result and and human body Corresponding human skeleton data.Then the pose based on human skeleton data identification human body, obtains the second pose recognition result.Most Eventually, the pose of human body is determined in conjunction with the first pose recognition result and the second pose recognition result.Due to determining human body pose In the process, not only consider the first pose recognition result obtained according to human figure region, it can also be in the first pose recognition result On the basis of combine analyzed according to the second pose recognition result that human skeleton data obtain, pass through the first pose identification knot Fruit and the second pose recognition result go to determine the pose of human body jointly, can not only reduce human skeleton noise, nor meeting Scene and other objects in over-fitting video, effectively improve the accuracy of human body pose identification.

Detailed description of the invention

The embodiment of the present application is described in more detail in conjunction with the accompanying drawings, the above-mentioned and other purposes of the application, Feature and advantage will be apparent.Attached drawing is used to provide to further understand the embodiment of the present application, and constitutes explanation A part of book is used to explain the application together with the embodiment of the present application, does not constitute the limitation to the application.In the accompanying drawings, Identical reference label usually indicates same parts or step.

Fig. 1 is system block diagram corresponding to the human body method for recognizing position and attitude of the application.

Fig. 2 is neural network structure figure corresponding to the human body method for recognizing position and attitude of the application.

Fig. 3 is the flow diagram of human body method for recognizing position and attitude provided by the embodiments of the present application.

Fig. 4 is that the application is corresponding with human body to human body progress pose identification output in human figure region using I3D model Human skeleton data procedures schematic flow chart.

Fig. 5 is that the application utilizes 3D convolutional neural networks acquisition human skeleton coordinate mistake corresponding with characteristics of human body's figure The schematic flow chart of journey.

Fig. 6 is the schematic flow that the application determines coordinate process of the human body key point in characteristics of human body's figure according to thermal map Figure.

Fig. 7 is the schematic flow chart that the application obtains bias vector process.

Fig. 8 is pose of the application using 2D convolutional neural networks based on human skeleton data identification human body, obtains second The schematic flow chart of pose recognition result process.

Fig. 9 is the structural schematic diagram of the human body pose identification device of one embodiment of the application.

Figure 10 is the structural schematic diagram of the second output module 903 of the application.

Figure 11 is the structural schematic diagram of the first output module 902 of the application.

Figure 12 is that the human skeleton coordinate of the application obtains the structural schematic diagram of submodule 9022.

Figure 13 is the structural schematic diagram of the human body key point determination unit 90222 of the application.

Figure 14 is the structural schematic diagram of the human body pose identification device of another embodiment of the application.

Figure 15 is the structural schematic diagram of the pose identification submodule 9031 of the application.

Figure 16 is the structural schematic diagram of the human body pose identification device of another embodiment of the application.

Figure 17 is the structural schematic diagram of the determining module 904 of the application.

Figure 18 is the structure chart for the electronic equipment that one exemplary embodiment of the application provides.

Wherein, 11 be human testing frame, and 12 be I3D model, and 13 be warp lamination, and 14 be 2D convolutional neural networks, and 21 are Video image, 22 be the first pose recognition result, and 23 scheme for characteristics of human body, and 24 be the first intermediate data, and 25 be the second mediant According to 26 be third intermediate data, and 27 be the 4th intermediate data, and 28 be the 5th intermediate data, and 29 be the second pose recognition result.

Specific embodiment

In the following, example embodiment according to the application will be described in detail by referring to the drawings.Obviously, described embodiment is only It is only a part of the embodiment of the application, rather than the whole embodiments of the application, it should be appreciated that the application is not by described herein The limitation of example embodiment.

Application is summarized

As described above, there are the technologies of recognition result inaccuracy when human body pose identifies in video for the prior art Problem.

Based on above-mentioned technical problem, human body method for recognizing position and attitude and device provided by the present application, first from video to be identified In cut out the human figure region comprising human body.Secondly pose identification is carried out to human body in human figure region, obtains first Pose recognition result and human skeleton data corresponding with human body.Pose again based on human skeleton data identification human body, obtains Second pose recognition result.It is finally based on the first pose recognition result and the second pose recognition result, determines the pose of human body.

In this way, not only considering first obtained according to human figure region due to during determining human body pose Appearance recognition result can also combine on the basis of the first pose recognition result and be known according to the second pose that human skeleton data obtain Other result is analyzed, and removes the pose for determining human body jointly by the first pose recognition result and the second pose recognition result, no It only can reduce human skeleton noise, nor scene and other objects in meeting over-fitting video, effectively improve people The accuracy of posture identification.

After describing the basic principle of the application, carry out the various non-limits for specifically introducing the application below with reference to the accompanying drawings Property embodiment processed.

Exemplary system

Fig. 1 shows the system block diagram of the human body method for recognizing position and attitude of the application, according to the system block diagram: firstly, Video to be identified obtains human figure region after cutting, and human figure region is the region in human testing frame 11.Then, Pose identification is carried out to human body using I3D model 12, which specifically includes: based on RGB classification (RGB-based Action recognition), human body pose estimation (Pose estimation) and based on human skeleton classify (Pose- based action recognition).Wherein, the estimation of human body pose is carried out using warp lamination 13, utilizes 2D convolutional Neural Network 14 classified based on human skeleton.The first pose recognition result and second are being obtained by above-mentioned pose identification process After appearance recognition result, the first pose recognition result and the second pose recognition result are merged, the position of human body is finally determined Appearance.

Fig. 2 shows in the human body method for recognizing position and attitude of the application with neural network corresponding to each pose cognitive phase Structure chart, according to the neural network structure figure: after cutting out human figure region, if the view that human figure region includes The data structure of frequency image 21 be 8 × 7 × 7 × 2048,8 × 7 × 7 × 2048 indicate: time dimension 8, elevation dimension 7, Width dimensions are 7, data characteristics 2048.To the video image 21 in the process of processing, firstly, by the video image 21, which are input to I3D model 12, carries out pose identification, obtains the first pose recognition result 22 and human skeleton data.Wherein, first The data characteristics of pose recognition result 22 is 2048.And the process of human skeleton data is obtained the following steps are included: firstly, splitting The data structure of 8 characteristics of human body Figure 23, each characteristics of human body Figure 23 are 7 × 7 × 2048,7 × 7 × 2048 expressions: height out Dimension is 7, width dimensions 7, data characteristics 2048.Then, deconvolution twice is carried out to each characteristics of human body Figure 23 and rises dimension. The first intermediate data 24 is obtained after first time deconvolution rises dimension, the data structure of the first intermediate data 24 is 14 × 14 × 256, 14 × 14 × 256 indicate: elevation dimension 14, width dimensions 14, data characteristics 256.After second of deconvolution rises dimension The second intermediate data 25 is obtained, the data structure of the second intermediate data 25 is 28 × 28 × 256,28 × 28 × 256 expressions: height Dimension is 28, width dimensions 28, data characteristics 256.Then, it is obtained according to the second intermediate data crucial comprising each human body The third intermediate data 26 of point corresponding thermal map and offset data, the data structure of third intermediate data 26 is 28 × 28 × (3 × 17), 28 × 28 × (3 × 17) indicate: elevation dimension 28, width dimensions 28, thermal map and offset data correspond to 3 altogether and lead to Road, human body keypoint quantity are 17.Further, human skeleton can be obtained by being analyzed and processed to third intermediate data 26 Data.Then, human skeleton data are converted into human skeleton tensor and add objective degrees of confidence and obtain the 4th intermediate data 27, The data structure of 4th intermediate data 27 is 8 × 17 × 3,8 × 17 × 3 expressions: time dimension 8, human body keypoint quantity are 17, xy both direction coordinate and objective degrees of confidence correspond to 3 channels altogether.Then, the 4th intermediate data 27 is converted into 2D convolution The 5th intermediate data 28 that neural network 14 is capable of handling, the data structure of the 5th intermediate data 28 are 8 × 17 × 512,8 × 17 × 512 indicate: time dimension 8, human body keypoint quantity be 17, data characteristics 512.Finally, by the 5th intermediate data 28 It is input to 2D convolutional neural networks 14 and carries out pose identification, obtain the second pose recognition result 29.

Illustrative methods

Fig. 3 is the flow diagram for the human body method for recognizing position and attitude that one exemplary embodiment of the application provides.The present embodiment It can be applicable on electronic equipment, which is the terminal device with analysis processing visual ability, for example, mobile phone, plate The terminal devices such as computer or computer.As shown in figure 3, including the following steps:

Step 301, the human figure region comprising human body is cut out from video to be identified.

Step 302, pose identification is carried out to human body in human figure region, obtains the first pose recognition result 22 and and people The corresponding human skeleton data of body.

Step 303, the pose based on human skeleton data identification human body, obtains the second pose recognition result 29.

Step 304, it is based on the first pose recognition result 22 and the second pose recognition result 29, determines the pose of human body.

According to the human body method for recognizing position and attitude and device of the application, by cutting out from video to be identified comprising human body Human figure region, and pose identification is carried out to human body in human figure region, obtain the first pose recognition result 22 and and people The corresponding human skeleton data of body.Then the pose based on human skeleton data identification human body, obtains the second pose recognition result 29.Finally, the pose of human body is determined in conjunction with the first pose recognition result 22 and the second pose recognition result 29.Due to determining people During posture, the first pose recognition result 22 obtained according to human figure region is not only considered, it can also be at first It combines on the basis of appearance recognition result 22 and is analyzed according to the second pose recognition result 29 that human skeleton data obtain, passed through First pose recognition result 22 and the second pose recognition result 29 go to determine the pose of human body jointly, can not only reduce human body bone Frame noise, nor scene and other objects in meeting over-fitting video, effectively improve the accurate of human body pose identification Property.

For step 301, in this application, human testing is carried out to video to be identified using human body detector.People Human region can be identified by detector by human testing frame 11, referring to Fig. 1.Human testing frame 11 is to wrap in image Minimum circumscribed rectangle containing human body.Region in human testing frame 11 is to be cut out from video to be identified comprising human body Human figure region.How human figure region is cut out from video to be identified for human body detector, belongs to the prior art, The application does not limit this.

For step 302, the application is realized using I3D (Inflated 3D ConvNet) model 12.I3D model 12 be based on model obtained from the training of human body video collection.In this application, make in human figure region step 301 obtained For the input of I3D model 12, the first pose recognition result 22 and human skeleton data will be exported by I3D model 12.For For each frame image in video to be identified, a first pose recognition result 22 and a human skeleton number can be obtained According to.Wherein, correspond to the process for obtaining the first pose recognition result 22, the human body pose in Fig. 1 based on RGB classification in Fig. 1 Estimation corresponds to the process for obtaining human skeleton data.

It is identified for how to carry out pose identification the first pose of output to human body in human figure region using I3D model 12 As a result 22, detailed description given below:

The I3D model 12 of the application is obtained by the way that the 2D convolution kernel of ResNet-50 is extended to 3D convolution kernel, after extension Neural network comprising 3D convolution kernel is also referred to as 3D convolutional neural networks.Wherein, 2D convolution kernel is height (H) × width (W), It only includes height and two dimensions of width.Referring to fig. 2,3D convolution kernel is time (T) × H × W, not only includes elevation dimension And width dimensions, it also include time dimension.It is special in addition to human body pose can be captured for 3D convolution kernel is compared to 2D convolution kernel Sign, additionally it is possible to capture temporal characteristics, and then establish the relationship between human body pose feature and temporal characteristics, finally identify view Frequently the pose of human body of each moment improves the accuracy of pose recognition result.

For further, the I3D model 12 of the application is made of two warp laminations 13 and two 1 × 1 convolutional layers, finally One convolutional layer connects global pool layer and full articulamentum as output.For warp lamination 13, two deconvolution Layer 13 is made of 256 filters respectively, and the convolution kernel size of each warp lamination 13 is 4 × 4, step-length 2, in warp lamination Access normalization layer and ReLU activation primitive after 13.For 1 × 1 convolutional layer, two 1 × 1 convolutional layers are placed parallel.

Specifically, if the human figure region data set that step 301 is obtainedIt indicates,In include N frame image and include n kind classification, y_i∈ { 1, K, n } is the label of i-th of sample, thus according to as follows Formula (1) obtains the first pose recognition result 22:

Wherein, Y_rgb(y_i) it is I3D behavior class prediction vector Y_rgbY_iDimension, L_rFor the first pose recognition result 22.

For Fig. 2, after cutting out human figure region, if the number for the video image 21 that human figure region includes According to structure be 8 × 7 × 7 × 2048,8 × 7 × 7 × 2048 indicate: time dimension 8, elevation dimension 7, width dimensions 7, Data characteristics is 2048.So, which is input to I3D model 12 and carries out pose identification, obtained the first pose and know Other result 22, the data characteristics of the first pose recognition result 22 are 2048.

For how using I3D model 12 human figure region to human body carry out pose identification, export it is corresponding with human body Human skeleton data, as shown in figure 4, including the following steps: on the basis of above-mentioned embodiment illustrated in fig. 3

Step 401: human figure region is split as multiple characteristics of human body Figure 23.

Step 402: obtaining human skeleton corresponding with each characteristics of human body Figure 23 respectively using 3D convolutional neural networks and sit Mark.

Step 403: will each human skeleton coordinate corresponding with each characteristics of human body Figure 23 as human skeleton data.

The application after fractionation obtains multiple characteristics of human body Figure 23, by being obtained respectively using 3D convolutional neural networks and The corresponding human skeleton coordinate of each characteristics of human body Figure 23, and using human skeleton coordinate as human skeleton data, for subsequent Pose identification is carried out according to human skeleton data, the 3D convolution kernel for including due to 3D convolutional neural networks is in addition to that can capture people Posture feature, additionally it is possible to temporal characteristics are captured, so as to improve the accuracy of pose recognition result.

Specifically, firstly, in step 401, being split for human figure region to video to be identified, if obtaining Dry characteristics of human body Figure 23, wherein each frame image in video to be identified can all correspond to that there are a characteristics of human body Figure 23, human bodies The image that characteristic pattern 23 is included for human figure region in the frame image.Secondly, in step 402, obtaining multiple human bodies After characteristic pattern 23, human skeleton corresponding with each characteristics of human body Figure 23 is obtained respectively by the processing of 3D convolutional neural networks Coordinate.Finally, in step 403, using the corresponding human skeleton coordinate of each characteristics of human body Figure 23 as human skeleton data.Its In, human skeleton is made of several human body key points, and human body key point is otherwise known as skeleton key point, is able to reflect people Body bone information.In general, human body key point includes the positions such as head, neck, shoulder, elbow, hand, stern, knee and foot.Human skeleton coordinate is behaved The set of the coordinate for several human body key points that body skeleton includes.For example, if preset human body key point includes head, neck and shoulder, So coordinate, the coordinate of neck and the coordinate of shoulder of the human skeleton coordinate comprising head.

For step 402, for how corresponding with a characteristics of human body Figure 23 using the acquisition of 3D convolutional neural networks Human skeleton coordinate, as shown in figure 5, including the following steps: on the basis of above-mentioned embodiment illustrated in fig. 4

Step 501: obtaining heat corresponding with human body key point each in characteristics of human body Figure 23 using 3D convolutional neural networks Figure.

Step 502: according to each thermal map, determining coordinate of each human body key point in characteristics of human body Figure 23 respectively.

Step 503: according to coordinate of each human body key point in characteristics of human body Figure 23, obtaining and Figure 23 pairs of characteristics of human body The human skeleton coordinate answered.

The application combines thermal map to analyze and determines that human body closes by obtaining thermal map corresponding to each human body key point respectively Coordinate of the key point in characteristics of human body Figure 23, then human body bone is obtained by coordinate of each human body key point in characteristics of human body Figure 23 Rack coordinate, since thermal map can intuitively and accurately reflect human body key point feature, the human body bone determined using thermal map Rack coordinate is more accurate.

Specifically, in this application, each human body key in characteristics of human body Figure 23 is obtained using 3D convolutional neural networks The corresponding thermal map (Heatmap) of point.Wherein, the corresponding width thermal map of each human body key point, every width thermal map indicate corresponding human body A possibility that key point position, a width thermal map namely a channel.For example, if a width characteristics of human body Figure 23 includes P Human body key point, then P thermal map will be obtained, and there are P channels.For Fig. 2, in the process for obtaining thermal map In, firstly, if splitting out 8 characteristics of human body Figure 23, the data structure of each characteristics of human body Figure 23 be 7 data structures be 7 × 7 × 2048,7 × 7 × 2048 indicate: elevation dimension 7, width dimensions 7, data characteristics 2048.Then, special to each human body It levies Figure 23 and carries out the liter dimension of deconvolution twice.The first intermediate data 24, the first intermediate data are obtained after first time deconvolution rises dimension 24 data structure is 14 × 14 × 256,14 × 14 × 256 expressions: elevation dimension 14, width dimensions 14, data characteristics It is 256.The second intermediate data 25 is obtained after second of deconvolution rises dimension, the data structure of the second intermediate data 25 is 28 × 28 × 256,28 × 28 × 256 indicate: elevation dimension 28, width dimensions 28, data characteristics 256.Then, according in second Between data obtain include each human body key point corresponding thermal map and offset data third intermediate data 26, third intermediate data 26 data structure be 28 × 28 × (3 × 17), 28 × 28 × (3 × 17) indicate: elevation dimension 28, width dimensions 28, Thermal map and offset data correspond to 3 channels altogether, human body keypoint quantity is 17, finally, extracting from third intermediate data 26 The corresponding thermal map of human body key point.

After obtaining thermal map, for step 502, for how according to thermal map determine human body key point human body spy The coordinate in Figure 23 is levied, as shown in fig. 6, on the basis of above-mentioned embodiment illustrated in fig. 5, it may include following steps:

Step 601: utilizing trained pixel point target corresponding with thermal map, obtain each pixel in thermal map and belong to human body The probability of key point.

Step 602: by the coordinate of the corresponding pixel of maximum value in probability, being determined as the first coordinate of human body key point.

Step 603: bias vector being superimposed to the first coordinate, obtains the second coordinate of human body key point.

Step 604: based on the proportionate relationship between thermal map and characteristics of human body Figure 23, the second coordinate being amplified, is obtained Coordinate of the human body key point in characteristics of human body Figure 23.

The application obtains training pixel point target by carrying out training in advance to pixel, further according to training pixel point target It determines that each pixel in thermal map belongs to the probability of human body key point, is closed the coordinate of the pixel of maximum probability as human body First coordinate of key point, while considering the biasing of coordinate, it is superimposed bias vector on the first coordinate and obtains the second coordinate, by the Coordinate of two coordinates as human body key point, therefore the accuracy for the human body key point coordinate that can be improved, and then indirectly Ground improves the accuracy of final pose identification.

In the specific implementation process, it is assumed that the two-dimensional coordinate of each pixel i on thermal map is x_i, i ∈ { 1, K, Q }, i are The index of pixel, Q are the total quantity of pixel on thermal map, and coordinate of k-th of human body key point on thermal map is l_k.If M is It is corresponding with the trusted area of k-th of human body key point to preset credible radius, then when | | x_i-l_k| |≤M, it is meant that pixel x_iBy center of circle radius of k-th of human body key point in the border circular areas of M, that is, pixel x_iIn k-th human body key point In trusted area.And then neural network is to pixel x_iExpected probability output be h_k(x_i)=1.Wherein, h_k(x_i) reflection be K-th of human body key point is in pixel x_iA possibility that place, h_k(x_i) the more big then possibility of numerical value it is higher.

Further, in this application, the training in advance of corresponding thermal map has trained pixel point target, and training pixel point target is It is non-zero i.e. 1 figure.When | | x_i-l_k| | when≤M,In the case of otherEqual to 0. By specifiedTo pixel x_iTwo classification problems are solved with human body key point, minimize h_k(x_i) andPoor is exhausted To value, neural network output h can be supervised in the training process_k(x_i) closeIt is closed to which study obtains k-th of human body Key point is in pixel x_iA possibility that place namely pixel belong to the probability of human body key point.Then, probability value is maximum Pixel is determined as human body key point, and the coordinate of human body key point is determined according to the coordinate of the pixel.Specifically, it can adopt The coordinate for obtaining the maximum pixel of probability value is operated with argmax, executes the formula of argmax operation referring to following formula (2):

It wherein, is the first coordinate by the coordinate that above-mentioned formula (2) obtains.After obtaining the first coordinate, for how The coordinate for determining human body key point, there are following two embodiments.In the first embodiment, directly first can be sat It is denoted as the coordinate of human body key point.But often there is biggish error using the first embodiment.Therefore, the application Second embodiment is provided, in this embodiment, bias vector is added to the first coordinate, obtains the second coordinate, second is sat Mark the coordinate as human body key point.Since second of embodiment considers the biasing of coordinate, can be improved The accuracy of human body key point coordinate, and then the accuracy of final pose identification is improved indirectly.

For further, for how to obtain bias vector, as shown in fig. 7, comprises following steps:

Step 701: obtaining offset data corresponding with human body key point using 3D convolutional neural networks.

Step 702: the difference in the prediction coordinate based on human body key point and thermal map between the coordinate of pixel, to compensation Data are normalized, and obtain bias vector.

In the specific implementation process, each human body key point in characteristics of human body Figure 23 is being obtained using 3D convolutional neural networks While corresponding thermal map, it is corresponding that each human body key point in characteristics of human body Figure 23 can also be obtained using 3D convolutional neural networks Offset data (Offset).Wherein, the corresponding offset data of each human body key point, offset data indicate each pixel Biasing of the position relative to human body key point, each offset data include two channels, correspond respectively to coordinate system the direction x and The direction y, when there are P human body key point, then there are 2P channels for offset data.For Fig. 2, offset data is and heat Figure acquires simultaneously, and therefore, the process for obtaining offset data is similar to the acquisition process of thermal map, it may be assumed that firstly, if split out The data structure of 8 characteristics of human body Figure 23, each characteristics of human body Figure 23 are 7 × 7 × 2048,7 × 7 × 2048 expressions: height dimension Degree is 7, width dimensions 7, data characteristics 2048.Then, deconvolution twice is carried out to each characteristics of human body Figure 23 and rises dimension.? First time deconvolution obtains the first intermediate data 24 after rising dimension, and the data structure of the first intermediate data 24 is 14 × 14 × 256,14 × 14 × 256 indicate: elevation dimension 14, width dimensions 14, data characteristics 256.After second of deconvolution rises dimension To the second intermediate data 25, the data structure of the second intermediate data 25 is 28 × 28 × 256,28 × 28 × 256 expressions: height dimension Degree is 28, width dimensions 28, data characteristics 256.Then, being obtained according to the second intermediate data includes each human body key point The third intermediate data 26 of corresponding thermal map and offset data, the data structure of third intermediate data 26 is 28 × 28 × (3 × 17), 28 × 28 × (3 × 17) indicate: elevation dimension 28, width dimensions 28, thermal map and offset data correspond to 3 altogether and lead to Road, human body keypoint quantity are 17.Finally, extracting the corresponding offset data of human body key point from third intermediate data 26.

After obtaining offset data, in the prediction coordinate based on human body key point and thermal map between the coordinate of pixel Offset data is normalized in difference, that is, to each pixel position and human body key point prediction 2D Offset Vector F_k(x_i)=l_k-x_i.In training, regression problem is solved to each pixel position and human body key point, minimizes F_k (x_i) and l_k-x_iAbsolute value of the difference, solution obtain bias vector F_k(x_i).In turn, in step 603, the first coordinate is superimposed inclined Set vector F_k(x_i), obtain the second coordinate.

Further, since thermal map is the image that characteristics of human body Figure 23 is obtained after resolution compression, in general, characteristics of human body The spatial resolution of Figure 23 is 224 × 224, and the spatial resolution of thermal map and offset data is 28 × 28.Therefore, in step 604 In, according to the proportionate relationship between thermal map and characteristics of human body Figure 23, the second coordinate is amplified, human body key point can be obtained Coordinate in characteristics of human body Figure 23, the coordinate can accurately reflect position of the human body key point in characteristics of human body Figure 23.

Human body key point is obtained by the above process after the coordinate in characteristics of human body Figure 23, executes step 503.In step In rapid 503, human skeleton coordinate corresponding with characteristics of human body Figure 23 includes each human body key point in characteristics of human body Figure 23 Coordinate.

The application after obtaining human skeleton data, execute step 303, step 303 correspond to Fig. 1 in based on human body Skeleton classification.For step 303, the application is using 2D convolutional neural networks 14 based on human skeleton data identification human body Pose obtains the second pose recognition result 29, due to can accurately know according to human skeleton data capture human body pose It Chu not human body pose.As shown in figure 8, step 303 may include following steps on the basis of above-mentioned embodiment illustrated in fig. 3:

Step 801: human skeleton data are converted into human skeleton tensor.

Step 802: adding objective degrees of confidence into human skeleton tensor, wherein objective degrees of confidence passes through to each thermal map Maximum pondization is carried out to obtain.

Step 803: 2D convolutional neural networks 14 will be input to added with the human skeleton tensor of objective degrees of confidence, obtain the Two pose recognition results 29.

The application combines by the way that human skeleton data are converted to human skeleton tensor and carries out maximum Chi Huahou to thermal map The objective degrees of confidence of acquisition, using the human skeleton tensor added with objective degrees of confidence as the input of 2D convolutional neural networks 14, And then the second pose recognition result 29 is obtained, it can be improved the accurate of the human body pose identified using human skeleton data Property.

In the specific implementation process, the human skeleton data comprising human skeleton coordinate are converted to the people of 2xTxK first Body skeleton tensor, K are the quantity of human body key point, and T is the quantity of picture frame in video to be identified.Then, an additional mesh Mark confidence level is added on human skeleton tensor, and objective degrees of confidence is obtained by carrying out maximum pondization to each thermal map.Then, Human skeleton tensor added with objective degrees of confidence is input to 2D convolutional neural networks 14 and carries out pose identification, 2D convolutional Neural Network 14 namely 2D CNN.Since the pose sequence dimension of input is small, the pondization operation that can be used in 2D CNN is removed, It is 1 that the convolutional layer that step-length is 2, which is replaced with step-length, finally accesses global pool layer and full articulamentum.Final prediction uses intersection Entropy loss optimization, obtains the second pose recognition result 29 by following formula (3):

Wherein, Y_paction(y_i) be 2D CNN class prediction vector Y_pactionY_iDimension, L_pactionFor the knowledge of the second pose Other result 29.

For Fig. 2, human skeleton data are converted into human skeleton tensor and add objective degrees of confidence obtain the 4th Intermediate data 27, the data structure of the 4th intermediate data 27 are 8 × 17 × 3,8 × 17 × 3 expressions: time dimension 8, human body close Key point quantity is that 17, xy both direction coordinate and objective degrees of confidence correspond to 3 channels altogether.Then, by 27 turns of the 4th intermediate data It is changed to the 5th intermediate data 28 that 2D convolutional neural networks 14 are capable of handling, the data structure of the 5th intermediate data 28 is 8 × 17 × 512,8 × 17 × 512 indicate: time dimension 8, human body keypoint quantity be 17, data characteristics 512.Finally, by the 5th Intermediate data 28 is input to 2D convolutional neural networks 14 and carries out pose identification, obtains the second pose recognition result 29.

In order to measure the gap between the output predicted value of neural network and actual value, and predicted value is corrected according to gap, So that predicted value is closer to actual value, it is further comprising the steps of for the human body method for recognizing position and attitude of the application:

First task during human figure region carries out pose identification to human body is obtained to lose；And it obtains The second task loss during pose based on human skeleton data identification human body.

Specifically, execute step 302 human figure region to human body carry out pose identification during, utilize with Lower formula (4) obtains first task loss:

Wherein, L_h(θ) is first task loss, and θ is 3D convolutional neural networks learning parameter, and R is smooth L₁Loss, K For the quantity of human body key point.smooth L₁It can be obtained by following formula (5):

Meanwhile during executing pose of step 303 acquisition based on human skeleton data identification human body, utilization is following Formula (6) obtains the loss of the second task:

Wherein, L_o(θ) is the loss of the second task, and the biasing loss function of formula six is at each pixel position smoothL₁The sum of loss, biasing loss only solve the position in the circle for being M from each human body key point radius.

Further, after obtaining first task loss and the loss of the second task, for step 304, it is based on first Pose recognition result 22, the second pose recognition result 29, first task loss and the loss of the second task, determine the pose of human body.

Specifically, object pose estimation loss, target can be obtained according to first task loss and the loss of the second task Pose estimation loss is obtained by following formula (7):

L_p=λ_hL_h(θ)+λ_oL_o(θ) formula (7)

Wherein, L_pEstimate to lose for object pose, λ_hAnd λ_oFor balance weight, balance weight is for balancing first task damage Second task of becoming estranged loss, λ_hAnd λ_oTake 0.5.And the pose of human body is obtained by following formula (8):

L=λ₁L_r+λ₂L_p+λ₃L_pactionFormula (8)

Wherein, L is the pose of human body, λ₁,λ₂And λ₃It is the loss of three tasks respectively, is defaulted as 1.

The application using first task loss and the second task by being lost to the first pose recognition result 22 and second Appearance recognition result 29 is modified, and enables to the human body pose finally determined more accurate.

Exemplary means

Based on the same inventive concept, the embodiment of the present application also provides a kind of human body pose identification device, as shown in figure 9, should Device includes:

Module 901 is cut, for cutting out the human figure region comprising human body from video to be identified.

First output module 902 obtains first for carrying out pose identification to the human body in the human figure region Pose recognition result and human skeleton data corresponding with the human body.

Second output module 903 obtains second for identifying the pose of the human body based on the human skeleton data Appearance recognition result.

Determining module 904 determines institute for being based on the first pose recognition result and the second pose recognition result State the pose of human body.

Wherein, the second output module 903, as shown in Figure 10, comprising:

Pose identifies submodule 9031, for being based on described in human skeleton data identification using 2D convolutional neural networks The pose of human body obtains the second pose recognition result.

Wherein, the first output module 902, as shown in figure 11, comprising:

Submodule 9021 is split, is schemed for the human figure region to be split as multiple characteristics of human body；

Human skeleton coordinate obtains submodule 9022, for being obtained respectively and each institute using the 3D convolutional neural networks It states characteristics of human body and schemes corresponding human skeleton coordinate；

Submodule 9023 is determined, for will each human skeleton coordinate work corresponding with each characteristics of human body's figure For the human skeleton data.

Wherein, human skeleton coordinate obtains submodule 9022, as shown in figure 12, comprising:

Thermal map obtaining unit 90221, it is each in 3D convolutional neural networks acquisition and characteristics of human body's figure for utilizing The corresponding thermal map of a human body key point；

Human body key point determination unit 90222, for determining that each human body is crucial respectively according to each thermal map Coordinate of the point in characteristics of human body's figure；

Skeleton coordinate obtaining unit 90223, for according to each human body key point in characteristics of human body's figure Coordinate obtains the human skeleton coordinate corresponding with characteristics of human body figure.

Wherein, human body key point determination unit 90222, as shown in figure 13, comprising:

Probability obtains subelement 902221, for utilizing trained pixel point target corresponding with the thermal map, described in acquisition Each pixel belongs to the probability of the human body key point in thermal map；

Coordinate determines subelement 902222, for being determined as the coordinate of the corresponding pixel of maximum value in the probability First coordinate of the human body key point；

It is superimposed subelement 902223, for being superimposed bias vector to first coordinate, obtains the human body key point Second coordinate；

Amplify subelement 902224, for based on the proportionate relationship between the thermal map and characteristics of human body's figure, to institute It states the second coordinate to amplify, obtains coordinate of the human body key point in characteristics of human body's figure.

Wherein, as shown in figure 14, described device further include:

Offset data obtains module 905, for being obtained and the human body key point pair using the 3D convolutional neural networks The offset data answered；

Data processing module 906, for pixel in prediction coordinate and the thermal map based on the human body key point The offset data is normalized in difference between coordinate, obtains the bias vector.

Wherein, pose identifies submodule 9031, as shown in figure 15, comprising:

Date Conversion Unit 90311, for the human skeleton data to be converted to human skeleton tensor；

Adding unit 90312, for adding objective degrees of confidence into the human skeleton tensor, wherein the target is set Reliability is obtained by carrying out maximum pondization to each thermal map；

Pose recognition unit 90313, for the human skeleton tensor for being added with the objective degrees of confidence to be input to 2D convolutional neural networks obtain the second pose recognition result.

Wherein, as shown in figure 16, described device further include:

Loss obtains module 907, for obtaining the mistake for carrying out pose identification to the human body in the human figure region First task loss in journey；And during obtaining the pose for identifying the human body based on the human skeleton data The loss of second task；

Wherein it is determined that module 904, as shown in figure 17, comprising:

Submodule 9041 is determined, for based on the first pose recognition result, the second pose recognition result, described First task loss and second task loss, determine the pose of the human body.

Example electronic device

In the following, being described with reference to Figure 18 the electronic equipment according to the embodiment of the present application.

Figure 18 illustrates the block diagram of the electronic equipment according to the embodiment of the present application.

As shown in figure 18, electronic equipment 1801 includes one or more processors 18011 and memory 18012.

Processor 18011 can be central processing unit (CPU) or have data-handling capacity and/or instruction execution energy The processing unit of the other forms of power, and can control the other assemblies in electronic equipment 1801 to execute desired function.

Memory 18012 may include one or more computer program products, and the computer program product can wrap Include various forms of computer readable storage mediums, such as volatile memory and/or nonvolatile memory.The volatibility Memory for example may include random access memory (RAM) and/or cache memory (cache) etc..It is described non-volatile Property memory for example may include read-only memory (ROM), hard disk, flash memory etc..It can on the computer readable storage medium To store one or more computer program instructions, processor 18011 can run described program instruction, described above to realize The application each embodiment human body method for recognizing position and attitude and/or other desired functions.The computer can It reads that the various contents such as input signal, signal component, noise component(s) can also be stored in storage medium.

In one example, electronic equipment 1801 can also include: input unit 18013 and output device 18014, these Component passes through the interconnection of bindiny mechanism's (not shown) of bus system and/or other forms.

For example, the input unit 18013 can be above-mentioned microphone or microphone array, for capturing the input of sound source Signal.When the electronic equipment is stand-alone device, which can be communication network connector.

In addition, the input equipment 18013 can also include such as keyboard, mouse etc..

The output device 18014 can be output to the outside various information, including determine range information, directional information Deng.The output equipment 18014 may include such as display, loudspeaker, printer and communication network and its be connected remote Journey output equipment etc..

Certainly, to put it more simply, illustrating only in the electronic equipment 1,801 one in component related with the application in Figure 18 A bit, the component of such as bus, input/output interface etc. is omitted.In addition to this, according to concrete application situation, electronic equipment 1801 can also include any other component appropriate.

Illustrative computer program product and computer readable storage medium

Other than the above method and equipment, embodiments herein can also be computer program product comprising meter Calculation machine program instruction, it is above-mentioned that the computer program instructions make the processor execute this specification when being run by processor According to the step in the human body method for recognizing position and attitude of the various embodiments of the application described in " illustrative methods " part.

The computer program product can be write with any combination of one or more programming languages for holding The program code of row the embodiment of the present application operation, described program design language includes object oriented program language, such as Java, C++ etc. further include conventional procedural programming language, such as " C " language or similar programming language.Journey Sequence code can be executed fully on the user computing device, partly execute on a user device, be independent soft as one Part packet executes, part executes on a remote computing or completely in remote computing device on the user computing device for part Or it is executed on server.

In addition, embodiments herein can also be computer readable storage medium, it is stored thereon with computer program and refers to It enables, the computer program instructions make the processor execute above-mentioned " the exemplary side of this specification when being run by processor According to the step in the human body method for recognizing position and attitude of the various embodiments of the application described in method " part.

The computer readable storage medium can be using any combination of one or more readable mediums.Readable medium can To be readable signal medium or readable storage medium storing program for executing.Readable storage medium storing program for executing for example can include but is not limited to electricity, magnetic, light, electricity Magnetic, the system of infrared ray or semiconductor, device or device, or any above combination.Readable storage medium storing program for executing it is more specific Example (non exhaustive list) includes: the electrical connection with one or more conducting wires, portable disc, hard disk, random access memory Device (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc Read-only memory (CD-ROM), light storage device, magnetic memory device or above-mentioned any appropriate combination.

The basic principle of the application is described in conjunction with specific embodiments above, however, it is desirable to, it is noted that in this application The advantages of referring to, advantage, effect etc. are only exemplary rather than limitation, must not believe that these advantages, advantage, effect etc. are the application Each embodiment is prerequisite.In addition, detail disclosed above is merely to exemplary effect and the work being easy to understand With, rather than limit, it is that must be realized using above-mentioned concrete details that above-mentioned details, which is not intended to limit the application,.

Device involved in the application, device, equipment, system block diagram only as illustrative example and be not intended to It is required that or hint must be attached in such a way that box illustrates, arrange, configure.As those skilled in the art will appreciate that , it can be connected by any way, arrange, configure these devices, device, equipment, system.Such as "include", "comprise", " tool " etc. word be open vocabulary, refer to " including but not limited to ", and can be used interchangeably with it.Vocabulary used herein above "or" and "and" refer to vocabulary "and/or", and can be used interchangeably with it, unless it is not such that context, which is explicitly indicated,.Here made Vocabulary " such as " refers to phrase " such as, but not limited to ", and can be used interchangeably with it.

It may also be noted that each component or each step are can to decompose in the device of the application, device and method And/or reconfigure.These decompose and/or reconfigure the equivalent scheme that should be regarded as the application.

The above description of disclosed aspect is provided so that any person skilled in the art can make or use this Application.Various modifications in terms of these are readily apparent to those skilled in the art, and are defined herein General Principle can be applied to other aspect without departing from scope of the present application.Therefore, the application is not intended to be limited to Aspect shown in this, but according to principle disclosed herein and the consistent widest range of novel feature.

In order to which purpose of illustration and description has been presented for above description.In addition, this description is not intended to the reality of the application It applies example and is restricted to form disclosed herein.Although already discussed above multiple exemplary aspects and embodiment, this field skill Its certain modifications, modification, change, addition and sub-portfolio will be recognized in art personnel.

Claims

1. a kind of human body method for recognizing position and attitude, comprising:

The human figure region to the human body carry out pose identification, obtain the first pose recognition result and with the human body Corresponding human skeleton data；

2. human body method for recognizing position and attitude according to claim 1 identifies the human body based on the human skeleton data Pose obtains the second pose recognition result, comprising:

The pose for identifying the human body based on the human skeleton data using 2D convolutional neural networks, obtains second pose Recognition result.

3. human body method for recognizing position and attitude according to claim 1 carries out position to the human body in the human figure region Appearance identification, obtains human skeleton data corresponding with the human body, comprising:

The human figure region is split as multiple characteristics of human body's figures；

Obtain human skeleton coordinate corresponding with each characteristics of human body's figure respectively using 3D convolutional neural networks；

Will each human skeleton coordinate corresponding with each characteristics of human body's figure as the human skeleton data.

4. human body method for recognizing position and attitude according to claim 3 is obtained and an institute using the 3D convolutional neural networks It states characteristics of human body and schemes the corresponding human skeleton coordinate, comprising:

Thermal map corresponding with human body key point each in characteristics of human body's figure is obtained using the 3D convolutional neural networks；

According to each thermal map, coordinate of each human body key point in characteristics of human body's figure is determined respectively；

According to coordinate of each human body key point in characteristics of human body's figure, obtain corresponding with characteristics of human body figure The human skeleton coordinate.

5. human body method for recognizing position and attitude according to claim 4 determines the human body key point in institute according to the thermal map State the coordinate in characteristics of human body's figure, comprising:

Using trained pixel point target corresponding with the thermal map, obtains each pixel in the thermal map and belong to the human body The probability of key point；

By the coordinate of the corresponding pixel of maximum value in the probability, it is determined as the first coordinate of the human body key point；

Bias vector is superimposed to first coordinate, obtains the second coordinate of the human body key point；

Based on the proportionate relationship between the thermal map and characteristics of human body's figure, second coordinate is amplified, obtains institute State coordinate of the human body key point in characteristics of human body's figure.

6. human body method for recognizing position and attitude according to claim 5, it is described to first coordinate superposition bias vector it Before, the method also includes:

Offset data corresponding with the human body key point is obtained using the 3D convolutional neural networks；

Difference in prediction coordinate based on the human body key point and the thermal map between the coordinate of pixel, to the compensation Data are normalized, and obtain the bias vector.

7. human body method for recognizing position and attitude according to claim 4 identifies the human body based on the human skeleton data Pose obtains the second pose recognition result, comprising:

The human skeleton data are converted into human skeleton tensor；

Objective degrees of confidence is added into the human skeleton tensor, wherein the objective degrees of confidence passes through to each thermal map Maximum pondization is carried out to obtain；

The human skeleton tensor added with the objective degrees of confidence is input to 2D convolutional neural networks, obtains described second Pose recognition result.

8. human body method for recognizing position and attitude according to claim 1, further includes:

The first task during human figure region carries out pose identification to the human body is obtained to lose；And Obtain the second task loss during the pose for identifying the human body based on the human skeleton data；

Wherein, it is based on the first pose recognition result and the second pose recognition result, determines the pose of the human body, is wrapped It includes:

Based on the first pose recognition result, the second pose recognition result, first task loss and described second Task loss, determines the pose of the human body.

9. a kind of human body pose identification device, comprising:

First output module obtains the knowledge of the first pose for carrying out pose identification to the human body in the human figure region Other result and human skeleton data corresponding with the human body；

Second output module obtains the identification of the second pose for identifying the pose of the human body based on the human skeleton data As a result；

Determining module determines the human body for being based on the first pose recognition result and the second pose recognition result Pose.

10. a kind of computer readable storage medium, the storage medium is stored with computer program, and the computer program is used for Execute any human body method for recognizing position and attitude of the claims 1-8.

11. a kind of electronic equipment, the electronic equipment include:

Processor；

For storing the memory of the processor-executable instruction；

The processor, for executing any human body method for recognizing position and attitude of the claims 1-8.