CN108389227A

CN108389227A - A kind of dimensional posture method of estimation based on multiple view depth perceptron frame

Info

Publication number: CN108389227A
Application number: CN201810171088.1A
Authority: CN
Inventors: 夏春秋
Original assignee: Shenzhen Vision Technology Co Ltd
Current assignee: Shenzhen Vision Technology Co Ltd
Priority date: 2018-03-01
Filing date: 2018-03-01
Publication date: 2018-08-10

Abstract

A kind of dimensional posture method of estimation based on multiple view depth perceptron frame proposed in the present invention, main contents include：Particular figure perceptron network, Multi-view Integration network, layering jump connection, data prediction, training and assessment, its process is, particular figure perceptron network extracts two-dimensional shapes and layering texture information from different views, particular figure perceptron network generates mapping for each view, by the high-resolution thermal map in each joint of hourglass network struction that encoder and decoder form, it is connected using the jump of hourglass network to realize Multi-view Integration network, Multi-view Integration network synthesizes the information from all available views, provides accurate 3 d pose.Hierarchy information is combined deduction three-dimensional structure by the present invention with estimation joint thermal map, can overcome the limitation directly measured with observation system, has higher accuracy to the estimation of dimensional posture.

Description

A kind of dimensional posture method of estimation based on multiple view depth perceptron frame

Technical field

The present invention relates to pose estimation fields, more particularly, to a kind of three-dimensional appearance based on multiple view depth perceptron frame Gesture method of estimation.

Background technology

Human posture's estimation is the research hotspot of computer vision field in recent years.Computer system is by from image or regarding Human posture's picture is extracted in frequency, the posture of human body is analyzed and counted, to judge the behavior of personage.Cause This, human posture's estimation has extremely wide application.In unusual checking systematic difference, by being drawn to video monitoring Personage's posture in face carries out detection and analysis in real time, when occurring the behaviors such as fight, steal in picture, system energy and Shi Jilu simultaneously sends out alarm.In the application of sports posture analysis, the video analytic system pair of dimensional posture estimation is utilized Video data when training athlete carries out analyzing processing, calculates relevant kinematic parameter and carries out specialty analysis, to for Training next time provides specific aim guidance；It can also be used for competition tactics analysis, be carried out at analysis by the match video to opponent Reason, can analyze the related data of opponent's posture, to assist sportsman to formulate related tactical concept.In body-sensing interactive game In, by monitoring the posture and motion state of player in real time, convert player's body gesture before screen and action to Data transfer is to games system, to realize human-computer interaction.The system for the capture 3 D human body posture developed in recent years, such as directly Measure and observation system, these systems costly, and take it is longer, accuracy is not high.

The present invention proposes a kind of dimensional posture method of estimation based on multiple view depth perceptron frame, particular figure perception Device network extracts two-dimensional shapes and layering texture information from different views, and particular figure perceptron network is directed to each view Mapping is generated, by the high-resolution thermal map in each joint of hourglass network struction that encoder and decoder form, utilizes hourglass net The jump of network connects to realize that Multi-view Integration network, Multi-view Integration network synthesize the information from all available views, carry For accurate 3 d pose.Hierarchy information is combined deduction three-dimensional structure by the present invention with estimation joint thermal map, can be with Overcome the limitation directly measured with observation system, there is higher accuracy to the estimation of dimensional posture.

Invention content

Costly for existing system, and longer, the not high problem of accuracy is taken, the purpose of the present invention is to provide A kind of dimensional posture method of estimation based on multiple view depth perceptron frame, particular figure perceptron network is from different views Extract two-dimensional shapes and layering texture information, particular figure perceptron network generates mapping for each view, by encoder and The high-resolution thermal map in each joint of hourglass network struction of decoder composition is connected more to realize using the jump of hourglass network View integrated network, Multi-view Integration network synthesize the information from all available views, provide accurate 3 d pose.

To solve the above problems, the present invention provides a kind of dimensional posture estimation side based on multiple view depth perceptron frame Method, main contents include：

(1) particular figure perceptron network；

(2) Multi-view Integration network；

(3) layering jump connection；

(4) data prediction；

(5) training and assessment.

Wherein, the dimensional posture method of estimation based on multiple view depth perceptron frame, this method is by two nets Network forms：One " particular figure perceptron " network extracts two-dimensional shapes and layering texture information from different views；And it is " more View is integrated " network can then synthesize the information from all available views, accurate 3 d pose is provided.

Wherein, the particular figure perceptron network, particular figure perceptron network extract abundant from each view Information further includes for the three-dimensional layering texture information inferred in next step wherein including not only two-dimensional shapes；Each two dimension body Posture is indicated that wherein J indicates the quantity of body joints by J thermal maps；IfFor the input RGB image of view i, It is mapped for s-th of textural characteristics of view i,To regard Scheme j-th of joint thermal map of i；Then, as follows for particular figure perceptron network (f) mapping of i-th of view：

Intermediate parity is performed by the loss of pixel thermal map：

Wherein, ‖ ‖ are Euclidean distances,It is to be rendered from the true two-dimensional attitude demarcated by gaussian kernel function , average value is equal to the truthful data demarcated and variance；Then the hourglass network being made of encoder and decoder is used.

Further, the hourglass network, the figure layer that encoder is merged using convolution sum handle input picture, generate The characteristic pattern of low resolution, decoder handle low resolution characteristic pattern using up-sampling and convolutional layer, build the height in each joint Resolution ratio thermal map；One of the key components of hourglass network are that jump connects, i.e., the Feature Mapping before each convergence-level, it Be directly appended to corresponding part in decoder, prevent the loss of encoder middle high-resolution information；The layering of these networks Abundant texture information is shared in jump connection in varing proportions；Therefore, it is suggested that by being supplied to Multi-view Integration network They are used for more efficient three-dimensional reasoning；They allow to use more rich gradient signal, and can provide more Three-dimensional cues, and it is not only the combination using thermal map and untreated input picture.

Wherein, the Multi-view Integration network, the information of the multiple views of Multi-view Integration system integrating synthesize three Tie up Attitude estimation；The input of the network is the connection for the output of the particular figure perceptron network of N number of different views, and Output is 3 d pose；Each dimensional posture skeletonOne group of joint coordination being defined as in three dimensions；Therefore The mapping of Multi-view Integration network (g) is as follows：

By assuming that three-dimensional joint annotation can be used for training dataset, loss function can be defined as：

Wherein, p_jWithIt is the three-dimensional coordinate of the truthful data demarcated and estimation joint j respectively；

It is proposed a kind of data-driven method from bottom to top, it directly generates three from the output of the specific perceptron network of vision Tie up posture skeleton；Multi-view Integration network is designed to encoder.

Further, the encoder tests two kinds of encoder：First, encoder by a series of kernels and The convolutional layer that step-length is 2 is constituted, and wherein the resolution ratio of Feature Mapping is all half in each layer；Secondly, one is similar to hourglass The encoder of network first part is replaced including maximum tether layer and Standard convolution layer by a pile residual error study module； First and second network architectures are referred to as simple encoder and half hourglass network；For two kinds of network architecture, coding Device exports and then is forwarded to the full articulamentum that output size is 3 × J, for estimating 3 d pose skeleton and measuring trained damage Lose function；As can be seen that half hourglass network that is benefited from residual error module and being periodically inserted into maximum pond layer can provide ratio The more accurate 3 d pose of simple encoder network compares.

Wherein, layering jump connection is connected using the jump of hourglass network to realize Multi-view Integration network； In the frame of proposition, each in four jump connections of the encoder section generation of hourglass network is at residual error module Reason, and be added with the corresponding part in half hourglass network；In order to handle multiple view setting, each jump connection should be as network It is connected between view before input.

Wherein, the data prediction, in order to prepare training image, the down-sampled images from video；Each video bag 200 frames are included, the rate of 30 frame per second prevents overfitting only with odd-numbered frame；All images are adjusted to 256 × 256 Pixel, and be cut, so that main body is located at center；Three-dimensional joint annotation is provided by motion capture system.

Further, the described three-dimensional joint annotation, select 23 labels define including head, neck, left/right shoulder, 14 joints including left/right elbow, left/right wrist, left/right hip, left/right knee and left/right ankle, and using only the rail in these joints Mark is used to train the joint of network；The coordination in each joint is normalized from zero to one in entire data set；After pretreatment, data Structure by video each odd-numbered frame cutting image and the three-dimensional joint annotation of corresponding standardization form.

Wherein, the training and assessment propose a kind of two stage training strategy, use hourglass mould in the first stage Type, and it is finely adjusted on weight lifting data set, learning rate 0.00025, by 5 periods；In second stage, by making 3 d pose skeleton is normalized with corresponding with dual-view image, from the beginning to Multi-view Integration model on weight lifting data set It is trained；For 50 periods, the learning rate of model is 0.0005；

In order to assess the performance that single-view and dual-view is arranged in network, two experiments are carried out：First, network is for single View setting is trained, and uses 90 degree and 135 degree of views respectively；Secondly, which is trained to total using two views It is arranged with the dual-view as network inputs；In all experiments, the repetition of all objects and weight lifting task are as training number According to collection, the repetition of all objects is as test data set.

Description of the drawings

Fig. 1 is a kind of system flow chart of the dimensional posture method of estimation based on multiple view depth perceptron frame of the present invention.

Fig. 2 is that a kind of multiple view of the dimensional posture method of estimation based on multiple view depth perceptron frame of the present invention is felt deeply and known Device frame.

Fig. 3 is a kind of Multi-view Integration net of the dimensional posture method of estimation based on multiple view depth perceptron frame of the present invention Network.

Specific implementation mode

It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase It mutually combines, invention is further described in detail in the following with reference to the drawings and specific embodiments.

Fig. 1 is a kind of system flow chart of the dimensional posture method of estimation based on multiple view depth perceptron frame of the present invention. Include mainly particular figure perceptron network, Multi-view Integration network, layering jump connection, data prediction, training and assessment.

Particular figure perceptron network extracts two-dimensional shapes and layering texture information, particular figure sense from different views Know that device network generates mapping for each view, by the high score in each joint of hourglass network struction that encoder and decoder form Resolution thermal map is connected using the jump of hourglass network to realize that Multi-view Integration network, the synthesis of Multi-view Integration network come from institute The information for having available views provides accurate 3 d pose.

Wherein, in order to prepare training image, the down-sampled images from video；Each video includes 200 frames, 30 frame per second Rate prevents overfitting only with odd-numbered frame；All images are adjusted to 256 × 256 pixels, and are cut, and make master Body is located at center；Three-dimensional joint annotation is provided by motion capture system.

Selection 23 label come define including head, neck, left/right shoulder, left/right elbow, left/right wrist, left/right hip, a left side/ 14 joints including right knee and left/right ankle, and the joint using only the track in these joints for training network；Entire data The coordination in each joint is concentrated to be normalized from zero to one；After pretreatment, data structure by video each odd-numbered frame cutting Image and the three-dimensional joint annotation composition of corresponding standardization.

It proposes a kind of two stage training strategy, uses hourglass model in the first stage, and to it on weight lifting data set It is finely adjusted, learning rate 0.00025, by 5 periods；In second stage, return by using dual-view image and accordingly One changes 3 d pose skeleton, is trained from the beginning to Multi-view Integration model on weight lifting data set；For 50 periods, The learning rate of model is 0.0005；

Fig. 2 is that a kind of multiple view of the dimensional posture method of estimation based on multiple view depth perceptron frame of the present invention is felt deeply and known Device frame.Particular figure perceptron network extracts abundant information from each view, wherein including not only two-dimensional shapes, further includes For the three-dimensional layering texture information inferred in next step；Each two dimension body gesture is indicated that wherein J indicates that body is closed by J thermal maps The quantity of section；IfFor the input RGB image of view i,It is the of view i S textural characteristics mapping,For j-th of joint thermal map of view i；Then, for i-th Particular figure perceptron network (f) mapping of view is as follows：

Intermediate parity is performed by the loss of pixel thermal map：

The figure layer that encoder is merged using convolution sum handles input picture, generates the characteristic pattern of low resolution, decoder Low resolution characteristic pattern is handled using up-sampling and convolutional layer, builds the high-resolution thermal map in each joint；The pass of hourglass network One of key component part is jump connection, i.e., the Feature Mapping before each convergence-level, they are directly appended in decoder Corresponding part prevents the loss of encoder middle high-resolution information；The layering jump connection of these networks is shared in varing proportions Abundant texture information；Therefore, it is suggested that they are used for more efficient three by being supplied to Multi-view Integration network Tie up reasoning；They allow to use more rich gradient signal, and can provide more three-dimensional cues, and are not only using heat The combination of figure and untreated input picture.

It is connected using the jump of hourglass network to realize Multi-view Integration network；In the frame of proposition, hourglass network Encoder section generate four jump connection in each use residual error resume module, and with it is corresponding in half hourglass network Part is added；In order to handle multiple view setting, each jump connection should connect before as network inputs between view Come.

Fig. 3 is a kind of Multi-view Integration net of the dimensional posture method of estimation based on multiple view depth perceptron frame of the present invention Network.The information of the multiple views of Multi-view Integration system integrating synthesizes 3 d pose estimation；The input of the network is for N number of The connection of the output of the particular figure perceptron network of different views, and it is 3 d pose to export；Each dimensional posture skeletonOne group of joint coordination being defined as in three dimensions；Therefore the mapping of Multi-view Integration network (g) is as follows：

Test two kinds of encoder：First, encoder is made of a series of convolutional layer that kernels and step-length are 2, The resolution ratio of middle Feature Mapping is all half in each layer；Secondly, an encoder for being similar to hourglass network first part, Include that maximum tether layer and Standard convolution layer are replaced by a pile residual error study module；First and second network architectures claim respectively For simple encoder and half hourglass network；For two kinds of network architecture, then encoder output is forwarded to output big The small full articulamentum for being 3 × J, for estimating 3 d pose skeleton and measuring trained loss function；As can be seen that from residual error mould It is benefited in block and the half hourglass network for being periodically inserted into maximum pond layer can provide more accurate than simple encoder network three Dimension posture compares.

For those skilled in the art, the present invention is not limited to the details of above-described embodiment, in the essence without departing substantially from the present invention In the case of refreshing and range, the present invention can be realized in other specific forms.In addition, those skilled in the art can be to this hair Bright to carry out various modification and variations without departing from the spirit and scope of the present invention, these improvements and modifications also should be regarded as the present invention's Protection domain.Therefore, the following claims are intended to be interpreted as including preferred embodiment and falls into all changes of the scope of the invention More and change.

Claims

1. a kind of dimensional posture method of estimation based on multiple view depth perceptron frame, which is characterized in that include mainly specific regard Figure perceptron network (one)；Multi-view Integration network (two)；Layering jump connection (three)；Data prediction (four)；It trains and comments Estimate (five).

2. based on the dimensional posture method of estimation based on multiple view depth perceptron frame described in claims 1, feature exists In this method is made of two networks：One " particular figure perceptron " network extracts two-dimensional shapes from different views With layering texture information；And " Multi-view Integration " network can then synthesize the information from all available views, provide accurately 3 d pose.

3. based on the particular figure perceptron network (one) described in claims 1, which is characterized in that particular figure perceptron net Network extracts abundant information from each view, further includes for three-dimensional point inferred in next step wherein including not only two-dimensional shapes Layer texture information；Each two dimension body gesture is indicated that wherein J indicates the quantity of body joints by J thermal maps；If For the input RGB image of view i,It is mapped for s-th of textural characteristics of view i, For j-th of joint thermal map of view i；Then, for the particular figure perceptron of i-th of view Network (f) mapping is as follows：

Intermediate parity is performed by the loss of pixel thermal map：

Wherein, ‖ ‖ are Euclidean distances,It is to be rendered from the true two-dimensional attitude demarcated by gaussian kernel function, Average value is equal to the truthful data demarcated and variance；Then the hourglass network being made of encoder and decoder is used.

4. based on the hourglass network described in claims 3, which is characterized in that the figure layer that encoder is merged using convolution sum is located Input picture is managed, the characteristic pattern of low resolution is generated, decoder handles low resolution characteristic pattern, structure using up-sampling and convolutional layer Build the high-resolution thermal map in each joint；One of the key components of hourglass network be jump connection, i.e., each convergence-level it Preceding Feature Mapping, they are directly appended to the corresponding part in decoder, prevent the loss of encoder middle high-resolution information； Abundant texture information is shared in the layering jump connection of these networks in varing proportions；Therefore, it is suggested that by being supplied to They are used for more efficient three-dimensional reasoning by Multi-view Integration network；They allow to use more rich gradient signal, and More three-dimensional cues can be provided, and be not only the combination using thermal map and untreated input picture.

5. based on the Multi-view Integration network (two) described in claims 1, which is characterized in that Multi-view Integration system integrating The information of multiple views is estimated to synthesize 3 d pose；The input of the network is the particular figure perception for N number of different views The connection of the output of device network, and it is 3 d pose to export；Each dimensional posture skeletonIt is defined as three-dimensional space Between in one group of joint coordination；Therefore the mapping of Multi-view Integration network (g) is as follows：

It is proposed a kind of data-driven method from bottom to top, it directly generates three-dimensional appearance from the output of the specific perceptron network of vision State skeleton；Multi-view Integration network is designed to encoder.

6. based on the encoder described in claims 5, which is characterized in that test two kinds of encoder：First, encoder A series of convolutional layer for being 2 by kernels and step-length is constituted, and wherein the resolution ratio of Feature Mapping is all half in each layer；Secondly, One encoder for being similar to hourglass network first part, including maximum tether layer and Standard convolution layer by a pile residual error Module is practised to be replaced；First and second network architectures are referred to as simple encoder and half hourglass network；For two kinds of networks Architecture, then encoder output is forwarded to the full articulamentum that output size is 3 × J, for estimating 3 d pose skeleton And measure trained loss function；As can be seen that half hourglass of maximum pond layer is benefited and is periodically inserted into from residual error module Network can provide 3 d pose more accurate than simple encoder network and compare.

7. based on the layering jump connection (three) described in claims 1, which is characterized in that connected using the jump of hourglass network To realize Multi-view Integration network；In the frame of proposition, in four jump connections of the encoder section generation of hourglass network Each use residual error resume module, and be added with the corresponding part in half hourglass network；In order to handle multiple view setting, often A jump connection should connect before as network inputs between view.

8. based on the data prediction (four) described in claims 1, which is characterized in that in order to prepare training image, from video Middle down-sampled images；Each video includes 200 frames, and the rate of 30 frame per second prevents overfitting only with odd-numbered frame；It is all Image be adjusted to 256 × 256 pixels, and be cut, main body made to be located at center；Three-dimensional joint annotation is by motion capture system It provides.

9. based on the three-dimensional joint annotation described in claims 8, which is characterized in that selection 23 marks to define including head 14 joints including portion, neck, left/right shoulder, left/right elbow, left/right wrist, left/right hip, left/right knee and left/right ankle, and only It is used to train the joint of network using the track in these joints；The coordination in each joint normalizing from zero to one in entire data set Change；After pretreatment, data structure by video each odd-numbered frame cutting image and the three-dimensional joint of corresponding standardization annotate Composition.

10. based on described in claims 1 training and assessment (five), which is characterized in that propose a kind of two stage training war Slightly, hourglass model is used in the first stage, and it is finely adjusted on weight lifting data set, learning rate 0.00025, by 5 A period；In second stage, by using dual-view image and corresponding normalization 3 d pose skeleton, on weight lifting data set Multi-view Integration model is trained from the beginning；For 50 periods, the learning rate of model is 0.0005；

In order to assess the performance that single-view and dual-view is arranged in network, two experiments are carried out：First, network is directed to single-view Setting is trained, and uses 90 degree and 135 degree of views respectively；Secondly, which is trained to jointly make using two views It is arranged for the dual-view of network inputs；In all experiments, the repetitions of all objects and weight lifting task as training dataset, The repetition of all objects is as test data set.