CN106897697A

CN106897697A - A kind of personage and pose detection method based on visualization compiler

Info

Publication number: CN106897697A
Application number: CN201710103927.1A
Authority: CN
Inventors: 夏春秋
Original assignee: Shenzhen Vision Technology Co Ltd
Current assignee: Shenzhen Vision Technology Co Ltd
Priority date: 2017-02-24
Filing date: 2017-02-24
Publication date: 2017-06-27

Abstract

A kind of personage and pose detection method based on visualization compiler proposed in the present invention, its main contents include：The Data Synthesis of scene description, by generated data learning network, using basic block define network, posture network (Pose Net) alignment by union, its process is, first with scene description as the input for visualizing compiler, annotated to train pedestrian detecting system and posture estimation system with the True Data demarcated, then by generated data learning network；It is then used by remaining module and the two base units of space confidence module to define network, finally with posture network positions pedestrian.The present invention can automatically obtain annotation, body part position and the segmentation mask of detection, and pedestrian is positioned by using video camera, estimate its posture and carry out activity analysis；The influence to detecting such as reduce illumination, block, being effectively improved recognition efficiency.

Description

A kind of personage and pose detection method based on visualization compiler

Technical field

The present invention relates to personage's posture detection field, more particularly, to a kind of personage based on visualization compiler and appearance Gesture detection method.

Background technology

The detection of human action posture the fields such as video monitoring, virtual reality, interpersonal intelligent interaction extensive use and Study hotspot as computer vision field, its prison that can be used for dangerous posture in the intelligent monitoring of public arena and crowd Control etc..Although the research of recent year human posture detection achieves impressive progress, the high complexity of human posture and many Variability causes that the accuracy and high efficiency of identification do not fully meet the use requirement of relevant industries.Different illumination, regard The condition such as angle and background can cause that human body behavior produces difference in posture and characteristic, in addition human body from blocking, partial occlusion, people Body individual difference, many person recognitions etc. are all the embodiments spatially of human posture's detection of complex, so personage and posture inspection Survey method needs further research.

The present invention proposes a kind of personage based on visualization compiler and pose detection method, first uses scene description conduct The input of compiler is visualized, is annotated to train pedestrian detecting system and posture estimation system with the True Data demarcated, then By generated data learning network；Remaining module and the two base units of space confidence module are then used by define network, Finally use posture network positions pedestrian.The present invention can automatically obtain annotation, body part position and the segmentation mask of detection, lead to Cross using video camera to position pedestrian, estimate its posture and carry out activity analysis；Reduce illumination, block etc. to detection Influence, be effectively improved recognition efficiency.

The content of the invention

For illumination, the problem that influence can be produced such as block, compiled based on visualization it is an object of the invention to provide one kind The personage for translating device and pose detection method, first with scene description as the input for visualizing compiler, with the true number demarcated Pedestrian detecting system and posture estimation system are trained according to annotation, then by generated data learning network；It is then used by residual mode Block and the two base units of space confidence module define network, finally with posture network positions pedestrian.

To solve the above problems, the present invention provides a kind of personage based on visualization compiler and pose detection method, its Main contents include：

(1) Data Synthesis of scene description；

(2) by generated data learning network；

(3) network is defined using basic block；

(4) posture network (Pose Net) alignment by union.

Wherein, described visualization compiler, for generating the specific mankind's detection of scene and posture estimation system；It is Know that information has：

(1) the inherent parameter and extrinsic parameter of camera；

(2) rough physical geometry layout (walk, be seated, standing) of scene and may not be blocked (obstacle) or physically not The scene areas in the region (wall) of presence；

(3) posture of scene regional pedestrian and direction；

Together with single image, scene description synthesizes physically as the input of compiler in the effective coverage of scene Ground connection and geometrically accurate people；The set of compiler learning region particular model, detection, Attitude estimation for people and point Cut；During reasoning, each region in these particular models is run simultaneously on its corresponding region.

Wherein, the Data Synthesis of described scene description are, it is necessary to the good True Data of high-quality demarcation annotates to train Pedestrian detecting system and posture estimation system；Without complicated manually labeling process, visualization compiler usage scenario is retouched Pedestrian outward appearance of the simulation suitable for each region of scene is stated, so as in expanding to a large amount of scenes.

Further, described scene description, given scenario description, the plane 3D models that compiler firstly generates scene come Barrier is surrounded, that is, is fitted ground level, planar wall and cube；Then camera lens characteristic (example is considered using camera parameter Such as, the perspective distortion in wide angle camera) and for rendering the scene of the accurate people of geometry；Except each " the effective row in scene People position " is presented outside the outward appearance of people, and rendering pipeline can also accurately control the change of human appearance, such as sex, height, width Degree, orientation and attitude；Virtual mankind's database includes 139 different models, covers sex, clothing color and race；Compiling Device can be from 0 degree to 360 degree, it is also possible to guided by any previous available information；

In order to mark to the life rendered in image into the True Data demarcated, attribute is closed first by following label It is linked to each 3D dummy model：The 3D positions of 27 parts of segmentation mask and the center of the people for detecting；Then from 3D annotations and camera projective parameter automatically extract the 2D labels for training, and this process allows the consistent noiseless mark of generation Sign；Further, it is also possible to evenly across the change of all of outward appearance, direction, posture or position.

Wherein, it is described by generated data learning network, the specific data of scene for producing are used, visualization compiler is produced The visualization procedure of raw deep neural network form, the standard operation training according to scene description；

The visualization procedure generated by visualization compiler completes following task jointly：The localization of pedestrian, defines its appearance The boundary mark of gesture, and split define their pixel；In order to predict pedestrian position, attitude and segmentation mask, network must be to pedestrian Overall picture, the model before the useful space configuration of the local appearance of terrestrial reference and these parts is modeled；It is outer in order to capture RGB input mappings are used for the essence of pedestrian, local terrestrial reference and segmentation mask for sight, complete pedestrian and local terrestrial reference outward appearance, study It is determined that the thermal map regression problem of position；Priori in spatial relationship between component locations is learnt by space confidence (SB) module, Space confidence module considers the correlation between the thermal map of pedestrian, local terrestrial reference and segmentation mask；By this visualization procedure Particular instance is referred to as posture network (Pose Net).

Further, described human body attitude estimating system, is generally considered as detection and Attitude estimation independent and order and appoints Business, is Attitude estimation after detection；The True Data mankind detection that these systems or expection have been demarcated, or using ready-made Detector is detected roughly；However, detection and positioning parts are highly complementary processes；Detection can greatly influence Attitude estimation process, the presence confidence being accurately positioned for strengthening people in corresponding position of part；Therefore, posture network model These tasks are coupled, the efficiency of pedestrian detection and Attitude estimation is improved.

Wherein, described use basic block defines network, and using remaining module and space confidence module, the two are substantially single Position defines network；It is introduced into remaining unit and solves the problems, such as disappearance gradient in training depth convolutional network；It is substantially single using this Unit is network, and sets up it and carry out definition space confidence (SB) module.

Further, described space confidence module, is mapped to the input feature vector of block part and positions confidence (thermal map), together When treatment from previous piece of input feature vector and part positioning confidence；The characteristics of image and part positioning confidence generated by the block lead to Cross the input that cascade forms next piece；Given input x to SB modules, output y is given by：

Wherein,Represent attended operation, r=f_reaX () is the operation by the non-same branch of remaining unit, b= f_beliefX () is represented from input x to expectation thermal map (people's detection, part are detected and segmentation mask) by a series of 1 × 1 convolution；SB Unit makes network consider contextual information detection confidence level；Confidence level b is positioned from i-th part of SB units_iTravel to down One (i+1) individual SB block, and processed by non-identity path, the correlation between capture various pieces thermal map；By passing Be can be seen that using SB unit-distance codes with returning

Due to attended operation, mark shortcut and f in each SB unit_rea() treatment comes from all previous SB units Confidence；Additionally, the detection confidence level figure generated in each SB unit have also contemplated that the part at all previous SB units Positioning confidence level, each SB unit is with different reception field computations；Therefore, network is utilized in multiple stages and received by multiple The detection confidence level figure of field size.

Wherein, described posture network (Pose Net) alignment by union, gives input picture, posture network association home row People, positions body part and pedestrian in the form of thermal map；Network is made up of complete convolutional layer, spatial context is kept, while carrying Computationally efficient；To realize being accurately positioned and Attitude estimation for pedestrian, predicted using intensive thermal map in the entire network, prevent by The information caused in sub-sampling (pond) is lost；

Input picture is by with 5 × 5 convolutional layers and 3 × 3 wave filter of wave filter, it then follows for Object identifying The design of remaining network；It is afterwards 3 SB units, each has the convolution filter of big received field, increases the received field of network, Dense prediction is performed simultaneously；SB units are followed by two 1 × 1 convolutional layers, by image feature maps to thermal map；Finally, the company of skipping Connect for merging the information from multiple difference context areas, combination receives the feature of field from various yardsticks；For examining Thermal map, body centre and the segmentation that the bounding box of survey is positioned around joint are inferred；

By optimizing network, neural network forecast is minimizedWith for going People's detection, part are positioned and the multitask mean square error loss L between the preferable thermal map of segmentation mask, are defined as follows,

Wherein, α, β and γ are that hyper parameter is traded off different loss functions.

Further, described posture network, it is the high-quality composograph of the pedestrian's outward appearance in usage scenario, visually Change the complete convolutional neural networks of compiler study scene and the specific spatial variations in region；Detected while for pedestrian, appearance State is estimated and is split；It can start anew to train generated data.

Brief description of the drawings

Fig. 1 is a kind of system flow chart of personage and pose detection method based on visualization compiler of the present invention.

Fig. 2 is a kind of visualization compiler of personage and pose detection method based on visualization compiler of the present invention.

Fig. 3 is that a kind of use basic block of personage and pose detection method based on visualization compiler of the present invention defines net Network.

Fig. 4 is a kind of posture network (Pose of personage and pose detection method based on visualization compiler of the present invention Net) alignment by union.

Specific embodiment

It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase Mutually combine, the present invention is described in further detail with specific embodiment below in conjunction with the accompanying drawings.

Fig. 1 is a kind of system flow chart of personage and pose detection method based on visualization compiler of the present invention.Mainly Data Synthesis including scene description, by generated data learning network, network, posture network (Pose are defined using basic block Net) alignment by union.

The Data Synthesis of scene description are, it is necessary to the good True Data of high-quality demarcation annotates to train pedestrian detecting system And posture estimation system；Without complicated manually labeling process, visualization compiler usage scenario describes simulation and is applied to Pedestrian's outward appearance in each region of scene, so as in expanding to a large amount of scenes.

Given scenario is described, and compiler firstly generates the plane 3D models of scene to surround barrier, that is, be fitted ground level, Planar wall and cube；Then camera lens characteristic (for example, the perspective distortion in wide angle camera) is considered using camera parameter With the scene for rendering the accurate people of geometry；In addition to the outward appearance of people is presented in each " the effective pedestrian position " of scene, Rendering pipeline can also accurately control the change of human appearance, such as sex, height, width, orientation and attitude；Virtual mankind's number Include 139 different models according to storehouse, cover sex, clothing color and race；Compiler can be from 0 degree to 360 degree, it is also possible to by Any previous available information guiding；

By generated data learning network, using the specific data of scene for producing, visualization compiler produces depth nerve The visualization procedure of latticed form, the standard operation training according to scene description；

Wherein, human body attitude estimating system, is generally considered as independence and serial task, after detection by detection and Attitude estimation It is Attitude estimation；The True Data mankind detection that these systems or expection have been demarcated, or carried out using ready-made detector Rough detection；However, detection and positioning parts are highly complementary processes；Detection can greatly influence Attitude estimation mistake Journey, the presence confidence being accurately positioned for strengthening people in corresponding position of part；Therefore, posture network model couples these Business, improves the efficiency of pedestrian detection and Attitude estimation.

Fig. 2 is a kind of visualization compiler of personage and pose detection method based on visualization compiler of the present invention.Can It is used to generate the specific mankind's detection of scene and posture estimation system depending on changing compiler；Its Given information has：

(1) the inherent parameter and extrinsic parameter of camera；

(3) posture of scene regional pedestrian and direction；

Fig. 3 is that a kind of use basic block of personage and pose detection method based on visualization compiler of the present invention defines net Network.Network is defined using remaining module and the two base units of space confidence module；Introduce remaining unit and solve training deeply The problem of disappearance gradient in degree convolutional network；It is network to use this elementary cell, and sets up it and carry out definition space confidence (SB) Module.

Wherein, space confidence module, is mapped to the input feature vector of block part and positions confidence (thermal map), while treatment comes from Previous piece of input feature vector and part positioning confidence；The characteristics of image and part positioning confidence generated by the block are formed by cascade Next piece of input；Given input x to SB modules, output y is given by：

Wherein,Represent attended operation, r=f_rea(X) it is operation by the non-same branch of remaining unit, b=f_belief X () is represented from input x to expectation thermal map (people's detection, part are detected and segmentation mask) by a series of 1 × 1 convolution；SB units Network is set to consider contextual information detection confidence level；Confidence level b is positioned from i-th part of SB units_iTravel to next (i+1) individual SB blocks, and processed by non-identity path, the correlation between capture various pieces thermal map；By recursively Be can be seen that using SB unit-distance codes

Fig. 4 is a kind of posture network (Pose of personage and pose detection method based on visualization compiler of the present invention Net) alignment by union.Given input picture, posture network association positioning pedestrian, positions body part and row in the form of thermal map People；Network is made up of complete convolutional layer, keeps spatial context, while improving computational efficiency；To realize being accurately positioned for pedestrian And Attitude estimation, predicted using intensive thermal map in the entire network, prevent the information caused due to sub-sampling (pond) from losing；

Wherein, posture network is the high-quality composograph of the pedestrian's outward appearance in usage scenario, visualization compiler study The complete convolutional neural networks of the specific spatial variations of scene and region；Detection, Attitude estimation and segmentation while for pedestrian； It can start anew to train generated data.

For those skilled in the art, the present invention is not restricted to the details of above-described embodiment, without departing substantially from essence of the invention In the case of god and scope, the present invention can be realized with other concrete forms.Additionally, those skilled in the art can be to this hair Bright to carry out various changes and modification without departing from the spirit and scope of the present invention, these improvement also should be regarded as of the invention with modification Protection domain.Therefore, appended claims are intended to be construed to include preferred embodiment and fall into all changes of the scope of the invention More and modification.

Claims

1. it is a kind of based on the personage for visualizing compiler and pose detection method, it is characterised in that mainly to include scene description Data Synthesis (one)；By generated data learning network (two)；Network (three) is defined using basic block；Posture network (Pose Net) alignment by union (four).

2. based on the visualization compiler described in claims 1, it is characterised in that for generating the specific mankind's detection of scene And posture estimation system；Its Given information has：

(1) the inherent parameter and extrinsic parameter of camera；

(2) the rough physical geometry of scene is laid out (walk, be seated, standing) and may be blocked (obstacle) or not exist physically Region (wall) scene areas；

(3) posture of scene regional pedestrian and direction；

Together with single image, scene description synthesizes in the effective coverage of scene and is physically grounded as the input of compiler Geometrically accurate people；The set of compiler learning region particular model, detection, Attitude estimation and segmentation for people； During reasoning, each region in these particular models is run simultaneously on its corresponding region.

3. the Data Synthesis () of the scene description being based on described in claims 1, it is characterised in that need high-quality demarcation Good True Data annotates to train pedestrian detecting system and posture estimation system；Without complicated manually labeling process, The usage scenario description simulation of visualization compiler is applied to pedestrian's outward appearance in each region of scene, so as to expand to a large amount of scenes In.

4., based on the scene description described in claims 3, it is characterised in that given scenario is described, compiler firstly generates field The plane 3D models of scape surround barrier, that is, be fitted ground level, planar wall and cube；Then considered using camera parameter Camera lens characteristic (for example, the perspective distortion in wide angle camera) and the scene for rendering the accurate people of geometry；Except on the scene Each " effective pedestrian position " of scape is presented outside the outward appearance of people, and rendering pipeline can also accurately control the change of human appearance Change, such as sex, height, width, orientation and attitude；Virtual mankind's database includes 139 different models, covers sex, clothes Dress color and race；Compiler can be from 0 degree to 360 degree, it is also possible to guided by any previous available information；

In order to mark to the life rendered in image into the True Data demarcated, Attribute Association is arrived first by following label Each 3D dummy model：The 3D positions of 27 parts of segmentation mask and the center of the people for detecting；Then noted from 3D Release and automatically extract 2D labels for training with camera projective parameter, this process allows the consistent noiseless label of generation；This Outward, can also be evenly across the change of all of outward appearance, direction, posture or position.

5. based on described in claims 1 by generated data learning network (two), it is characterised in that use the scene for producing Specific data, visualization compiler produces the visualization procedure of deep neural network form, according to the standard operation of scene description Training；

The visualization procedure generated by visualization compiler completes following task jointly：The localization of pedestrian, defines its posture Boundary mark, and split define their pixel；In order to predict pedestrian position, attitude and segmentation mask, network must be to the complete of pedestrian Model before the useful space configuration of looks, the local appearance of terrestrial reference and these parts is modeled；In order to capture outward appearance, complete Whole pedestrian and local terrestrial reference outward appearance, study being accurately positioned for pedestrian, local terrestrial reference and segmentation mask by RGB inputs mapping Thermal map regression problem；Priori in spatial relationship between component locations is learnt by space confidence (SB) module, space is put Letter module considers the correlation between the thermal map of pedestrian, local terrestrial reference and segmentation mask；By the specific reality of this visualization procedure Exampleization is referred to as posture network (Pose Net).

6. based on the human body attitude estimating system described in claims 5, it is characterised in that generally regard detection and Attitude estimation It is independence and serial task, is Attitude estimation after detection；The True Data mankind detection that these systems or expection have been demarcated, Or detected roughly using ready-made detector；However, detection and positioning parts are highly complementary processes；Detection Attitude estimation process, the presence confidence being accurately positioned for strengthening people in corresponding position of part can greatly be influenceed；Cause This, posture network model couples these tasks, improves the efficiency of pedestrian detection and Attitude estimation.

7. network (three) is defined based on the use basic block described in claims 1, it is characterised in that use remaining module and sky Between confidence module the two base units define network；It is introduced into remaining unit and solves the gradient that disappears in training depth convolutional network Problem；It is network to use this elementary cell, and sets up it and carry out definition space confidence (SB) module.

8. based on the space confidence module described in claims 7, it is characterised in that the input feature vector of block is mapped into part fixed (thermal map) is believed in position, while treatment is from previous piece of input feature vector and part positioning confidence；The characteristics of image generated by the block With the input that part positioning confidence forms next piece by cascade；Given input x to SB modules, output y is given by：

Wherein,Represent attended operation, r=f_reaX () is the operation by the non-same branch of remaining unit, b=f_belief(x) Represent from input x to expectation thermal map (people's detection, part are detected and segmentation mask) by a series of 1 × 1 convolution；SB units make net Network considers contextual information detection confidence level；Part positioning confidence level bi from i-th SB unit travels to next (i+ 1) individual SB blocks, and processed by non-identity path, the correlation between capture various pieces thermal map；By recursively applying SB Unit-distance code can be seen that

Due to attended operation, mark shortcut and f in each SB unit_rea() processes putting from all previous SB units Letter；Additionally, the detection confidence level figure generated in each SB unit have also contemplated that the part positioning at all previous SB units Confidence level, each SB unit is with different reception field computations；Therefore, network is using in multiple stages and big by multiple received fields Small detection confidence level figure.

9. based on posture network (Pose Net) alignment by union (four) described in claims 1, it is characterised in that given input Image, posture network association positioning pedestrian, positions body part and pedestrian in the form of thermal map；Network is by complete convolutional layer group Into holding spatial context, while improving computational efficiency；To realize being accurately positioned and Attitude estimation for pedestrian, in whole network It is middle to be predicted using intensive thermal map, prevent the information caused due to sub-sampling (pond) from losing；

Input picture is by with 5 × 5 convolutional layers and 3 × 3 wave filter of wave filter, it then follows for the remnants of Object identifying The design of network；It is afterwards 3 SB units, each has the convolution filter of big received field, increases the received field of network, while Perform dense prediction；SB units are followed by two 1 × 1 convolutional layers, by image feature maps to thermal map；Finally, connection is skipped to use In information of the fusion from multiple difference context areas, combination receives the feature of field from various yardsticks；For what is detected Thermal map, body centre and the segmentation that bounding box is positioned around joint are inferred；

By optimizing network, neural network forecast is minimizedExamined with for pedestrian Survey, part positions and the multitask mean square error loss L between the preferable thermal map of segmentation mask, be defined as follows,

10. based on the posture network described in claims 9, it is characterised in that it is the height of the pedestrian's outward appearance in usage scenario The complete convolutional neural networks of quality combined image, visualization compiler study scene and the specific spatial variations in region；For Detection, Attitude estimation and segmentation while pedestrian；It can start anew to train generated data.