CN109977912A

CN109977912A - Video human critical point detection method, apparatus, computer equipment and storage medium

Info

Publication number: CN109977912A
Application number: CN201910276687.4A
Authority: CN
Inventors: 张樯; 张挺; 李斌; 李司同; 崔洪
Original assignee: Beijing Institute of Environmental Features
Current assignee: Beijing Institute of Environmental Features
Priority date: 2019-04-08
Filing date: 2019-04-08
Publication date: 2019-07-05
Anticipated expiration: 2039-04-08
Also published as: CN109977912B

Abstract

The present invention relates to a kind of video human critical point detection methods, comprising: extracts multiframe image to be detected in video to be detected；The optical flow field between image to be detected described in multiframe is obtained, and obtains the characteristic pattern of each described image to be detected；The characteristic pattern is merged according to the optical flow field, obtains the Enhanced feature figure of described image to be detected；The Enhanced feature figure is inputted into default neural network, obtains the human body key point in described image to be detected.By extracting the optical flow field between multiple image, image to be detected is enhanced, and then improve the accuracy rate of Video Key point detection.

Description

Video human critical point detection method, apparatus, computer equipment and storage medium

Technical field

The present invention relates to field of image processings, more particularly to a kind of video human critical point detection method, apparatus, calculate Machine equipment and storage medium.

Background technique

The research of human body critical point detection is how accurately to identify and position to each key point of human body in image, it It is the basis of many computer vision applications such as action recognition, human-computer interaction.

Generally use " bottom-up " and " top-down " two methods on video human critical point detection at present, but this two Video is all only simply decomposed into several frames by kind algorithm, and the Processing Algorithm of single frames is recycled to be handled frame by frame, without benefit With the time-domain information of interframe, cause human body critical point detection accuracy rate lower.

Summary of the invention

The purpose of the present invention is to provide a kind of video human critical point detection method, apparatus, computer equipment and readable Storage medium can effectively improve the accuracy of human body critical point detection in video.

The purpose of the present invention is achieved through the following technical solutions:

A kind of video human critical point detection method, which comprises

Extract multiframe image to be detected in video to be detected；

The optical flow field between image to be detected described in multiframe is obtained, and obtains the characteristic pattern of each described image to be detected；

The characteristic pattern is merged according to the optical flow field, obtains the Enhanced feature figure of described image to be detected；

The Enhanced feature figure is inputted into default neural network, obtains the human body key point in described image to be detected.

In one embodiment, described image to be detected includes current frame image and at least one historical frames image；It is described The step of extracting multiframe image to be detected in video to be detected, comprising:

Extract the current frame image in the video to be detected；

At least one historical frames image in the video to be detected is extracted, the historical frames image takes the frame moment to be located at institute It is before stating current frame image and adjacent with the current frame image.

In one embodiment, the optical flow field obtained between image to be detected described in multiframe, and obtain each described The step of characteristic pattern of image to be detected, comprising:

Obtain the optical flow field between the current frame image and the historical frames image；

The current signature figure of the current frame image is obtained, and obtains the history feature figure of the historical frames image.

In one embodiment, the step for obtaining the optical flow field between the current frame image and the historical frames image Suddenly, comprising:

By the current frame image and the default neural light stream network of historical frames image input, the present frame figure is obtained Optical flow field between picture and the historical frames image.

In one embodiment, described to be merged the characteristic pattern according to the optical flow field, obtain described image to be detected Enhanced feature figure the step of, comprising:

The history feature figure is aligned to the current signature figure according to the optical flow field, obtains alignment feature figure；

The alignment feature figure and the current signature figure are subjected to Time Domain Fusion, obtain the Enhanced feature figure.

By the current frame image and the default neural light stream network of historical frames image input, the present frame figure is obtained Optical flow field between picture and the historical frames image, also obtains scale field；Wherein, the scale field and the characteristic pattern dimension phase Together.

The alignment feature figure is multiplied to obtain fine-characterization figure with the scale field；

The fine-characterization figure and the current signature figure are subjected to Time Domain Fusion, obtain the Enhanced feature figure.

A kind of video human critical point detection device, described device include:

Image zooming-out module, for extracting multiframe image to be detected in video to be detected；

Optical-flow Feature extraction module, for obtaining the optical flow field between image to be detected described in multiframe, and each institute of acquisition State the characteristic pattern of image to be detected；

Image enhancement module obtains described image to be detected for merging the characteristic pattern according to the optical flow field Enhanced feature figure；

Critical point detection module obtains the mapping to be checked for the Enhanced feature figure to be inputted default neural network Human body key point as in.

A kind of computer equipment, including memory and processor, the memory are stored with computer program, the processing Above-mentioned steps when device executes the computer program.

A kind of computer readable storage medium, is stored thereon with computer program, and the computer program is held by processor Above-mentioned steps are realized when row.

Video human critical point detection method provided by the invention extracts the multiframe mapping to be checked in video to be detected Picture；The optical flow field between image to be detected described in multiframe is obtained, and obtains the characteristic pattern of each described image to be detected；According to institute It states optical flow field to merge the characteristic pattern, obtains the Enhanced feature figure of described image to be detected；The Enhanced feature figure is inputted Default neural network, obtains the human body key point in described image to be detected.It is right by extracting the optical flow field between multiple image Image to be detected is enhanced, and then improves the accuracy rate of Video Key point detection.

Detailed description of the invention

Fig. 1 is the applied environment figure of video human critical point detection method in one embodiment；

Fig. 2 is the flow diagram of video human critical point detection method in one embodiment；

Fig. 3 is the flow diagram of video human critical point detection method in another embodiment；

Fig. 4 is the structural block diagram of video human critical point detection device in another embodiment；

Fig. 5 is the internal structure chart of computer equipment in one embodiment.

Specific embodiment

To make the objectives, technical solutions, and advantages of the present invention more comprehensible, with reference to the accompanying drawings and embodiments, to this Invention is described in further detail.It should be appreciated that the specific embodiments described herein are only used to explain the present invention, And the scope of protection of the present invention is not limited.

Video human critical point detection method provided by the present application can be applied in application environment as shown in Figure 1.It should Application environment includes server 104 and photographic device 102, and server 104 obtains video to be detected from photographic device 102, and Extract multiframe image to be detected in video to be detected；Server 104 obtains the optical flow field between image to be detected described in multiframe, And obtain the characteristic pattern of each described image to be detected；Server 104 merges the characteristic pattern according to the optical flow field, obtains The Enhanced feature figure of described image to be detected；The Enhanced feature figure is inputted default neural network by server 104, is obtained described Human body key point in image to be detected.Wherein, server can use the either multiple server compositions of independent server Server cluster is realized；Photographic device can have the device of camera function to realize using camera, camera, mobile phone etc..

In one embodiment, it as shown in Fig. 2, providing a kind of video human critical point detection method, answers in this way For being illustrated for the server in Fig. 1, comprising the following steps:

Step S202 extracts multiframe image to be detected in video to be detected.

In this step, image to be detected includes current frame image and at least one historical frames image.

In the specific implementation process, multiframe image to be detected in the extraction video to be detected of step S202, comprising:

1) current frame image in the video to be detected is extracted；

2) at least one historical frames image in the video to be detected is extracted, the historical frames image takes the frame moment to be located at It is before the current frame image and adjacent with the current frame image.

For example, three adjacent frame images can be extracted, using last frame image as current frame image, by two frame figure of front As being used as historical frames image.

Step S204 obtains the optical flow field between image to be detected described in multiframe, and obtains each described image to be detected Characteristic pattern.

In this step, light stream estimation be according to two observe the variation of body surface, shape etc. between moments to Calculate a kind of method of object of which movement variation.What light stream characterized is the motion information between two images, and what it reflected is previous The instantaneous velocity of pixel motion in frame image to a later frame image.

In the present embodiment, the light stream estimation of interframe is carried out using Flownet2S network.

As shown in figure 3, in one embodiment, the light stream between image to be detected described in the acquisition multiframe of step S204 , and obtain the characteristic pattern of each described image to be detected, comprising:

Step S410 obtains the optical flow field between the current frame image and the historical frames image；

Step S420 obtains the current signature figure of the current frame image, and obtains the history of the historical frames image Characteristic pattern.

In one embodiment, the light stream of step S410 obtained between the current frame image and the historical frames image , it may include: to obtain the current frame image and the default neural light stream network of historical frames image input described current Optical flow field between frame image and the historical frames image.

Specifically, using M_i→kTo indicate that one is calculated the two-dimentional light stream of the i-th frame to kth frame by Flownet2S ?.Assuming that a certain pixel in the i-th framing bit in position p, be the pixel motion to position q in kth frame, then then there is q=p+ δ p, Middle δ p=M_i→k(p)；Before carrying out feature alignment, need light stream to be zoomed to by bilinear interpolation the identical size of characteristic pattern；By δ p is mostly decimal in above formula, so needing to be aligned by formula (1) Lai Shixian feature.

Wherein what c was indicated is a channel of characteristic pattern f, and q traverses each coordinate on characteristic pattern, and G is that bilinearity is inserted It is worth transformation kernel.Due to G be it is two-dimensional, two one-dimensional transformation kernels can be decomposed into and be multiplied, as shown in formula (2).

G (q, p+ δ p)=g (q_x,p_x+δp_x)·g(q_y,p_y+δp_y) (2)

Wherein g (a, b)=max (0,1- | a-b |)；It is non-zero due to there was only seldom item in above formula, the meter of institute's above formula Calculating can quickly.

In another embodiment, it is detected in order to enable obtaining feature after alignment and can be more advantageous to, step S410's Obtain the optical flow field between the current frame image and the historical frames image, can also include: by the current frame image and The default neural light stream network of historical frames image input, obtains the light between the current frame image and the historical frames image Flow field also obtains scale field；Wherein, the scale field is identical as the characteristic pattern dimension.

Specifically, Flownet2S not only exports optical flow field also while exporting the scale field of one and characteristic pattern identical dimensional S_i→k。

The characteristic pattern is merged according to the optical flow field, obtains the Enhanced feature of described image to be detected by step S206 Figure.

In this step, characteristic pattern fusion is carried out using GRU (Gated Recurrent Units, gating cycle unit), Specifically, carrying out temporal signatures fusion using GRU, that is, ConvGRU of convolution form.

In one embodiment, when only obtaining optical flow field in step S410, step S206's will according to the optical flow field The characteristic pattern fusion, obtains the Enhanced feature figure of described image to be detected, comprising:

1) the history feature figure is aligned according to the optical flow field to the current signature figure, obtains alignment feature figure；

2) the alignment feature figure and the current signature figure are subjected to Time Domain Fusion, obtain the Enhanced feature figure.

Specifically, input information is handled according to formula formula (3) inside a GRU unit.GRU unit it is new State h_tIt is preceding state h_t-1With memory state h'_tWeighted sum.Update door z_tHow many ingredient quilt in memory state determined For calculating new state h_t, reset door r_tControl preceding state h_t-1To the influence degree of memory state.With full type of attachment GRU is different, and * indicates convolution here,Indicate that contraposition is multiplied, σ is sigmoid function, and w is weight to be learned, and b is bias term.

z_t=σ (x_t*w_xz+h_t-1*w_hz+b_z),

r_t=σ (x_t*w_xr+h_t-1*w_hr+b_r),

In another embodiment, when obtaining optical flow field in step S410 and when scale field, step S206 according to Optical flow field merges the characteristic pattern, obtains the Enhanced feature figure of described image to be detected, comprising:

2) the alignment feature figure is multiplied to obtain fine-characterization figure with the scale field；Specifically, scale field S_i→kAnd sky Between be aligned after alignment feature figure be multiplied, obtain fine-characterization figure.

3) the fine-characterization figure and the current signature figure are subjected to Time Domain Fusion, obtain the Enhanced feature figure.

The Enhanced feature figure is inputted default neural network, obtains the human body in described image to be detected by step S208 Key point.

In this step, the human body critical point detection based on image is carried out using Mask-RCNN.The net of Mask-RCNN Network structure mainly includes the feature extraction network of bottom, the candidate frame generation network of middle layer and the specific subtask positioned at head Network three parts composition.

Bottom feature extraction network is original image for extracting feature abundant, input from image, and output is special Sign figure.In order to extract better feature, it is stronger that VGG network used in Faster-RCNN is replaced with into feature representation ability Residual error network.Simultaneously as often there is the different different target of size scale in image, only from the characteristic pattern of single scale into Row detection easily causes missing inspection.For the core network as the Resnet, the feature resolution of shallow-layer is high but semantic Level is lower, and the Feature Semantics level of deep layer is higher but resolution ratio is low.It can be incited somebody to action by using FPN network as core network The information fusion of different scale is got up, and the Analysis On Multi-scale Features figure of output examines subsequent target detection, semantic segmentation, key point Survey has great importance.

Intermediate candidate frame generates network for distinguishing target and background, generating target candidate frame, is later exactly according to time Frame is selected to be cut out characteristic pattern.The method used in Faster-RCNN is RoI Pooling, and realization is reflected from original image region The last pooling in convolution region is mapped to the function of fixed size, the size in the region is normalized into convolutional network input Size.Mask-RCNN between the ROIAlign layers of feature and input to extraction using calibrating.It avoids to each side ROI Boundary or block are digitized, and are obtained using bilinear interpolation method calculating four sampling locations fixed in ROI block defeated Enter characteristic value and result is merged.The characteristic pattern of ROIAlign layers of 7 × 7 size of final output gives subsequent subtask net Network.

The sub-network of specific tasks is located at top, for human body critical point detection, including 8 layer of 3 × 3 convolution.By It is very sensitive for the resolution ratio of characteristic pattern in the accuracy rate of critical point detection therefore double by one uncoiling lamination of cascade and one Linear interpolation layer makes the result scale finally exported be 56 × 56.

Above-mentioned video human critical point detection method, by extracting multiframe image to be detected in video to be detected；It obtains The optical flow field between image to be detected described in multiframe is taken, and obtains the characteristic pattern of each described image to be detected；According to the light The characteristic pattern is merged in flow field, obtains the Enhanced feature figure of described image to be detected；Enhanced feature figure input is default Neural network obtains the human body key point in described image to be detected.By extracting the optical flow field between multiple image, to be checked Altimetric image is enhanced, and then improves the accuracy rate of Video Key point detection.

As shown in figure 4, Fig. 4 is the structural schematic diagram of video human critical point detection device in one embodiment, this implementation A kind of video human critical point detection device, including image zooming-out module 401, Optical-flow Feature extraction module 402, figure are provided in example Image intensifying module 403 and critical point detection module 404, in which:

Image zooming-out module 401, for extracting multiframe image to be detected in video to be detected；

Optical-flow Feature extraction module 402 for obtaining the optical flow field between image to be detected described in multiframe, and obtains each The characteristic pattern of described image to be detected；

Image enhancement module 403 obtains described image to be detected for merging the characteristic pattern according to the optical flow field Enhanced feature figure；

Critical point detection module 404 obtains described to be detected for the Enhanced feature figure to be inputted default neural network Human body key point in image.

Specific restriction about video human critical point detection device may refer to above for video human key point The restriction of detection method, details are not described herein.Modules in above-mentioned video human critical point detection device can whole or portion Divide and is realized by software, hardware and combinations thereof.Above-mentioned each module can be embedded in the form of hardware or independently of computer equipment In processor in, can also be stored in a software form in the memory in computer equipment, in order to processor calling hold The corresponding operation of the above modules of row.

As shown in figure 5, Fig. 5 is the schematic diagram of internal structure of computer equipment in one embodiment.The computer equipment packet Include processor, non-volatile memory medium, memory and the network interface connected by device bus.Wherein, which sets Standby non-volatile memory medium is stored with operating device, database and computer-readable instruction, can be stored with control in database Part information sequence when the computer-readable instruction is executed by processor, may make processor to realize a kind of video human key point Detection method.The processor of the computer equipment supports the operation of entire computer equipment for providing calculating and control ability. Computer-readable instruction can be stored in the memory of the computer equipment, when which is executed by processor, Processor may make to execute a kind of video human critical point detection method.The network interface of the computer equipment is used to connect with terminal Connect letter.It will be understood by those skilled in the art that structure shown in Fig. 5, only part relevant to application scheme is tied The block diagram of structure does not constitute the restriction for the computer equipment being applied thereon to application scheme, specific computer equipment It may include perhaps combining certain components or with different component layouts than more or fewer components as shown in the figure.

In one embodiment it is proposed that a kind of computer equipment, computer equipment include memory, processor and storage On a memory and the computer program that can run on a processor, processor realize following steps when executing computer program: Extract multiframe image to be detected in video to be detected；The optical flow field between image to be detected described in multiframe is obtained, and is obtained The characteristic pattern of each described image to be detected；The characteristic pattern is merged according to the optical flow field, obtains described image to be detected Enhanced feature figure；The Enhanced feature figure is inputted into default neural network, obtains the human body key point in described image to be detected.

It includes current frame image that processor, which executes described image to be detected when computer program, in one of the embodiments, With at least one historical frames image；The step of described multiframe image to be detected extracted in video to be detected, comprising: described in extraction Current frame image in video to be detected；Extract at least one historical frames image in the video to be detected, the historical frames figure The frame moment that takes of picture is located at before the current frame image and adjacent with the current frame image.

When processor executes computer program in one of the embodiments, image to be detected described in the acquisition multiframe it Between optical flow field, and the step of obtaining the characteristic pattern of each described image to be detected, comprising: obtain the current frame image and institute State the optical flow field between historical frames image；The current signature figure of the current frame image is obtained, and obtains the historical frames figure The history feature figure of picture.

Processor executes the current frame image and described of obtaining when computer program in one of the embodiments, The step of optical flow field between historical frames image, comprising: by the current frame image and the default mind of historical frames image input Through light stream network, the optical flow field between the current frame image and the historical frames image is obtained.

In one of the embodiments, processor execute when computer program it is described according to the optical flow field by the feature Figure fusion, the step of obtaining the Enhanced feature figure of described image to be detected, comprising: according to the optical flow field by the history feature Figure is aligned to the current signature figure, obtains alignment feature figure；When the alignment feature figure and the current signature figure are carried out Domain fusion, obtains the Enhanced feature figure.

Processor executes the current frame image and described of obtaining when computer program in one of the embodiments, The step of optical flow field between historical frames image, comprising: by the current frame image and the default mind of historical frames image input Through light stream network, the optical flow field between the current frame image and the historical frames image is obtained, scale field is also obtained；Wherein, The scale field is identical as the characteristic pattern dimension.

In one of the embodiments, processor execute when computer program it is described according to the optical flow field by the feature Figure fusion, the step of obtaining the Enhanced feature figure of described image to be detected, comprising: according to the optical flow field by the history feature Figure is aligned to the current signature figure, obtains alignment feature figure；The alignment feature figure is multiplied to obtain carefully with the scale field Change characteristic pattern；The fine-characterization figure and the current signature figure are subjected to Time Domain Fusion, obtain the Enhanced feature figure.

In one embodiment it is proposed that a kind of storage medium for being stored with computer-readable instruction, this is computer-readable When instruction is executed by one or more processors, so that one or more processors execute following steps: extracting video to be detected In multiframe image to be detected；The optical flow field between image to be detected described in multiframe is obtained, and obtains each mapping to be checked The characteristic pattern of picture；The characteristic pattern is merged according to the optical flow field, obtains the Enhanced feature figure of described image to be detected；By institute It states Enhanced feature figure and inputs default neural network, obtain the human body key point in described image to be detected.

Described image to be detected includes current when computer-readable instruction is executed by processor in one of the embodiments, Frame image and at least one historical frames image；The step of described multiframe image to be detected extracted in video to be detected, comprising: mention Take the current frame image in the video to be detected；At least one historical frames image in the video to be detected is extracted, it is described to go through The frame moment that takes of history frame image is located at before the current frame image and adjacent with the current frame image.

It is to be detected described in the acquisition multiframe when computer-readable instruction is executed by processor in one of the embodiments, Optical flow field between image, and the step of obtaining the characteristic pattern of each described image to be detected, comprising: obtain the present frame figure Optical flow field between picture and the historical frames image；The current signature figure of the current frame image is obtained, and is gone through described in acquisition The history feature figure of history frame image.

It is described when computer-readable instruction is executed by processor in one of the embodiments, to obtain the current frame image The step of optical flow field between the historical frames image, comprising: input the current frame image and the historical frames image Default nerve light stream network, obtains the optical flow field between the current frame image and the historical frames image.

When computer-readable instruction is executed by processor in one of the embodiments, it is described according to the optical flow field by institute The step of stating characteristic pattern fusion, obtaining the Enhanced feature figure of described image to be detected, comprising: gone through according to the optical flow field by described History characteristic pattern is aligned to the current signature figure, obtains alignment feature figure；By the alignment feature figure and the current signature figure Time Domain Fusion is carried out, the Enhanced feature figure is obtained.

It is described when computer-readable instruction is executed by processor in one of the embodiments, to obtain the current frame image The step of optical flow field between the historical frames image, comprising: input the current frame image and the historical frames image Default nerve light stream network, obtains the optical flow field between the current frame image and the historical frames image, also obtains scale field； Wherein, the scale field is identical as the characteristic pattern dimension.

When computer-readable instruction is executed by processor in one of the embodiments, it is described according to the optical flow field by institute The step of stating characteristic pattern fusion, obtaining the Enhanced feature figure of described image to be detected, comprising: gone through according to the optical flow field by described History characteristic pattern is aligned to the current signature figure, obtains alignment feature figure；The alignment feature figure is multiplied with the scale field Obtain fine-characterization figure；The fine-characterization figure and the current signature figure are subjected to Time Domain Fusion, obtain the Enhanced feature Figure.

It should be understood that although each step in the flow chart of attached drawing is successively shown according to the instruction of arrow, These steps are not that the inevitable sequence according to arrow instruction successively executes.Unless expressly stating otherwise herein, these steps Execution there is no stringent sequences to limit, can execute in the other order.Moreover, at least one in the flow chart of attached drawing Part steps may include that perhaps these sub-steps of multiple stages or stage are not necessarily in synchronization to multiple sub-steps Completion is executed, but can be executed at different times, execution sequence, which is also not necessarily, successively to be carried out, but can be with other At least part of the sub-step or stage of step or other steps executes in turn or alternately.

The above is only some embodiments of the invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims

1. a kind of video human critical point detection method, which is characterized in that the described method includes:

Extract multiframe image to be detected in video to be detected；

2. the method according to claim 1, wherein described image to be detected includes current frame image and at least one A historical frames image；The step of described multiframe image to be detected extracted in video to be detected, comprising:

Extract the current frame image in the video to be detected；

At least one historical frames image in the video to be detected is extracted, the historical frames image takes the frame moment to be located at described work as It is before prior image frame and adjacent with the current frame image.

3. according to the method described in claim 2, it is characterized in that, the light stream obtained between image to be detected described in multiframe , and the step of obtaining the characteristic pattern of each described image to be detected, comprising:

4. according to the method described in claim 3, it is characterized in that, the acquisition current frame image and the historical frames figure As between optical flow field the step of, comprising:

By the current frame image and the default neural light stream network of historical frames image input, obtain the current frame image and Optical flow field between the historical frames image.

5. according to the method described in claim 4, it is characterized in that, described merge the characteristic pattern according to the optical flow field, The step of obtaining the Enhanced feature figure of described image to be detected, comprising:

6. according to the method described in claim 3, it is characterized in that, the acquisition current frame image and the historical frames figure As between optical flow field the step of, comprising:

By the current frame image and the default neural light stream network of historical frames image input, obtain the current frame image and Optical flow field between the historical frames image, also obtains scale field；Wherein, the scale field is identical as the characteristic pattern dimension.

7. according to the method described in claim 6, it is characterized in that, described merge the characteristic pattern according to the optical flow field, The step of obtaining the Enhanced feature figure of described image to be detected, comprising:

8. a kind of video human critical point detection device, which is characterized in that described device includes:

Optical-flow Feature extraction module, for obtaining the optical flow field between image to be detected described in multiframe, and obtain it is each it is described to The characteristic pattern of detection image；

Image enhancement module obtains the enhancing of described image to be detected for merging the characteristic pattern according to the optical flow field Characteristic pattern；

Critical point detection module obtains in described image to be detected for the Enhanced feature figure to be inputted default neural network Human body key point.

9. a kind of computer equipment, including memory and processor, the memory are stored with computer program, feature exists In the step of processor realizes any one of claims 1 to 7 the method when executing the computer program.

10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program The step of method described in any one of claims 1 to 7 is realized when being executed by processor.