CN111179281A

CN111179281A - Human body image extraction method and human body action video extraction method

Info

Publication number: CN111179281A
Application number: CN201911349143.2A
Authority: CN
Inventors: 王楠; 雷欢; 马敬奇; 陈再励; 何峰; 卢杏坚; 钟震宇
Original assignee: Guangdong Institute of Intelligent Manufacturing
Current assignee: Guangdong Institute of Intelligent Manufacturing
Priority date: 2019-12-24
Filing date: 2019-12-24
Publication date: 2020-05-19

Abstract

The invention discloses a human body image extraction method, which comprises the following steps: acquiring an original input picture; extracting human body information from an original input picture based on a skeleton detection method, wherein the human body information comprises a human body number for each human body and human body skeleton joint point coordinate information corresponding to each human body; constructing a target area based on coordinate information of human body skeleton joint points of all human bodies; extracting a target picture from an original input picture based on the target area; and extracting a human body image corresponding to each human body in the target picture based on an image segmentation algorithm. The human body image extraction method has the advantages of reasonable step setting, high execution speed, low hardware requirement performance and good practicability. In addition, the invention also provides a human body action video extraction method.

Description

Human body image extraction method and human body action video extraction method

Technical Field

The invention relates to the field of picture processing, in particular to a human body image extraction method and a human body action video extraction method.

Background

The human body segmentation is an important step of applications such as human body three-dimensional modeling, posture estimation, mode recognition, detection and tracking, and the like, and the quality of the human body segmentation effect directly determines the effect of subsequent work, so that the research on how to obtain an accurate human body segmentation result has certain practical significance.

In an actual scene, human body segmentation is affected by a plurality of factors such as noise, shading, similar colors and complex backgrounds, and an ideal result cannot be obtained, so that how to obtain an accurate human body segmentation result in a complex scene is still a very challenging task.

At present, the research and application of human body image segmentation are still in the exploration phase, and the commonly used human body segmentation algorithms can be roughly divided into: image segmentation based on image graphics, image segmentation based on shallow machine learning, and image segmentation based on deep learning.

However, the existing segmentation method has the problem that when the pedestrian areas are overlapped, the pedestrians in the overlapped areas cannot be accurately segmented, so that the segmentation precision is low.

Disclosure of Invention

In order to overcome the defects of the existing segmentation method, the invention provides the human body image extraction method and the human body action video extraction method.

Correspondingly, the human body image extraction method comprises the following steps:

acquiring an original input picture;

extracting human body information in the original input picture based on a skeleton detection method, wherein the human body information comprises a human body number for each human body and human body skeleton joint point coordinate information corresponding to each human body;

constructing a target area based on coordinate information of human body skeleton joint points of all human bodies;

extracting a target picture from an original input picture based on the target area;

and extracting a human body image corresponding to each human body in the target picture based on an image segmentation algorithm.

In an optional implementation mode, extracting the coordinate information of human skeleton joint points of all human bodies in the original input picture based on a trained deep convolutional neural network;

the coordinate information of all human body skeleton joint points is P_ki＝{(x_ki,y_ki) I ═ 0,1,.. n, k ═ 1,2,.. m } where k represents the human body number and i represents the human skeleton joint point number; n and m are integer values generated based on the original input picture.

In an optional embodiment, the constructing the target region based on the coordinate information of the human skeleton joint points of all the human bodies includes:

let the target region of the kth human body be a rectangular region, denoted as R_k(x, y, w, h), (x, y) is the coordinates of the lower left corner point of the rectangular region, w is the width of the rectangular region, and w is the height of the rectangular region;

wherein x is x_kmin-b,x_kminIs the minimum x coordinate in the skeleton joint point coordinate information in the kth individual; y ═ y_minA, wherein y_kminThe y coordinate with the minimum coordinate information of the skeleton joint point in the kth individual is used, and d is an empirical value; w ═ x_kmax-x_kmin|+2b，x_kmaxIs the x coordinate with the maximum coordinate information of the skeleton joint point in the kth individual; h ═ y_kmax-y_kmin|+2a，y_kmaxThe y coordinate with the maximum coordinate information of the skeleton joint point in the kth individual; a and b are empirical values.

In an optional embodiment, the extracting a target picture from an original input picture based on the target region includes:

reserving pixel points corresponding to all target areas of human bodies in the original input picture, and formatting the rest pixel points into designated colors;

the target picture comprises a plurality of unconnected target blocks, and one target block in the plurality of target blocks comprises one target area or more than two target areas.

In an optional embodiment, extracting the human body image from the target picture based on an image segmentation algorithm includes:

and sequentially extracting a corresponding number of human body images from each of the plurality of target blocks.

In an optional implementation manner, the sequentially extracting a corresponding number of human body images from each of the plurality of target blocks includes:

selecting a target block, and counting the number of target areas in the target block;

if the number of the target areas in the target block is one, extracting a human body image from the target block based on an image segmentation algorithm;

if the number of the target areas in the target block is more than two, selecting any two target areas with coincident target areas as processing objects, extracting two communicated human body images from the processing objects based on an image segmentation algorithm, segmenting the two communicated human body images based on a watershed algorithm, associating segmentation results to human bodies with corresponding human body numbers, and traversing and executing the step until the combination mode of the target areas is selected;

and obtaining a corresponding human body image based on a plurality of segmentation results of each human body.

In an optional embodiment, the image segmentation algorithm is one of a graph-cut algorithm, a gram-but algorithm and a one-cut algorithm.

Correspondingly, the invention provides a human body action video extraction method, which comprises the following steps:

sequentially extracting each frame of video picture of an original input video based on a time axis;

taking each frame of video picture as an original input picture and executing the human body image extraction method to obtain a human body image corresponding to each human body;

and carrying out video recombination on the human body image corresponding to the human body with the specific human body number in each frame of video picture based on the time axis to obtain the human body action video corresponding to the human body with the specific human body number.

In an optional implementation manner, if a human body image in an extracted frame in the human body motion video has a blocked area, the blocked area is completed based on human body images of other frames except the extracted frame through corresponding human body skeleton joint point coordinate information.

In an optional embodiment, the completing the occluded region based on the body images of the remaining frames according to the corresponding body skeleton joint coordinate information includes the following steps:

determining a human body part where the shielded area is located in the extracted frame and a head joint point and a tail joint point corresponding to the human body part, and obtaining a target human body part posture vector based on the coordinate information of human body skeleton joint points corresponding to the head joint point and the tail joint point;

calculating a reference target human body part posture vector of the same human body part in the rest frames of the human body action video based on the same head joint point and tail joint point;

respectively carrying out similarity matching on the target human body component posture vector and reference target human body component posture vectors in other frames of the human body action video;

and filling the occluded area in the extraction frame by using the reference target human body part with the highest similarity degree.

The invention provides a human body image extraction method and a human body action video extraction method, the human body image extraction method utilizes the coordinate information of human body skeleton joint points to carry out primary processing on an original input picture, can reduce the execution complexity of subsequent human body image extraction in a large area, and has certain effects on improving the execution speed and reducing the hardware requirement; the overlapped human body images are segmented based on the watershed algorithm, the steps are reasonably set, the execution speed is high, and the method has good practicability. The human body motion video extraction method established based on the human body image extraction method can accurately extract the human body motion video corresponding to each human body, and completes the shielded area according to the relation between the front frame and the rear frame so as to further restore the complete appearance of the human body, and has good practicability.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of a human body image extraction method according to an embodiment of the invention;

FIG. 2 shows a schematic diagram of an original input picture (excluding human skeletal joint points) of an embodiment of the invention;

FIG. 3 shows a schematic diagram of an original input picture including human skeletal joint information according to an embodiment of the present invention;

FIG. 4 shows a schematic diagram of a target picture of an embodiment of the invention;

FIG. 5 shows a schematic representation of a human body image of an embodiment of the invention;

fig. 6 shows a flowchart of a human motion video extraction method according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that, in different embodiments, due to the variation of the device structure, several components named by the same name may have different structures, and for components with different structures but the same name in different embodiments, different numbers may be used for distinguishing between different embodiments.

Fig. 1 shows a flow diagram of a human body image extraction method according to an embodiment of the invention.

The embodiment of the invention provides a human body image extraction method, which comprises the following steps:

s101, acquiring an original input picture;

fig. 2 shows a schematic diagram of an original input picture (excluding human skeletal joint points), which is a schematic diagram according to an embodiment of the present invention, and the source of the original input picture may be a photographed picture or one image captured from a video. The actual image formation is more complex than the schematic shown in fig. 2 of the drawings, the drawings of the embodiments of the present invention being for illustration only.

It should be noted that the human body image extraction method according to the embodiment of the present invention is to extract a plurality of human body images from an original input picture.

As can be seen from fig. 2 of the accompanying drawings, the number of the human bodies in the original input picture is four, for convenience of subsequent description, different human bodies are respectively labeled, and the purpose of the subsequent steps is to extract four human body images from the original input picture.

S102, extracting coordinate information of human body skeleton joint points from the original input picture based on a skeleton detection method;

specifically, the skeleton detection method is mainly used for confirming the position information and the number information of the human body in the original input picture.

Generally, skeleton detection methods are mainly classified into top-down detection methods and bottom-up detection methods; specifically, the top-down detection method is that a human body whole is used as a human body detector, then each component of the human body is estimated from the human body whole, and finally the coordinate information of the human body skeleton joint point is confirmed through the posture of the component; the detection method from bottom to top is characterized in that components forming a human body type whole are used as a human body detector, different components are related to a corresponding human body to form the human body type whole, and finally coordinate information of human body skeleton joint points is confirmed through the posture of the human body type whole.

Specifically, the differences of the two modes are mainly reflected in the difference of the types of the human body detectors, and if a bottom-up detection method is adopted, different components are required to be associated to the corresponding human body type as a whole; specifically, both of the above two methods can be implemented based on a neural network.

In the embodiment of the invention, the coordinate information of the human skeleton joint points of all pedestrians in the original input picture can be extracted based on the trained deep convolutional neural network; specifically, the original input picture is input into a trained deep convolutional neural network, and the trained deep convolutional neural network outputs a series of coordinate points and human body attribution of the coordinate points.

Specifically, the coordinate information of the human skeleton joint points of all the pedestrians can be generally expressed as P_ki＝{(x_ki,y_ki) I ═ 0,1,.. ·. n, k ═ 1,2,..... m }, k denotes a pedestrian number, and i denotes a key point number; n and m are integer values generated based on the original input picture. In this embodiment, m is 4, and the maximum value of n is 13.

It should be noted that, sometimes, the number of joint point coordinate information of different pedestrians may be different due to the occlusion relationship, but since the human skeleton joint point coordinate information of the embodiment of the present invention is only used for extracting an image, and does not relate to the content such as the posture analysis, the absence of a part of joint points only represents the occlusion of a pattern (the occluded part of the human body is absent in the picture), the extraction of the picture is less affected,

fig. 3 is a schematic diagram of an original input picture including human skeleton joint point information, and it can be seen from fig. 3 that after this step, a series of human skeleton joint point coordinate information for each individual human body can be obtained in the original input picture.

S103: constructing a target area based on the coordinate information of the human body skeleton joint points;

specifically, the target area refers to a target area containing a human body image, and the purpose of this step is to primarily remove a background image of the original input picture, which is unrelated to human body image information, so as to simplify the processing flow and processing pressure.

Specifically, since this step is mainly used for the preliminary processing, the target region of the k-th pedestrian may be set as a rectangular region, which may be denoted as R_k(x, y, w, h), (x, y) is the coordinates of the lower left corner point of the rectangular region, w is the width of the rectangular region, and w is the height of the rectangular region.

Specifically, x ═ x_kmin-b,x_kminThe minimum x coordinate in the k pedestrian skeleton joint point; y ═ y_minA, wherein y_kminD is an empirical value, and is the minimum y coordinate in the k-th pedestrian skeleton joint point; w ═ x_kmax-x_kmin|+2b，x_kmaxThe maximum coordinate in the k-th pedestrian skeleton joint point; h ═ y_kmax-y_kmin|+2a，y_kmaxThe maximum y coordinate of all detected pedestrian skeleton joint points is obtained; a, b are experience values preset for ensuring the integrity of the intercepted area.

S104: extracting a target picture from an original input picture based on the target area;

fig. 4 shows a schematic diagram of a target picture of an embodiment of the invention.

After the processing of step S103, a plurality of target regions (rectangular regions) including the human body image can be obtained, specifically, in order to reduce the number of background pixels of the image, optionally, pixel points of the target regions corresponding to all pedestrians in the original input picture can be retained, and the remaining pixel points are formatted in a designated color. In this embodiment, the remaining pixels are formatted to be black.

In specific implementation, the target picture extracted through the step includes a plurality of target blocks, and the target blocks include one target area or a plurality of target areas. As in the present embodiment, the target areas corresponding to the human body images of two human bodies, k-1 and k-4, respectively form target blocks; the target areas corresponding to the human body images of the two human bodies, namely k-2 and k-3, are an integral target block.

S105: and extracting a human body image in the target picture based on an image segmentation algorithm.

A small part of background pixel points are also reserved in the target picture and need to be removed (namely, the extraction of the human body image). Specifically, the human body image is extracted from the target picture, which is substantially extracted from each target block, and specifically, according to the number of target areas included in the target block, a corresponding number of human body images are respectively extracted.

Specifically, in step S102, the coordinate information of the human skeleton joint point of each human body is obtained, and when the human body image is extracted, the extraction is generally performed for a specific region of each human body (for example, for a target region of each human body in the present embodiment).

Specifically, in the target picture, there may be a portion of the target regions of the human body independently (for example, a human body with k equal to 1, 4), and the target regions of the portion of the human body overlap (for example, a human body with k equal to 2, 3).

When the human body image of the target block only comprising one target area is extracted, the human body image corresponding to the target block can be directly extracted by using an image segmentation algorithm;

when extracting the human body images of the target blocks comprising more than two target areas, firstly extracting all the human body images based on an image segmentation algorithm, segmenting the overlapped parts of all the human body images based on a watershed segmentation algorithm, and obtaining the human body images with corresponding quantity.

Specifically, the image segmentation algorithm may be a graph-cut, a gram-but, a one-cut and other image algorithms; the watershed segmentation algorithm is an image region segmentation method, and in the segmentation process, the similarity between adjacent pixels is taken as an important reference basis, so that pixel points which are close in spatial position and have similar gray values (gradient calculation) are connected with each other to form a closed contour, and the overlapped human body image is segmented.

Specifically, for the target picture, the representation form of the human body overlapped on the image is that one human body shields a partial region of the other human body, the human body located in the foreground (shielding component) is continuous in the target picture, the human body located in the background (shielded component) shields the shielding component in the target picture, and the human body located in the background (shielded component) is cut off by the shielding component, so that the human body in the overlapped region can be rapidly separated and classified by using the watershed segmentation algorithm.

Figure 5 shows a schematic representation of a human body image. Specifically, referring to fig. 5, where k is 2 and k is 3, the overlapped region of the two human body images is the position framed by the white frame, and the overlapped target region can be accurately extracted and classified by analyzing the overlapped target region through a watershed segmentation algorithm.

Correspondingly, the embodiment of the invention also provides a human body action video extraction method, which comprises the following steps:

s201, sequentially extracting each frame of video picture of an original input video based on a time axis;

decomposing an original input video into a plurality of frames of video pictures based on a time axis;

s202, taking each frame of video picture as an original input picture and executing the human body image extraction method to obtain a human body image corresponding to each human body;

and respectively processing the plurality of frames of video pictures by using the human body image extraction method to obtain a plurality of human body images only retaining the human body images.

S203: and carrying out video recombination on the human body image corresponding to the human body with the specific human body number in each frame of video picture based on the time axis to obtain the human body action video corresponding to the human body with the specific human body number.

And confirming the human body number of a human body object to be researched, extracting corresponding human body images from the plurality of human body images by using the human body number, sequencing by using a time axis again, and recombining into a human body action video comprising the corresponding human body number.

In a specific implementation, if a human body image in a frame in the human body motion video includes a blocked area, since blocking is not absolute and blocking positions are not absolutely identical, the blocked area can be complemented based on a human body image of a previous frame or a next frame through corresponding human body skeleton joint point coordinate information, so that the image is restored.

Specifically, if the human body image in the extracted frame in the human body motion video has a blocked area, the blocked area is completed based on the human body images of the other frames except the extracted frame through the corresponding human body skeleton joint point coordinate information.

The step of completing the shielded area based on the human body images of the rest frames through the corresponding human body skeleton joint point coordinate information comprises the following steps:

s301: determining a human body part where the shielded area is located in the extracted frame and a head joint point and a tail joint point corresponding to the human body part, and obtaining a target human body part posture vector based on the coordinate information of human body skeleton joint points corresponding to the head joint point and the tail joint point;

specifically, the coordinate information of the human body skeleton joint point corresponding to the head joint point is represented as a point a, the coordinate information of the human body skeleton joint point corresponding to the tail joint point is represented as a point B, and the coordinates are two-dimensional coordinates in the screen image.

Correspondingly, the target human body part attitude vector is expressed as

S302: calculating a reference target human body part posture vector of the same human body part in the rest frames of the human body action video based on the same head joint point and tail joint point;

correspondingly, the attitude vector of the reference target human body part of the w frame and the rest frames can be expressed as

w is 1,2, …, u, u is the total frame number of the human motion video minus 1;

s303, respectively carrying out similarity matching on the target human body component posture vector and reference target human body component posture vectors in other frames of the human body action video;

in particular implementations, similarity matching may take into account the aspects of comparison including vector angle and vector length.

Specifically, the meaning of the vector length similarity matching is that, on one hand, images or videos shot through the camera have a near-far effect, so that, firstly, human body parts with the same magnification or reduction ratio (indicating that the distances between the human body parts and the camera are close to or equal) can be obtained with a higher probability based on the vector length comparison, and on the other hand, due to the diversity of the motion postures of the human body parts, such as the rotation of the arm around the shoulder, the same images of the same human body part at different angles can be obtained with a higher probability based on the vector length comparison. At this time, the missing part in the target human body part attitude vector can be obtained by reasonably rotating the human body image corresponding to the reference target human body part attitude vector.

Specifically, the meaning of the vector angle similarity matching is that, under the condition that the vector angles of the target human body component attitude vector and the reference target human body component attitude vector are the same or similar, the reference target human body component attitude in the same attitude or similar attitude as the target human body component attitude vector can be obtained with a high probability. At this time, the missing part in the target human body part attitude vector can be obtained by reasonably enlarging or reducing the human body image corresponding to the reference target human body part attitude vector.

In specific implementation, different weighted values are required to be allocated to the vector angle and the vector length so as to obtain a reasonable similarity matching result.

S304: and filling the occluded area in the extraction frame by using the reference target human body part with the highest similarity degree.

Specifically, the completion method includes, but is not limited to, zooming, stretching, rotating, and the like.

In summary, the embodiment of the present invention provides a human body image extraction method and a human body action video extraction method, the human body image extraction method performs preliminary processing on an original input picture by using coordinate information of a human body skeleton joint point, can reduce the execution complexity of subsequent human body image extraction in a large area, and has certain effects on improving the execution speed and reducing the hardware requirement; the overlapped human body images are segmented based on the watershed algorithm, the steps are reasonably set, the execution speed is high, and the method has good practicability. The human body motion video extraction method established based on the human body image extraction method can accurately extract the human body motion video corresponding to each human body, and completes the shielded area according to the relation between the front frame and the rear frame so as to further restore the complete appearance of the human body, and has good practicability.

The human body image extraction method and the human body motion video extraction method provided by the embodiment of the invention are described in detail, a specific example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A human body image extraction method is characterized by comprising the following steps:

acquiring an original input picture;

2. The human image extraction method according to claim 1, wherein human skeleton joint point coordinate information of all human bodies in the original input picture is extracted based on a trained deep convolutional neural network;

3. The human image extraction method of claim 2, wherein the constructing the target region based on the human skeleton joint point coordinate information of all human bodies comprises:

wherein x is x_kmin-b,x_kminIs the minimum x coordinate in the skeleton joint point coordinate information in the kth individual; y ═ y_minA, wherein y_kminThe y coordinate with the minimum coordinate information of the skeleton joint point in the kth individual is used, and d is an empirical value; w ═ x_kmax-x_kmin|+2b，x_kmaxIs the x coordinate with the maximum coordinate information of the skeleton joint point in the kth individual; h ═ y_kmax-y_kmin|+2a，y_kmaxThe y coordinate with the maximum coordinate information of the skeleton joint point in the kth individual; a, b are empirical values。

4. The human image extraction method of claim 4, wherein the extracting the target picture from the original input picture based on the target region comprises:

5. The human image extraction method of claim 4, wherein extracting the human image in the target picture based on an image segmentation algorithm comprises:

6. The human body image extraction method according to claim 5, wherein said sequentially extracting a corresponding number of human body images from each of the plurality of target blocks comprises:

7. The human image extraction method according to claim 6, wherein the image segmentation algorithm is one of a graph-cut algorithm, a gram-but algorithm, and a one-cut algorithm.

8. A human motion video extraction method is characterized by comprising the following steps:

taking each frame of video picture as an original input picture and executing the human body image extraction method of any one of claims 1 to 7 on the original input picture to obtain a human body image corresponding to each human body;

and carrying out video recombination on the human body image corresponding to the human body with the specific human body number in each frame of video picture according to the time axis sequence corresponding to each frame of video picture to obtain the human body action video corresponding to the human body with the specific human body number.

9. The human motion extraction method according to claim 8, wherein if the human image in the extracted frame in the human motion video has an occluded region, the occluded region is completed based on the human images of the other frames except the extracted frame by corresponding human skeleton joint point coordinate information.

10. The human motion extraction method according to claim 9, wherein the complementing the occluded region based on the human images of the remaining frames by the corresponding human skeleton joint point coordinate information comprises: