CN111832386A

CN111832386A - Method and device for estimating human body posture and computer readable medium

Info

Publication number: CN111832386A
Application number: CN202010439265.7A
Authority: CN
Inventors: 曲毅; 何晓光; 屈莎; 刘佳玉; 王洪亮; 杜浩宇
Original assignee: Dalian Ruidong Technology Co ltd
Current assignee: Dalian Ruidong Technology Co ltd
Priority date: 2020-05-22
Filing date: 2020-05-22
Publication date: 2020-10-27

Abstract

The invention relates to a method and a device for estimating human body posture and a computer readable medium, belonging to the technical field of computer image processing. The invention first estimates the human pose of the articulated 3D object model by the computer by obtaining a sequence of source images and thereby a sequence of objects separated from the image background from the corresponding source image segments; then matching the sequence with a reference profile sequence to determine one or more selected reference profile sequences that form a best match; subsequently retrieving, for each of said reference sequence contours, a reference body pose associated with one of said reference contours; and calculating an estimate of the body pose of the articulated object model from the retrieved one or more reference body poses. The result from the above steps is an initial body pose estimate, which can then be used in further steps. The invention can be operated in an uncontrolled environment with loose camera calibration, low athlete resolution and occlusion.

Description

Method and device for estimating human body posture and computer readable medium

Technical Field

The invention belongs to the technical field of computer image processing, and particularly relates to a method and a device for estimating human body posture and a computer readable medium.

Background

Human pose estimation or motion capture, as a fundamental problem in the field of computer vision and graphics, has found wide application in the fields of character animation in games and movies, controllerless interfaces for games and surveillance, and the like. In view of the complexity of the problem, there is currently no universal solution that makes it suitable for all fields of application. It is also noted that the so-called solution may depend to a large extent on the relevant conditions and constraints imposed on the settings. Generally, the more constraints, the more accurate human body posture estimation result can be obtained. But in real-world scenarios it is often difficult to add constraints. Many practical applications cannot leave the scene. For example, it is possible to simply utilize video material already in a TV broadcast to show how to use accurate human pose estimation results to render a high quality of a sportsman from an arbitrary viewpoint during a sports game. Besides its application in the rendering domain, the estimation of human body pose during a match can be used for biomechanical analysis and synthesis, as well as for match statistics, even to port real matches into computer games.

Currently, commercial motion capture systems typically use optical markers throughout the body to track the time-varying motion state of a subject. Although such systems can obtain very accurate body posture estimation results, various body postures and facial expressions can be captured. However, this method can only work in a controlled environment, and therefore, the method is only applied to a specific range, and the application range is severely limited.

The problem of markerless human pose reconstruction (or motion capture) can be broadly divided into two categories, currently based on the type of "shot" used, i.e. using a video sequence from one camera, and using "shots" from multiple calibration cameras. The monocular video sequence-based human pose estimation method is more convenient for some applications because it is less restrictive to the user, but it also has some inherent problems including depth uncertainty (ambiguity). Although ambiguity can be resolved using a method from motion to structure, this is a very difficult problem visually. The structure from motion algorithms typically relies on high resolution scenes containing a lot of detail, which is not usually a requirement in motion scenes.

In addition, another problem to be solved urgently in the human body posture estimation method is occlusion. If the "shot" intended for body pose analysis comes from only a single camera, it is difficult to resolve it. And by increasing the number of cameras, it is more likely to obtain an unobstructed "shot" of the same object. In general, the higher the spatial coverage of the camera, the higher the quality of the shot obtained. In addition to this, sports broadcasting has now started using multiple cameras for image acquisition. Therefore, people can use the information to obtain more accurate 3D human body posture estimation results.

In the prior art, most of the multi-view 3D human body posture estimation methods utilize a tracking algorithm to reconstruct a human body posture at a certain time or reconstruct a human body posture at another time according to a human body posture at a certain time. Wherein tracking can be achieved using optical flow fitting and/or stereo vision matching. However, while these methods can provide more accurate human pose estimation results, they typically need to work in a controlled environment and require more high resolution cameras (typically at least four) and good scene space coverage (typically circular coverage) to address the ambiguity due to occlusion.

Of course, there are other methods of constructing a "proxy geometry" that use multi-view contours or multi-view stereo methods. After the proxy geometry construction is completed, the skeleton is loaded into the geometry. Although these methods can provide very good results, they also have some limitations with respect to the setup. This is because they require a carefully constructed studio, many high resolution cameras and very good spatial coverage.

In addition to this, there is another class of algorithms based on image analysis and segmentation. These algorithms can use machine learning methods to distinguish body parts. However, this analysis usually requires a high resolution shot, which is an unrealistic condition for most application scenarios.

Disclosure of Invention

The present invention is directed to the above problems, and an object of the present invention is to provide a data-driven human body posture estimation method, apparatus and computer-readable medium that can operate in an uncontrolled environment where camera calibration is loose, athlete resolution is low and occlusion exists.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows.

A first aspect of the invention provides a method of estimating a human pose for estimating an articulated object model, wherein the articulated object model is a computer-based 3D model of a real-world object recorded by one or more source cameras and the articulated object model represents a plurality of joints and links linking the joints, and wherein the human pose of the articulated object model is defined by the spatial positions of the joints, the method comprising the steps of:

acquiring a video stream of views of real world objects (14) recorded by a camera (9);

obtaining at least one sequence of source images (10) from said video stream;

processing the at least one sequence of source images to extract, for each image, a corresponding source image segment comprising a view of the real world object separated from the image background, thereby generating at least one sequence of source image segments;

maintaining in a database in computer readable form a sequence of reference contours, each reference contour associated with an articulated object model and with a particular reference body pose of the articulated object model;

for each sequence of said at least one sequence of source image segments, matching the sequence with a plurality of reference contour sequences and determining one or more selected reference contour sequences that best match said sequence of source image segments;

wherein the matching of the two sequences is done by matching each source image segment with a reference contour at the same position within its sequence, calculating a matching error indicating the degree of their matching, and calculating a sequence matching error from the matching errors of the source image segments;

for each selected sequence of reference contours, retrieving a reference body pose associated with one of the reference contours; and

calculating an estimation result of the human pose of the articulated object model from the retrieved one or more reference human poses;

one image of the source image sequence is designated as a frame of interest and the source image segment generated therefrom is designated as a source image segment of interest;

the sequence matching error is a weighted sum of two sequence matching errors;

the matching error of the source image segment of interest is weighted the most and decreases with the distance of the source image segment (within the sequence) to the source image segment of interest; and

a reference body pose of the reference contour matching the source image segment of interest is retrieved.

The estimation result obtained by the above steps is an initial body pose estimation which one can then use in a correlation extension step, e.g. for ensuring local consistency between body pose estimates from consecutive frames and for ensuring global consistency over a longer sequence of frames.

Preferably, one image of the source image sequence (10) is designated as a frame of interest (56) and the source image segment (13) generated therefrom is designated as a source image segment of interest;

the sequence match error is a weighted sum of the match errors of the two sequences (51, 52);

the matching error of the source image segment of interest has the greatest weight and decreases with the distance of the source image segment to the source image segment of interest; and retrieving a reference body pose of the reference contour matching the source image segment of interest.

Preferably, wherein at least two source image sequences (10) recorded simultaneously by at least two source cameras (9, 9') are obtained and processed, and the human pose estimation result of the articulated object model (4) is obtained from the retrieved reference human poses determined from the at least two source image sequences by selecting the combination of the retrieved reference human poses that most fit in 3D space.

Preferably, a local correspondence is established between two human poses determined from at least one sequence of consecutive source image segments, each human pose being associated with at least one source image segment (13), wherein elements in one or both human poses of the articulated object model (4), i.e. the joints (2) and/or the links (3), correspond to limbs of the real world object (14) which can be marked in a blurred manner:

for each of a pair of consecutive source image segments, determining a corresponding label of an image point from the limb in the source image segment from the associated human pose and each of the possible fuzzy labels for each human pose;

selecting a tag of the human pose for a first one of the pair of consecutive source image segments;

calculating an optical flow between the first and second of the pair of consecutive source image segments;

determining from the optical flow the positions in the second image segment to which image points corresponding to limbs of the first image segment have moved and marking those positions in the second image segment in dependence on the marking of limbs in the first image segment;

among the possible fuzzy labels for marking the human body pose of the second image segment of the image point according to the human body pose, a label is selected which coincides with the label determined according to the optical flow.

Preferably, the step of marking image points from the limb in the source image segments is accomplished by:

for each of a pair of consecutive source image segments, a projection of a model of the real world object (14) into the source image is determined from the associated human pose and each of the possible fuzzy labels for each human pose, and thereby image points of the source image segment are labeled from the projected extremity visible at the positional image point.

Preferably, given a sequence of human poses, each human pose being associated with one of the sequence of source image segments and wherein there is ambiguity as to the labeling of one or more sets of limbs, comprising the steps of establishing a global uniformity of human poses matching the sequence of consecutive source images:

for each source image segment (13) of the source image sequence,

retrieving the associated body poses and labels of the model elements determined by the previous step;

determining the human body posture with the minimum distance from the retrieved human body posture in the database by considering the marks of the human body posture of the database;

calculating a consistency error term representing a difference between two human poses;

calculating the total consistency error of the whole source image sequence according to the consistency error items;

repeating the above steps to calculate a total consistency error for all variants of possible global labels of the fuzzy limb set;

the global tag variable with the smallest overall consistency error is selected.

Preferably, the total consistency error is the sum of all consistency error terms over the sequence.

A first aspect of the present invention provides an apparatus for estimating a posture of a human body, comprising a processor and a memory, the processor and the memory being connected to each other, wherein the memory is used for storing a computer program, the computer program comprises program instructions, and the processor is configured to call the program instructions to execute the method provided by the first aspect of the present invention.

A third aspect of the invention provides a computer-readable storage medium having stored thereon a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method provided by the first aspect of the invention.

A fourth aspect of the invention provides a method of manufacturing a computer-readable storage medium, comprising the steps of storing computer-executable instructions on the computer-readable medium, which, when executed by a processor of a computing system, cause the computing system to perform the method steps provided by the first aspect of the invention.

Compared with the prior art, the method has the advantages that the method can be operated in the uncontrolled environment with loose camera calibration, low player resolution and occlusion, and can drive the human body posture to be estimated. The method and apparatus provided by the present invention can estimate the human body posture using only as few as two cameras. And it has no limiting assumption about possible body gestures or motion sequences. By using time consistency for initial body pose estimation and body pose refinement, user interaction is limited to a few clicks to invert the arms and legs in the event of failure.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments of the present invention will be briefly described below, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram of a real-world scene profile;

FIG. 2 is an articulated object model;

a in figure 3 is a typical contour in a segmented image,

b is the three best matching human poses in the database;

FIG. 4 is an estimated human pose in a game projecting a 3D skeleton onto a source camera;

FIG. 5 is an overview of the method;

FIG. 6 is a 2D human pose estimation result according to the prior art and according to the present method:

FIG. 7 is an example of human pose blur;

FIG. 8 is a partial coherency diagram;

FIG. 9 is an estimated body pose before and after optimization of the body pose;

FIG. 10 is a failure case;

fig. 11 is a resulting sequence showing all camera views per frame.

Detailed Description

Features and exemplary embodiments of various aspects of the present invention will be described in detail below, and in order to make objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail below with reference to the accompanying drawings and the embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It will be apparent to one skilled in the art that the present invention may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present invention by illustrating examples of the present invention.

The embodiment focuses on a human body posture estimation method based on the unconstrained track and field sports event relay lens. In this scenario, camera position, object size, and temporal consistency are meant to present some challenges. Although human pose estimates can be computed based on a multi-camera setup, there are still problems with a small number of available cameras and a wide baseline. In addition, since the cameras are typically placed on only one side of the field, limited scene coverage is provided. Although these cameras provide high resolution images, they are typically set to a wide angle for editing reasons. Thus, players typically cover only a height of between 50 and 200 pixels. In addition, the motion of the player can be very complex, especially in contact countersports like basketball, so there is more shading.

To this end, the invention proposes a data-driven human pose estimation method that can operate in an uncontrolled environment where camera calibration is loose, athlete resolution is low, and occlusion exists. The resulting method and system can estimate human body pose using only as few as two cameras. And it has no limiting assumption about possible body gestures or motion sequences. By using time consistency for initial body pose estimation and body pose refinement, user interaction is limited to a few clicks to invert the arms and legs in the event of failure.

Many existing methods in human pose estimation rely on tracking or segmenting images in 2D and extrapolating the skeleton to 3D using calibration information. These methods work well for high resolution lenses, but due to lack of information they often fail on low resolution images and are very sensitive to external lighting conditions. The method according to the invention is suitable for completely uncontrolled low resolution outdoor settings, since it relies only on rough contours and rough calibration.

The method uses a human pose and contour comparison database to extract human pose candidate objects in 2D and uses camera calibration information to compute the corresponding 3D skeleton. The method of the invention firstly executes a novel search strategy based on time consistency outline in the database to extract the closest database candidate object with time consistency. In addition, a novel time consistency step is applied to perform initial body posture estimation. Since the exact true body pose is not usually in the database, this will only result in the closest match, not the exact body pose. To this end, the invention provides a novel space-time optimization technology capable of automatically realizing accurate three-dimensional human body posture by using time information.

Compared with the prior art, the invention makes the following creative contributions:

the invention provides an initial human body posture estimation method for searching a database human body posture based on a time consistency contour;

local and global consistency checking methods for initial human body pose estimation are improved;

the invention provides a space-time human body posture optimization method based on new constraints.

Unlike learning skeletal statistical models, the present invention directly uses the human body posture database. The latter has two advantages: first, this data-driven approach allows for easy addition of new body pose sequences to accommodate new settings or previously unknown body poses. Secondly, because the method only searches the closest human body posture in the database, the statistical deviation of the more common human body posture is smaller. Using a database containing anthropometrically correct data will always result in an initially estimated reasonable body posture.

The present invention will be described in detail below with reference to the accompanying drawings, taking a javelin competition scenario as an example.

Fig. 1 gives a schematic view of a real world scene 8. The scene 8 comprises real world objects 14, such as persons, observed by two or more source cameras 9, 9', which source cameras 9, 9' generate video streams of source images 10, 10', respectively. The system and method according to the invention generates a virtual image 12 of the display scene 8 from a different viewpoint of the virtual camera 1 than the viewpoint of the source camera 9, 9'. Optionally, a virtual video stream is generated from the virtual image sequence 12. The device implemented according to the invention comprises a processing unit 15, said processing unit 15 performing the image processing calculations implementing the method of the invention, given a source image 10, 10', and generating one or more virtual images 12. The processing unit 15 is configured to interact with the storage unit 16 for storing the source image 10, the virtual image 12 and the intermediate result. The processing unit 15 is controlled by a workstation 19, which workstation 19 typically comprises a display device, a data input device such as a keyboard and a pointing device such as a mouse. In addition to this, the processing unit 15 may be configured to provide a virtual video stream to the TV broadcast transmitter 17 and/or to the video display device 18.

Fig. 2 shows a 3D model 1 of the scene 8, which comprises an articulated object model 4 of the real world object 14. The 3D model 1 typically also comprises other object models, e.g. representing other people, the ground, buildings, etc. (not shown). The articulated object model 4 comprises joints 2 connected by links 3, which links 3 substantially correspond to bones or limbs in the case of a manikin. Each of the joints 2 is defined as a point in a 3D space, and each link 3 may be represented by a straight line connecting two joints 2 through the 3D space. Based on the human pose, the 3D shape of the mesh model of the object or the 3D shape of another model representation of the 3D shape of the real world object may be calculated and projected into the 2D image seen by the source camera 9, 9'. This allows for inspection and improvement of the body posture. Although the body pose is described by a one-dimensional link, the limbs, represented by the (mesh) model, placed according to the body pose may extend in three dimensions. Although the present application is explained in terms of modeled human shapes, its scope also covers applications to animal shapes or to artificial structures such as robots.

First, the present invention utilizes a contour database for 2D body pose estimation in each individual input view. The present invention assumes that the object 14 can be roughly segmented from the background, for example using chroma-keying or background subtraction. Fig. 3a shows a typical example of a segmented image 13 in the application scenario. Wherein. The basic idea of calculating an initial guess of the object body pose, i.e. the 2D position of the skeletal joints 2, is to compare it with a database of known contours for each skeletal body pose. See fig. 3b with different reference profiles 3', two of them are reproduced in a smaller ratio to save space.

Fig. 4 shows an example of a frame in a sequence of source images from an athletic scene, where the 3D body pose found by the method is projected into the image and superimposed on the athlete.

In one embodiment, the method includes two steps as shown in FIG. 5. In a first step 62, given the coarse calibration and contour 61 from the body pose database 66 and the reference contour, the method extracts the 2D body pose of each individual camera view using spatio-temporal contour matching techniques, resulting in a triangulated 3D body pose guess 63 (labeled "estimated 3D body pose"). However, such human posture detection can easily produce ambiguity, i.e., left-right flipping of the symmetric portion. Although the skeleton closely matches the contour, the athlete's arms or legs may still be flipped. These ambiguities are sometimes difficult to detect even for the human eye due to occlusion and low resolution. We therefore use an optical flow-based technique to detect the instances where flipping occurs and correct them to obtain a consistent sequence. It is important to note that the optical flow is not reliable enough in the described setup to track the entire motion of the various parts of the body of the mobilizer over the entire sequence, but it can be used for local comparisons. The first step uses a sequence 51 of source image segments 3, matching this sequence 51 with a sequence 52 of a plurality of reference contours 13' focused on an image frame 56 of interest.

However, typically, any body gesture in the database will not exactly match the actual body gesture. Thus, in a second part or step 64 of the method (labeled "spatio-temporal human pose optimization"), this initial 3D human pose 63 is refined by an optimization process 64 based on spatio-temporal constraints. The resulting optimized 3D skeleton 65 (labeled "optimized 3D body pose") matches the contours from all views and the features have temporal consistency over successive frames.

An initial body pose estimate is computed by first retrieving the 2D body pose from each of the mobile and each of the camera views using a novel spatio-temporal data-driven contour search method. Once the 2D body pose of each player in each camera is found, the 2D joint positions observed in the image can be placed in 3D space using the calibration information of the camera (this step is also referred to as lifting the 2D positions in 3D). The present invention calculates the 3D position of each joint by intersecting the rays corresponding to each 2D joint in each camera view. Since the rays do not intersect exactly, the present invention selects the closest point to these rays in the least squares sense. From this the triangulation error E can be derived, as well as the initial camera offset.

The method represents the 3D skeleton of the human pose S in angular space in the following way: each bone i is represented relative to its "parent" bone using two angles α and β, and the length l of the bone. The "root" bone is defined by a global position p₀Given three angles alpha₀、β₀、γ₀The specified direction is defined. The joint position j in 3D euclidean space can be easily calculated from this angular space representation and vice versa (taking into account gimbal locking).

A large database with the ability to sample the entire range of motion is important to the method of the present invention and is difficult to create manually. Thus, the present invention uses the CMU motion capture database. The template mesh assembled with the same skeleton is then deformed using linear hybrid skinning (skeleton skinning animation algorithm) to match the body poses of the database body poses. Thus, a virtual snapshot is taken and the contours are extracted. In this way, the present invention creates a database containing approximately 50000 profiles.

However, the CMU database only includes limited body posture types, as it mainly comes from running and walking sequences. Thus, the present invention manually adds 1200 contours from several motion scenes. Although the number of such manually added contours is small compared to the number of automatically generated contours, it is sufficient to extend the span of the example human body poses to achieve good results. It is important to note that the example human poses added are not the same as the sequence we use to fit the human poses. The database can be continuously expanded through the newly generated human body posture, so that better initial human body posture estimation can be obtained.

The method of the present invention accepts as input data a coarse binary contour mask and a coarse camera calibration for each athlete. And compares these contours to contours from a database. It calculates the quality of match between the input profile and the database profile over a fixed grating size (grid of 40 height and 32 pixels width) suitable for segmentation.

The contour extraction method effectively expands the traditional method by utilizing time information. The method does not rely on single frame matching but takes into account a weighted sum over the sequence of image frames of the differences between the source image segments 13 and the reference contours 13'. The binary input contour image I (from the image frame of interest 56) with index t is compared with the contour image I 'with index s from the database'_sWhen comparing, the obtained pixel error E_q(s) is calculated as follows:

where n is the filter window size, i.e., the number of frames before and after the frame of interest 56 under consideration; p is the set of all raster positions where the corresponding pixel is unlikely to be occluded in both images, i.e. is not expected to be part of the contour of the other player. | P | represents the number of raster positions in P. The raster position may correspond to the actual hardware pixels of the camera, or may correspond to pixels in the scaled image calculated from the camera image.

Wherein the weight θ_s(i) A normalized gaussian function 53 centered around s is depicted. L 'for not included in the dataset'_s+_i(p) theta_s(i) Is set to 0 before the normalization process. The advantage of this method is that it compares the sequence rather than the single image, which not only enables an increase in temporal coherence (achieving smooth motion), but also enables an effective improvement in human pose estimation. By this method, even image portions that are occluded over several frames can be fitted more firmly. Generally, this approach helps to prevent matching contours that are similar but result from disparate human poses. FIG. 6 presents a direct comparison of the initial body pose estimation to which the present invention relates to and the prior art body pose estimation. Wherein. FIG. 6(a) is a diagram of estimating 2D human pose by comparing only the current frame to the database (prior art approach); (b) the database entry found for the single frame comparison; (c) estimating a 2D human body posture by comparing the contour sequences; (d) the database sequences found correspond to the intermediate onesAn image segment.

By using this pixel error, the present invention searches for the best two person pose hypotheses in each camera view, and selects the best combination of these hypotheses by selecting the lowest triangulation error E. Of course, in alternative embodiments, more than two body pose hypotheses from each camera may be used to determine the best combination, or a different number of each camera, or assume it is best without considering triangulation errors, or only one body pose hypothesis from at least one camera.

FIG. 7 shows an example of human pose ambiguity: (a) a possible tag in the first camera; (b) two possible positions of the knee are illustrated from the top schematic; and (c) possible tags in the second camera.

Since the 2D human pose detection step relies on contour matching, ambiguity can easily arise. Although giving an outline and matching initial 2D body pose retrieved from the database, one still cannot determine whether the "left" and "right" labels on the arms and legs are correct. To this end, in one embodiment, after the 2D contour match, information from the retrieved database body poses defining whether a leg or arm (or, in general, one of a set of symmetric joint chains) is to be labeled, e.g., will be labeled, e.g., "left" or "right" after the match of the 2D contour. The remaining ambiguity is resolved as follows. Wherein figure 7(a) shows an example outline of two possible labels for a leg. As can be seen in the figure, the possible positions of the right knee are marked with diamonds. This view is from the left camera in the mode of fig. 7 (b). The same object in the same frame then shows a contour in the other camera as shown in fig. 7(c), again with two possible leg labels. Thus, after lifting to 3D, there are four possible right knee positions in 3D. These possible cases are shown in fig. 7 (b). It can be seen that if the right knee falls on one location marked with an asterisk, the left knee falls on another asterisk-marked location. If the right knee falls on one circled position, the left knee will fall on top of the position marked by the other circle. Assuming that the circle marks the correct position of the knee, there can be two different types of failures: or the knee may be incorrectly marked in 3D, but in the correct position; or the knee appears in the wrong position (star).

Without additional information it is not possible to decide which positions are correct in this case, i.e. to select the only correct position from the four possibilities, especially if only two cameras are available. One possible method of eliminating the flip ambiguity involves examining all possible combinations, leaving only the anatomically possible combinations. It should be noted, however, that several configurations of flipping still may produce anatomically correct body posture.

To properly resolve these ambiguities, the present invention uses a two-step approach: first, local consistency is established between each pair of consecutive 2D frames, so that the entire sequence of 2D body poses is consistent in time. Second, any remaining ambiguity throughout the entire sequence is resolved globally.

The first step in achieving this goal of local consistency is to ensure that the 2D human pose found from camera frames k (as shown in fig. 8 (a) and 8 (b)) and k + l (as shown in fig. 8 (c)) is consistent, i.e., there is no left-right flip of the arm and leg between two consecutive frames. In other words, if a pixel located at frame k belongs to the right leg, then at frame k + l, that pixel should also belong to the right leg. By applying this assumption, the present invention is a color image l_C(k)And I_C(k+1)Each pixel in (a) assigns a corresponding skeleton and calculates the optical flow between frames (as shown in fig. 8 (d)).

FIG. 8 depicts a method for establishing local consistency through model-based consistency checking. In this embodiment, although this is a mesh-based model, the method may be performed with another representation of the 3D shape of the real world object. Wherein the joints assigned to the right leg are marked with diamonds: (a) the last frame and (b) the appropriate mesh and (c) the leg is wrongly assigned in the current frame, (d) the optical flow, (e) the mesh is fitted with a match of the correct and wrong labels in the current frame, (f) the error of flipping the (correct) leg in the current frame. Wherein in (e) and (f) the irregular white shapes on the calf and foot represent pixels marked as false in (e) 54 and as correct in (f) 55. Although there are few pixels in (e) that are not "correct", and pixels in (f) that are not "wrong", they are not. There is no guarantee that the above problems will not occur.

The basic idea is that the pixels in frame k and the corresponding pixels in frame k +1 calculated using optical flow should be assigned to the same skeleton. Otherwise a flip may occur (as shown in fig. 8 (c)). The invention thus calculates this association for all combinations of possible labels for the left or right joints (see fig. 7), calculates the consistency of the pixel stream for each combination, and selects the label combination with the highest consistency for the second frame. To make this approach more robust in terms of optical flow errors, the invention considers only pixels with good optical flow and the corresponding pixel labels in both frames are of the same skeletal type, i.e. either two arms or two legs. For example, if pixel p belongs to the left arm in frame k and to the torso in frame k +1, then this is likely because the optical flow is inaccurate due to the presence of occlusions, and therefore this pixel may be excluded. If the pixels belong to different bones of the same type (arm or leg), this is a strong indication of flipping. The invention adopts a voting strategy to select the optimal flipping configuration: a limb is labeled as "left" or "right" depending on whether it contains most of the "left" or "right" pixels.

To do this, each pixel must be assigned to its corresponding bone. Since occlusion is not considered by a simple assignment based on distance to bone, it is not an optimal approach. Thus, the present invention uses information from all cameras to construct a 3D body pose (as described in the initial body pose estimation section). Also, the present invention uses color coding for all bones and a warped and rendered template mesh for all possible flips in all cameras. The rendered mesh carries information for each limb, whether it is "left" or "right". Thus, pixel designation is a simple lookup that provides accurate designation despite the presence of self-occlusion: for each pixel, the rendering grid of the same location can indicate whether the pixel belongs to "left" or "right".

After the local consistency step is completed, there should no longer be a rollover between all consecutive frames, which means that the entire sequence is consistent. It is noted, however, that there may still be situations where the entire sequence is inverted to the wrong direction. However, this problem can be addressed by simply performing binary disambiguation on the entire sequence. The invention is thus able to check its global consistency by evaluating a function of the possible global labels of the arm and of the leg. The final tag is selected by selecting a tag combination that minimizes the sum of the following error terms across the sequence:

E_g＝λ_DBE_DB+λ_tE_t(2)

it can be seen that the above formula is of constant parameters

A weighted sum of the sum λ; wherein the content of the first and second substances,

is the "distance to the database" which ensures that the selected marker/flip will produce a reasonable human pose along the sequence. It penalizes the human body posture according to the distance to the closest human body posture P in the database:

where a and β are the connection angles of the triangular connection position J. α 'and β' are two human poses in the database of human poses P. And | J | is the number of joints. When searching for the closest database body pose for each body pose along the sequence, the labels (right/left) of the body poses in the database are considered. That is, the limbs of the sequential body gestures only match the limbs of the database body gestures that are labeled as the same. Since the database only contains anthropometrically correct body poses, this penalizes untrusted body poses.

To this end, the optimal 3D body pose computed from the body pose estimate is still limited to body poses that fit in the database present in each view. The database contains only a subset of all possible body poses and therefore typically does not contain an accurate solution.

To this end, the present invention applies an optimization method to retrieve more accurate human body poses, as shown in FIG. 9. To guide this optimization method, the present invention combines several spatial and temporal energy functions and uses the optimization method to minimize them.

The energy function or error function is based on the representation of the skeleton S described in the initial body pose estimation section of the present invention. All parameters except the length of the bone are variable per frame. And the bone length is also variable but remains constant throughout the sequence and is initialized to the average of the local lengths of all frames. This automatically introduces anthropometric constraints, as bones should not shrink or grow over time. Another good property of the selected skeletal representation is that it significantly reduces the number of variables. To cope with the calibration error, the present invention also optimizes the translation vector of one dimension given by each object, camera and frame.

The invention defines the energy or error function of each frame and object as a weighted sum of the following error terms:

E(S)＝ω_sE_s+ω_fE_f+ω_DBE_DB+ω_rotE_rot+ω_pE_p(4)

it should be noted that not all error terms are necessary conditions for the optimization process to return useful results. Depending on the nature of the scene and the real world objects, one may omit one or more error terms. For example, in one embodiment, to observe a moving scene, at least the contour filling error, and optionally the distance database error, is used.

In a local optimization process, the error functional may be first minimized, wherein it is noted that the human pose of the object seen in a frame or a group of frames at the same time is variable. Alternatively, the error functional can be minimized over a longer sequence of frames, where the human pose for all frames is also varied, which can more conveniently find the best match for the entire sequence that is consistent with each other (according to the optimization criteria for linking successive frames).

Contour matching error term E_s: the skeleton of the correct 3D skeleton should be able to be projected onto the 2D contours in all cameras, the error term E_sIs responsible for penalizing joint positions where the 2D projection lies outside the contour:

where C is the set of all cameras covering the object outline. J. the design is a square₊Is the union of the set J of all joints with a point located in the middle of the bone or one or more other points placed along the bone. The normalized Euclidean Distance Transform (EDT) algorithm can return a distance to the nearest point within the contour for each 2D point in the camera view by dividing by the larger edge of the contour bounding box. This normalization process is crucial to make the error independent of the size of the object in the camera image (which may vary in size depending on zoom). And P is_c(j) Is the projection that transforms the 3D joint j into camera space taking into account camera offset, with the aim of correcting for small calibration errors.

Contour fill error term E_f: despite the contour matching term E_sIt is mainly responsible for penalising joints that lie outside the contour, but so far there is no limitation when applying it to contours. Filling error term E_fThe position of a joint can be prevented from being too close to the position of another joint located in the torso, so in other words, this term can ensure that a joint is present in all limbs:

where R is the set of all grid points in the 2D body pose estimation portion that lie within the contour.

The grid points can be transformed from the camera space of camera c to light in world space. And dist () is the distance of the ray to the joint.

The purpose of this error term is to penalize the human pose where the elements of the articulated object model (in this case the joints and links, or just the links) fold within the outline. The error term tends to approximate the human pose of the element for each raster or grid point within the contour. In other words, the contour fill error term increases as the distance from a point to the nearest element increases. The most recent elements may be links or joints, or both.

Distance database human body attitude error term E_DB: this term has been defined by equation 3. It can ensure that the final 3D body posture is kinematically possible (e.g. the knee joint bends in the correct way) by using a database of correct body postures. It implicitly adds anthropometric constraints to our optimization. The closest database body poses used here are found by a new search of the body poses, since the estimated body poses may have changed during the optimization process.

Smoothness error term E_rolAnd E_p: the human motion is usually smooth, so the skeletons of adjacent frames should be similar. This enables the present invention to introduce temporal consistency into human pose optimization. Thus, E_rolPenalizing large changes in the interior angle of the skeleton of successive frames, and E_pPenalizing large movements:

E_p＝|p₀-p₀|(8)

where α ' and β ' are the corresponding angles of the same object in the previous frame, and p '₀Is the global position of the root joint in the previous frame. Rotation of the root bone may be considered and constrained in a similar manner.

Length error term E: the initialization of the bone length (or link length) is already a good approximation when processing a sequence of frames. Therefore, the present invention attempts to maintain the optimized human pose near the following lengths:

wherein l_iIs the length of the final bone length and,

is an initial bone or link length or another reference bone or link length. In a further embodiment of the method according to the invention,

can be considered to be the true bone length and is also variable and can be estimated when performing optimization over the entire sequence of frames.

The optimization process is as follows: to minimize the energy term in equation 4, the present invention employs, for example, a local optimization strategy, wherein the present invention iteratively optimizes variables one by performing a line search along a randomly chosen direction. For each variable, the invention selects 10 random directions for optimization and performs 20 global iterations. This optimization process can be implemented independently of the initial body pose estimation process described above, i.e., using any other body pose estimation process or using a constant default body pose as the initial estimate. However, using an initial human pose estimation process that provides a reasonably good initial estimate ensures that the optimization process may find a global optimal match, thereby avoiding local minima of the error function. Figure 9 illustrates the effect of the optimization process. The leftmost example shows the effect of the contour fill error term: the player's arm may be raised up or down to reduce the contour matching error term, but the contour filling error term is reduced only when the arm is moved up. Figure 9 shows a significant improvement over the method of Germann et al, which does not include human pose optimization at all and each human pose must be manually corrected.

The present invention evaluates the proposed method on four real sequences of video shots of a javelin match using two or three cameras, resulting in about 1500 human poses to be processed. Fig. 4 and 11 show a subset of the results.

Each row in fig. 11 shows a set of consecutive body poses and each entry shows an image of a respective player in all available cameras. It can be seen that the method can recover an accurate human pose in most cases, even with only two cameras and very low resolution images.

The parameter values used by the invention in all results (calculated using the automatic parameter tuning system) are as follows:

ω_s＝9,ω_f＝15,ω_DB＝0.05,ω_rot＝0.1,ω_p＝1,λ_DB＝0.15,λ_t＝0.3

for the optimization functions in equations (4) and (2), the present invention uses the above parameters for all results. The parameters are obtained by the following parameter adjustment process. The present invention makes manual annotations of 2D human poses in two scenes. The method is then run and the results are automatically compared to manual annotations. By using this as an error function and using the parameters as variables, automatic parameter optimization can be achieved.

The human body posture estimation method related by the invention needs about 40 seconds per frame for each moving person in the arrangement of two cameras, and needs about 60 seconds in the arrangement of three cameras. The present invention implements a parallel version, running one thread for each athlete. On an 8-core system, it can provide an acceleration ratio of about 8 times.

It should be noted that the initial human body posture estimation in the present invention does not depend on the human body posture estimation result of the previous frame. Thus, there is no drift in the process, which can be recovered from wrong body pose guesses.

Fig. 10 shows the failure case of the method described so far: (a) the human pose is too far from the database and cannot be estimated correctly; (b) the arm is too close to the body and cannot be positioned correctly.

Fig. 11 shows the resulting sequence showing all camera views per frame.

The method described so far may fail due to the lack of information provided by the binary contour alone, especially when the arm is too close to the body, as shown in fig. 10 (b). That is, several body poses may have very similar binary contours. Therefore, using only contour information may not be sufficient to eliminate ambiguity of human body posture. And introducing optical flow into the optimization process can resolve this ambiguity. In more detail, this is achieved by determining the light flow between two successive images and thereby determining the expected position of one or more bones or joints. This may be performed for all joints, or only for joints that are occluded (or occluded by other parts of the body). For example, given the position of a joint (or bone position and orientation) in one frame, the optical flow to adjacent frames will be used to compute the expected position in adjacent frames.

However, if no occlusion of a body part occurs in the camera, the optical flow law is the most reliable method. Thus, in another embodiment, similar to the method described above, the mesh rendered from the camera view is also used to mark each pixel with the body part to which it belongs. Then, in order to propagate the position of the joint to the next frame, only the optical flow of those pixels belonging to the respective body part of the joint near (or above) the projected joint position is used.

Given these expected positions, the expected joint positions can be calculated by triangulation for each of the available cameras. To and database E_DBIs calculated to the corresponding expected human posture (taking into account the whole body or just feeling likeLimb or bone of interest) distance (also referred to as the flow error term E_EX) And corresponding weighting term omega_EXE_EXAdded to the energy function of equation (4).

In addition, the results of the method of the present invention are highly dependent on the body posture database. Therefore, a good database should have a large range of motion and a large range of view so that the initial body pose guesses are close to the correct body pose. Fig. 10(a) shows an example where there are no similar body poses in the database and therefore the body pose estimation fails. Good body gestures can be manually or automatically selected and then added to the database, thereby expanding the space of possible body gestures.

Another important a priori information that can be further exploited is the kinematic information of the human skeleton: to date, this approach has used some implicit anthropometric constraints, but specific constraints on joint angles can be added to human pose optimization, i.e. taking into account the fact that human joint angles are limited to a certain range.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

Memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method of estimating a body pose, applicable to an articulated object model (4), said articulated object model (4) being a computer 3D model (1) of said real world object (14) observed by a camera (9), and said articulated object model (4) comprising a plurality of joints (2) and links (3), wherein the body pose of the articulated object model (4) is defined by the spatial positions of the joints (2), characterized in that the method comprises:

obtaining at least one sequence of source images (10) from said video stream;

-processing the source images of said at least one sequence (10) to extract, for each image, a corresponding source image segment (13), said corresponding source image segment (13) comprising a view of said real world object (14) separated from said image background, so as to be able to generate at least one sequence (51) of source image segments (13);

for each sequence (51) of at least one sequence of source image segments (13), first matching the sequence (51) with a plurality of reference contour sequences (13') and determining one or more selected reference contour sequences (13') that best match the sequence (51) of source image segments (13);

-the sequence (52) of reference contours (13') is stored in a computer-readable form database, each reference contour (13') being associated with an articulated object model (4) and with a specific reference human pose of the articulated object model (4);

matching of the two sequences (51, 52) is performed by matching each source image segment (13) with a reference contour (13') at the same position within its sequence and calculating a matching error indicating the degree to which they match, and calculating a sequence matching error from the matching errors of the source image segments;

for each of these selected sequences of reference contours (13'), retrieving a reference body pose associated with one of the reference contours (13'); and

a body pose estimation result of the articulated object model (4) is obtained from the retrieved one or more reference body poses.

2. The method of claim 1, wherein:

one image of the source image sequence (10) is designated as a frame of interest (56) and a source image segment (13) generated therefrom is designated as a source image segment of interest;

3. The method of claim 1, wherein:

wherein at least two source image sequences (10) simultaneously recorded by at least two source cameras (9, 9') are obtained and processed, and a human pose estimation result of the articulated object model (4) is obtained from the retrieved reference human poses determined from the at least two source image sequences by selecting a combination of the retrieved reference human poses that best fits in the 3D space.

4. The method of claim 1, wherein:

establishing a local correspondence between two human poses determined from at least one sequence of consecutive source image segments, each human pose being associated with at least one source image segment (13), wherein elements in one or both human poses of the articulated object model (4), i.e. the joints (2) and/or the links (3), correspond to limbs of a real world object (14) that can be marked in a blurred manner:

determining from the optical flow the positions in the second image segment to which image points corresponding to limbs of the first image segment have moved and marking these positions in the second image segment in dependence on the marking of limbs in the first image segment;

5. The method of claim 4, wherein:

the step of marking image points from the limb in the source image section is accomplished by:

for each of a pair of consecutive source image segments, a projection of a model of the real world object (14) into the source image is determined from the associated human pose and each of the possible fuzzy labels for each human pose, and thereby image points of the source image segments are labeled from the projected extremity visible at the positional image point.

6. The method of claim 4, wherein:

given a sequence of human poses, each human pose being associated with one of the sequence of source image segments and wherein there is ambiguity regarding the labeling of one or more sets of limbs, comprising the steps of establishing a global correspondence of human poses matching the sequence of consecutive source images:

for each source image segment (13) of the source image sequence,

7. The method of claim 6, wherein:

the total consistency error is the sum of all consistency error terms over the sequence.

8. An apparatus for estimating a posture of a human body, characterized in that:

comprising a processor and a memory, said processor and memory being interconnected, wherein said memory is adapted to store a computer program comprising program instructions, said processor being configured to invoke said program instructions to perform the method according to any of claims 1-7.

9. A computer-readable storage medium characterized by:

the computer storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method of any one of claims 1-7.

10. A method of making a computer-readable storage medium comprising the steps of storing computer-executable instructions on the computer-readable medium, which when executed by a processor of a computing system, cause the computing system to perform the method steps of any of claims 1-7.