WO2011042230A1

WO2011042230A1 - Head pose estimation

Info

Publication number: WO2011042230A1
Application number: PCT/EP2010/060431
Authority: WO
Inventors: Andreas Launila; Josephine Sullivan; Eric Hayman; Martin Brogren
Original assignee: Svenska Tracab Ab
Priority date: 2009-10-08
Filing date: 2010-07-19
Publication date: 2011-04-14

Abstract

A method for estimating the pose of a body part of a team sport player using a machine learning technique is provided. The method comprises the steps of extracting a set of features from tracking data and determining an estimate for the pose by applying a trained classifier to the set of features. The set of features comprises at least one of the position of the player and the position of a ball. Further, a system (200) for estimating the pose of a body part of a team sport player is provided. The system comprises a video camera (201 ), a tracking unit (202), a body part appearance unit (203), a feature extracting unit (204), and an estimation unit (205).

Description

HEAD POSE ESTIMATION.

Field of the invention

The invention relates to 3D reconstruction and analysis of team sport games. More specifically, the invention relates to estimating the pose of a body part of a team sport player.

Background of the invention Known methods for estimating the head pose from video footage are based on the appearance of the player's head.

Typically, in a bottom-up approach, the head of the player is located in a video frame and the sub-frame including the head is subsequently analyzed. If head pose estimation is performed on low-resolution video footage, the method must perform well with sub-frames of sizes about 20x20 pixels or smaller. Known methods for low-resolution head pose estimation are based on skin detection in combination with support vectors machines (SVM), nearest neighbor models, neural networks, probabilistic models, tree-based models, and boosting.

Information from bottom-up head pose estimation may also be combined with top-down information, such as information about the

orientation of the player's body.

Summary of the invention

It is an object of the present invention to provide a more efficient alternative to the above techniques and prior art.

More specifically, it is an object of the present invention to provide a method for estimating the pose of a body part of a team sport player using a machine learning technique. These and other objects of the present invention are achieved by means of a method for estimating the pose of a body part of a team sport player defined in independent claim 1 , by means of a computer program product according to independent claim 12, and by means of a system for estimating the head pose of a football player according to independent claims 13 and 15. Embodiments of the invention are characterized by the dependent claims.

According to a first aspect of the invention, a method for estimating the pose of a body part of a team sport player is provided. The method uses a machine learning technique. The method comprises the steps of extracting a set of features from tracking data and determining an estimate for the pose. The set of features comprises at least one of a position of the player and a position of a ball. The estimate for the pose is determined by applying a trained classifier to the set of features. The classifier is associated with the machine learning technique.

According to a second aspect of the invention, a computer program product is provided. The computer program product comprises a computer usable medium that has a computer readable program code embodied therein. The computer readable program code is adapted to be executed to implement the method according to the first aspect of the invention.

According to a third aspect of the invention, a system for estimating the pose of a body part of a team sport player using a machine learning technique is provided. The system comprises a tracking unit, a feature extracting unit, and an estimation unit. The tracking unit is configured for determining at least one of the position of the player, the position of a ball, and the positions of other players. The feature extracting unit is configured for extracting a first set of features. The first set of features is extracted from the positions. The estimation unit is configured for determining an estimate for the pose. The estimate is determined by applying a trained classifier to the first set of features. The classifier is associated with the machine learning technique.

According to an embodiment of the invention, the system further comprises a video camera and a body part appearance unit. According to the embodiment, the tracking unit uses video frames received from the video camera. The body part appearance unit is configured for analyzing an appearance of the body part. The appearance is derived from the video frames. The feature extracting unit is configured for extracting a first set of features and, according to the embodiment, also a second set of features. The first set of features is extracted from the positions. The second set of features is extracted from the appearance. The feature extracting unit according to the embodiment is further configured for combining the first set of features and the second set of features. The estimation unit is further configured for determining an estimate for the pose. According to the embodiment, the estimate is determined by applying a trained classifier to the combined set of features. The classifier is associated with the machine learning technique.

According to a fourth aspect of the invention, another system for estimating the pose of a body part of team sport player using machine learning techniques is provided. The system comprises a video camera, a tracking unit, a body part appearance unit, a feature extracting unit, and an estimation unit. The tracking unit is configured for determining at least one of the position of the player, the position of a ball, and the positions of other players. The tracking unit uses video frames received from the video camera. The body part appearance unit is configured for analyzing an appearance of the body part. The appearance is derived from the video frames. The feature extracting unit is configured for extracting a first set of features and a second set of features. The first set of features is extracted from the positions. The second set of features is extracted from the appearance. The estimation unit is configured for determining a first estimate for the pose and for determining a second estimate for the pose. The first estimate is determined by applying a trained first classifier to the first set of features. The second estimate is determined by applying a trained second classifier to the second set of features. The first classifier is associated with a first machine learning technique and the second classifier are associated with a second machine learning technique. The estimation unit is further configured for combining the first estimate and the second estimate. An embodiment of the invention may use any machine learning technique, e.g., support vectors machines (SVM), nearest neighbor models, neural networks, probabilistic models, tree-based models, or boosting.

The body part may, e.g., be the head or the torso of the player.

The method may be applied to any team sport, e.g., football, handball, basketball, ice-hockey, or polo.

Even though the method is described with respect to a ball, it may also be applied to a team sport involving a similar item, such as a puck.

The present invention makes use of an understanding that the pose of a body part of a team sport player may be estimated using information pertaining to how the body part could be oriented. This is referred to a top- down approach. If, e.g., the head pose of a player is to be estimated, information pertaining to where the player could be looking is utilized. In this case the head pose of the player is used as an approximation for the direction in which the player is looking. Such information may, e.g., comprise the position of the player with respect to the playing field. From this information one may estimate, using machine learning methods, in which direction the player is most likely to look. It would, for instance, be more reasonable that a player who is located close to the other team's goal is looking towards that goal than in any other direction. The information may also comprise the position of a ball, or a puck, with respect to the player. In this case the player would be more likely to look in the direction of the ball than in any other direction.

The additional information may be derived from tracking data comprising at least the position of the player or the position of a ball.

Preferably, the tracking data comprises both the position of the player and the position of a ball as a function of time. Tracking data may be obtained by analyzing video footage, by utilizing GPS receivers which the players are equipped with, or by using transponders. The method according to the first aspect of the invention is advantageous in that it is computationally light. Thus, it can be performed, if implemented on a computer, in real-time.

Estimated head poses may be utilized, e.g., for analyzing a game or for rendering 3D animations of the players. For the purpose of describing the present invention, a position is defined in an absolute manner with respect to a suitable coordinate system. Such a coordinate system may, e.g., be defined with respect to the playing field. It will be appreciated that a position can also be defined in a relative manner, e.g., with respect to a certain point of reference. For example, the ball has an absolute position with respect to the coordinate system. It is this absolute position which is extracted from tracking data. On the other hand, the position of the ball with respect to the player, i.e., the relative position of the ball, is the position used in connection with machine learning techniques, since it is this relative position that determines in which direction the player is likely to look. Such relative positions can be calculated from absolute positions if the reference position is known. Instead of using a Cartesian coordinate system one may describe a relative position by a direction, i.e., by an angle with respect to a direction of reference, and a distance. The direction of reference may, e.g., be a line of symmetry of the playing field, a camera viewing direction, or any other direction.

The pose of a body part, e.g., the head pose, may be described by an angle defined with respect to a direction of reference. Typically, a number of bins are used for classification, each bin covering a certain angular range such that the total of all bins covers the whole range, i.e., 360 degrees.

A velocity is assumed to be a vector quantity, i.e., specifying the speed and the direction of motion.

To this end, the pose of a body part of a team sport player, e.g., the head pose of a football player, is estimated by applying a trained classifier of a machine learning technique to a set of features extracted from tracking data, the features pertaining to positions of the player, the ball or puck, and the other players.

A classifier associated with a machine learning method may be trained in an supervised or in an unsupervised manner. If a supervised training is employed, a labeled set of features is employed for training.

Even though the invention is in some cases described with respect to head poses, corresponding embodiments for estimating the pose of other body parts, in particular the torso, may be constructed. Further, the invention is in some cases described with reference to football, but embodiments for other team sports may be constructed.

According to an embodiment of the invention, the set of features further comprises positions of other players. Taking into account the positions of other players with respect to the player is advantageous in that a more reliable estimate of the player's head pose can be obtained, since the player is likely to watch the actions of the other players. The number of other players which are taken into account may be limited to players which are within a certain distance from the player or within a certain region of interest. Such a region of interest may, e.g., be confined to the region between the player and one of the goals.

According to an embodiment of the invention, the set of features further comprises at least one of a velocity of the player, a velocity of the ball, and velocities of other players. Taking into account the velocity of the player, the ball, and other players, i.e., the dynamical aspect of a game, is advantageous in that a more reliable estimate of the head pose may be obtained as compared to considering only the static aspect of the game, i.e., the positions of the player, the ball, and the other players. For example, the player is more likely to look into its own direction of motion or towards a region of the playing field to where the ball is moving.

According to an embodiment of the invention, the tracking data is derived from video frames. The video frames may, e.g., be derived from video footage of a team sport game. In particular, the video frames may be extracted from low-resolution video footage. This is advantageous since video footage is easily obtained from one or several video cameras placed nearby the playing field. Video cameras may, e.g., be placed outside the playing field such that a side-view is obtained, or they may be placed over the playing field such that a top-view is obtained. Tracking data may also be obtained by other means, for example using GPS based or transponder based tracking devices which the players are equipped with. Tracking data obtained from different sources may be combined. From the positions extracted from the tracking data, the distance between the player and another entity, e.g., the ball, the puck, or another player, as well as the direction of the other entity with respect to the player, can be obtained. Further, velocities may be derived from tracking data extracted from a set of sequential video frames.

According to an embodiment of the invention, the set of features further comprises a camera angle. Taking into account the viewing angle of the camera, which produced the video footage from which the tracking data is derived, allows to compensate for different camera angles if video footage from several cameras is used.

According to an embodiment of the invention, the set of features further comprises features strongly linked to the team sport. For example, the set of features may comprise the position of a goal, a basket, or a net, the position and/or velocity of one or several referees, and the position of one or several coaches. Further examples for additional features are the position of the ball relative to the estimated player's goal, the side of the ball the player is on relative to his goal, i.e., behind or in front of the ball, the estimated head pose of the player in possession of the ball, the distance between the player in possession of the ball to the estimated player's goal, the team in possession of the ball, if the team of the estimated player is attacking, defending or neither, the position of the estimated player relative to his defending goal, the strategic position of the estimated player, e.g., attacker, goal keeper, inner midfield, the head pose estimation of all other players in the same team and/or in the opposite team as the estimated player, and the head pose estimation of the other players which are nearby the estimated player. Using features which are strongly linked to the team sport is advantageous since it reduces the prediction error in estimating the head pose by taking into account the attacking and defending aspect of the game, which is the core of many team sports such as football.

According to another embodiment of the invention, the set of features further comprises features pertaining to an appearance of the body part. In other words, the features pertaining to the team sport are merged with features pertaining to appearance of a body part, e.g., the head. The merged set of features may then be used with a common trained classifier. Merging the features from the top-down approach with the features from the bottom-up approach and using a common trained classifier is advantageous in that the prediction error for estimating the pose may be reduced.

According to an embodiment of the invention, the estimate for the pose is combined with an estimate for the pose determined using the appearance of the body part. Combining the estimate from a top-down approach with an estimate from a bottom-up approach is advantageous since it may decrease the prediction error for estimating the pose. The estimate for the pose determined using the appearance of the body part may be determined using the same machine learning technique as the estimate obtained from tracking data. The two estimates may also be obtained using different machine learning techniques.

According to an embodiment of the invention, two separate trained classifiers are used. One classifier is used for situations when the set of features comprises the position of a ball. Another, independent, trained classifier is used for situations when the set of features does not comprise the position of a ball. Using separate trained classifiers is advantageous since it reduces the prediction error in estimating the pose by taking into account the presence of the ball. The ball may, e.g., be absent from the tracking data if the ball could not be identified in the underlying video footage. This might, e.g., be the case if the ball is obscured by a player, or if the ball is

indistinguishable from the background.

According to an embodiment of the invention, a plurality of separate trained classifiers is used for different types of situation a game is in. Different types of situations may, e.g., be corner-kick, throw-in, counter-attack, penalty, cross, free-kick, ball out-of-play or ball in-play, goal attempt, attacking, and defending. Using several separate trained classifiers is advantageous since it reduces the prediction error in estimating the pose by taking into account the situation the game is in. The behavior of a player with respect to where his attention is directed, i.e., in which direction he is most likely to look, depends on the situation the game is in. For example, the player's attention might be focused on the ball in some situations whereas he is more likely to look at the goal keeper in other situations. According to an embodiment of the invention, the poses of several body parts are estimated jointly. This is advantageous since the poses of body parts may be correlated due to human anatomy. This is, e.g., the case for the head and the torso of a team sport player.

Even though the invention has in some cases been described with reference to the method according to the first aspect of the invention, corresponding reasoning applies to the computer program product according to the second aspect of the invention and the system according to the third and the fourth aspect of the invention.

Further objectives of, features of, and advantages with, the present invention will become apparent when studying the following detailed disclosure, the drawings and the appended claims. Those skilled in the art realize that different features of the present invention can be combined to create embodiments other than those described in the following.

Brief description of the drawings

The above, as well as additional objects, features and advantages of the present invention, will be better understood through the following illustrative and non-limiting detailed description of embodiments of the present invention, with reference to the appended drawings, in which:

Fig. 1 shows a football field.

Fig. 2 shows a system in accordance with an embodiment of the invention.

All the figures are schematic, not necessarily to scale, and generally only show parts which are necessary in order to elucidate the invention, wherein other parts may be omitted or merely suggested.

Detailed description

Fig. 1 shows a football field 100 with players 1 10¹— 1 10⁵ of a first team, players 120¹-120⁵ of a second team, and a football 130. For simplicity, only five players of each team are shown. An embodiment of the invention may be used to estimate the head pose of a football player, e.g., player 1 10⁵, using a machine learning technique. The head pose may, e.g., be defined as the angle of the player's head with respect to the touch line 101 . However, any other direction may be used as reference.

Typically, a number of bins are used for the classification. As an example, one may use eight bins, each bin covering an angular range of 45 degrees.

The head pose of player 1 10⁵ is estimated by applying a trained classifier to a set of features extracted from tracking data. The set of features comprises at least the position of the player 1 10⁵ or the position of the ball 130, and preferably both. The set of features may further comprise the position of other players, e.g., the positions of the players 120¹-120⁵ of the other team and/or the positions of the players 1 10¹— 1 10⁴ of the player's own team. The set of features may also comprise further features, such as the velocities of the player, the ball, and the other players, or any other feature linked to football.

The result of the estimation, i.e., applying a machine learning method to the set of features, is the bin with the largest likelihood. This bin

corresponds to the angular range of the head pose of player 1 10⁵ which is most likely.

For instance, trained SVM classifiers may be used for estimating the head pose. SVMs are a reliable and fast machine learning technique.

However, other machine learning techniques, such as nearest neighbor models, neural networks, probabilistic models, tree-based models, or boosting, may also be used.

The trained classifier used for estimating the head pose may be trained on tracking data using a supervised or an unsupervised approach. If a supervised approach is used, tracking data extracted from video footage may be used in connection with a supervisor identifying the player's head poses by inspection of the video frames.

With reference to Fig. 2, a system 200 in accordance with an embodiment of the invention is described. System 200 comprises a video camera 201 , a tracking unit 202, a body part appearance unit 203, a feature extracting unit 204, and an estimation unit 205.

The video camera 201 may be used to generate video footage of a football match, e.g., the scenery sketched in Fig. 1 . The tracking unit 202 may use the video footage for extracting tracking data, e.g., the positions and/or velocities of the players and the ball. The body part appearance unit 203 may analyze the head appearance in a bottom-up approach using a machine learning technique. The feature extracting unit 204 may extract a first set of features, pertaining to the positions and/or velocities of the player, the ball, and the other players, and a second set of features, pertaining to the appearance of the head. The feature extracting unit 204 may further merge the first and the second set of features. The estimation unit 205 estimates the head pose, i.e., the most likely bin, using a machine learning technique. This is achieved by applying a trained classifier of the machine learning technique to the merged set of features obtained from the feature extracting unit 205.

The system described with reference to Fig. 2 combines the bottom-up approach, using the appearance of the head as an input for the machine learning technique, with the top-down approach, using the positions and/or velocities of the player, the ball, and the other players as an input.

System 200 achieves this by merging the two sets of features, one pertaining to the bottom-up approach and one pertaining to the top-down approach, respectively, and by applying a trained classifier to the merged set of features. The advantage of combining the two approaches is that a more reliable estimate may be obtained, i.e., the prediction error may be reduced and ambiguities may be resolved.

As an alternative to the embodiment of system 200 described above, the bottom-up approach and the top-down approach may be combined in a different way, in accordance with another embodiment of the invention.

Instead of merging the two sets of features and applying a trained classifier to the merged set of features, the estimating unit 205 may apply two different trained classifiers separately to the two set of features and combine the obtained head pose estimates. In other words, the estimating unit 205 may apply a first trained classifier to the first set of features obtained from the feature extracting unit 204 to obtain a first estimate for the head pose from tracking data, i.e., an estimate obtained from a top-down approach. The estimating unit 205 may apply a second trained classifier to the second set of features obtained from the feature extracting unit 204 to obtain a second estimate for the head pose from the appearance of the head, i.e., an estimate obtained from a bottom-up approach. The estimating unit 205 may combine the two estimates, e.g., by calculating a weighted average. The advantage of combining the two approaches is that a more reliable estimate may be obtained, i.e., the prediction error may be reduced and ambiguities may be resolved. The two estimates may be obtained using the same machine learning technique. The two estimates may also be obtained using two different machine learning techniques.

As a further alternative, a system may be provided using only the top- down approach. Such a system would not need to include a video camera 201 or a body part appearance unit 203. The system comprises a tracking unit 202, a feature extracting unit 204, and an estimation unit 205. The tracking unit 202 may use tracking data from any type of tracking device, e.g., the positions and/or velocities of the players and the ball. The feature extracting unit 204 may extract a first set of features, pertaining to the positions and/or velocities of the player, the ball, and the other players. The estimation unit 205 estimates the head pose, i.e., the most likely bin, using a machine learning technique. This is achieved by applying a trained classifier of the machine learning technique to the first set of features obtained from the feature extracting unit 205.

Finally, a more reliable estimate of the pose of a body part may be obtained by estimating the poses of several body parts jointly.

One way of jointly estimating dependent labels is sequence labeling, where one assumes that the labels have a sequence structure. This problem can be solved by a wide array of algorithms, including Conditional Random Fields (CRFs), Hidden Markov Models, Max Margin Markov Networks and Structured SVMs. A wide range of approaches can be used and three examples are explained below. The first straight-forward approach to joint estimation is to construct a new problem where one label is assigned to each possible combination of the old labels, transforming the joint classification problem into a multiclass problem.

The second approach is to use a CRF to estimate all labels while taking interdependencies into account. To avoid the training time incurred by training it with all features, a feature extraction step can be performed first where SVMs estimate the probability distribution independently. The probability estimates are then given to the CRF.

The third approach is to replace the CRF with a one SVM per body part, leading to a multi-layer SVM. The SVMs are trained to predict the body part poses, given estimated probability distributions for the labels.

Even though embodiments have been described with reference to the head pose of a football player, corresponding embodiments may be constructed for body parts other than heads and for team sports other than football.

The person skilled in the art realizes that the present invention by no means is limited to the embodiments described above. On the contrary, many modifications and variations are possible within the scope of the appended claims. For example, different classifiers may be used for different players or different teams. Further, posterior distributions may be calculated as an output from the machine learning method, and an estimate may be obtained by analyzing the posterior distribution. The prediction error of the body part pose estimate may be reduced by adding additional features to the set of features. Even though a system comprising a video camera has been described, video footage may also be obtained from other sources, such as television video footage.

Claims

1 . A method for estimating the pose of a body part of a team sport player using a machine learning technique, the method comprising the steps of:

extracting a set of features from tracking data, said set of features comprising at least one of a position of said player and a position of a ball, and

determining an estimate for said pose by applying a trained classifier to said set of features, said classifier being associated with the machine learning technique.

2. The method according to claim 1 , wherein said set of features further comprises positions of other players.

3. The method according to claim 1 , wherein said set of features further comprises at least one of a velocity of said player, a velocity of said ball, and velocities of said other players.

4. The method according to any one of the claims 1 to 3, wherein said tracking data is derived from video frames.

5. The method according to claim 1 , wherein said set of features further comprises a camera angle.

6. The method according to claim 1 , wherein said set of features further comprises features strongly linked to said team sport.

7. The method according to claim 1 , wherein said set of features further comprises features pertaining to an appearance of said body part.

8. The method according to claim 1 , wherein the estimate for said pose is combined with an estimate for said pose determined using the appearance of said body part.

9. The method according to claim 1 , wherein two separate trained classifiers are used, one for situations when said set of features comprises the position of a ball, and one for situations when said set of features does not comprise the position of a ball.

10. The method according to claim 1 , wherein a plurality of separate trained classifiers is used for different types of situations a game is in.

1 1 . The method according to claim 1 , wherein the poses of several body parts are estimated jointly.

12. A computer program product, comprising a computer usable medium having a computer readable program code embodied therein, said computer readable program code adapted to be executed to implement the method according to any one of the claims 1 to 1 1 .

13. A system (200) for estimating the pose of a body part of a team sport player using a machine learning technique, the system comprising: a tracking unit (202) configured for determining at least one of the position of the player, the position of a ball, and the positions of other players, a feature extracting unit (204) configured for extracting a first set of features from said positions, and

an estimation unit (205) configured for determining an estimate for said pose by applying a trained classifier to said first set of features, said classifier being associated with the machine learning technique.

14. The system of claim 13, further comprising:

a video camera (201 ), and a body part appearance unit (203) configured for analyzing an appearance of the body part, said appearance being derived form said video frames,

wherein:

said tracking unit (202) is further configured to use video frames received from said video camera for determining said at least one of the position of the player, the position of a ball, and the positions of other players, said feature extracting unit (204) is further configured to extract a second set of features from said appearance, and for combining said first set of features and said second set of features, and

said estimation unit (205) is further configured for determining an estimate for said pose by applying a trained classifier to the combined set of features, said classifier being associated with the machine learning technique.

15. A system (200) for estimating the pose of a body part of a team sport player using machine learning techniques, the system comprising:

a video camera (201 ),

a tracking unit (202) configured for determining at least one of the position of the player, the position of a ball, and the positions of other players, using video frames received from said video camera,

a body part appearance unit (203) configured for analyzing an appearance of the body part, said appearance being derived form said video frames,

a feature extracting unit (204) configured for extracting a first set of features from said positions and a second set of features from said

appearance,

an estimation unit (205) configured for determining a first estimate for said pose by applying a trained first classifier to said first set of features, for determining a second estimate for said pose by applying a trained second classifier to said second set of features, said first classifier being associated with a first machine learning technique and said second classifier being associated with a second machine learning technique, and for combining said first estimate and said second estimate.