CN109903312B

CN109903312B - Football player running distance statistical method based on video multi-target tracking

Info

Publication number: CN109903312B
Application number: CN201910071272.3A
Authority: CN
Inventors: 毋立芳; 付亨; 简萌; 徐得中; 李则昱; 卢哲
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2019-01-25
Filing date: 2019-01-25
Publication date: 2021-04-30
Anticipated expiration: 2039-01-25
Also published as: CN109903312A

Abstract

A football player running distance statistical method based on video multi-target tracking belongs to the field of sports data statistics. The distance traveled by the players on the field is an important statistic. With the development of computer vision technology, a football player running distance statistical scheme based on a football match video is provided. Firstly, the method can obtain multi-target tracking data in the video by analyzing the football game video. And then summarizing all tracking tracks, calculating to obtain the running track and the running distance of the player through track smoothing and top view mapping operation, and outputting a visual result. The method is a complete solution for realizing the running track and running distance statistics of the players, aims to reduce the cost of manual marking, verifies the feasibility of the method through tests and has important application value.

Description

Football player running distance statistical method based on video multi-target tracking

Technical Field

The invention is applied to the field of sports data statistics, and particularly relates to computer vision and digital image processing technologies such as target detection, multi-target tracking, homography transformation and the like. The method comprises the steps of inputting images in sequence according to football game videos shot by special cameras erected in a stadium, identifying human-shaped targets of players in the images through target detection and tracking, obtaining tracking tracks, and summarizing tracking results to realize statistics of running tracks and distances of the players.

Background

The running track and distance of a player in a football match are important statistical data, and the data can be used for evaluating the performance of the player. According to research, the statistical method of the data relies on professional manual labeling, is provided by only a few professional sports data companies, is high in acquisition cost, and is mostly applied to professional games. The invention can automatically calculate the data, has the function of assisting manual work, and can greatly reduce the cost. Meanwhile, the data has wider application scenes. The system is not only limited to be applied in professional games any more, but also can obtain better support in amateur games or daily sports training, helps more football players to effectively improve self level, and has important application value.

The football game video used as input in the present invention is provided by a dedicated camera mounted within the football field. Two groups of cameras are arranged on a support with the height of 15 meters, and the left half field and the right half field of the court are respectively taken down. The video is shot with a fixed visual angle, and once the video is installed, the height and the depression angle of the camera are not changed any more. Due to the limitation of camera imaging and shooting visual angles, the two groups of cameras are respectively responsible for the video content of the game in the half field with the midfield line as the boundary. The actual picture taken is shown in figure 1.

In a football match video, positioning information of a human-shaped target of a player in the video is realized through target detection and a multi-target tracking algorithm, and the running track and distance of the target of the player are calculated after the position information is gathered. In conclusion, the invention can greatly reduce the workload of manual marking and realize the more convenient statistics of the running track and the running distance of the football player.

Disclosure of Invention

In order to realize the running distance statistics of the players, an automatic measuring and calculating scheme based on a multi-target tracking technology is provided, and the flow of the measuring and calculating method is shown in figure 2. Firstly, preprocessing sequentially input football match video images, and locking an interested area for target detection of players; then, sending the image of the locking area into a target detection module to obtain a target detection frame of the players in the area; then, the target detection result of each frame is sent to a multi-target tracking module, the multi-target tracking module is matched with a Kalman tracker in an association mode, the motion state parameters of the corresponding tracker are updated according to the assigned result, and the tracking track of the player is output along with the iteration of the video image; secondly, gathering all tracking tracks of each player in the complete video time through manual correction, and obtaining a complete player position tracking track; and finally, the tracking track is sent to a data post-processing module, and the steps of track smoothing and denoising, difference value compensation, top view mapping transformation and running distance calculation are included, so that the finally required running distance statistical result can be output.

The invention contents of each main module of the method are as follows:

1. video image preprocessing module

The video images input sequentially need to be preprocessed before target detection, and the method mainly comprises two steps of target detection region of interest (ROI) setting and video half-field boundary point labeling.

Due to the shooting visual angle of the video and the generalization capability of the human-shaped target of the target detection model, the video of the football match can shoot the unnecessary human-shaped targets such as pedestrians, audiences, workers and the like outside the scene besides the player target required by the invention, as shown in fig. 3. Therefore, regions which are not interested in the image need to be filtered through preprocessing, and unnecessary human-shaped objects are identified by the object detection model. The flow of region-of-interest image filtering is shown in fig. 4, an example of which is shown in fig. 5.

Meanwhile, four boundary coordinates of a half field covered by the video also need to be marked for filtering subsequent redundant target detection frames and calculating a homography transformation mapping matrix.

2. Object detection module

The running distance of the player is mainly calculated by a target tracking track in a match video according to a corresponding proportion, and the basis of updating the target tracking state in the video is the detection of the humanoid target of the player.

The target detection network structure is based on a multi-scale framework of an SSD target detection algorithm, and directly classifies and regresses the receptive field frame corresponding to each characteristic layer; an RF (receptive field) sampling frame gray learning strategy is adopted to avoid learning some redundant characteristic parameters. An FCN full convolution network is constructed, and a corresponding network layer is selected according to the pixel range covered by the effective receptive field. The target detection depth network has 22 convolution layers, can realize target detection frame prediction of three scales, and can directly deduce the position and confidence score of a target by respectively corresponding target detection with different sizes in a video image by the three scales.

The program flow of object detection is shown in fig. 6. The video image of the soccer game in which the region of interest (ROI) has been set in the previous step is fed to the module as an input. Secondly, detecting a human-shaped target of the player by using a target detection method based on target receptive field regression, firstly, loading a trained target detection depth network model and a configuration parameter model, and deploying a target detection environment; then counting the size range of the long edge of the humanoid target in the target video, and setting the dimension of image size resetting (resize), so that the size of the long edge of the target frame after size resetting is within the range of the target detection model; secondly, calculating a binary image of the output characteristic image through a target detection algorithm, and setting a minimum confidence threshold value to screen out a target candidate frame with a confidence coefficient larger than the threshold value; then, using non-maximum value suppression operation, and preferentially reserving a target candidate frame with high confidence coefficient; and finally, in the filtered target detection frame, the coordinate of the lower middle point of the target frame is considered to be the coordinate of the position of the player, the relationship between the lower middle point of the target frame and the boundary point region is judged according to the half-field boundary point coordinate obtained in the preprocessing module, and only the target frame in the boundary point region of the court is reserved, so that the non-player humanoid target located near the boundary of the region of interest (ROI) and the player target not located in the half-field region are filtered. The final human-shaped target detection output result is shown in fig. 7, and the rectangular frame in the example figure is the output result.

3. Multi-target tracking module

3.1 Multi-target tracking Process

The biggest difference between target tracking and target detection is that tracking is a continuous process, and compared with target detection results, the target tracking results use unique id information to associate targets of each frame. And repeating the operation of the preprocessing module and the target detection module for each frame of image in the football match video, sequentially sending the output target detection frames into the multi-target tracking module to track the target of each player, and outputting the tracking track in the video dimension. The flow of multi-target tracking is shown in fig. 8.

In the multi-target tracking logic, a target detection frame of each frame is used as input, similarity correlation matching is firstly carried out on each detection frame and an existing Kalman filtering tracker, and the detection frame and the tracking frame are divided into three types according to matching results: 1. a detection frame and a tracking frame which are successfully matched, 2, a detection frame on the unmatched, and 3, a tracking frame on the unmatched. Then, the classification results are respectively subjected to corresponding operations as shown in fig. 8, so as to ensure the normal operation of the tracking.

1. Successfully matched detection and tracking boxes: and for the successfully matched detection frame and tracking frame, updating the motion state information of the Kalman filtering tracker assigned by the frame by using the coordinate information (the horizontal axis coordinate x of the central point of the target frame, the vertical axis coordinate y of the central point of the target frame, the area s of the target frame, the width-to-height ratio r of the target frame, and the unit of all the pixels) of the detection frame for predicting the tracking frame of the next frame. Meanwhile, the times (Hits) of successful matching of the tracker are accumulated, the last successful matching time (Age) of the distance between the trackers is set to be zero, the times (Hits) of successful matching of the trackers are finally judged, the state of the trackers is determined to be 'definite' when the times (Hits) of successful matching are more than or equal to 3, and the state of the trackers is set to be 'tentative' if not.

2. Detection box on unmatched: because the accuracy of the target detection model does not reach 100%, there are two possibilities in the detection frame that is not matched, one is a false detection frame, and the other is the appearance of a new target. Therefore, it is necessary to filter out the false detection frame, and according to the performance analysis of target detection, the individual target is not correspondingly suppressed in the non-maximum suppression portion due to the performance of the detector, and the false detection of two adjacent detections occurs in the same target, and the false detection frame is as shown in fig. 9. Therefore, the intersection ratio (IOU) is calculated between the detection frame and the tracking frame which are not matched, and when the intersection ratio is larger than 0.7, the detection frame in the condition is regarded as the false detection frame, so that the effect of filtering false detection is achieved. For the rest of the detection boxes which are not matched, the situation representing the appearance of a new target is considered, a tracker based on a Kalman filtering motion model is created for the detection boxes, the successful matching times (Hits) of the tracker are initialized to 0, a new target id value is given to the tracker, and the tracker state is set to be 'tentative'. For the first frame after the start of tracking, no tracker is initialized, so that each input target detection box in the first frame is associated and matched with the detection box on the unmatched frame.

3. Tracking box on unmatched: there are three possibilities for a tracking frame on a mismatch, one is that the tracking target disappears, one is that the tracking target is occluded, and the last is a tracker created due to false detection. Therefore, the wrong tracker is firstly filtered, the state of the tracker is classified into a "fixed" tracking frame or a "tentative" tracking frame, if the state of the tracking frame is "tentative", the tracker is considered to be created due to the wrong detection frame, and if the state of the tracking frame is "fixed", the tracker is considered to be the other two normal cases. And for the situation that the target disappears, the tracking frame is removed, and for the situation that the target is shielded, the detection frame reappears when the target is separated from the shielding, at the moment, the tracking frame is ensured to keep the original motion state for continuous prediction, and when the detection frame reappears, the matching of the original tracking frame and the detection frame is completed. The amount of time (Age) since the last successful match is used to decide whether to retain or delete, Age reaching a preset maximum value the tracker is deleted, otherwise the prediction is allowed to remain.

After the matching results and processing of the three types are respectively finished, all the tracking frames with the determined states in the current frame are returned to be used as the tracking results of the current frame.

3.2 Association matching Process

The main logic of multi-target tracking is introduced in the above steps, wherein the assignment problem of the detection box of the associated matching input and the Kalman filtering tracker is particularly important. When occlusion occurs, wrong tracking may exist in a Kalman (Kalman) tracking frame, and due to a measurement mode, the wrong tracking frame is highly likely to be matched with an upper detection frame in a wrong manner, so that the detection frame cannot be successfully matched with a tracking frame which the detection frame should be matched with, and further the tracking frame drifts. Or a correct tracking frame is matched with a detection frame which should be matched with other tracking frames, so that id interchange of the two tracking frames occurs. In order to solve the problem, the method designs a priority matching method based on confidence degree. The association matching logic is shown in fig. 10.

Firstly, all detection frames in a current frame and a tracking frame of a Kalman motion model are taken as input, and correlation matching is carried out on the detection frames and the tracking frame. Firstly, according to the respective states of the tracking frames, the tracking frames in the determined states are matched preferentially, and finally the tracking frames in the tentative states are matched. For the tracking frame with the determined state, the priority of the tracking frame when the tracking frame is matched and associated is judged by taking the distance between the tracking frame and the last successful matching time (Age) as a confidence coefficient. The method considers that the smaller the time (Age) from the last successful matching, the higher the confidence coefficient of the matching is, and the better the matching is to be performed; the higher the time (Age) from the last successful match, the lower the confidence level, and the more delayed the match, and the specific matching logic is as shown in fig. 10. And in each matching, the similarity of the detection frame and the tracking frame is calculated according to a measurement mode, the most similar detection frame and the tracking frame are uniquely assigned, a matching pair is output, the detection frame and the tracking frame which are not matched in the matching are input to the next priority for continuous associated matching, and the assignment sequence problem of the detection frame and the tracking frame is completed according to the priority of the confidence coefficient and the like. The final goal is to divide it into: and outputting results of the successfully matched pair, the unmatched detection frame and the unmatched tracking frame.

3.3 metric similarity procedure

The above steps describe the priority problem when the detection frame and the tracking frame are matched, and how to match is also an important problem, and a specific matching flow is shown in fig. 11.

Firstly, all detection frames and tracking frames to be matched are taken as input, and under the application scene, the similarity calculation of the detection frames and the tracking frames is mainly considered on the spatial distance level, so that the two-dimensional Euclidean distance between the upper midpoints of the two frames is taken as a similarity measurement mode, and the calculation process is as follows:

calculating the similarity between each tracking frame and each detection frame by the measurement mode to form a similarity overhead matrix taking the number of the tracking frames as the number of rows and the number of the detection frames as the number of columns; then, the matching pairs with the distance larger than the distance threshold in the cost matrix are filtered, the similarity of the matching pairs is set to be infinite, and the purpose is to enable the Hungarian Algorithm (Hungarian Algorithm) to ignore the minimum cost assignment when calculating the minimum cost assignment, so that the minimum cost matching pairs with obvious errors are avoided, the wrong assignment under extreme conditions is filtered, and the final matching result is output.

3.4 Kalman Filter model

In the invention, each tracking frame uses a Kalman filter based on a constant speed model and a linear observation model to predict the motion state of a target, the prediction result is (x, y, s, r), and 8 parameters (x, y, s, r, v) are used^x,v^y,v^s,v^r) A description of the motion state is made. Where x and y are respectively the coordinates of the midpoint of the target frame, s is the area of the target frame, r is the aspect ratio of the target frame, and v represents the rate of change thereof. The principle of kalman filtering is not described in detail herein. The results of using multi-target tracking based on the above steps are shown in fig. 12.

4. Data post-processing module

4.1 splicing trajectories

Due to the objective reason of the video shooting visual angle, a plurality of serious shielding conditions exist, and for the extreme conditions, the multi-target tracking algorithm cannot achieve the accuracy of one hundred percent. A plurality of sections of tracking tracks are required to be spliced and gathered together, so that complete running track statistics of a target is realized, and compared with the traditional manual statistics, a large amount of time cost is saved.

4.2 track smoothing

The invention takes the lower middle point of the tracking frame as the position coordinate of the player target, and the position jitter phenomenon can be caused by directly using the lower middle point coordinate as the tracking track because the target frame can change along with the human-shaped posture, so that the tracking track smoothing processing is added to obtain more effective tracking data. The flow of the procedure for track smoothing is shown in fig. 13.

Firstly, sequentially inputting tracking data, for example, setting the length of a sampling sliding window to be 20, wherein the tracking data of the first 10 frames and the last 10 frames of a video need to be skipped; then circulating all ids in the target frame, sequentially obtaining the tracking data of the ids in the nearly 20 frames, and taking the median of the horizontal axis and the vertical axis in the image of the midpoint coordinates of the lower edge in the 20 frames as the smoothing result of the target frame; and then sequentially circulating the track corresponding to each id in each frame, and repeating the steps to finally finish the smoothing treatment.

4.3 position coordinate mapping

The smoothed multi-target tracking result needs to be mapped to the top view of a standard football field, so that accurate running distance data of the players is obtained according to the mapping proportion calculation. The present invention uses a mapping method of Homography (Homography). Since the football field can be assumed to be a plane, the tracking position p in the video image is (x, y,1)^TThe top view mapping coordinates of (a) may be expressed as p' ═ Hp, where the homography transformation matrix H is a 3 × 3 matrix, and the matrix H is calculated as follows with respect to the original coordinates (x, y) and the mapping coordinates (u, v):

the mapping procedure is shown in fig. 14. Firstly, the positions of the court boundary points are obtained in the preprocessing step, and the coordinates of the boundary points corresponding to the top view of a standard football court are found, wherein the top view of the standard football court is shown in figure 15 for example; then, a homography transformation matrix is calculated based on the two sets of coordinate values, the matrix representing the mapping relationship between the two images, and a mapping result of the tracking result position in the top view can be calculated based on the matrix, as shown in fig. 16.

4.4 calculating running distance

The invention can output the running tracks of all the players in the football match video and count the running distances of all the players, and the output logic of the result is shown in figure 17.

Firstly, inputting all tracking mapping data, and sequentially circulating all the tracks corresponding to all the ids in all the frames and the target frame; then, the size proportion of the same boundary line in the top view (pixel) of the court and the actual field (meter) is counted to obtain the proportional relation; and finally, calculating the distance between the tracking positions of different players and outputting the running track and the running distance of each player.

Drawings

FIG. 1 is an example of a video shot of a soccer game;

(a) a left half match video, (b) a right half match video;

FIG. 2 is a general framework of a running distance statistical scenario for a soccer player;

FIG. 3 is an example of a video image of a soccer game

(a) Human-shaped target detection within a region of interest, (b) human-shaped target examples within a region of non-interest;

FIG. 4 is a schematic view of a process for filtering a region of interest of a game image;

FIG. 5 is an example of a game image region of interest filtering;

FIG. 6 is a schematic view of a human-shaped object detection process

FIG. 7 is an example of player humanoid target detection;

FIG. 8 is a schematic view of a multi-target tracking process;

FIG. 9 is an example of a target detection false positive frame;

FIG. 10 is a schematic diagram illustrating a flow of association matching between a detection frame and a tracking frame;

FIG. 11 is a schematic flow chart of similarity measurement;

FIG. 12 is an example of a multi-target tracking result;

FIG. 13 is a schematic diagram of a track smoothing process;

FIG. 14 is a schematic top view mapping flow diagram;

FIG. 15 is a top view example of a standard football pitch;

FIG. 16 is a top view coordinate mapping example;

(a) a football match video player tracking result, (b) a court top view mapping result;

FIG. 17 is a schematic flow chart of a process for calculating a running distance;

FIG. 18 data post-processing example;

(a) a soccer match video player tracking result, (b) a court top view mapping result, (c) a top view mapping example;

FIG. 19 outputs a result example;

(a) player target running trajectory, running distance example 1, (b) player target running trajectory, running distance example 2;

Detailed Description

The invention provides a football match player running distance statistical method based on multi-target tracking.

The method comprises the following specific implementation steps:

1. video image preprocessing module

And selecting the half football match videos on the left side and the right side which need to be counted according to requirements, and sequentially inputting the images of each frame of the videos according to the time sequence. The left half-field video and the right half-field video are aligned according to time, the image resolution of the game video is 2704 pixels by 1520 pixels, and the video frame rate is 30 Hz. An example of an image thereof is shown in fig. 1.

And defining a region of interest (ROI) in the input football game video, enabling the target detection model to only identify the humanoid targets of football players in the region, and filtering the unneeded humanoid targets such as pedestrians, audiences, staff and the like outside the field. Namely, selecting a half-field area corresponding to the video, covering a plurality of areas near the boundary of the court appropriately to ensure that the humanoid target of the player standing on the boundary line can be completely detected. It has been found in practice that the region of interest (ROI) is preferably extended by 50 pixel length outward from the boundary line. By annotation, region of interest (ROI) coordinate points in the test video: the left side video is sequentially [195, 320], [2165,1190], [2480,285], [1441,161], and the right side video is sequentially [609,1278], [2491,183], [1215,165], [219,420 ]. The respective coordinates are sequentially connected into a quadrangular region, and the pixel value of the image outside the region is set to 0 to realize the black setting process, wherein the left-side video processed effect is as shown in fig. 5.

Meanwhile, marking four half-field boundary points corresponding to the video: two corner flag points, the intersection of the two midfield lines with the edge line, and the coordinates of these points in the video image are recorded. In the test video, the half-field boundary points in the left video are [256, 345], [2114,1119], [2435,326], [1447,186] in sequence, and the half-field boundary points in the right video are [636,1207], [2431,206], [1193,193], [251,450] in sequence. And filtering subsequent redundant target detection boxes and calculating a homography transformation mapping matrix.

2. Object detection module

2.1 data set Generation

In order to train the target detection model, a data set of humanoid targets of the football player needs to be prepared, and a data set comprising 50000 humanoid targets is made through labeling and used for training and testing when the target detection network parameters are trained.

2.2 making an object detection model

The target detection network is realized based on an MXnet deep learning framework, operates in a Linux system and adopts a GPU for operation. And putting the manufactured data set into a training network, setting the parameters for training, and after iterating 1000000 times, converging to generate a three-scale target detection model with the long edge of the target positioned in the range of 30-180 pixels.

2.3 outputting player target detection results

The program flow of object detection is shown in fig. 6. The preprocessed football game video image is used as input.

(1) Loading the trained target detection model and training parameters, and deploying a target detection environment;

(2) counting the size range of the long edge of the humanoid target in the target video, and setting the dimension of image size reset (resize) to be 1.2, so that the size of the long edge of the target frame after size reset is within the range of the target detection model.

(3) Calculating a binary image of the output characteristic image through a target detection algorithm, setting a minimum confidence coefficient threshold value screen to be 0.55, and selecting a target candidate frame with the confidence coefficient greater than the threshold value;

(4) then, using non-maximum value suppression operation, setting a non-maximum value suppression threshold value to be 0.6, and preferentially reserving a target candidate frame with high confidence coefficient;

(5) and finally, in the filtered target detection frame, considering that the coordinate of the lower middle point of the target frame is the coordinate of the position of the player, judging the relationship between the lower middle point of the target frame and the boundary point region according to the half-field boundary point coordinate obtained in the preprocessing step, and only reserving the target frame in the boundary point region of the court, thereby filtering the non-player humanoid target located near the boundary of the region of interest (ROI) and the player targets not located in the half-field region.

The final human-shaped target detection output result of the player is shown in fig. 7, and a rectangular frame in the example figure is a target detection result. Target detection accuracy (Precision) 98.83%, Recall (Recall) 92.86%, F1 Score (F1-Score) 95.75%.

3. Multi-target tracking module

3.1 Multi-target tracking

And repeating the operation of the preprocessing module and the target detection module for each frame of image in the football match video, sequentially sending the output target detection frames into the multi-target tracking module to track the target of each player, and outputting the tracking track in the video dimension. The flow of multi-target tracking is shown in fig. 8.

In the multi-target tracking logic, a target detection frame of each frame is used as input, similarity correlation matching is firstly carried out on each detection frame and an existing Kalman filtering tracker, and the detection frame and the tracking frame are divided into three types according to matching results: 1. a detection frame and a tracking frame which are successfully matched, 2, a detection frame on the unmatched, and 3, a tracking frame on the unmatched. Then, the classification results are respectively subjected to corresponding operations as shown in fig. 8:

3. Tracking box on unmatched: there are three possibilities for a tracking frame on a mismatch, one is that the tracking target disappears, one is that the tracking target is occluded, and the last is a tracker created due to false detection. Therefore, the wrong tracker is firstly filtered, the state of the tracker is classified into a "fixed" tracking frame or a "tentative" tracking frame, if the state of the tracking frame is "tentative", the tracker is considered to be created due to the wrong detection frame, and if the state of the tracking frame is "fixed", the tracker is considered to be the other two normal cases. And for the situation that the target disappears, the tracking frame is removed, and for the situation that the target is shielded, the detection frame reappears when the target is separated from the shielding, at the moment, the tracking frame is ensured to keep the original motion state for continuous prediction, and when the detection frame reappears, the matching of the original tracking frame and the detection frame is completed. The amount of time (Age) from the last successful match is therefore used to decide whether to retain or delete, Age reaching a preset maximum value deleting the tracker, otherwise allowing the prediction to continue to be retained.

The above-mentioned step of associating the matching problem uses a method of priority matching based on confidence. An embodiment of associative matching is shown in fig. 10. All detection frames and tracking frames to be matched are used as input, and the two-dimensional Euclidean distance between the upper midpoints of the two frames is used as a similarity measurement standard. The similarity between each tracking frame and each detection frame can be calculated to form a similarity overhead matrix taking the number of the tracking frames as the number of rows and the number of the detection frames as the number of columns; and then filtering matching pairs with the distance larger than a metric distance threshold value in the overhead matrix, setting the similarity of the matching pairs to be infinite, and filtering out wrong assignment under an extreme condition. The Hungarian Algorithm (Hungarian Algorithm) then calculates the minimum cost assignment and outputs the final matching result. In the invention, each tracking frame uses a Kalman filter based on a constant speed model and a linear observation model to predict the motion state of a target, the prediction result is (x, y, s, r), and 8 parameters (x, y, s, r, v) are used^x,v^y,v^s,v^r) A description of the motion state is made. Where x and y are respectively the coordinates of the midpoint of the target frame, s is the area of the target frame, r is the aspect ratio of the target frame, and v represents the rate of change thereof.

After the matching results and processing of the three types are respectively finished, all the tracking frames with the determined states in the current frame are returned to be used as the tracking results of the current frame. The results of using multi-target tracking based on the above steps are shown in fig. 12.

3.2 tracking parameter tuning

And performing performance evaluation in the test video marked with the real label (ground route) by using a multi-target tracking algorithm. Then, according to the evaluation result, the following parameters are adjusted to achieve the optimal performance:

1. and adjusting the minimum matching times (Hits) in multi-target tracking.

2. And adjusting the maximum time limit (Age) of the tracker during multi-target tracking.

3. And adjusting the metric distance threshold value during multi-target tracking.

The performance of multi-target tracking accuracy (MOTA) of 92.05% and IDF1 of 79.76% can be obtained in a test video by setting the minimum matching times (Hits) of multi-target tracking to 3 times, the maximum time limit (Age) of a tracker to 7 frames and the measurement distance threshold to 30 pixels.

4. Data post-processing module

4.1 splicing trajectories

Due to the objective reason of the video shooting visual angle, a plurality of serious shielding conditions exist, and for the extreme conditions, the multi-target tracking algorithm cannot achieve the accuracy of one hundred percent. A plurality of sections of tracking tracks need to be spliced and gathered together, so that complete running track statistics of a target is realized.

4.2 track smoothing

And carrying out data post-processing on the spliced tracking tracks. Firstly, carrying out median smoothing on the coordinates of each track, and setting the length of a smoothing window to be 20 frames to obtain the smooth track under the dimensionality of the video coordinates. The flow of the procedure for track smoothing is shown in fig. 13.

4.3 position coordinate mapping

And calculating a homography transformation matrix H of the corresponding point coordinates of the video and top view court according to the half-field boundary point coordinates obtained by marking in the video preprocessing module. The coordinate mapped position of the target in the top view can be obtained using a homography transformation, as shown in fig. 16. The match videos on the left side and the right side respectively perform mapping operation on the same top view, so that the function of lens fusion on the left side and the right side is realized, and the effect is shown in fig. 18.

4.4 calculating running distance

And (3) performing proportion calculation according to the top view and the selected same reference object of the actual court, taking the long side line of the court as an example: the long side edge length in top view is 1472 pixels, and the long side edge length of the real court measured in the field is 95 meters. A ratio of 15.5 pixels/meter can be derived. And finally, summarizing the tracking position information of each target, and calculating according to the proportion to obtain the running track and the running distance data. The results of the test video are shown in fig. 19, where the running trajectories of the two example players are plotted as blue lines, with running distances of 57 and 33 meters, respectively.

Claims

1. A football player running distance statistical method based on video multi-target tracking,

firstly, preprocessing sequentially input football match video images and locking an interested area for target detection of players; then, sending the image of the locking area into a target detection module to obtain a target detection frame of the players in the area; then, the target detection result of each frame is sent to a multi-target tracking module, the multi-target tracking module is matched with a Kalman tracker in an association mode, the motion state parameters of the corresponding tracker are updated according to the assigned result, and the tracking track of the player is output along with the iteration of the video image; secondly, summarizing all tracking tracks of each player in the complete video time through manual correction to obtain a complete player position tracking track; finally, the tracking track is sent to a data post-processing module, the steps of track smoothing and denoising, difference value compensation, top view mapping transformation and running distance calculation are included, and a finally required running distance statistical result is output;

the method is characterized in that the multi-target tracking is as follows:

3.1 Multi-target tracking Process

In the multi-target tracking logic, a target detection frame of each frame is used as input, similarity correlation matching is firstly carried out on each detection frame and an existing Kalman filtering tracker, and the detection frame and the tracking frame are divided into three types according to matching results: 1. a detection frame and a tracking frame which are successfully matched, 2, a detection frame which is not matched, and 3, a tracking frame which is not matched; then, respectively carrying out the following corresponding operations on the classification results to ensure the normal operation of tracking;

1) successfully matched detection and tracking boxes: for the successfully matched detection frame and tracking frame, updating the motion state information of the Kalman filtering tracker assigned by the detection frame by using the coordinate information of the detection frame for predicting the tracking frame of the next frame; meanwhile, accumulating the successful matching times Hits of the tracker, setting the last successful matching time Age of the tracker to zero, finally judging the successful matching times Hits of the tracker, judging the state of the tracker to be 'determined' when the successful matching times are more than or equal to 3, and otherwise, setting the state to be 'tentative';

2) detection box on unmatched:

calculating the intersection ratio IOU of the detection frame and the tracking frame which are not matched, and when the intersection ratio is more than 0.7, considering the detection frame under the condition as a false detection frame so as to achieve the effect of filtering false detection; regarding the rest detection frames which are not matched, the situation of the new target is considered to represent the appearance of the new target, a tracker based on a Kalman filtering motion model is created for the new target, the successful matching times Hits of the tracker are initialized to be 0, a new target id value is given, and the state of the tracker is set to be 'tentative'; for the first frame after the tracking is started, no tracker is initialized at this time, so that each input target detection frame in the first frame is associated and matched as a detection frame which is not matched;

3) track box on unmatched: for the unmatched tracking frames, three possibilities exist, one is that the tracking target disappears, the other is that the tracking target is blocked, and the last is a tracker created due to wrong detection; therefore, firstly, the error tracker is filtered, the state of the tracker is classified into a "determined" tracking frame or a "tentative" tracking frame, if the state of the tracking frame is "tentative", the tracker is considered to be created due to the error detection frame, and if the state of the tracking frame is "determined", the tracker is considered to be the other two normal cases; the tracking frame is removed under the condition that the target disappears, and the detection frame reappears under the condition that the target is shielded, the tracking frame is ensured to keep the original motion state to continue prediction at the moment, and the matching between the original tracking frame and the detection frame is completed when the detection frame reappears; therefore, the retention or deletion is determined by using the time Age from the last successful matching, the tracker is deleted when the Age reaches the preset maximum value, and otherwise, the retention prediction is allowed to continue;

after the matching results and processing of the three types are respectively finished, all the tracking frames with the determined states in the current frame are returned to be used as the tracking results of the current frame;

3.2 Association matching Process

Firstly, taking all detection frames in a current frame and a tracking frame of a Kalman motion model as input, and carrying out correlation matching on the detection frames and the tracking frame; firstly, preferentially matching the tracking frames in a determined state according to the respective states of the tracking frames, and finally matching the tracking frames in a tentative state; for the tracking frame with the determined state, the priority of the tracking frame when the tracking frame is matched and associated is judged by taking the time from the last successful matching as a confidence coefficient; the smaller the time from the last successful matching is, the higher the confidence coefficient is, and the better the matching is to be performed; the time from the previous successful matching is longer, the confidence coefficient is considered to be lower, the more delayed matching is needed, the similarity between the detection frame and the tracking frame is calculated according to a measurement mode in each matching, the most similar detection frame and the most similar tracking frame are uniquely assigned, a matching pair is output, the detection frame and the tracking frame which are not matched in the matching are input to the next priority for continuous associated matching, and the assignment sequence problem of the detection frame and the tracking frame is completed by analogy according to the priority of the confidence coefficient; the final goal is to divide it into: successfully matching three output results of the pair, the unmatched detection frame and the unmatched tracking frame;

3.3 metric similarity procedure

Firstly, all detection frames and tracking frames to be matched are used as input, and in the application scene, the similarity calculation of the detection frames and the tracking frames is considered on the aspect of spatial distance, so that the two-dimensional Euclidean distance between the upper midpoints of the two frames is used as a similarity measurement mode;

calculating the similarity between each tracking frame and each detection frame by the measurement mode to form a similarity overhead matrix taking the number of the tracking frames as the number of rows and the number of the detection frames as the number of columns; and then filtering the matching pairs with the distance larger than the distance threshold in the cost matrix, setting the similarity of the matching pairs as infinity, thereby avoiding the occurrence of the minimum cost matching pair with obvious errors, filtering out wrong assignment under extreme conditions, and outputting a final matching result.