CN115731266A

CN115731266A - Cross-camera multi-target tracking method, device and equipment and readable storage medium

Info

Publication number: CN115731266A
Application number: CN202211485794.6A
Authority: CN
Inventors: 安宁; 王子依; 周斌; 胡波; 李艳红
Original assignee: Wuhan Etah Information Technology Co ltd
Current assignee: Wuhan Etah Information Technology Co ltd
Priority date: 2022-11-24
Filing date: 2022-11-24
Publication date: 2023-03-03

Abstract

The application relates to a cross-camera multi-target tracking method, a device, equipment and a readable storage medium, which comprises the steps of correcting a picture shot by a camera and generating a view boundary and a target intersection line for the camera; detecting a single-camera target through a coordinate attention mechanism network to obtain a plurality of detection frames corresponding to the single camera; performing target tracking on the detection frame through a multi-target tracking algorithm and generating an identity ID for the detection frame; determining a detection frame to be handed over which enters a current camera overlapping area according to a visual field boundary and a target handing-over line, and converting image coordinates of the detection frame to be handed over and an image coordinate of a target detection frame of the current camera into world coordinates; calculating the proximity of the centroid position and the overlapping degree of the detection frames to be jointed according to the world coordinates; and target consistency judgment and target handover between the detection frame to be handed over and the target detection frame are realized based on the identity ID, the proximity of the centroid position and the detection frame overlapping degree, so that the multi-target accurate tracking and identification of the cross-camera are realized.

Description

Cross-camera multi-target tracking method, device and equipment and readable storage medium

Technical Field

The present application relates to the field of target tracking technologies, and in particular, to a cross-camera multi-target tracking method, apparatus, device, and readable storage medium.

Background

Multi-target multi-view tracking aims at determining the position and identity of each person from video streams shot by a plurality of cameras, and is widely applied in the fields of unmanned driving, human-computer interaction, virtual reality and the like. In recent years, the analysis result of multi-target multi-view tracking can assist tactical formulation of sports competitions, improve the scientificity and efficiency of training, enhance the effect of sports video broadcasting, provide interactive contents for audiences and the like, and therefore, the method also plays an important role in the field of sports.

The multi-target multi-view tracking is used as a key technology of a high-speed sports motion image analysis system, and can meet the requirement of monitoring a large scene in multiple angles, but the problems of target adhesion, human body posture change, light ray change and the like exist due to the large relative displacement of a high-speed moving target between frames, so that the existing tracking algorithm is difficult to realize accurate tracking, and further identity switching or identity loss is easily caused; in addition, in the multi-target multi-view tracking process, the difficulty of cross-visual-field multi-target re-identification and tracking is large due to the following problems: the error of feature matching of adjacent cameras can interfere with the target handover effect, and the sudden change of the target form and the low degree of distinguishing the appearance of athletes caused by the difference of the viewpoints of different cameras are high. Therefore, how to realize the accurate tracking and identification of multiple targets across cameras is a problem which needs to be solved at present.

Disclosure of Invention

The application provides a cross-camera multi-target tracking method, a device, equipment and a readable storage medium, which are used for realizing cross-camera multi-target accurate tracking and identification.

In a first aspect, a cross-camera multi-target tracking method is provided, which includes the following steps:

correcting pictures shot by a plurality of cameras based on a checkerboard correction method, and generating a view boundary and a target intersection line for each corrected camera;

detecting a single-camera target through a coordinate attention mechanism network to obtain a plurality of detection frames corresponding to the single camera;

target tracking is carried out on all detection frames through a preset multi-target tracking algorithm, and identity IDs are generated for the detection frames corresponding to all targets;

determining a detection frame to be handed over which enters the overlapping area of the current camera from the detection frames corresponding to the adjacent previous camera according to the view boundary and the target handing over line;

converting image coordinates of a detection frame to be handed over and a target detection frame corresponding to a current camera into world coordinates;

respectively calculating the centroid position proximity and the detection frame overlapping degree between the detection frame to be jointed and each target detection frame according to the world coordinates;

and performing target consistency judgment and target handover on the detection frame to be handed over and each target detection frame based on the identity ID, the proximity of the centroid position and the detection frame overlapping degree.

In some embodiments, the performing, based on the identity ID, the proximity of the centroid position, and the detection frame overlap, the target consistency determination and the target handover for the detection frame to be handed over and each target detection frame includes:

performing target consistency judgment on the detection frame to be handed over and each target detection frame according to the centroid position proximity and the detection frame overlapping degree;

taking the target detection frame which has the minimum centroid position proximity and the maximum detection frame overlapping degree with the detection frame to be handed over as a candidate target frame;

and giving the ID of the detection frame to be handed over to the candidate target frame so as to realize the target hand-over across the cameras.

In some embodiments, the method for determining the proximity of the minimum centroid position is:

when the world coordinates of a target detection frame and the world coordinates of a detection frame to be handed over meet the following first condition, judging that the target detection frame and the detection frame to be handed over are close to a minimum centroid position;

the first condition is:

wherein (X' _w Y' w) represents world coordinates of the target detection frame,

and world coordinates representing the detection frame to be handed over.

In some embodiments, the number of the detection frames to be handed over is multiple, and after the step of calculating the proximity of the centroid position and the overlap degree of the detection frames between the detection frames to be handed over and each target detection frame according to the world coordinates respectively, the method further includes:

calculating the Euclidean distance between one detection frame to be handed over and other detection frames to be handed over according to the world coordinates of the detection frames to be handed over;

if the Euclidean distance is detected to be larger than a distance threshold value, executing the step of carrying out target consistency judgment and target handover on the detection frame to be handed over and each target detection frame based on the identity ID, the proximity of the centroid position and the overlap degree of the detection frames based on one detection frame to be handed over;

and if the Euclidean distance is detected to be smaller than or equal to the distance threshold value, performing target handover on the detection frame to be handed over and the target detection frame through the Hungarian algorithm and the proximity of the centroid positions.

In some embodiments, the coordinate attention mechanism network is a YOLO model based on a coordinate attention mechanism.

In a second aspect, a cross-camera multi-target tracking apparatus is provided, including:

the processing unit is used for correcting pictures shot by the cameras based on a checkerboard correction method and generating a view boundary and a target intersection line for each corrected camera;

the detection unit is used for detecting a single-camera target through a coordinate attention mechanism network to obtain a plurality of detection frames corresponding to the single camera;

the tracking unit is used for tracking the targets of all the detection frames through a preset multi-target tracking algorithm so as to generate an identity ID (identity) for the detection frame corresponding to each target;

the judging unit is used for determining a detection frame to be handed over which enters the overlapping area of the current camera from the detection frames corresponding to the adjacent previous camera according to the view boundary and the target handing-over line;

the conversion unit is used for converting the image coordinates of the detection frame to be handed over and the target detection frame corresponding to the current camera into world coordinates;

the calculating unit is used for respectively calculating the centroid position proximity and the detection frame overlapping degree between the detection frame to be jointed and each target detection frame according to the world coordinates;

a handover unit for performing target consistency determination and target handover on the detection frame to be handed over and each target detection frame based on the identity ID, the proximity of the centroid position, and the detection frame overlap degree.

In some embodiments, the handover unit is specifically configured to:

In some embodiments, the number of the detection frames to be handed over is multiple, and the computing unit is further configured to:

if the Euclidean distance is detected to be larger than a distance threshold value, enabling the computing unit to execute the steps of carrying out target consistency judgment and target handover on the detection frames to be handed over and each target detection frame based on the identity ID, the proximity of the centroid position and the detection frame overlapping degree on the basis of one detection frame to be handed over;

and if the Euclidean distance is detected to be smaller than or equal to the distance threshold value, enabling the handover unit to be further used for performing target handover on the detection frame to be handed over and the target detection frame through the Hungarian algorithm and the proximity of the centroid position.

In a third aspect, a cross-camera multi-target tracking device is provided, including: the system comprises a memory and a processor, wherein at least one instruction is stored in the memory, and the at least one instruction is loaded and executed by the processor to realize the cross-camera multi-target tracking method.

In a fourth aspect, a computer-readable storage medium is provided, which stores a computer program that, when executed by a processor, implements the foregoing cross-camera multi-target tracking method.

The application provides a cross-camera multi-target tracking method, a device, equipment and a readable storage medium, which comprises the steps of correcting pictures shot by a plurality of cameras based on a checkerboard correction method, and generating a view boundary and a target intersection line for each corrected camera; detecting a single-camera target through a coordinate attention mechanism network to obtain a plurality of detection frames corresponding to the single camera; target tracking is carried out on all detection frames through a preset multi-target tracking algorithm, and identity IDs are generated for the detection frames corresponding to all targets; determining a detection frame to be handed over which enters the overlapping area of the current camera from the detection frames corresponding to the adjacent previous camera according to the view boundary and the target handing over line; converting image coordinates of a detection frame to be handed over and a target detection frame corresponding to a current camera into world coordinates; respectively calculating the centroid position proximity and the detection frame overlapping degree between the detection frame to be jointed and each target detection frame according to the world coordinates; and performing target consistency judgment and target handover on the detection frame to be handed over and each target detection frame based on the identity ID, the proximity of the centroid position and the detection frame overlapping degree. This application improves the accuracy that the adhesion target detected through coordinate attention mechanism, and in the cross camera target switching process, through the adjacent camera of image coordinate and world coordinate's corresponding relation association, so that increase the dimension information of unified space world coordinate and restrain the target matching in the overlap region, and because the barycenter coordinate of target can not take place the sudden change along with the change of gesture under the angle of top view, consequently treat the barycenter position proximity of handing-over detection frame and detect the frame overlap degree and carry out cross camera target uniformity and judge and the target handing-over in the overlapping view, can effectively promote accuracy and the stability that the target tracked discernment and handing-over, and then can continuously carry out identity maintenance and accurate positioning to the target.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a cross-camera multi-target tracking method according to an embodiment of the present disclosure;

FIG. 2 is a block diagram of a cross-camera multi-target handover algorithm according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of target adhesion at different angles according to an embodiment of the present disclosure;

fig. 4 is a schematic relationship diagram of coordinate systems related to camera calibration according to an embodiment of the present disclosure;

fig. 5 is a schematic diagram of converting image coordinates into actual site coordinates (i.e., world coordinates) according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a cross-camera multi-target tracking device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making creative efforts shall fall within the protection scope of the present application.

The embodiment of the application provides a cross-camera multi-target tracking method, a cross-camera multi-target tracking device and a readable storage medium, so that cross-camera multi-target accurate tracking and identification are achieved.

Referring to fig. 1 and 2, an embodiment of the present application provides a cross-camera multi-target tracking method, including the following steps:

step S10: correcting pictures shot by a plurality of cameras based on a checkerboard correction method, and generating a view boundary and a target intersection line for each corrected camera;

exemplarily, it can be understood that in a high-speed gliding movement such as a short-track fast gliding movement, since sportsmen and postures are similar, for a non-rigid target, a target re-identification method based on appearance characteristics has great difficulty in extracting characteristics, so that sportsmen with different identities cannot be effectively distinguished, and identity ID hopping is easily caused; in addition, as shown in fig. 3, the distance between the athletes (each circle in fig. 3 represents one athlete) is short during high-speed skating, so that the problem of adhesion exists, and the sliding tracks are overlapped and crossed due to motion interaction, so that the target re-identification method based on track matching is prone to mismatching or missing matching when the target cross-camera handover is performed. The embodiment provides a cross-camera multi-target tracking algorithm based on space constraint so as to improve the accuracy and stability of target handover.

The following embodiments will explain the steps and principles of the cross-camera multi-target tracking method by taking short-track speed skating as an example.

The target handover is a key link in the target continuous tracking process, and the handover accuracy determines the cross-camera multi-target tracking effect. Therefore, in this embodiment, four high-speed industrial cameras (i.e. high-speed industrial cameras) are installed on the roof of the short-track speed skating venue in advance, and the four cameras are distributed right above the ice field runway, and adjacent cameras have a certain overlapping view field to ensure smooth switching of the target in the tracking process. It should be noted that the target referred to in this embodiment refers to an athlete, and the target is to be presented in the form of a detection box.

In this embodiment, a checkerboard correction method is used to correct the four monitored pictures, specifically, picture distortion caused by camera shooting angles is eliminated through plane perspective conversion, checkerboard calibration images at different angles, different postures and different positions are collected, feature points in the images are detected, distortion coefficients are obtained after internal parameters and external parameters of the cameras are obtained, and then the pictures shot by the four monitoring cameras are corrected.

It should be understood that the view boundaries refer to the position of the boundaries of one camera view in the neighboring cameras, also referred to as the field lines, which are key information in the object handover process. In order to distinguish the overlapping area and the non-overlapping area of the vision fields between adjacent cameras, a vision field boundary needs to be generated for each camera, namely when a plurality of cameras exist, before athlete identity correlation (which can also be understood as target re-identification) of the cameras is carried out, the vision field boundary needs to be generated so as to judge the target handover time. It should be noted that the present embodiment will adopt a projection invariant method to generate a view boundary for each camera, so as to distinguish when an object appears in the views of the two cameras; meanwhile, a pair of target cross-connecting lines are defined in the overlapping area, namely a line suitable for target cross-connecting is selected in the overlapping area, so that when the target touches the line, the table name should search for a target matched with the table name in the view of the other camera.

Step S20: detecting a single-camera target through a coordinate attention mechanism network to obtain a plurality of detection frames corresponding to the single camera; wherein the coordinate attention mechanism network is a YOLO model based on a coordinate attention mechanism;

exemplarily, it should be understood that a short-track speed skating player may generate a blocking phenomenon due to fast interaction during a skating process, while the conventional background modeling target detection method has limitations in handling the blocking problem, but the target detection method based on deep learning has higher detection accuracy and speed, which can more easily distinguish the blocking target; in addition, the difference of the movement positions of the short-track speed skiers between frames is large, so that the target detector needs to have high sensitivity on position information, and therefore, in the embodiment, a CA-YOLO model based on a coordinate attention mechanism is used as the target detector to perform target detection, so that the adhesion target can be identified more accurately.

Specifically, a video stream shot by a single camera is input to a CA-YOLO model; a main feature extraction network of the CA-YOLO model extracts features of the video stream to output a feature map; then, after a network output characteristic diagram is extracted from the main features of the CA-YOLO model, a coordinate attention mechanism layer is added, position information is embedded into channel attention, the features are aggregated along two spatial directions respectively, and remote dependency relationship is captured and accurate position information is reserved respectively; at the moment, the coordinate attention mechanism divides the global pooling into two feature coding operations, namely, the obtained feature map is coded into a pair of direction perception and position sensitive attention maps, and then the association between feature pixels and positions is established, so that the segmentation capability of the network is obviously improved at the boundary and the details, and more accurate target positioning is facilitated; and then, carrying out Feature fusion on the previous Feature layer through an FPN (Feature Pyramid) enhanced Feature extraction network, and finally outputting the target category and the detection frame through a detection head of CA-YOLO. It should be noted that the detection frame carries the position information of the target, that is, the detection frame can reflect the position information of the target.

Step S30: target tracking is carried out on all detection frames through a preset multi-target tracking algorithm, and an identity ID is generated for the detection frame corresponding to each target;

exemplarily, in this embodiment, after the detection frames are obtained by the target detection algorithm, the inter-frame targets are associated by using the multi-target tracking algorithm, and an identity ID is generated for each detection frame, so that the athlete identity association is realized, and then the targets are continuously tracked, that is, the same target between the continuous associated frames in the subsequent frames is kept consistent with the same target between the frames. Because the rule of the short-track speed skating match is not limited by tracks, the conditions of rapid position change and name alternation frequently occur to athletes in the skating process, and when the positions of adjacent athletes are very close, the confidence coefficient of target detection is reduced, so that the tracking performance is not high. Therefore, the embodiment uses Bytetrack algorithm to perform multi-target tracking, i.e. the low split frames are reserved.

Specifically, firstly, matching of a high-score detection frame and a track is realized based on motion similarity: taking a video clip V, a target detection Det and a Kalman filtering algorithm KF as the input of an associated algorithm Byte in the Bytetrack; setting a detection score threshold τ _high 、τ _low And tracking a score threshold e; all detection frames generated by the target detector are divided into two parts: detection score above threshold τ _high Is classified as D _high And the detection score is higher than the threshold τ _low And is below a threshold τ _high Is classified as D _low (ii) a Then, calculating motion similarity, namely calculating an IoU (Intersection over Unit) between a detection target frame of the current frame and a boundary frame obtained by Kalman filtering prediction, wherein the larger the IoU is, the higher the motion similarity is; matching of high-scoring detection frames with trajectories based on motion similarity, i.e. to D _high Performing first association with all the tracks T, predicting the position information of the target in the next frame by using Kalman filtering, and performing Hungary calculationAnd matching, wherein each track corresponding to the matching result comprises a detection frame and an identity ID of the target.

Then, the predicted track is matched with a detection frame of the current frame by using a Hungarian algorithm to associate the inter-frame ID, wherein the track can be understood as a position sequence (u, v, r, h) of the target at different moments, u, v, r and h respectively represent the position of the central point of a target rectangular frame, r is an aspect ratio, and h is high; this embodiment will match through motion information: the detection box is first matched to the most recent tracklet (i.e. tracking a small fragment) and then matched to the missing tracklet. Specifically, a Kalman filter is used for predicting the track of a target corresponding to a previous frame, the track state of a target of a next frame obtained according to the prediction is compared with the track state (u, v, r, h) of a target of a detected next frame, and only when the comparison value is smaller than a certain threshold value, whether the target detected by the next frame corresponds to the target of the previous frame can be judged (the tracking effect can be achieved by connecting the targets corresponding to the previous frame and the next frame); and integrating the motion information by using the Mahalanobis distance between the prediction Kalman state and the current detection value, and after the distance value of each track corresponding to each detected target is obtained, giving a threshold value and screening out the objects which are matched.

Wherein, the Mahalanobis distance

d (i, j) represents the motion matching degree between the jth detection and the ith track, the covariance matrix of the observation space at the current moment predicted by the Kalman filter of the track, y _i Representing the predicted observed quantity at the current time, d _j Indicating the current track status corresponding to the detected target.

And finally, realizing second matching of the unmatched tracks: in the previous matching process, the detection frame with the motion similarity smaller than 0.2 is not subjected to matching track, and the detection frame which is not matched is stored in D _remain And the track which is not successfully matched is stored in T _remain (ii) a Then to D _low Detection frame and T with low and medium scores _remain The remaining tracks in (2) are matched for the second timeAt this time, the track which is not successfully matched is stored in T _re-remain And the unmatched low-score detection boxes are directly deleted.

Wherein for T _re-remain Considering that it is a temporary missing target, put it into T _lsot If T is _lost If the middle track exists more than 30 frames, then the slave T _lost Otherwise, continuing to delete at T _lost Preserving; if later can match it is also taken from T _lost Deleting; for D _remain If the detection score is above the tracking score threshold e and there are more than two frames, then initialize to a new track. For the output of each frame, only the detection frame and the corresponding ID of the track with respect to the current frame are output.

Step S40: determining a detection frame to be handed over which enters the overlapping area of the current camera from the detection frames corresponding to the adjacent previous camera according to the visual field boundary and the target handing over line;

exemplarily, in the embodiment, when multi-target tracking is performed, whether a target reaches an overlapping view field of a next adjacent camera is judged according to a view field boundary and a target intersection line, and if the target does not reach the overlapping view field of the next adjacent camera, multi-target tracking under a single camera is performed; when the overlapped view field is reached, the cross-camera target handover is performed. For example, when the detection frame 1 corresponding to the object a in the previous adjacent camera Cam1 touches the view boundary line corresponding to the current camera Cam2, it indicates that the object a reaches the overlapping view of the current camera; when the detection frame 1 corresponding to the target A touches a target cross-connection line corresponding to the current camera Cam2, the target A needs to be subjected to cross-camera target cross-connection at the moment, namely when the target passes through the overlapping area of adjacent cameras, the target cross-connection work is carried out; therefore, the detection frame 1 is regarded as a detection frame to be handed over.

Step S50: converting image coordinates of a detection frame to be handed over and a target detection frame corresponding to a current camera into world coordinates;

exemplarily, it should be understood that in a multi-camera tracking system, a plurality of cameras need to be globally calibrated to unify all the cameras, and image coordinates of each video frame are mapped into a world coordinate system to establish a correspondence between adjacent cameras. The relationship among the common coordinate systems related to camera calibration, such as the world coordinate system, the camera coordinate system, the image physical coordinate system, and the image pixel coordinate system, is shown in fig. 4. The imaging process of the camera is essentially the conversion of a coordinate system, points in space are projected to an image plane after being converted from a world coordinate system to a camera coordinate system, and finally data of an image physical coordinate system on a virtual image plane is converted to an image pixel coordinate system. It is understood that the world coordinate system in the present embodiment is used to describe the actual field coordinates of the target (i.e. the moving person) in the ice field, and the image coordinate system is used to describe the projection relation of the target from the camera coordinate system to the image coordinate system during the imaging process.

Specifically, for any point P in space, its camera coordinate is (X) _c ，Y _c ，Z _c ) World coordinate is (X) _w ，Y _w ，Z _w ) The conversion from the world coordinate system to the camera coordinate system can be realized by rotation translation, and the conversion formula is shown as formula (1), wherein the rotation matrix R is an orthogonal identity matrix of 3 × 3, and t represents a three-dimensional translation vector.

The image coordinate of any point P in the space is set as (x, y), the transformation from the camera coordinate system to the image coordinate system is the three-dimensional to two-dimensional projection perspective process, the shape is projected on the projection surface by the central projection method, and a single-side projection graph close to the visual effect is formed, and the transformation formula is shown as formula (2):

it should be noted that the speed skating scene according to the embodiment is a top-down view, and the height of the camera from the ice field surface is about 20 meters, so the conversion between the field coordinates (i.e., world coordinates) and the image coordinates can be regarded as two-dimensional translational affine transformation. Thus, see FIG. 5 for a descriptionThe position of the athlete (namely the target) on the ice rink is centered on the middle part of the ice rink, and a field coordinate system (X) is established according to a two-dimensional Cartesian coordinate system _w ，Y _w ) Obtaining the field coordinates of the central points of the four cameras according to the mapping data of the geodetic instrument; then obtaining the image coordinate corresponding to each camera by the site coordinate

And then mapping the image coordinates to a uniform world coordinate system to establish coordinate association for the four cameras and obtain world coordinates corresponding to each detection frame. Wherein fig. 5 shows an illustration of the mapping of the image coordinates of the object centroids in the field area and the sub-surveillance zones (i.e., the part of the surveillance pictures of Cam1 to Cam4 presented in fig. 5) of four surveillance cameras in the field to the field coordinates.

In this embodiment, when a target (i.e., a detection frame to be handed over) enters a handover stage, the targets need to be linked across cameras, image coordinates corresponding to tracking targets in picture areas of two adjacent handover cameras need to be converted into uniform field coordinates, and kinematic parameters such as a tracking track, a speed, a sliding distance and the like of a fast sliding team member can be acquired by analyzing the field coordinates. Wherein the image coordinates of the sub-monitoring area

Corresponding to a uniform field coordinate (X) _w ，Y _w ) The conversion formula of (c) is:

in the formula (I), the compound is shown in the specification,

and

representing the coordinate value of the central point of each camera in the actual field coordinate obtained by mapping,

and

corresponding to the coordinate value of the center point of the single image frame, mu is the unit pixel distance, lambda is the mapping coefficient, and R is the projection deviation caused by the displacement of the image point generated by vertical shooting.

In the embodiment, when the targets are matched in the handover area, the correlation result and the error value based on the field coordinate matching are fed back to the image coordinates of the targets, and at the moment, the targets which are successfully handed over and the targets which are not successfully handed over can be determined according to the matching result; it follows that the above feedback can be obtained by converting the site coordinates into image coordinates. Wherein, the field coordinate (X) _w ，Y _w ) Obtaining image coordinates corresponding to the sub-monitoring area

The formula is as follows:

in the formula (I), the compound is shown in the specification,

and

a central point coordinate value representing the actual site coordinate of each camera,

and

corresponding to the coordinate value of the center point of the single image frame, mu is the unit pixel distance, lambda is the mapping coefficient, and c is the correction parameter.

Step S60: respectively calculating the centroid position proximity and the detection frame overlapping degree between the detection frame to be jointed and each target detection frame according to the world coordinates;

exemplarily, it can be understood that in a conventional (non-relay) short track speed skating game, the number of players is fixed and no new targets are added. Thus, the evaluation criteria for cross-camera target handover in this embodiment will consist of centroid position proximity and detection frame overlap. The centroid position proximity score refers to a difference value of distances between world coordinates of a candidate target center point in a current camera and world coordinates of a target in a previous camera view field, that is, the centroid position proximity between a detection frame to be handed over and each target detection frame can be obtained by calculating a difference value of distances between the world coordinates of a detection frame center point to be handed over in the current camera and the world coordinates of a target detection frame center point in an adjacent previous camera.

The detection frame overlapping degree refers to the area overlapping degree between a target frame to be matched in the current camera and a target frame with a determined Identity (ID) in the previous camera, namely, the detection frame overlapping degree between the detection frame to be handed over and each target detection frame in the current camera can be obtained by calculating the area overlapping degree between the detection frame to be handed over in the adjacent previous camera and the target detection frame in the current camera. Specifically, the coordinate frame from the two views can be calculated by equation (7)

And

the size of the IoU (i.e. the detection frame overlap),

the larger the IoU value is, the higher the possibility that the two detection frames are the same target is.

Step S70: and performing target consistency judgment and target handover on the detection frame to be handed over and each target detection frame based on the identity ID, the proximity of the centroid position and the detection frame overlapping degree.

Exemplarily, in the present embodiment, the proximity of the centroid position is combined with the overlap of the detection frames to perform the target consistency determination; the target detection frame with the smaller proximity of the centroid position and the larger overlap degree of the detection frames is more likely to become the target (i.e., the target to be joined) being tracked in the previous view, i.e., the probability of the target consistency is higher; and the identity ID of the target to be handed over is given to the target detection frame with high consistency with the detection frame to be handed over, so that the target handover can be completed.

Therefore, the accuracy of the adhesion target detection is improved through the coordinate attention mechanism in the embodiment, in the process of cross-camera target switching, adjacent cameras are associated through the corresponding relation between the image coordinates and the world coordinates, so that the dimensional information of the uniform space world coordinates is added in the overlapping area to restrict the target matching, and due to the fact that the centroid coordinates of the targets do not change suddenly along with the change of the posture under the top view angle, cross-camera target consistency judgment and target handover are conducted through the centroid position proximity of the detection frames to be handed over in the overlapping view and the detection frame overlapping degree, the accuracy and the stability of target tracking identification and handover can be effectively improved, and further the identity of the targets can be continuously maintained and accurately positioned.

Further, the performing, based on the identity ID, the proximity of the centroid position, and the overlap degree of the detection frames, target consistency determination and target handover on the detection frame to be handed over and each target detection frame includes:

taking the target detection frame which has the minimum centroid position proximity and the maximum detection frame overlapping degree with the detection frame to be handed over as a candidate target frame; the method for determining the proximity of the minimum centroid position comprises the following steps:

the first condition is that:

wherein (X' _w ，Y' _w ) World coordinates representing the target detection box,

representing world coordinates of the detection frame to be handed over;

For example, in the present embodiment, when performing target consistency determination on the detection frame to be handed over and each target detection frame according to the proximity of the centroid position and the overlap degree of the detection frames, the target detection frame determined to have the smallest proximity of the centroid position and the largest overlap degree of the detection frames to be handed over and the detection frame to be handed over are the same target, and thus the target detection frame is taken as a candidate target frame. The method for judging the proximity of the minimum centroid position comprises the following steps: the positions of the center points of n target detection frames in one frame of image in the image coordinate system are obtained from the output result of target detection

Further calculate its position in the world coordinate system

Setting the position of the tracked detection frame to be handed over in a world coordinate system as

World coordinates of a target detection frame in a current view

When the following condition is satisfied:

namely, the coordinate of the detection frame to be handed over is considered to be closest to the coordinate of the target detection frame in the physical position, and then the minimum centroid position proximity between the detection frame to be handed over and the target detection frame is judged.

And finally, giving the identity ID of the detection frame to be handed over to the candidate target frame to realize the target hand-over across the camera, and further continuously associating the same target between the frames to keep the identity consistent so as to continuously track the target.

Further, the number of the detection frames to be handed over is multiple, and after the step of calculating the proximity of the centroid position and the overlapping degree of the detection frames between the detection frames to be handed over and each target detection frame according to the world coordinates, the method further includes:

Exemplarily, in the present embodiment, when a plurality of athletes arrive at a target handover line at the same time (i.e. a plurality of detection frames to be handed over exist at the same time), the multi-objective data fusion problem needs to be handled. Presetting a distance threshold value tau of the distance between target center coordinates, and when the Euclidean distance between a detection frame to be jointed and other detection frames to be jointed under a certain vision field is greater than the distance threshold value tau, carrying out target matching by utilizing the proximity of the centroid position and the overlapping degree of the detection frames; and when the Euclidean distance between the central coordinates of the detection frame to be handed over and other detection frames to be handed over is smaller than or equal to the distance threshold tau, the Hungarian algorithm is added to be combined with the proximity of the centroid position to carry out nearest neighbor optimal matching, so that target handover is realized, mismatching and missing matching are reduced, and the accuracy of target handover is improved.

The Hungarian algorithm can be used for solving the maximum matching problem of the bipartite graph, and the maximum matching refers to a subset which enables the number of edges matched between nodes of the bipartite graph to be maximum. And setting all target detection frame sets of the current vision field as U and all target detection frame sets of the last vision field as V, and matching every two detection frames of the two sets as much as possible by Hungary algorithm.

In the weighted bipartite graph, the weights of the edges connected by the nodes are different, so that the problem of maximum or minimum total weight, namely the optimal matching problem, is considered while the nodes are required to be matched. Specifically, assume that the set of all object detection boxes of the current field of view is U = { U = { (U) } ₁ ，u ₂ ，u ₃ ，u ₄ V = { V } for all target detection frames of last view set ₁ ，v ₂ ，v ₃ ，v ₄ And (c) forming a bipartite graph G = (U, V, S), wherein edges exist between U and V, no edge exists inside U and inside V, S represents a set of edges completing matching, the edges in the set of S have no common node, and the sum of weights of the edges in matching is represented by a function f (S), so that the goal of completing weight maximum matching is as follows: find a matching set of edges S that maximizes f (S).

It will be appreciated that the matching by the Hungarian algorithm is actually a bijection defining U-V, in turn U ₁ …u ₄ The most paired vertex is found. When some vertex in U has the possibility of matching with the vertex in V, the matching degree can be represented by the weight of the edge, namely U _i -V _j Is given a weight of W _ij . The essence of the Hungarian algorithm is to complete the minimum matching of the weights, namely, the aim is to find a matching edge set S which enables f (S) to be minimum, so if the Hungarian algorithm is used for solving the problem of maximum matching of the weights, the maximum matching of the weights needs to be converted into the minimum matching, namely, the weights of the edges are added with minus signs, and therefore the final result of the minimum matching of the weights by using the Hungarian algorithm is actually the maximum matching of the weights.

The minimum matching of the Hungarian algorithm is realized by operating the adjacency matrix, and the specific process is as follows: (1) Subtracting the minimum value from the value of each row in the adjacency matrix (i.e., the value is the weight value including the negative sign), so that each row has 0; (2) Subtracting the minimum value from each column to enable each column to have 0, and obtaining a new adjacent matrix; (3) performing a loop operation on the adjacency matrix: (3.1) covering all 0 elements in the adjacency matrix with as few lines as possible; (3.2) judging whether the loop needs to be terminated: if the number of lines L satisfies the number of nodes n, the loop is terminated; if the number of lines L is less than the number of nodes n, continuing to loop, and assuming that the current L =3 and n =4, continuing to loop; (3.3) transforming 0 elements in the adjacency matrix to obtain more 0 elements: setting the minimum value of the elements which are not covered by the line as k, subtracting the minimum value k from the elements which are not covered, simultaneously, setting the element + k corresponding to the line crossing position, and returning to the step (3.1); assuming that L =4 is obtained after the operation of step (3.1), and since L = n, the loop needs to be terminated, and at this time, the result of the current adjacency matrix is the result of the minimum matching, so that the neighbor optimal matching can be achieved.

It will be appreciated that the weights W of the edges between nodes of the bipartite graph are obtained by calculating the proximity of the centroid positions _ij The smaller the weight, the greater the probability that the two nodes correspond correctly. Assume that there are 4 frames to be handed over in the last camera and 4 frames to be handed overThe coordinates of the central point of (a) in the world coordinate system are a1= (90,82), a2= (120,110), a3= (98,122), and a4= (136,150), respectively; there are 4 target detection boxes to be matched in the current camera, and the coordinates of the center points of the 4 target detection boxes in the world coordinate system are b1= (138, 153), b2= (95, 119), b3= (91, 84), and b4= (122, 110), respectively, let W _ij Representing the Euclidean distance x between the coordinates of the ith detection frame to be handed over and the current jth target detection frame _ai And y _ai The abscissa and ordinate, x, representing the frame to be cross-linked _bj And y _bj The abscissa and ordinate of the current target detection frame are

Calculating a distance matrix A:

defining decision variables:

through the step of operating the adjacency matrix, the optimal matching which minimizes the W value is obtained, namely the optimal matching based on the combination of the Hungarian algorithm and the proximity of the centroid position can be realized, and the matching result is as follows: a is ₁ →b ₃ ，a ₂ →b ₄ ，a ₃ →b ₂ ，a ₄ →b ₁ 。

In summary, the cross-camera multi-target tracking algorithm provided by the embodiment mainly includes three parts, namely target detection, multi-target tracking under a single camera, and cross-camera target handover: correcting the cross-camera picture to generate a visual field boundary and a target cross-connection line; outputting detection frame position information by using a target detector based on deep learning, then performing multi-target tracking under a single camera, and allocating and keeping an Identity (ID) for each detected target; when the target passes through the overlapping area of the adjacent cameras, the target handover work is carried out, namely the consistency of the target is judged by utilizing space dimension constraint under the overlapping vision field according to the proximity of the centroid position and the overlapping degree of the detection frames of the corresponding frames by establishing the corresponding relation between the image coordinates of the cameras and the actual field coordinates, so that the stability of the target handover can be improved; and when the target handover is completed, continuing to perform relay tracking in the next camera. Judging whether the target reaches the overlapped view field of the next adjacent camera according to the view field boundary, and if the target does not reach the overlapped view field, performing multi-target tracking under a single camera; if the overlapped vision field is reached, performing cross-camera target handover; and (5) circulating the steps until all the targets drive through the finish line, and ending the tracking process. Therefore, the method has excellent tracking performance for the target moving at high speed, and can continuously perform identity maintenance and accurate positioning on the target.

The following embodiment also checks the cross-camera multi-target tracking algorithm.

In the embodiment, 34 complete short-track speed skating videos of the skiers are collected, each field comprises four time-synchronized high-speed industrial camera pictures, 136 segments of monitoring videos are counted, the duration of each segment of the videos is about 3 minutes, the size of each segment of the videos is 860x720 pixels, and the frame rate of each segment of the videos is 60 frames/second. The method comprises the steps of collecting 4000 images containing a plurality of athletes from a video to form a Skater training data set, carrying out manual labeling on the athletes to obtain an xml format labeling file containing a target coordinate value and width and height, and dividing the training set and the testing set according to the proportion of 9. Wherein, the number of athletes in each video is 4-10, and the cross-camera multi-target tracking test can be carried out.

In training the Skater data set by using the CA-YOLO model, the size of an input image is 788 multiplied by 530, the maximum epoch is 200, the batch size and the initial learning rate are respectively set to be 4 and 0.0001, the previous 170 epochs are subjected to data enhancement by using Mosaic and MixUp, and the data set is further expanded by rotating, translating and the like.

Predicting by using the trained model, wherein an NMS (Non Maximum Suppression) threshold is set to be 0.3, and a confidence threshold is set to be 0.65; and displaying the target category and the confidence score above a detection frame in the detection result of the CA-YOLO. In this embodiment, the detection effect of different target detection models is evaluated by using the Average accuracy of AP (Average Precision), and the evaluation result is shown in table 1. As can be seen from table 1, compared with the mainstream models such as the YOLO series, the CA-YOLO with the coordinate attention mechanism proposed in this embodiment achieves an AP accuracy of 91.1% when the IoU threshold is 0.5, and an average AP of 69.2% when the IuO threshold is 0.5 to 0.95, and thus shows stable detection performance, has higher sensitivity to target position differences, and is more favorable for detecting adhesion targets.

TABLE 1 target detection accuracy comparison

It can be understood that in multi-target multi-camera tracking, the evaluation index related to the target identity ID is the most important. In this embodiment, the cross-camera target handover algorithm proposed in this application will be evaluated using IDF1, IDP, IDR and IDsw indexes. The IDF1 represents an average number ratio, specifically refers to an average number ratio of a correct target detection number to a real number sum and a calculation detection number sum, and is a first index for evaluating the quality of the tracker; IDP (Identification Precision) refers to the Precision of ID Identification in each detection frame; IDR (Identification Recall) means Recall rate of ID Identification in each detection box; the IDsw (ID Switches) indicates the number of times of error jump of the target ID, that is, the target ID is switched erroneously in the tracking process, and the optimal state of the IDsw is 0.

The calculation formulas of IDF1, IDP and IDR are shown as formulas (9) to (11) in sequence:

in the formula, IDTP represents the number of IDs correctly allocated as positive for detection targets in a video, IDFP represents the number of IDs incorrectly allocated as positive, IDFN represents the number of IDs incorrectly allocated as negative (i.e., the number of missing allocations), and IDTN represents the number of IDs correctly allocated as negative. The optimal state of the IDF1, IDP and IDR indexes is 100%.

In the present embodiment, the CFT method based on feature matching, the MTMC _ ReID and TRACTA method based on trajectory matching, and the HMT method based on perspective transformation and motion feature are included in the experiment, and the evaluation result of the specific target handover method is shown in table 2:

TABLE 2 experimental comparison of target handover algorithm

As can be seen from table 2, compared with other existing handover methods, the cross-camera target handover method based on spatial constraint provided by the present application has certain advantages in the indexes IDF1, IDP, IDR, and IDsw, and can accurately map image coordinates to uniform field coordinates, so that target re-identification can be accurately performed, and better robustness is achieved in the problem of continuous tracking of a fixed multi-view camera with overlapping view fields.

Therefore, the cross-camera multi-target tracking algorithm based on space constraint is provided on the basis of a multi-target fixed camera system with overlapped vision fields and aiming at the problems that targets are adhered, the appearances of the targets are similar and continuous and accurate tracking is difficult to realize in multi-target multi-view tracking. The CA-YOLO detector added with a coordinate attention mechanism is adopted to improve the accuracy of detecting the adhesion targets, in the process of target switching, priori knowledge based on space constraint is utilized to carry out target handover on the athletes according to the position proximity of the centroid and the overlapping degree of the detection frames, and the Hungarian algorithm is adopted to realize maximum matching. A large number of experimental results on a target tracking benchmark and a Skater data set show that the AP of a CA-YOLO target detector is up to 91.1 percent, the IDF1 is up to 83.4 percent during target tracking, and continuous and stable tracking of a fast moving video sequence with a fixed view field can be realized.

The embodiment of the present application further provides a cross-camera multi-target tracking device, including:

the judging unit is used for determining a detection frame to be handed over entering the overlapping area of the current camera from the detection frames corresponding to the adjacent previous camera according to the visual field boundary and the target handing over line;

a handover unit for performing target consistency determination and target handover on the detection frame to be handed over and each target detection frame based on the identity ID, the proximity of the centroid position and the detection frame overlap degree.

Further, the handover unit is specifically configured to:

performing target consistency judgment on the detection frame to be handed over and each target detection frame according to the proximity of the centroid position and the overlapping degree of the detection frames;

Further, the method for determining the proximity of the minimum centroid position comprises the following steps:

when the world coordinate of a target detection frame and the world coordinate of a detection frame to be handed over meet the following first condition, judging that the target detection frame and the detection frame to be handed over have the minimum centroid position proximity;

the first condition is:

and world coordinates representing the detection frame to be handed over.

Further, the number of the detection frames to be handed over is multiple, and the computing unit is further configured to:

Further, the coordinate attention mechanism network is a YOLO model based on a coordinate attention mechanism.

It should be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the apparatus and the units described above may refer to the corresponding processes in the foregoing cross-camera multi-target tracking method embodiment, and are not described herein again.

The apparatus provided by the above embodiment may be implemented in a form of a computer program, and the computer program may be run on a cross-camera multi-target tracking device as shown in fig. 6.

The embodiment of the present application further provides a cross-camera multi-target tracking device, including: the system comprises a memory, a processor and a network interface which are connected through a system bus, wherein at least one instruction is stored in the memory, and the at least one instruction is loaded and executed by the processor so as to realize all steps or partial steps of the cross-camera multi-target tracking method.

The network interface is used for performing network communication, such as sending assigned tasks. It will be appreciated by those skilled in the art that the configuration shown in fig. 6 is a block diagram of only a portion of the configuration associated with the present application, and is not intended to limit the computing device to which the present application may be applied, and that a particular computing device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

The Processor may be a CPU, or other general purpose Processor, digital Signal Processor (DSP), application Specific Integrated Circuit (ASIC), field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware component, or the like. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, the processor being the control center of the computer device and the various interfaces and lines connecting the various parts of the overall computer device.

The memory may be used to store computer programs and/or modules, and the processor may implement various functions of the computer device by executing or executing the computer programs and/or modules stored in the memory, as well as by invoking data stored in the memory. The memory may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a video playing function, an image playing function, etc.), and the like; the storage data area may store data (such as video data, image data, etc.) created according to the use of the cellular phone, etc. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

The embodiment of the application also provides a computer readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, all steps or part of steps of the cross-camera multi-target tracking method are realized.

The embodiments of the present application may implement all or part of the foregoing processes, and may also be implemented by a computer program instructing related hardware, where the computer program may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of the foregoing methods. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic disk, optical disk, computer memory, read-Only memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution media, and the like. It should be noted that the computer readable medium may contain other components which may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, in accordance with legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunications signals.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, server, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrases "comprising one of 8230; \8230;" 8230; "does not exclude the presence of additional like elements in a process, method, article, or system that comprises the element.

The previous description is only an example of the present application, and is provided to enable any person skilled in the art to understand or implement the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A cross-camera multi-target tracking method is characterized by comprising the following steps:

determining a detection frame to be handed over which enters the overlapping area of the current camera from the detection frames corresponding to the adjacent previous camera according to the visual field boundary and the target handing over line;

2. The cross-camera multi-target tracking method according to claim 1, wherein the performing target consistency determination and target handover on the detection frame to be handed over and each target detection frame based on the identity ID, the proximity of the centroid position, and the detection frame overlap degree comprises:

3. The cross-camera multi-target tracking method according to claim 2, wherein the determination method of the proximity of the minimum centroid position is:

the first condition is:

and world coordinates representing the detection frame to be handed over.

4. The cross-camera multi-target tracking method according to claim 1, wherein the number of the detection frames to be jointed is multiple, and after the step of calculating the proximity of the centroid position and the overlap degree of the detection frames between the detection frames to be jointed and each target detection frame respectively according to the world coordinates, the method further comprises:

if the Euclidean distance is detected to be larger than a distance threshold value, executing the step of carrying out target consistency judgment and target handover on the detection frame to be handed over and each target detection frame based on the identity ID, the proximity of the centroid position and the overlapping degree of the detection frames based on one detection frame to be handed over;

and if the Euclidean distance is detected to be smaller than or equal to the distance threshold value, performing target handover on the one detection frame to be handed over and the target detection frame through Hungarian algorithm and the proximity of the position of the centroid.

5. The cross-camera multi-target tracking method according to claim 1, characterized in that: the coordinate attention mechanism network is a YOLO model based on a coordinate attention mechanism.

6. A cross-camera multi-target tracking device is characterized by comprising:

7. The cross-camera multi-target tracking device of claim 6, wherein the handover unit is specifically configured to:

taking the target detection frame which has the minimum centroid position proximity and the maximum detection frame overlapping degree with the detection frame to be jointed as a candidate target frame;

and giving the ID of the detection frame to be handed over to the candidate target frame so as to realize the target hand-over across the camera.

8. The cross-camera multi-target tracking device of claim 6, wherein: the number of the detection frames to be handed over is multiple, and the calculation unit is further configured to:

if the Euclidean distance is detected to be greater than the distance threshold value, enabling the computing unit to execute the steps of performing target consistency judgment and target handover on the detection frames to be handed over and each target detection frame based on the identity ID, the proximity of the centroid position and the detection frame overlapping degree based on the one detection frame to be handed over;

9. A cross-camera multi-target tracking device, comprising: a memory and a processor, the memory having stored therein at least one instruction that is loaded and executed by the processor to implement the cross-camera multi-target tracking method of any one of claims 1 to 5.

10. A computer-readable storage medium characterized by: the computer storage medium stores a computer program that, when executed by a processor, implements the cross-camera multi-target tracking method of any one of claims 1 to 5.