LU102028B1

LU102028B1 - Multiple view multiple target tracking method and system based on distributed camera network

Info

Publication number: LU102028B1
Application number: LU102028A
Authority: LU
Inventors: Guoliang Liu
Original assignee: Univ Shandong
Priority date: 2019-10-23
Filing date: 2020-09-03
Publication date: 2021-03-03
Also published as: CN110782483A; CN110782483B

Abstract

The present disclosure discloses a multiple view multiple target tracking method and system based on a distributed camera network. The method includes: obtaining a current frame view collected by each camera in the distributed camera network; extracting a rectangular boundary box of a detected target from the current frame view; extracting visual appearance information of the detected target from an image in the rectangular boundary box by using a pre-trained convolutional neural network; converting an image coordinate of the detected target in the current frame view into a ground coordinate; and output step: constructing a data incidence matrix based on the visual appearance information and the ground coordinate of the detected target; and processing the data incidence matrix by using the Hungarian algorithm, and outputting a result of successful matching or failed matching between the detected target in the current frame view and a known trajectory.

Description

MULTIPLE VIEW MULTIPLE TARGET TRACKING METHOD AND LU102028 |

Field of the Invention |

The present disclosure relates to the technical field of multiple target tracking, in | particular to a multiple view multiple target tracking method and system based on a | distributed camera network. | Background of the Invention |

The statements in this section merely mention the background art related to the | present disclosure, and do not necessarily constitute the prior art. |

The multiple object tracking (Multiple Object Tracking) technology has many | applications in nowadays society, such as surveillance, monitoring, and crowd | behavior analysis. |

In the process of implementing the present disclosure, the inventor found the | following technical problems in the prior art: | Multiple object tracking is still a challenging task, because it needs to solve the | problems such as target detection, trajectory estimation, data association and | re-identification at the same time.

In order to detect targets, various sensors such as | radar, laser, sonar, cameras or the like can be used according to the needs of specific | tasks, and corresponding detection algorithms are also required.

Target detection is | one of the difficulties in the multiple target tracking.

Another challenging problem of | the multiple target tracking is occlusion.

The target can be occluded by other objects, | or it can be occluded by the current field of view (Field Of View), and frequent | occlusion can easily cause loss of the target, which affects the tracking accuracy. | Summary of the Invention |

In order to overcome the shortcomings of the prior art, the present disclosure provides | a multiple view multiple target tracking method and system based on a distributed | camera network; |

In a first aspect, the present disclosure provides a multiple view multiple target LU102028 | tracking method based on a distributed camera network; Ë The multiple view multiple target tracking method based on the distributed camera À network includes: ' obtaining a current frame view collected by each camera in the distributed camera | network; | extracting a rectangular boundary box of a detected target from the current frame i view; | extracting visual appearance information of the detected target from an image in the , rectangular boundary box by using a pre-trained convolutional neural network; and | converting an image coordinate of the detected target in the current frame view into a | ground coordinate; and | output step: constructing a data incidence matrix based on the visual appearance . information and the ground coordinate of the detected target; and processing the data | incidence matrix by using the Hungarian algorithm, and outputting a result of ; successful matching or failed matching between the detected target in the current . frame view and a known trajectory. .

In a second aspect, the present disclosure further provides a multiple view multiple | target tracking system based on a distributed camera network; | The multiple view multiple target tracking system based on the distributed camera | network includes: | an obtaining module, configured to obtain a current frame view collected by each | camera in the distributed camera network; | a preprocessing module, configured to extract a rectangular boundary box of a | detected target from the current frame view; | an extraction module, configured to extract visual appearance information of the | detected target from an image in the rectangular boundary box by using a pre-trained | convolutional neural network; and convert an image coordinate of the detected target | in the current frame view into a ground coordinate; and | an output module, configured to construct a data incidence matrix based on the visual | appearance information and the ground coordinate of the detected target; and process LU102028 | the data incidence matrix by using the Hungarian algorithm, and output a result of | successful matching or failed matching between the detected target in the current | frame view and a known trajectory. | In a third aspect, the present disclosure further provides an electronic device, | including a memory, a processor, and computer instructions stored in the memory and | running on the processor, and the computer instructions complete the steps of the | method in the first aspect when executed by the processor. .

In a fourth aspect, the present disclosure further provides a computer-readable storage A medium for storing computer instructions, and the computer instructions complete the | steps of the method in the first aspect when executed by a processor. | Compared with the prior art, the present disclosure has the following beneficial | effects: .

In the method, the data incidence matrix generated by combining the visual | appearance information with the ground coordinate is adopted, and the matching | between the detected target and the known trajectory is implemented by using the data | incidence matrix, so that the accuracy of matching can be improved; and | The method benefits from both deep appearance visual features and distributed | trajectory estimation. Compared with the original Deep SORT method, the method is .

more robust in dealing with target re-identification and occlusion problems. | Brief Description of the Drawings | The drawings constituting a part of the present application are used for providing a | further understanding of the present application. The exemplary embodiments of the | present application and descriptions thereof are used for explaining the present | application, but do not constitute an improper limitation to the present application. | Fig. 1 is an overall structure of a distributed multiple view multiple target tracking | system in the first embodiment. | Detailed Description of the Embodiments | It should be pointed out that the following detailed descriptions are all exemplary and | are intended to provide further descriptions of the present application. Unless 0102028 | otherwise specified, all technical and scientific terms used herein have the same | meaning as commonly understood by those of ordinary skill in the technical field of | the present application. | It should be noted that the terms used here are only for describing specific | embodiments, and are not intended to limit the exemplary embodiments according to | the present application. As used herein, unless the context clearly indicates otherwise, | the singular form is also intended to include the plural form. In addition, it should also | be understood that when the terms "comprising" and/or "including" are used in the | present specification, they indicate the presence of features, steps, operations, devices, | components and/or combinations thereof. | First embodiment | The present embodiment provides a multiple view multiple target tracking method | based on a distributed camera network; .

The multiple view multiple target tracking method based on the distributed camera | network includes: .

S1: obtaining a current frame view collected by each camera in the distributed camera | network; | S2: extracting a rectangular boundary box of a detected target from the current frame | view; | S3: extracting visual appearance information of the detected target from an image in | the rectangular boundary box by using a pre-trained convolutional neural network; | and converting an image coordinate of the detected target in the current frame view | into a ground coordinate; and | S4: output step: constructing a data incidence matrix based on the visual appearance | information and the ground coordinate of the detected target; and processing the data | incidence matrix by using the Hungarian algorithm, and outputting a result of | successful matching or failed matching between the detected target in the current | frame view and a known trajectory. | As one or more embodiments, in S4, the specific steps of the output step include: | calculating the Mahalanobis distance between the ground coordinate of the detected LU102028 ’ target and the destination coordinate of each stored trajectory in the current frame | view; | calculating M cosine distances between the visual appearance information of the | detected target and the visual appearance information of previous M frames adjacent | to the current frame view; and storing the minimum value of the M cosine distances as | the final cosine distance; . when the Mahalanobis distance and the final cosine distance are both less than a set 0 threshold, performing weighted summation on the Mahalanobis distance and the final - cosine distance to obtain the data incidence matrix; and | inputting the data incidence matrix into the Hungarian algorithm, and outputting, by | the Hungarian algorithm, the result of successful matching or failed matching between | the detected target in the current frame view and the known trajectory. | As one or more embodiments, the method further includes: | S5: if image coordinate information of the successfully matched detected target and a | corresponding trajectory serial number ID are stored in the current camera, then | performing repeated iteration on the successfully matched information in the current | camera and the successfully associated information in the adjacent camera for A exchange, and calculating average consensus to obtain a convergent information | vector and a convergent information matrix; and | calculating posterior pose information based on the converged information vector and || the converged information matrix, thus so far, multiple view multiple target tracking .

is achieved; and then, predicting the position information of the detected target in the | next frame of view. .

As one or more embodiments, the method further includes: | S6: if the ground coordinate of the detected target that fails to match and a | corresponding trajectory serial number ID are stored in the current camera, then | calculating the Euclidean distance between the coordinate information of the detected | target that fails to match in the current camera and the destination coordinate || information of each stored trajectory in the views captured by the other remaining | cameras; and be if the Eucli LU102028 Mg uclidean di t . 8 dance 1 le. € detected et thresh Éd target in the curr old, Matching the Views ¢ ent Camera wi; ground coordi | Aptured by the oth With the ç orrespond Tdinate | CT remaini Ing tra Lo | As One or m ng Cameras.

Jectories in the | ore embodiment, th | POUR > the dices I distribyteq type i | distributed ca [ data in a ce centralized type, and Cans that the | | ntral Processing unit f > the centralized pe i | the data in re or Processing: and the distrib Pe is to collect | Spective Stribut | an 0 pro | D communicate with each | 0A c | § ONE or more embodiments, in gp 4, | ? » INC eXtractin | det 8 à rectangular | ected target from the current fra . | boundary box of a [I bound Me View includes: extracting th | Ounda ’ e | ry box of the detected target from the curr te 5 rectangular : As one or more embodiments, in S3, the ; _. | > 196 Specific training st : | 15 . & steps of the pre-trained | convolutional neural network include: | constructing the convolutional neural network: and i oo. : ; ; constructing a training set, | wherein the training set is an im i ; | g age of known visual appearance information; inputting the training set into the convolutional neural network to train the i convolutional neural network; and | obtaining a trained convolutional neural network. | For example, the training set is a large-scale pedestrian re-identification data set à containing more than 1.1 million images of 1261 pedestrians. à As one or more embodiments, in S3, the visual appearance information specifically | refers to 128-dimensional normalized features output by the convolutional neural | network.

For example, it may be a contour feature. | As one or more embodiments, in S3, the specific step of converting the image | coordinate of the detected target in the current frame view into the ground coordinate . includes: |, using a pixel coordinate of the midpoint of a bottom margin of the boundary box of a | person in the image as the position information of the person, and converting the pixel | coordinate into the ground coordinate through a homography matrix (homography LU102028 . matrix), wherein the homography matrix is obtained by camera calibration. | As one or more embodiments, in S5, the specific step of calculating the average | consensus includes: | based on the ICF algorithm, performing information exchange through repeated | iteration of adjacent cameras so as to obtain the convergent information vector and the |! convergent information matrix: ) Wherein, € represents a constant, v“ represents the information vector, and VE ; represents the information matrix. .

As one or more embodiments, in S5, the specific step of calculating the posterior pose | information includes: | calculating a posterior state vector x; (t) and a posterior information matrix W;t (t) .

in the current frame. | Wherein, N represents the number of cameras. | As one or more embodiments, in S5, the specific step of predicting the position | information of the detected target in the next frame of view includes: | predicting a next state variable x; (t) and a next information matrix W; (t) of the | target: | Wherein, i represents the i node, t represents the t® frame, ® represents a linear A state transition matrix, and Q represents a process noise covariance. | The overall structure of the distributed multiple target tracking method proposed in |

Coe eT TE I the present disclosure is shown in Fig. 1. First, target detection is performed on each LU102028 | camera by using YOLOv3, and the algorithm can extract the rectangular boundary | box of the detected target with a higher frame rate. Then, the visual appearance | information of the target is obtained through a pre-trained convolutional neural ‘ network. The Hungarian algorithm is used to combine the visual appearance . information and the position information of the target for data association. The . position information of the target can be fused from multiple view information by | using an information weighed consensus filter (Information Weighed Consensus | Filter). The specific steps are as follows: |

1. Target detection | Target detection refers to obtaining different targets in an image and determining their | types and positions. The target detection method based on deep learning has strong | robustness to illumination changes, occlusion problems and complex environments. | There are two main research directions: a two-stage method and a one-stage method. | The two-stage method includes: predicting the number of candidate frames that may | have the target at first, and then adjusting the sizes of the frames and classifying the | frames to obtain the precise position, size and category of the target, such as Faster . R-CNN. In the one-stage method, the first step is omitted, and the position and the | category of the target are directly predicted, for example, YOLOv3 is compared with | the two-stage method, the one-stage method is usually faster and has comparable | performance. Therefore, we choose the YOLOvV3 as a target detector. |

2. Data association | A simple Hungarian algorithm is used for data association, the visual appearance | information of the target is a 128-dimensional feature vector obtained by a trained | convolutional neural network, and the position information of the target is obtained by , converting an image coordinate onto the ground by using a corrected homography | matrix (homography matrix), and the final correlation matrix is obtained by weighting | the visual appearance information and the position information of the target: |

Wherein, i,j respectively represent the measured value of the i trajectory and the j® LU102028 | measured value, À represents an adjustable weight parameter, d) represents the | Mahalanobis distance between a measurement position and the last frame of each . stored trajectory in the current frame, and d® represents the minimum cosine distance | between the measured appearance information and the appearance information stored | in each trajectory. .

It should be understood that the homography matrix refers to a homography matrix ‘ obtained by camera calibration, and the homography matrix can convert the pixel | coordinate into the ground coordinate. | In addition, a threshold function is used to ignore irrelevant candidates: . bi = 1,if di j) « t® 2) | Wherein, k is equal to 1 or 2 and represents the appearance information or the position | information, only when both the Mahalanobis distance and the cosine distance are less | than the corresponding thresholds, the association between the i trajectory and the j® | tracked target is allowed, that is, b is set as 1. .

3. Trajectory processing using multiple view information | The trajectory processing step is used for ID management, including restoring an old . trajectory or creating a new trajectory. Restoring the old trajectory means that after a | person walks out of the field of view for 30 frames, his trajectory will be deleted, and | when he comes back again, he will be given the original ID; and creating the new E trajectory means that a new person enters the field of view and his is assigned with a | new ID. When a detection value that fails to match is found in the current view, the | position information thereof is matched with the position information of the last frame | of the trajectory at first by using the Euclidean distance to restore the old trajectory, | and if a matching candidate is found, the detection value is assigned with the ID of the | trajectory. | If the match fails again, then the algorithm will check whether there is a match in | other views, that is, the detection match that also fails to match in other views, for the | generation of the new trajectory, and if the distance meets the threshold requirements, |

CORRE SRE ee Tee TT EF 10 | a new ID is initialized for them. In addition, the algorithm removes the trajectories LU102028 ; that have disappeared for more than 30 seconds in the current view to reduce : interference. |

4. Information weighed consensus filter for multiple view information fusion | The information weighed consensus filter (ICF) is an effective distributed state | estimation method. Here, the ICF is used to perform multiple view information fusion | to estimate the position of the target. The ICF mainly includes three steps: state | prediction, measurement update and weighed consensus. In terms of state prediction, : in S6, a linear constant speed model is used to predict the next state variable x and the : next information matrix W of the target. ; Wi (6) = (®(W#(t— 1)) oT + Q) 3) | Wherein, i represents the i node, t represents the t frame, ® represents a linear : state transition matrix, and Q represents a process noise covariance. The predicted | position information is sent to a data association module to match the measured value. | During the measurement update, the current measured value z; is used to calculate : the information vector v; and the information matrix V;. | x}; Wi» H;, R; and N respectively represent a priori state vector, the information | matrix, an observation matrix, a measured noise covariance and the number of | cameras. With respect to the weighed consensus, each camera will send and receive | the information vector v; and the information matrix V; to and from the adjacent | camera, and k times of iteration are performed until the filter is convergent. | Wherein, € represents a constant. |

The posterior state vector xj (t) and the posterior information matrix W;(t) in the LU102028 | current frame are obtained at last. | Wit) = NV (10) |

Wherein, N represents the number of cameras. | Second embodiment |

The present embodiment further provides a multiple view multiple target tracking | system based on a distributed camera network; |

The multiple view multiple target tracking system based on the distributed camera E network includes: | an obtaining module, configured to obtain a current frame view collected by each | camera in the distributed camera network; | a preprocessing module, configured to extract a rectangular boundary box of a | detected target from the current frame view; | an extraction module, configured to extract visual appearance information of the | detected target from an image in the rectangular boundary box by using a pre-trained | convolutional neural network; and convert an image coordinate of the detected target | in the current frame view into a ground coordinate; and | an output module, configured to construct a data incidence matrix based on the visual | appearance information and the ground coordinate of the detected target; and process | the data incidence matrix by using the Hungarian algorithm, and output a result of | successful matching or failed matching between the detected target in the current | frame view and a known trajectory. | Third embodiment |

The present embodiment further provides an electronic device, including a memory, a | processor, and computer instructions stored in the memory and running on the | processor, and the computer instructions complete the steps of the method in the first | aspect when executed by the processor. | Fourth embodiment |

The present embodiment further provides a computer-readable storage medium for | storing comp tru dth . vom we | uter Instructions, and the computer instructions comps. « method in the first aspect when executed by a Shy Ë ÿ a processor. = i ipti nen plit. Ë à The above descriptions are only preferred embodiments of the present applive, J and are not used to limit the present application. For those skilled in the art, the ny | present application can have various modifications and changes. Any modifications, yy equivalent replacements, improvements and the like, made within the spirit and “«.

principle of the present application, shall all be included in the protection scope of the | present application. |

Claims

) ERAS EE TES me nm ms TS SH yp 13 | LU102028 | Claims |

1. A multiple view multiple target tracking method based on a distributed camera | network, comprising: | obtaining a current frame view collected by each camera in the distributed camera | network; | extracting a rectangular boundary box of a detected target from the current frame | view; extracting visual appearance information of the detected target from an image in the | rectangular boundary box by using a pre-trained convolutional neural network; and | converting an image coordinate of the detected target in the current frame view into a | ground coordinate; and | output step: constructing a data incidence matrix based on the visual appearance | information and the ground coordinate of the detected target; and processing the data | incidence matrix by using the Hungarian algorithm, and outputting a result of | successful matching or failed matching between the detected target in the current | frame view and a known trajectory.

2. The method of claim 1, wherein the specific steps of the output step comprise: | calculating the Mahalanobis distance between the ground coordinate of the detected target and the destination coordinate of each stored trajectory in the current frame view; calculating M cosine distances between the visual appearance information of the detected target and the visual appearance information of previous M frames adjacent to the current frame view; and storing the minimum value of the M cosine distances as the final cosine distance; when the Mahalanobis distance and the final cosine distance are both less than a set threshold, performing weighted summation on the Mahalanobis distance and the final cosine distance to obtain the data incidence matrix; and inputting the data incidence matrix into the Hungarian algorithm, and outputting, by

| CE TNT TT 7 LL 14 the Hungarian algorithm, the result of successful matching or failed matching between | LV102028 the detected target in the current frame view and the known trajectory.

3. The method of claim 1, further comprising: if image coordinate information of the successfully matched detected target and a corresponding trajectory serial number ID are stored in the current camera, then performing repeated iteration on the successfully matched information in the current | camera and the successfully associated information in the adjacent camera for exchange, and calculating average consensus to obtain a convergent information | vector and a convergent information matrix; and calculating posterior pose information based on the converged information vector and the converged information matrix, thus so far, multiple view multiple target tracking is achieved; and then, predicting the position information of the detected target in the next frame of view.

4. The method of claim 3, further comprising: if the ground coordinate of the detected target that fails to match and a corresponding trajectory serial number ID are stored in the current camera, then calculating the Euclidean distance between the coordinate information of the detected target that fails to match in the current camera and the destination coordinate information of each stored trajectory in the views captured by the other remaining cameras; and if the Euclidean distance is less than a set threshold, matching the ground coordinate of the detected target in the current camera with the corresponding trajectories in the views captured by the other remaining cameras.

5. The method of claim 3, wherein the extracting a rectangular boundary box of a detected target from the current frame view comprises: extracting the rectangular boundary box of the detected target from the current frame view by using a YOLOv3 network.

6. The method of claim 3, wherein the specific training steps of the pre-trained convolutional neural network comprise: constructing the convolutional neural network; and constructing a training set, wherein the training set is an image of known visual appearance information;

inputting the training set into the convolutional neural network to train the LU102028 convolutional neural network; and obtaining a trained convolutional neural network.

7. The method of claim 1, wherein the specific step of converting the image coordinate of the detected target in the current frame view into the ground coordinate comprises: using a pixel coordinate of the midpoint of a bottom margin of the boundary box of a person in the image as the position information of the person, and converting the pixel coordinate into the ground coordinate through a homography matrix, wherein the homography matrix is obtained by camera calibration.

8. A multiple view multiple target tracking system based on a distributed camera network, comprising: an obtaining module, configured to obtain a current frame view collected by each camera in the distributed camera network; a preprocessing module, configured to extract a rectangular boundary box of a detected target from the current frame view; an extraction module, configured to extract visual appearance information of the detected target from an image in the rectangular boundary box by using a pre-trained convolutional neural network; and convert an image coordinate of the detected target in the current frame view into a ground coordinate; and an output module, configured to construct a data incidence matrix based on the visual appearance information and the ground coordinate of the detected target; and process the data incidence matrix by using the Hungarian algorithm, and output a result of successful matching or failed matching between the detected target in the current frame view and a known trajectory.

9. An electronic device, comprising a memory, a processor, and computer instructions stored in the memory and running on the processor, wherein the computer instructions complete the steps of the method according to any one of claims 1-7 when executed by the processor.

10. À computer-readable storage medium for storing computer instructions, wherein

CR RE EE eee nm im mm TT ITA ; LU102028 | the computer instructions complete the steps of the method according to any one of | claims 1-7 when executed by a processor. |