CN109190508B

CN109190508B - Multi-camera data fusion method based on space coordinate system

Info

Publication number: CN109190508B
Application number: CN201810917557.XA
Authority: CN
Inventors: 曹杰; 张剑书; 章磊; 李秀怡; 申冬琴
Original assignee: Nanjing University of Finance and Economics
Current assignee: Nanjing University of Finance and Economics
Priority date: 2018-08-13
Filing date: 2018-08-13
Publication date: 2022-09-06
Anticipated expiration: 2038-08-13
Also published as: CN109190508A

Abstract

The invention discloses a multi-camera data fusion method based on a space coordinate system, which comprises the following steps: constructing a training data set aiming at a target to be extracted, and finishing the training of a target detection and recognition model; extracting the category of a target in video data acquired by each camera and the position information of the target under a two-dimensional image coordinate system, and establishing a coordinate mapping relation between the two-dimensional image coordinate system and a three-dimensional space coordinate system; carrying out target detection and target identification processing on video stream data acquired by cameras in a plurality of continuous scenes, and extracting the category information of targets appearing in each frame and the position information of the targets in a two-dimensional image coordinate system; mapping the position information of the target in a two-dimensional image coordinate system into the coordinate of the target in a three-dimensional space coordinate system; and obtaining the motion trail data of the target under the three-dimensional space coordinate system according to the distance information between each target in the current time node and each target in the previous time node.

Description

Multi-camera data fusion method based on space coordinate system

Technical Field

The invention relates to the field of video image processing, in particular to a monitoring data fusion method for multiple cameras in continuous multiple monitoring scenes under a space coordinate system.

Background

In recent years, with the popularization of network cameras for security, intelligent video monitoring technology has rapidly become a current research hotspot. Video data is a record of what happens in a monitored scene, with various types of information being implied. Since the background is fixed in most video surveillance data, the user is really interested in the objects appearing therein and the moving tracks of the objects for the video surveillance data.

At present, most of video monitoring data are respectively stored according to the serial numbers of cameras, and then are analyzed and processed through an intelligent video monitoring technology. Target detection, target recognition and target tracking are three important links of intelligent video monitoring analysis and processing. The target tracking is used for determining the continuous positions of targets interested by people in a video sequence, and the target tracking technology is a basic technology in the field of computer vision and has wide application value.

The traditional target tracking technology records the historical motion track of a target based on a two-dimensional image space. The method is easy to implement, and the motion trail of the target in the current scene can be recorded. However, this method has disadvantages: 1. the traditional target tracking technology is limited in a two-dimensional image space and cannot reflect the position change information of a target in the space; 2. when a target moves across scenes, the conventional target tracking technology needs to compare the target in the current scene with targets in a plurality of subsequent scenes, and then judges the moving direction of the target, so that the target cannot be well continuously tracked in the plurality of scenes.

Disclosure of Invention

The purpose of the invention is: the method comprises the steps of establishing a mapping relation between a two-dimensional image coordinate system and a real three-dimensional space coordinate system, mapping coordinate information of a target in the two-dimensional image coordinate system, which is obtained after target detection and target identification are carried out on video frames obtained by a plurality of cameras, into the space coordinate system, and providing a method for carrying out cross-scene tracking on the target on the basis of the coordinate information, so as to realize fusion of video monitoring data collected by the plurality of cameras.

The method comprises the steps of establishing a coordinate mapping equation between a two-dimensional image coordinate system and a real three-dimensional space coordinate system on the basis of analyzing a camera imaging principle and a camera calibration technology; then converting the coordinates of the target under a two-dimensional image coordinate system, which are acquired by the target detection and target identification method, into the coordinates of the target under a three-dimensional space coordinate system; and finally, tracking the target based on the coordinate of the target in the three-dimensional space coordinate system to realize the data fusion of the multiple cameras. The method can restore the position change information of the target in the real space, can better track the target across scenes, realizes the fusion of data acquired by a plurality of cameras, and provides more information for high-level target behavior analysis.

In order to achieve the above object, the present invention provides a multi-camera data fusion method based on a spatial coordinate system, which comprises the following steps:

deploying each camera in a scene to be monitored; constructing a training data set aiming at a target to be extracted, and finishing the training of a target detection and recognition model; extracting the category of a target in video data acquired by each camera and the position information of the target under a two-dimensional image coordinate system based on a target detection and identification model, and establishing a coordinate mapping relation between the two-dimensional image coordinate system and a three-dimensional space coordinate system; carrying out target detection and target identification processing on video stream data acquired by cameras in a plurality of continuous scenes, and extracting the category information of targets appearing in each frame and the position information of the targets in a two-dimensional image coordinate system; based on the coordinate mapping relation, mapping the position information of the target in the two-dimensional image coordinate system into the coordinate of the target in the three-dimensional space coordinate system; and finding the same targets in the adjacent time nodes according to the distance information between each target in the current time node and each target in the previous time node, and connecting the same targets to obtain the motion trail data of the targets in the three-dimensional space coordinate system.

Preferentially, the step of establishing a coordinate mapping relation between a two-dimensional image coordinate system and a three-dimensional space coordinate system specifically comprises the following steps:

for a scene monitored by each camera, selecting pixel coordinates in a coordinate system of 50 uniformly distributed two-dimensional images, and recording the pixel coordinates as { (u) ₁ ,v ₁ ),(u ₂ ,v ₂ ),……,(u ₅₀ ,v ₅₀ ) And simultaneously acquiring corresponding space coordinates of the 50 pixel coordinates in a real three-dimensional space coordinate system, and recording as { (X) ₁ ,Y ₁ ,Z ₁ ),(X ₂ ,Y ₂ ,Z ₂ ),……,(X ₅₀ ,Y ₅₀ ,Z ₅₀ ) Coordinates of each point in a real three-dimensional space coordinate system can be measured manually or positioned by using a GPS (global positioning system)Obtaining;

during the imaging process of the camera, the coordinate points (u, v) of the two-dimensional image and the corresponding coordinate points (X, Y, Z) of the three-dimensional space satisfy the following conditions:

considering that the targets we are interested in are all on ground level, i.e. Z ═ 0, the above equation can be converted to:

wherein, the parameter (w) ₁ ,w ₂ ,b ₁ ,w ₃ ,w ₄ ,b ₂ ) As a constant, the sum of the squares of the deviations of all the fitting results from the actual data is found by the least squares method

And

minimum parameter (w) ₁ ,w ₂ ,b ₁ ,w ₃ ,w ₄ ,b ₂ ) I.e. solving the system of equations:

preferably, the step of finding the same target in the adjacent time nodes according to the distance information between each target in the current time node and each target in the previous time node, and connecting the same targets to obtain the motion trajectory data of the target in the three-dimensional space coordinate system specifically includes:

reading coordinate data of the target in a three-dimensional space coordinate system under the monitoring scene of each camera, and sequencing according to the occurrence time of the coordinate data;

calculating the coordinates ((i +1) of each target in the (i +1) th time node under a three-dimensional space coordinate system ₁ ～(i+1) _a ) Coordinates (i) of each target in the ith time node in a three-dimensional space coordinate system ₁ ～i _b ) Distance between, noted d _xy (where x ∈ [1, a ]]，y∈[1,b])；

When x is a constant value, consider d temporarily _xy The minimum y and x are the numbers of the same target in two adjacent time nodes;

if d is _xy If the number of the target is less than the given threshold value T, determining that x and y are the numbers of the same target in two adjacent time nodes, otherwise, determining that x is the number of a new target appearing in the monitoring scene for the first time at the (i +1) th time node;

sequentially connecting the coordinates of the same target in two adjacent time nodes under a three-dimensional space coordinate system, and simultaneously recording the category and the start-stop time node of the target to which each coordinate connecting line belongs to obtain the motion trail data of the target under the scene monitored by each camera under the three-dimensional space coordinate system.

Preferably, the training step of the target detection and recognition model is completed, and specifically includes:

step 201: for each target to be extracted, collecting 500-1000 pictures containing the target, wherein the pictures should include the targets shot from different angles as much as possible;

step 202: adding a category label to the picture acquired in step 201 by a manual marking method;

step 203: adding position information of the target in a two-dimensional image coordinate system to the picture acquired in the step 201 by a manual marking method;

step 204: randomly scrambling a picture data set used for training to construct a training data set, and importing a Caffe-based deep learning framework fast-RCNN;

step 205: modifying the training parameters according to the categories and the number of the categories of the training set, and setting the iteration times;

step 206: and training a target detection and recognition model.

Preferably, the step of extracting the category of the target and the position information of the target in the two-dimensional image coordinate system in the video data collected by each camera based on the target detection and recognition model specifically includes:

for video stream data collected by cameras in a plurality of continuous scenes, performing target detection and target identification processing by using a target detection and identification model, and extracting the category information of a target in a current frame and the position information of the target in a two-dimensional image coordinate system, which are collected by each camera, every 1 second; meanwhile, the ID and the acquisition time of the camera to which the target belongs are acquired, and the information of the target is formed by the ID and the acquisition time and the position information of the target.

According to the method, a target detection and target recognition model is trained by utilizing a deep learning framework according to the actual needs of monitoring scenes, and the category information of a plurality of different targets and the position information of the targets under a two-dimensional image coordinate system can be simultaneously extracted from images acquired by cameras in a plurality of continuous monitoring scenes; establishing a coordinate mapping relation between a two-dimensional image coordinate system and a three-dimensional space coordinate system through a least square method, and converting the detected position information of each target in the two-dimensional image coordinate system into a coordinate in the target three-dimensional space coordinate system; and obtaining the motion track of the target in a three-dimensional space coordinate system by calculating the distance information between the targets under the adjacent time nodes. Compared with the traditional target tracking technology, the method can fuse the scattered monitoring data collected by each camera, and can simultaneously acquire the spatial position change information of a plurality of targets under a plurality of continuous scenes; the conversion from a two-dimensional pixel coordinate to a three-dimensional space coordinate can be completed through a coordinate mapping equation, and the tracking of the target in the three-dimensional space is realized; meanwhile, compared with the storage of continuous image data adopted by the traditional target tracking technology, the method for storing the category data and the track data of the target greatly reduces the amount of stored data and improves the calculation efficiency.

Drawings

Fig. 1 is a schematic flowchart of a multi-camera data fusion method based on a spatial coordinate system according to an embodiment of the present invention;

FIG. 2 is a flow chart of a target detection and recognition model training process.

Detailed Description

The invention relates to a multi-camera data fusion method based on a space coordinate system, which is further described in detail by combining the accompanying drawings:

fig. 1 is a schematic flow chart of a multi-camera data fusion method based on a spatial coordinate system according to an embodiment of the present invention. As shown in fig. 1, the multi-camera data fusion method based on the spatial coordinate system includes steps S1-S6:

s1, deploying multiple cameras in the scene to be monitored, wherein the edges of the scene covered by adjacent cameras should be as close as possible, and the edges of the scene covered by adjacent cameras should overlap as little as possible.

S2, training a target detection and recognition model, wherein a model training flow chart is shown in the attached figure 2, and the method comprises the following specific implementation steps:

step 201: 500 and 1000 pictures containing at least one category target are selected from each category. For the selection of the pictures, the pictures shot at different angles and the pictures containing the targets with different postures should be selected as much as possible to form a picture data set.

Step 202: adding a category label to the picture acquired in step 201 by a manual labeling method, where the category label is a category to which the object in the picture belongs.

Step 203: adding position information of the target in the two-dimensional image coordinate system to the picture acquired in step 201 by a manual marking method, wherein the position information of the target in the two-dimensional image coordinate system is coordinate information (x1, y1, x2, y2) of a rectangular enclosure frame where the target is located, wherein (x1, y1) is coordinates of an upper left corner of the rectangular enclosure frame where the target is located, and (x2, y2) is coordinates of a lower right corner of the rectangular enclosure frame where the target is located.

Step 204: randomly disordering a picture data set for training to construct a training data set, dividing the training data set, a test set and a verification set according to the proportion of 7:2:1, and importing a deep learning framework fast-RCNN.

Step 205: the training parameters are modified according to the categories of the targets in the training set and the number of the target categories, and the number of iterations is set, including rpn 1 st stage, fast rcnn 1 st stage, rpn 2 nd stage, and fast rcnn 2 nd stage.

Step 206: and training a target detection and identification model, and obtaining a target detection and identification model file with the suffix name of the coffee model after the set iteration times are finished.

And S3, carrying out target detection and target identification processing on the current frame acquired by the cameras every 1 second for the video stream data acquired by the cameras in a plurality of continuous scenes. Extracting the category information of the target in the current frame and the two-dimensional pixel coordinate position information (u) of the target under a two-dimensional image coordinate system _i ,v _i ) Meanwhile, the ID of the camera to which the target belongs and the time information of the appearance of the target need to be stored, and the information together form the information of the target.

S4, the specific implementation steps of the calculation method of the coordinate mapping equation between the two-dimensional coordinates and the three-dimensional coordinates are as follows: during the imaging process of the camera, the following relation is satisfied between the coordinate points (u, v) of the two-dimensional image and the corresponding coordinate points (X, Y, Z) of the three-dimensional space:

considering that the targets we are interested in are all on ground level, i.e. Z ═ 0, equation (1) can be converted into:

wherein the content of the first and second substances,

based onTherefore, a coordinate mapping relation between a two-dimensional image coordinate system and a three-dimensional space coordinate system can be established by a least square method, and the method comprises the following specific steps:

step 401: for the scene monitored by each camera, pixel coordinates in 50 uniformly distributed two-dimensional image coordinate systems are selected and are recorded as { (u) ₁ ,v ₁ ),(u ₂ ,v ₂ ),……,(u ₅₀ ,v ₅₀ ) And simultaneously acquiring corresponding space coordinates of the 50 pixel coordinates in a real three-dimensional space coordinate system through manual measurement or a GPS positioning method, and marking as { (X) ₁ ,Y ₁ ,Z ₁ ),(X ₂ ,Y ₂ ,Z ₂ ),……,(X ₅₀ ,Y ₅₀ ,Z ₅₀ )}。

Step 402: by least squares, selecting the appropriate parameters (w) ₁ ,w ₂ ,b ₁ ,w ₃ ,w ₄ ,b ₂ ) Ensuring the sum of squares M of the deviations of all fitting results from the actual data _X ，M _Y Minimum, M _X ，M _Y Can be expressed as:

make M _X ，M _Y The minimum time parameter (w) can be taken ₁ ,w ₂ ,b ₁ ,w ₃ ,w ₄ ,b ₂ ) I.e. solving the system of equations:

will parameter (w) ₁ ,w ₂ ,b ₁ ,w ₃ ,w ₄ ,b ₂ ) And substituting the equation into the formula (2) to obtain a mapping equation between a two-dimensional image coordinate system and a three-dimensional space coordinate system in the scene monitored by the camera.

S5, extracting the two-dimensional pixel coordinate position information (u) of each object in the two-dimensional image coordinate system extracted in the step S3 _i ,v _i ) Inputting a mapping equation between the two-dimensional image coordinate system and the three-dimensional space coordinate system obtained by calculation in the step 4:

thereby obtaining the coordinate (X) of the target detected in each monitoring scene in the three-dimensional space coordinate system _i ,Y _i ,Z _i )。

Step S6, fusing multi-camera data, extracting the track data of the target in the three-dimensional space coordinate system based on the coordinates of the target in the three-dimensional space coordinate system in each monitoring scene, and specifically comprising the following steps:

step 601: and sequentially reading the coordinates of the target under the three-dimensional space coordinate system under each monitoring scene in two adjacent time nodes according to the sequence of the occurrence time.

Step 602: calculating the coordinates (i +1) of each target in the (i +1) th time node under the three-dimensional space coordinate system ₁ ～(i+1) _a ) Coordinates (i) of each target in the ith time node in a three-dimensional space coordinate system ₁ ～i _b ) Distance between, noted d _xy (where x ∈ [1, a ]]， y∈[1,b])。

Step 603: when x is constant, d is selected such that _xy And the minimum y is temporarily considered as the number of the same target in two adjacent time nodes.

Step 604: if d is _xy And if the number is less than the given threshold value T, determining that x and y are the numbers of the same target in two adjacent time nodes, otherwise, determining that x is the number of a new target which appears in the monitoring scene for the first time in the (i +1) th time node.

Step 605: sequentially connecting the coordinates of the same target in the two adjacent time nodes judged in the step 604 under the three-dimensional space coordinate system, adding the type and the start-stop time node of the target to which the connecting line belongs to each coordinate connecting line, and forming the motion trajectory data of the target under the three-dimensional space coordinate system under the scene monitored by each camera by the data.

The embodiment of the invention can restore the position change information of the target in the real space, can better track the target across scenes, realizes the fusion of the target information acquired by a plurality of cameras, and provides more information for high-level target behavior analysis.

While particular embodiments of the present invention have been shown and described, it will be obvious to those skilled in the art that, based upon the teachings herein, changes and modifications may be made without departing from this exemplary embodiment and its broader aspects. Therefore, the appended claims are to encompass within their scope all such changes and modifications as are within the true spirit and scope of this exemplary embodiment of the invention.

Claims

1. A multi-camera data fusion method based on a space coordinate system is characterized by comprising the following steps:

deploying each camera in a scene to be monitored;

constructing a training data set aiming at a target to be extracted, and finishing the training of a target detection and recognition model; extracting the category of a target in video data acquired by each camera and the position information of the target under a two-dimensional image coordinate system based on a target detection and identification model, and establishing a coordinate mapping relation between the two-dimensional image coordinate system and a three-dimensional space coordinate system;

carrying out target detection and target identification processing on video stream data acquired by cameras in a plurality of continuous scenes, and extracting the category information of targets appearing in each frame and the position information of the targets in a two-dimensional image coordinate system;

based on the coordinate mapping relation, mapping the position information of the target in the two-dimensional image coordinate system into the coordinate of the target in the three-dimensional space coordinate system; and finding the same targets in the adjacent time nodes according to the distance information between each target in the current time node and each target in the previous time node, and connecting the same targets to obtain the motion trail data of the targets in the three-dimensional space coordinate system.

2. The method according to claim 1, wherein the step of establishing a coordinate mapping relationship between the two-dimensional image coordinate system and the three-dimensional space coordinate system comprises:

for a scene monitored by each camera, pixel coordinates in a two-dimensional image coordinate system with 50 uniform distributions are selected and are marked as { (u) ₁ ,v ₁ ),(u ₂ ,v ₂ ),……,(u ₅₀ ,v ₅₀ ) And simultaneously acquiring corresponding space coordinates of the 50 pixel coordinates in a real three-dimensional space coordinate system, and recording as { (X) ₁ ,Y ₁ ,Z ₁ ),(X ₂ ,Y ₂ ,Z ₂ ),……,(X ₅₀ ,Y ₅₀ ,Z ₅₀ ) Acquiring coordinates of each point in a real three-dimensional space coordinate system through manual measurement or a GPS positioning method;

wherein the parameter (w) ₁ ,w ₂ ,b ₁ ,w ₃ ,w ₄ ,b ₂ ) Is constant, found by least squares so that all the fit resultsSum of squares of deviations from actual data

And

3. the method according to claim 1, wherein the step of finding the same object in the adjacent time nodes according to the distance information between each object in the current time node and each object in the previous time node, and connecting the same objects to obtain the motion trajectory data of the objects in the three-dimensional space coordinate system specifically comprises:

calculating the coordinates ((i +1) of each target in the (i +1) th time node under a three-dimensional space coordinate system ₁ ～(i+1) _a ) Coordinate (i) of each target in the ith time node in a three-dimensional space coordinate system ₁ ～i _b ) Distance between, noted d _xy Where x ∈ [1, a ]]，y∈[1,b]；

When x is a constant value, temporarily consider d to be _xy Minimum y and x are the codings of the same object in two adjacent time nodesNumber;

4. The method according to claim 1, wherein the step of constructing a training data set for the target to be extracted to complete training of the target detection and recognition model specifically comprises:

step 201: for each target needing to be extracted, 500-1000 pictures containing the target are collected, and the pictures comprise the targets shot from different angles;

step 206: and training a target detection and recognition model.

5. The method according to claim 1, wherein the step of extracting the category of the target and the position information of the target under the two-dimensional image coordinate system in the video data collected by each camera based on the target detection and recognition model specifically comprises:

for video stream data acquired by cameras in a plurality of continuous scenes, target detection and target identification processing is carried out by using a target detection and identification model, and the category information of a target in a current frame and the position information of the target in a two-dimensional image coordinate system, which are acquired by each camera, are extracted every 1 second; meanwhile, the ID and the acquisition time of the camera to which the target belongs are acquired, and the information of the target is formed by the ID and the acquisition time and the position information of the target.