CN117095370A

CN117095370A - Multi-camera detection target fusion and blind supplementing method

Info

Publication number: CN117095370A
Application number: CN202311186896.2A
Authority: CN
Inventors: 胡丽娟; 张琳
Original assignee: Chelutong Technology Chengdu Co ltd
Current assignee: Chelutong Technology Chengdu Co ltd
Priority date: 2023-09-14
Filing date: 2023-09-14
Publication date: 2023-11-21

Abstract

The invention discloses a method for fusing and blind supplementing detection targets by multiple cameras, which comprises the steps of installing multiple cameras at an intersection, acquiring multiple video source images, and establishing a pixel coordinate system; respectively carrying out target detection on a plurality of video source images to obtain a target detection type; according to the target detection type and box characteristics, taking the pixel coordinates of the target pixel points under the pixel coordinate system of each detected target, and converting the pixel coordinates to obtain target coordinates; defining a specific area and an interested area of each camera, and directly outputting all targets in the specific area; establishing a custom coordinate system, and performing aerial view perspective transformation on target coordinates in the region of interest to obtain aerial view coordinates; and performing target fusion based on the aerial view coordinates, and outputting all targets of the region of interest. The invention avoids errors caused by fusion of the surface features of the target frame images, does not need to carry out joint calibration on the video source images, and reduces a great deal of time and labor cost.

Description

Multi-camera detection target fusion and blind supplementing method

Technical Field

The invention relates to the technical field of intelligent traffic detection, in particular to a method for fusing detection targets and blind supplementing by multiple cameras.

Background

Intelligent traffic plays a vital role in realizing sustainable development and economic development of intelligent city construction, and with the improvement of the economic level of people, motor vehicles, non-motor vehicles and pedestrian targets in traffic scenes are increasing, and traffic flow and traffic target illegal events are counted manually. The requirements on aspects such as road side perception real-time detection, illegal event alarm, evidence preservation and the like are higher, and a scheme of cooperatively detecting targets by multiple cameras on the road side is commonly used for realizing compensation of blind areas in the road so as to improve recall rate of target detection and reduce omission rate.

The existing method for fusing the detection results of multiple cameras comprises the following steps:

1. based on extracting target features such as color, texture, shape, etc., and fusing with these features.

2. Based on target fusion of deep learning, a deep learning model is obtained by carrying out joint training on a plurality of camera images, advanced semantic features in the images are extracted, and the target fusion is realized by utilizing the features.

The above method has at least the following disadvantages:

1. the method for extracting the target features by using the colors, textures and shapes is characterized in that the mounting positions of the multiple cameras in the road side project are usually shooting at the middle of the intersection in four directions, and the targets are slightly different in textures and shapes due to different shooting angles, and the cameras are influenced by the illumination directions, so that the colors of the targets shot by the cameras mounted in different directions are slightly different, and the fusion accuracy of the targets is reduced.

2. Before the images of the cameras are jointly trained, the image data of the cameras are required to be jointly calibrated, and accordingly larger time and labor cost are brought.

Disclosure of Invention

The invention aims to provide a method for detecting target fusion and blind complement by multiple cameras, which avoids errors caused by fusion of surface features of target frame images only, does not need to carry out joint calibration on video source images, and reduces a large amount of time and labor cost.

The embodiment of the invention is realized by the following technical scheme:

the method for fusing and blind supplementing of the detection targets by using the multiple cameras is characterized by comprising the following steps of:

installing a plurality of cameras at an intersection, acquiring a plurality of video source images, and establishing a pixel coordinate system;

respectively carrying out target detection on a plurality of video source images to obtain a target detection type;

according to the target detection type and box characteristics, taking the pixel coordinates of the target pixel points under the pixel coordinate system of each detected target, and converting the pixel coordinates to obtain target coordinates;

defining a specific area and an interested area of each camera, and directly outputting all targets in the specific area; establishing a custom coordinate system, and performing aerial view perspective transformation on target coordinates in the region of interest to obtain aerial view coordinates;

and performing target fusion based on the aerial view coordinates, and outputting all targets of the region of interest.

In one embodiment of the invention, the object detection types include motor vehicle objects and non-motor vehicle objects.

In an embodiment of the present invention, the specific method for obtaining the target coordinates by converting the coordinates of any one pixel point under the pixel coordinate system according to the target detection type and the box characteristics is as follows:

when the target belongs to a non-motor vehicle target, a conversion formula is provided:

wherein x and y are target coordinates, x ₀ 、y _O And h is the height of the box frame and is the pixel coordinate.

In an embodiment of the present invention, the specific method for obtaining the pixel coordinates of the target pixel point under the pixel coordinate system according to the target detection type and the box characteristics and converting the pixel coordinates of each detected target is as follows:

when the target belongs to the motor vehicle target, the relation is satisfiedWhen the method is used, a conversion formula is as follows:

when the target belongs to the motor vehicle target, the relation is satisfiedSetting a threshold value for the aspect ratio of the box frame;

when the aspect ratio of the target box is greater than the threshold, then there is a conversion formula:

in the above formula, x and y are target coordinates, x ₀ 、y _O And H is the height of the box, w is the width of the box, and H is the height of the single frame image.

In an embodiment of the present invention, the specific formula for obtaining the aerial view coordinate by performing aerial view perspective transformation on the target coordinate in the region of interest is:

in the formula, u ' and v ' are bird's eye viewsMark, k ₁₁ 、k ₁₂ 、k ₁₃ 、K ₂₁ 、K ₂₂ 、K ₂₃ 、K ₃₁ 、K ₃₂ And x and y are target coordinates and are coefficients.

In an embodiment of the present invention, the specific method for fusing targets based on the bird's eye coordinates and outputting all targets of the region of interest includes:

setting a distance threshold D _min And D _max Traversing a maximum number of cameras S within a region of interest _kmax And other cameras S _i Is a target of all of the above;

finding out the video camera S _kmax Each object within the region of interest is at a camera S _i The nearest Euclidean target d in the region of interest _min ；

Comparison d _min And D _min And D _max The magnitude of the value, if the relation d is satisfied _min ＞D _max Then it is stated that both are not the same target; if the relation d is satisfied _min ＜D _min Then the two are explained as the same target and are fused; if satisfy the relation D _min ＜d _min ＜D _max And judging whether the target types of the two are consistent, if so, merging the same target, and if not, merging the video source images again in the next frame.

In an embodiment of the present invention, the specific method for fusing targets based on the bird's eye coordinates and outputting all targets of the region of interest further includes:

and setting a fusion threshold, and stopping fusion if the fusion times exceed the fusion threshold and are not successful.

The technical scheme of the embodiment of the invention has at least the following advantages and beneficial effects:

according to the invention, through a joint deployment scheme of a plurality of cameras, the blind areas caused by mutual shielding among targets in a road target detection scene are compensated, the omission rate of target detection is reduced, the reduction of the target detection accuracy caused by factors such as illumination, weather and the like is avoided, and the robustness of target detection is improved; meanwhile, by fusing the image information of the cameras, the position and the motion trail of the target can be tracked more accurately, the continuity of target tracking is enhanced, and the accuracy of target identification is improved.

Drawings

FIG. 1 is a flow chart of the steps of the present invention;

FIG. 2 is a view of an installation plan of an intersection camera;

FIG. 3 is a schematic view of a camera shooting an intersection;

FIG. 4 is a schematic diagram of the type of target detection;

FIG. 5 is a schematic illustration of a bird's eye perspective transformation;

FIG. 6 is a schematic diagram of target fusion;

FIG. 7 is a schematic diagram of a custom coordinate system;

fig. 8 is a schematic diagram of a pixel coordinate system.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention.

Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1:

referring to fig. 1-8, a method for fusing and blind supplementing targets by using multiple cameras includes the following steps:

100. installing a plurality of cameras at an intersection, acquiring a plurality of video source images, and establishing a pixel coordinate system;

200. respectively carrying out target detection on a plurality of video source images to obtain a target detection type;

300. according to the target detection type and box characteristics, taking the pixel coordinates of the target pixel points under the pixel coordinate system of each detected target, and converting the pixel coordinates to obtain target coordinates;

400. defining a specific area and an interested area of each camera, and directly outputting all targets in the specific area; establishing a custom coordinate system, and performing aerial view perspective transformation on target coordinates in the region of interest to obtain aerial view coordinates;

500. and performing target fusion based on the aerial view coordinates, and outputting all targets of the region of interest.

In step 100, the method for establishing the pixel coordinate system is as shown in fig. 8: and establishing a coordinate system by taking the upper left corner of the picture as an origin of coordinates, taking the horizontal direction as an X axis and taking the vertical direction as a Y axis.

The specific method of step 200 is:

and carrying out target detection on the video source image by adopting a target detection algorithm yolov8, obtaining a target detection type, and judging that the target belongs to a motor vehicle target or a non-motor vehicle target.

The specific method of step 300 is:

wherein, x and y are the coordinates of a self-defined coordinate system, and x ₀ 、y _O The coordinates of the pixel coordinate system are H, w, and H are the heights of the box, the width of the box, and the height of the single frame image.

The specific formula of step 400 is:

in the formula, u 'and v' are aerial coordinates, k ₁₁ 、k ₁₂ 、k ₁₃ 、K ₂₁ 、K ₂₂ 、K ₂₃ 、K ₃₁ 、K ₃₂ And x and y are target coordinates and are coefficients.

In step 400, the method for establishing the custom coordinate system is shown in fig. 7: taking the central point of the target detection area at the intersection, for example, selecting the central point of the middle area of the intersection as a reference point for the intersection, and setting a proper coordinate system; for example, according to the size of the intersection and considering the width of the road surface at two sides of the intersection, selecting a proper position as an origin of coordinates, for example, taking the intersection point of the extension lines of two adjacent road perpendicular lines of the intersection as the origin of coordinates, and setting the directions of the x axis and the y axis of coordinates, a coordinate system Oxy shown in fig. 7 is schematically shown;

it should be noted that, in a real scene, the origin of coordinates of the coordinate system Oxy often includes a building, so that it is inconvenient to obtain the position of the selected calibration point on the image under the custom coordinate system, so that when the corresponding point under the custom coordinate system is obtained, the coordinate of the point under the coordinate system O 'x' y 'can be obtained first, and the coordinate system O' x 'y' shown in fig. 7 is illustrated. And then, calculating the coordinate value of the coordinate point under the coordinate system Oxy through translation.

The specific method of step 500 is:

In an embodiment of the present invention, the specific method for performing the target fusion based on the target position after the bird's eye perspective transformation and outputting all the targets and the types thereof at the intersection further includes:

Example 2:

this example is a specific analysis of example 1.

In step 100, the number of cameras is preferably four, and the installation mode is as shown in fig. 2, and the video source images are acquired in real time through the cameras.

In step 200, a specific manner of performing object detection on the plurality of video source images is to acquire an object detection type by using an object detection algorithm yolov8, where the acquired object detection types include an automotive object and a non-automotive object.

In step 300, as shown in fig. 3, for the target in the middle of the intersection, as perceived by the multiple video source images, the perceived result is the center point coordinate (x ₀ ,y ₀ ) And the width (w) and height (h) of the box, as in the target detection result example of fig. 4.

In order to more accurately obtain the pixel position of the target, the pixel position can be more accurately mapped to a self-defined coordinate system, a type based on the target is adopted, the height-width ratio of the box frame is utilized to set a threshold value, and the information of the whole box frame is considered in combination with the pixel position and the box height of the target, so that a certain pixel point is selected as the pixel position of the target.

1. For non-motor vehicle targets, such as pedestrians, bicycles, (electric) motorcycles and the like, the ground occupancy and the space occupancy of a single target are relatively small, and the pixel position of the target is obtained through a formula I, namely the coordinates of two pixels upwards from the center point of the lower edge of the target box frame are obtained.

The reason that the coordinates of the center point of the lower edge of the box frame are not taken at the pixel position at this time is that when the data are marked, the boundary of the box frame is formed by the minimum circumscribed rectangle of the detected object, and the position of a certain pixel on the boundary is not accurately taken as the object pixel.

2. For automotive targets, such as cars, buses, vans, etc., the floor occupancy and space occupancy of a single target is high. When a pixel point is taken as a target position on an image, setting thresholds according to the aspect ratio of a target box, and giving calculation modes of target pixel positions (x, y) under different conditions; the threshold value of the aspect ratio of the box of this embodiment is set to 1.

(a) When the target meets the switchTied typeAnd when the target is considered to be at a position with a far image field of view, the target pixel position is obtained through the formula I. />The y-direction coordinate value of the lower edge of the box frame is shown, and H is the height of a single frame image.

(b) When the target satisfies the relationIn the time, according to the height-width ratio of the target box frame +.>Values, divided into two cases:

I. when the target box meetsThe target is considered to be in a state of running longitudinally relative to the visual field and is in a forward running or backward running state (the front or the tail of the motor vehicle target is forward-facing on the image), and the pixel position of the target is obtained through a formula II by considering the characteristics of the image.

At the moment, the pixel position does not take the coordinates of the central point of the box frame, but increases the height parameter of the box frame, and takes the position of the central point pixel of the target box frame shifted downwards by 1/4 of the height of the box frame, so that the characteristic that the ground corresponding to the pixel position of the central point of the target box frame is inconsistent with the position of the target is effectively avoided.

II, when the target Box frame meetsThe object being considered to be in a cornering or transverse-looking condition (in which the vehicle object is visible on both sides of the imageEither side of (c), at which point the pixel location of the target is obtained by equation three. And the pixel position of the vehicle which runs transversely relative to the visual field is closer to the calculation mode of the center point of the lower edge of the box frame according to the formula III.

In this embodiment, in the first, second and third formulas, x and y are the target coordinates, x ₀ 、y _O And h is the height of the box and w is the width of the box.

In step 400, the formula derivation method is as follows:

1. and respectively calibrating four groups of pixel coordinates and custom coordinates, and selecting four proper pixel points as calibration points before perspective transformation.

2. Determining the position of the selected calibration point on the image under the self-defined coordinate system as the calibration point after perspective transformation according to the self-defined coordinate origin, the coordinate axis direction and the reference point at the crossroad; specifically, the method comprises the following steps:

(a) A coordinate system is customized at the crossroad, a central point of a target detection area is taken, for example, a central point of a middle area of the crossroad is selected as a reference point for the crossroad, and a proper coordinate system is set; for example, according to the size of the intersection and considering the width of the road surface at two sides of the intersection, a suitable position is selected as the origin of coordinates, for example, the intersection point of the extension lines of two adjacent road perpendicular lines of the intersection is taken as the origin of coordinates, and the directions of the x axis and the y axis of the coordinates are set, and a coordinate system Oxy shown in fig. 7 is schematically shown.

(b) Defining a proper coordinate value for the reference point on the Oxy coordinate; in a real scene, the origin of coordinates of the coordinate system Oxy often includes a building, so that it is inconvenient to obtain the position of the selected coordinate point on the image under the custom coordinate system, so that when the corresponding point under the custom coordinate system is obtained, the coordinate of the point under the coordinate system O ' x ' y ' can be obtained first, and then the coordinate value of the coordinate point under the coordinate system Oxy is calculated through translation.

3. Calculating perspective transformation matrixes of a plurality of video sources respectively for four groups of one-to-one corresponding points calibrated by each video source image; specifically, the method comprises the following steps:

(a) The A, B, C, D is set to be four pixel calibration points before perspective transformation on a certain video source image in sequence.

(b) Setting A ', B', C 'and D' as A, B, C, D standard points under the self-defined coordinate system; a ', B', C ', D' are considered to be points under the custom coordinate system after perspective transformation under the pixel coordinate system by A, B, C, D.

(c) Let M be perspective transformation matrix, (x, y, 1) be homogeneous coordinates of original image pixel point, the transformed homogeneous coordinates are:

the method can obtain:

then there are:

let m ₃₃ =1, then there is:

the perspective transformation matrix can be obtained by substituting the four groups of mapping points A, B, C, D, A ', B', C 'and D' into formula four and simultaneously calculating.

In step 400, the final goal of using multiple cameras for target detection at an intersection is to accurately output all targets and types within the effective detection range of the intersection. When the detection result is finally output, all targets are classified into two types.

One is a target that does not require fusion; a specific area (ROS, region of special) is defined for each camera as an effective detection range for the corresponding camera and is not covered by the field of view or is not well angled for other cameras, such as a lane area outside the pavement. The target detection of ROS is completed only through a certain camera, and Hungary optimization matching fusion is not needed at this time.

The other type is a target which needs to be output after Hungary matching fusion; a region of interest (ROI, region of interest) is defined for each camera as an effective detection range for a plurality of video sources, such as a region in the middle of an intersection.

For each camera the target of the ROI area:

1. firstly, counting the target number k of the ROI area of each camera, and recording the maximum value k _max A corresponding camera;

2. and performing bird-eye perspective transformation on the ROI area of each camera, and converting all targets of the ROI area of each camera into a custom coordinate system through bird-eye perspective transformation.

In step 500, k is defined in the custom coordinate system _max Corresponding camera (set as)) The detection target of the ROI of (2) is respectively compared with other sources (set as S _i I=1, 2, 3) ROI detection targets are fused using the hungarian optimization matching algorithm. The cost matrix is calculated by Euclidean distance between every two targets of targets in ROIs of different cameras under a custom coordinate system. At the same time need to set a distance threshold D _min And D _max Traversing the camera->And S is _i Finding out the camera +.>Each object within the upper ROI is at S _i The object with the nearest Euclidean distance in the upper ROI is recorded and the nearest distance value d is recorded _min : if d _min >D _max Then it is considered that both are not necessarily the same target; if d _min <D _min The two targets are considered to be the same target, the type of the corresponding target is determined according to the party with the larger box area, and the probability that the party with the larger box area is blocked is considered to be lower, so that the accuracy rate of target identification is higher; if D _min <d _min <D _max And considering whether the types corresponding to the two targets are consistent or not, if the types of the targets are consistent, considering the two targets as the same target, if the types of the targets are inconsistent, fusing the target detection data of the next frame as the target to be fused again, and if the continuous 10 frames are not fused successfully, not fusing.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The method for fusing and blind supplementing of the detection targets by using the multiple cameras is characterized by comprising the following steps of:

2. The method of claim 1, wherein the types of object detection include automotive objects and non-automotive objects.

3. The method for fusing and blind supplementing the targets detected by the multiple cameras according to claim 2, wherein the specific method for acquiring the coordinates of any one pixel point of each detected target under a pixel coordinate system and converting the coordinates to obtain the target coordinates according to the target detection type and the box characteristics is as follows:

4. The method for fusing and blind supplementing the detection targets by using the multiple cameras according to claim 3, wherein the specific method for obtaining the target coordinates by converting the pixel coordinates of the target pixel point of each detected target under the pixel coordinate system according to the detection type of the target and the characteristics of the box frame is as follows:

5. The method for fusing and blind supplementing the targets detected by the multiple cameras according to claim 1, wherein the specific formula for obtaining the aerial view coordinate by carrying out aerial view perspective transformation on the target coordinate in the region of interest is as follows:

6. The method for target fusion and blind complement for multi-camera detection according to claim 1, wherein the specific method for target fusion based on the aerial view coordinates and outputting all targets of the region of interest comprises the following steps:

7. The method for target fusion and blind complement for multi-camera detection according to claim 6, wherein the specific method for target fusion based on the bird's eye coordinates and outputting all targets of the region of interest further comprises: