CN111914715A

CN111914715A - Intelligent vehicle target real-time detection and positioning method based on bionic vision

Info

Publication number: CN111914715A
Application number: CN202010721224.7A
Authority: CN
Inventors: 安成刚; 张立国; 李巍; 李会祥; 吴程飞; 张志强; 王增志; 张旗; 史明亮
Original assignee: Langfang Heyi Life Network Technology Co ltd
Current assignee: Langfang Heyi Life Network Technology Co ltd
Priority date: 2020-07-24
Filing date: 2020-07-24
Publication date: 2020-11-10
Anticipated expiration: 2040-07-24
Also published as: CN111914715B

Abstract

The invention relates to an intelligent vehicle target real-time detection and positioning method based on bionic vision, which comprises the following steps: acquiring images of different focal length lenses in a multi-view imaging device of the intelligent vehicle in real time; detecting the category and position information of each target in each image; dividing the reference image into a common visual area and a non-common visual area according to the homogeneous coordinate system transformation relation of the object space of each lens; aiming at the common visual area of each image, performing three-dimensional reconstruction and positioning on a target in the common visual area by adopting a binocular different focal length stereoscopic vision reconstruction method to obtain three-dimensional positioning information and category label semantic information of the target; and aiming at the non-common visual area of the reference image, acquiring angle positioning information of the target in each non-common visual area so as to construct a vector map with semantic information for the intelligent vehicle. The method solves the technical problems that the three-dimensional semantic map cannot be constructed on line in real time and the calculated amount and the calculated complexity of the visual point cloud data in unmanned driving cannot be reduced in the prior art.

Description

Intelligent vehicle target real-time detection and positioning method based on bionic vision

Technical Field

The invention relates to the technical field of visual detection and positioning, in particular to an intelligent vehicle target real-time detection and positioning method based on bionic vision.

Background

In recent years, with the continuous and deep research of human beings on the bionics technology, the artificial intelligence technology is developing forward at an unprecedented speed and realizing breakthrough. A bionic vision sensor and an efficient computer vision solution are constructed by simulating a human eye vision system and an insect compound eye imaging mechanism, and the real-time detection and positioning of targets are the research hotspots of the current bionic vision artificial intelligence technology. Different from the perception that only distance can be perceived by laser radar, the vision sensor can also realize the identification of complex semantic information such as traffic signs, traffic lights, lane lines and the like, so that the unmanned automobile technology research based on bionic vision has great theoretical significance and application value.

The existing unmanned solution (such as ORB-SLAM, VINS-MONO, DSO) based on visual three-dimensional positioning and map construction (SLAM) is to establish a three-dimensional point cloud sparse or dense map corresponding to a plurality of image sequences by extracting and matching feature points of the whole image.

In addition, most of current visual SLAM researches are based on monocular motion + inertial navigation or a parallel binocular stereo vision model, that is, the same motion or two cameras with completely same parameters are used for performing target three-dimensional reconstruction through parallax of two images, and the imaging mode with a single visual angle has the defect that the balance among the field angle, the resolution and the detection speed is difficult to satisfy.

Therefore, on the premise of large visual field, high precision and quick perception, how to effectively reduce the complexity of the visual positioning algorithm and the calculated amount of three-dimensional point cloud data, and meanwhile, the semantic SLAM can be constructed on line, which becomes the difficult point of unmanned technology engineering.

Disclosure of Invention

Technical problem to be solved

In view of the defects and shortcomings of the prior art, the invention provides an intelligent vehicle target real-time detection and positioning method based on bionic vision, and solves the technical problem that the calculated amount of visual data in unmanned driving cannot be reduced, so that semantic SLAM cannot be constructed on line in real time in the prior art.

(II) technical scheme

In order to achieve the purpose, the invention adopts the main technical scheme that:

in a first aspect, an embodiment of the present invention provides a bionic vision-based method for detecting and positioning an intelligent vehicle target in real time, including:

s1, acquiring images of different focal length lenses in the multi-view imaging device of the intelligent vehicle in real time;

step S2, detecting the type and position information of each target in the image of each lens;

step S3, dividing the reference image into a common visual area and a non-common visual area according to the homogeneous coordinate system transformation relation of object space corresponding to the lenses with different focal lengths by taking the image of the lens with the smallest focal length as the reference image;

step S4, aiming at the common visual area of each image, performing three-dimensional reconstruction and positioning on the target in the common visual area by adopting a binocular different focal length stereoscopic vision reconstruction method to obtain three-dimensional positioning information and category label semantic information of all targets in the common visual area in the reference image;

aiming at a non-common view area of a reference image, dividing an upper area, a lower area, a left area, a right area and a left area according to a first visual angle of a camera device to which the image belongs, and acquiring angle positioning information of a target in each non-common view area in a perspective grid dividing mode;

and constructing a vector map with semantic information for the intelligent vehicle by using the three-dimensional positioning information, the category label semantic information and the angle positioning information corresponding to the non-common visual area.

Specifically, in practical applications, the step S3 belongs to a coordinate transformation relationship in a two-dimensional coordinate system, the step S4 belongs to a reconstruction method in a three-dimensional coordinate system, and the constructed vector map includes key information of objects in all focal length images, which realizes online real-time construction, and thus can be used for real engineering applications, and has high practicability.

Optionally, the step S2 includes:

detecting the type and position information of each target in the image of each lens by adopting a YOLOv5 target real-time detection algorithm;

and/or, the target comprises: traffic lights, speed limit signs, pedestrians, small animals, vehicles or lane lines, etc.;

the category and location information includes: position information, size information, and category information of the object in each image.

Optionally, the step S3 includes:

when the number of the focal length lenses is more than or equal to 3, dividing different focal length lenses into a long focal length lens and a short focal length lens;

regarding the short-focal-length lens, taking an image of a lens with the smallest focal length in the short-focal-length lens as a first reference image, and dividing imaging sections of field angles of other lenses in the short-focal-length lens, which correspond to the first reference image, into a first common visual area and a first non-common visual area;

and regarding the long-focus lens, taking the image of the lens with the minimum focal length in the long-focus lens as a second reference image, and dividing the imaging interval of the second reference image corresponding to the field angle of other lenses in the long-focus lens into a second common visual area and a second non-common visual area.

Optionally, the step S4 includes:

acquiring three-dimensional positioning information and category label semantic information of all targets in a common visual area in a first reference image by adopting a binocular different focal length stereoscopic vision reconstruction method;

acquiring angle positioning information of each target in a non-common visual area I in a reference image I in a perspective grid division mode;

acquiring three-dimensional positioning information and category label semantic information of all targets in a common visual area II in a reference image II by adopting a binocular different-focal-length stereoscopic vision reconstruction method;

acquiring angle positioning information of each target in a non-common visual area II in a reference image II in a perspective grid division mode;

accordingly, the method further comprises step S5,

step S5 includes:

and fusing all the acquired three-dimensional positioning information, category label semantic information, all the angle positioning information and the three-dimensional point cloud data in the multi-view imaging device to construct a three-dimensional semantic map for the intelligent vehicle.

Optionally, the step S4 includes:

acquiring three-dimensional positioning information and category label semantic information of all targets in a common visual area in a first reference image by adopting a binocular different focal length stereoscopic vision reconstruction method; acquiring angle positioning information of each target in a non-common visual area I in a reference image I in a perspective grid division mode;

acquiring three-dimensional positioning information and category label semantic information of all targets in a common visual area II in a reference image II by adopting a binocular different-focal-length stereoscopic vision reconstruction method; acquiring angle positioning information of each target in a non-common visual area II in a reference image II in a perspective grid division mode;

and converting the three-dimensional positioning information and the category label semantic information of all the targets in the common visual area of the first reference image and the second reference image and the angle positioning information of all the targets in the non-common visual area to a preset global coordinate system (latitude and longitude under a geodetic coordinate system) through a series of coordinate transformation.

Optionally, the step S3 includes:

if it is detected in step S2 that the same target has corresponding corner points a, b, c, d in the image of the lens with any focal length; the corresponding corner points a ', b', c ', d' existing in the image of the lens with the smallest focal length;

when the coordinate systems of the two lenses are not coincident and/or the focal lengths of the lenses are different, based on the homogeneous coordinate system transformation relation of the object image space, the lens with the minimum focal length is taken as a reference image and is divided into a square-back area consisting of eight angular points a, b, c and d and a ', b', c 'and d';

wherein, the central square of the square-shaped region is an overlapped imaging region as a common visual area;

the central square-shaped areas a, b, c and d are boundary areas a ', b', c 'and d' from the square-shaped areas a, b, c and d to the square-shaped areas, and non-overlapped imaging areas are used as non-common view areas.

Optionally, the step S4 includes:

s4-1, determining and establishing a one-to-one mapping relation of coordinates of each target of the multi-target image in each image according to a two-image-surface coordinate relation (A1);

s4-2, detecting the depth information Z value of the target, namely the space point M, in the three-dimensional space by means of a plurality of cameras with different resolutions and focal lengths and according to a binocular parallel vision model;

specifically, the coordinates (X, Y, Z) of the spatial point M are:

wherein, the focal length f of the two-phase camera lens₁、f₂And the base line distance b belongs to a pre-calibrated parameter, and the corresponding relation x of the pixel coordinates is obtained through a formula (A1)₁、y₁And x₂、y₂；

And S4-3, reconstructing the detection target in the common view area of the reference image according to the three-dimensional coordinate relation (A2) of the space point M, and marking the average depth value of the detection target area.

Optionally, the short-focus lens is an 8mm focal length lens and a 12mm focal length lens, and the long-focus lens is a 16mm focal length lens and a 25mm focal length lens;

the 8mm and 12mm short-focus lenses are used for monitoring short-distance targets of 5-35m in a large field angle of 90 degrees,

the 16mm and 25mm telephoto lenses are used to monitor a distant target of 10-60m within a small field angle of 40 °.

In a second aspect, an embodiment of the present invention provides an intelligent vehicle driving system, including: the multi-view image forming apparatus comprises a control device and a multi-view image forming device connected with the control device, wherein the multi-view image forming device comprises: at least one short focal length lens, at least one long focal length lens;

and after receiving the images collected by the short-focal-length lens and the long-focal-length lens, the control device adopts the real-time detection and positioning method of the intelligent vehicle of the first aspect to construct the three-dimensional semantic map of the intelligent vehicle on line in real time.

In a third aspect, an embodiment of the present invention further provides an intelligent vehicle, including the intelligent vehicle driving system in the second aspect.

(III) advantageous effects

The invention has the beneficial effects that: the method does not need to process all point cloud data, obtains the data of the area where the target is located and obtains the three-dimensional positioning information of the target in a coordinate conversion and three-dimensional reconstruction mode, can effectively reduce the calculation amount of data processing of a visual point cloud matching positioning algorithm and improve the positioning precision of the target, not only can realize the accurate positioning of the target with different fields of view and different distances, but also can greatly improve the working efficiency of data processing of the unmanned vehicle visual point cloud matching positioning algorithm, has the advantages of large field angle, high positioning precision, high processing speed and the like, and can meet the engineering application requirements of the current vision-based unmanned driving technology in an embedded processor.

It should be noted that the method of the embodiment of the present invention simulates the fast perception characteristics of the insect compound eye imaging system, and can utilize a depth learning method to classify and identify the target, and further utilize the space transformation relationship between the camera coordinate system and the world coordinate system to divide the image surface space of each camera into the common visual area and the non-common visual area, the common visual area adopts the multi-view different focal length three-dimensional reconstruction algorithm to accurately position the target, and the object space corresponding to the non-common visual area is divided into the perspective grids to roughly position the target, thereby increasing the observation visual angle, enhancing the processing efficiency of the intelligent vehicle visual system, and realizing the on-line semantic map construction.

Drawings

Fig. 1 is a flowchart of a method for detecting and positioning an intelligent vehicle target in real time based on bionic vision according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a multi-view imaging device;

FIG. 3 is a diagram showing the real-time detection and segmentation effect of the YOLOv5 algorithm on a target;

FIG. 4 is a schematic diagram of the common view region and the non-common view region corresponding to the 8mm and 12mm short-focus lens groups;

FIG. 5 is a schematic diagram of the common view area and the non-common view area corresponding to the 16mm and 25mm telephoto lens groups;

FIG. 6 is a schematic view of binocular disparity focus stereoscopic vision;

FIG. 7 is a diagram illustrating a variation curve of the short-focus camera image plane error Dx with the increase of the test distance L;

FIG. 8 is a diagram illustrating a mapping relationship between segmented target regions in an image established according to formula (4) in the second embodiment;

fig. 9 is a schematic flowchart of a method for real-time detection and positioning of an intelligent vehicle target based on bionic vision according to an embodiment of the present invention;

fig. 10 is a schematic diagram of the average depth of the detection target region in the three-dimensional space, which is obtained by reconstruction according to the formula (2) in the second embodiment;

fig. 11 is a schematic diagram of a real-time three-dimensional semantic map constructed by the method shown in fig. 1 according to an embodiment of the present invention.

Detailed Description

In order to better understand the above technical solutions, exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

In the following embodiments, the lens may be a lens of a video camera or a lens of a camera, and in the detailed description, some embodiments are described using the lens, and some embodiments are described using the lens of the camera or the camera, the lens of the video camera or the lens of the video camera, which means devices for capturing images, and each lens has a fixed focal length.

Example one

As shown in fig. 1, the present embodiment provides a real-time detection and positioning method for an intelligent vehicle target based on bionic vision, which specifically includes the following steps:

step S2, the type and position information of each object in the image of each shot is detected.

For example, a YOLOv5 target real-time detection algorithm may be employed to detect the category and location information of targets in the image of each shot.

The objectives of the present embodiment may include: traffic lights, speed limit signs, pedestrians, small animals, vehicles or lane lines, etc.; this embodiment is merely an example, and is not limited thereto, and is determined based on an actual image.

The category and location information includes: position information, size information, category information and the like of the object in each image, the category information may be a category of each object, and the size information may be a size of an area where the object is located.

And step S3, dividing the reference image into a common visual area and a non-common visual area according to the homogeneous coordinate system transformation relation of object space corresponding to the lenses with different focal lengths by taking the image of the lens with the smallest focal length as the reference image.

In this step, a target area of a two-dimensional coordinate system is mapped on the image.

For example, in the lenses with the same resolution, the focal lengths are different, that is, the long focal length and the short focal length object images are different, and then the reference image is determined based on the homogeneous coordinate system transformation relationship of the object image spaces corresponding to the lenses with different focal lengths, and the reference image is divided into a common view area and a non-common view area according to the field angles of other lenses.

For better understanding, as described with reference to fig. 4, assuming that there are four target points (a, B, C, D) in the space, which are respectively imaged at four diagonal points (a, B, C, D) on the image plane corresponding to the camera 1 and four diagonal points (a ', B', C ', D') on the image plane corresponding to the camera 2, it can be known from the homogeneous coordinate system transformation relationship of the object image, if the coordinate systems of the camera 1 and the camera 2 are not coincident or the focal lengths corresponding to the camera 1 and the camera 2 are not the same, the area surrounded by the coordinates of the four image points corresponding to the four target points (a, B, C, D) in the space will divide the reference image corresponding to the camera with a smaller focal length into the rectangular-shaped area as shown in fig. 4, and the central square area represents the imaging area corresponding to the overlapped viewing location of the camera 1 and the camera 2, i.e. the. The region from the four weeks of the central square to the boundary represents the imaging region corresponding to the non-overlapping fields of view of the cameras 1 and 2, i.e. the non-common view region.

For better understanding of step S4, the three-dimensional reconstructed location of the common viewing zone is described below with reference to the formula in the second embodiment.

S4-1, before three-dimensional reconstruction positioning, acquiring coordinate mapping relations of the target to be reconstructed in each image, namely, the target belongs to a multi-target detection and matching problem, wherein the most robust method is to solve the one-to-one mapping relation of each target between the multiple targets in the two images corresponding to other targets by using a linear search method, for example, three targets are A, B and C in the image 1, and three targets are A ', B' and C 'in the image 2, three-pair one-to-one mapping relations of A-A', B-B 'and C-C' are required to be established;

in the prior art, similarity error values of nine pairs of matching relations of AA ', AB ', AC ', BA ', BB ', BC ', CA ', CB ', CC ' need to be calculated respectively, and as the number of target N increases, the time complexity is polynomial time O (N)²) And is very time-consuming. Although the matching process can be accelerated by adopting the nearest search method of the K-D tree in the prior art, the time complexity is still linear logarithmic time O (nlogn), and therefore, the calculation amount cannot be reduced.

In the present application, the relationship between two image plane coordinates in the formula (3) given in the following second embodiment is used to further establish a one-to-one mapping relationship of the multi-target image coordinates, that is, the coordinates (x) of the known target in the image coordinate system 1₁，y₁) The coordinates (x) of the target in the image coordinate system 2 can be uniquely determined by normalizing the focal lengths α and β and the depth value Z of the detected target by the camera₂，y₂) The corresponding image plane coordinates of the same target in two images are determined directly through the image plane coordinate relation without establishing the one-to-one mapping relation of each target to other targets, the running time is irrelevant to the number of matched targets (constant time O (1)), and the time complexity of the image matching process is effectively reduced.

S4-2, three-dimensional reconstruction and positioning, mainly including detecting the depth information Z value (three-dimensional space) of the target in the three-dimensional space.

The stereo vision reconstruction method mainly comprises two methods, one is to generate multiple views through monocular motion to perform three-dimensional reconstruction, and the other is to perform three-dimensional reconstruction through two identical cameras (a pair of cameras) which are arranged side by side, namely a binocular parallel vision model. And the parallel binocular vision model has the defects that: because two cameras with the same parameters are adopted for image acquisition and calculation, the resolution and the view field are completely the same, so that the view field of the camera with high resolution is small and the large resolution of the view field is small in order to meet the real-time requirement. In order to achieve balance among resolution, field of view and real-time performance, only a multi-resolution camera is used for processing, the multi-resolution camera has two types, one type is a physical mode, a space variable resolution sensor is manufactured by simulating human retina, namely, the resolution distribution of an image is changed by changing the arrangement mode of sensor photosensitive chips, the method for changing the resolution of the image by changing the manufacturing process of the sensor has high cost, normal imaging can be realized only by converting the manufacturing process into a polar coordinate system, and large-scale use is difficult to realize. The other method is that a plurality of cameras with different resolutions or focal lengths are combined, and the method has the advantages of low cost, good operability and wide application range. Therefore, the second method can be used in this embodiment to achieve a balance between resolution and field of view.

Specifically, with reference to the binocular disparity distance three-dimensional reconstruction model shown in fig. 6, it can be obtained that, after the focal lengths of two cameras are known based on the triangle similarity principle, the distance (depth value) formula of the spatial point M from the cameras is as follows:

equation (1) above illustrates that if the target depth Z is to be calculated, the focal length f of the two cameras must be known₁And f₂And a base line distance b, wherein the prior parameters can be obtained by calibrating the cameras, and the corresponding relation x of pixel coordinates in the two cameras needs to be known₁And x₂. The three-dimensional coordinates of the target point M can be accurately reconstructed through the parameters:

in this embodiment, the point clouds in the target area are segmented in the image and are reconstructed, and the point cloud X and Y-axis coordinates in the segmented target area are averaged, and the depth value Z is the value with the highest frequency of occurrence of the point cloud depth value in the target area. That is, the three-dimensional point cloud set corresponding to each detected target is represented by only one average three-dimensional coordinate value, as shown in fig. 10.

It should be noted that, in this embodiment, the above-mentioned manner of constructing the vector map is constructed in real time on line, and the constructed map is different from the point cloud semantic map in the prior art, and only the boundary and the category of the target are labeled, instead of reconstructing the specific color and shape features of the target with the point cloud data, as shown in fig. 11.

In this embodiment, based on the method shown in fig. 1, the following step S5 is further included:

step S5: and fusing all the acquired three-dimensional positioning information, category label semantic information, all the angle positioning information and the three-dimensional point cloud data in the multi-view imaging device to construct a three-dimensional semantic map for the intelligent vehicle.

For example, the point cloud data in the multi-view imaging device includes point cloud data acquired by a laser radar or point cloud data of visual imaging, and the like, and the point cloud data is selected and fused according to actual needs. In the embodiment, the three-dimensional semantic map is constructed in an online manner.

That is to say, the method of the present embodiment can construct the semantic map in real time, and the map constructed by the method of the present embodiment is the semantic map constructed on line, which is simple and convenient to operate, does not need subsequent manual labeling, can satisfy the engineering application of the embedded processor, and has strong practicability compared with the map constructed by the offline method based on the point cloud + manual labeling map construction method in the prior art.

In addition, it should be noted that the method of this embodiment may implement real-time online semantic SLAM construction, that is, a partial map or a complete map constructed by data acquired by the image acquisition device in real time.

In a specific implementation process, the step S3 may include:

and regarding the long-focus lens, taking the image of the lens with the minimum focal length in the long-focus lens as a second reference image, and dividing the imaging interval of the second reference image corresponding to the field angle of other lenses in the long-focus lens into a second common visual area and a second non-common visual area. In the second embodiment, four focal length lenses are used for illustration.

In addition, when the number of the focal length lenses is greater than or equal to 3, the step S4 may include:

and converting the three-dimensional positioning information and the category label semantic information of all the targets in the common visual area of the first reference image and the second reference image and the angle positioning information of all the targets in the non-common visual area to a preset global coordinate system through a series of coordinate transformation.

The global coordinate system is a coordinate system where all targets in the constructed semantic map are located, and is different from imaginary coordinate systems such as an image coordinate system, a camera coordinate system and the like, the global coordinate system is a real existing place and can be established in a fixed place, and the global coordinate system is obtained through GPS-RTK measurement.

In addition, when step S5 is executed, all the acquired three-dimensional positioning information and category label semantic information, all the angle positioning information, and the three-dimensional point cloud data of the laser radar in the multi-view imaging device are fused to construct a three-dimensional semantic map for the smart vehicle.

Of course, if the method in fig. 1 does not execute step S5, the reference image two may be used as a reference image, and the three-dimensional positioning information and the category label semantic information of all the objects in the common view region one are three-dimensionally reconstructed and positioned in the common view region two, so as to obtain the three-dimensional positioning information and the category label semantic information of all the objects in the common view region in the reference image; and acquiring the angle positioning information of the target in each non-common visual area in a perspective grid division mode.

Therefore, the method of the embodiment can effectively reduce the calculated amount of data processing of the visual point cloud matching positioning algorithm and improve the target positioning precision, not only can realize the accurate positioning of targets with different fields of view and different distances, but also can greatly improve the working efficiency of data processing of the intelligent vehicle visual point cloud matching positioning algorithm, has the advantages of large field angle, high positioning precision and high processing speed, and can meet the engineering application requirements of the current vision-based unmanned technology on an embedded processor.

It should be noted that the semantic map constructed by the method of the present embodiment includes the category and the three-dimensional position information of the detection target. The category information is not filled in by specific color, shape and texture features, but is embodied by a boundary box with a label (the target category can be judged by semantic information through the label on the target boundary box). In addition, the three-dimensional information in the semantic map does not mean three-dimensional information for each feature point, but means average three-dimensional information in the detection target bounding box region. It is understood that the detection target image bounding box is treated as a centroid point. Therefore, the aim of real-time map construction is achieved by performing dimensionality reduction on the mass point cloud information of the high-precision map, namely only the essential meaning of the target is kept on the semantic map without accurately restoring the appearance shape.

Example two

The embodiment of the invention provides an intelligent vehicle target real-time detection and positioning method based on bionic vision, aiming at the engineering requirement of real-time detection and positioning of an unmanned embedded platform, and aiming at greatly improving the target detection and positioning efficiency of an unmanned visual information processing system, and is shown in figure 9.

The method provided by the embodiment of the invention simulates an insect compound eye imaging system to form a multi-target imaging system to enlarge the visual field range, classifies and identifies multiple targets by using a YOLOv5 deep learning target detection method, and segments and marks interested target positions in the whole image one by one; then, dividing the overlapped and non-overlapped areas of the sub-view field image surfaces of the multi-view imaging system into common-view and non-common-view areas, and respectively carrying out data coordinate conversion on the common-view and non-common-view areas.

Specifically, for a two-view image plane overlapping region, namely a common-view region: by utilizing a position relation formula between pixel coordinates corresponding to images (called as sub image surfaces) of each focal length lens of a multi-view imaging device, a one-to-one mapping relation (shown as a formula 4) of multi-target areas divided from two images is quickly established, namely, feature point extraction areas of two target images to be matched are positioned in advance, and the detection target traversal global search area is reduced; then, carrying out feature point matching on the pre-positioned target image feature point extraction area by utilizing an SURF algorithm; and finally, accurately positioning the detection target by adopting a multi-view different focal length three-dimensional reconstruction algorithm (as shown in the following formula 2).

Aiming at the non-overlapped area of the two view image planes, namely the non-common view area: dividing the first visual angle of the camera into four intervals, namely an upper interval, a lower interval, a left interval and a right interval; dividing each interval into angle perspective grids to roughly position the direction angle of the detection target; the subsequent laser radar can be conveniently and accurately positioned.

The method effectively reduces the calculated amount of data processing of the vision positioning algorithm, improves the positioning precision of the targets at different distances, enhances the real-time performance and the reliability of the vision system, and meets the engineering requirement of the unmanned technology based on vision.

In order to better understand the technical solution of the embodiment of the present invention, the above steps are explained in detail with reference to fig. 9:

step A1: as shown in fig. 2, image data is acquired for each vision sensor in the multi-view imaging device. That is, images of different focal length lenses in the multi-view imaging device of the intelligent vehicle are acquired in real time.

In this embodiment, four cameras with different focal lengths of 8mm, 12mm, 16mm and 25mm are adopted to form a multi-view imaging system with different fields of view: short-focus lenses of 8mm and 12mm are used to monitor close-range objects of 5-35m, for example pedestrians and vehicles, within a 90 deg. field of view. 16mm and 25mm telephoto lenses are used to monitor long-range objects of 10-60m within a 40 ° field angle, such as traffic lights and speed-limiting signs.

Step A2: and detecting the type and position information of each target in the image of each lens.

In the embodiment, in order to reduce the calculation amount between two or more frames of image point cloud data, a YOLOv5 target real-time detection algorithm is used to obtain the position and size information and the category information of different types of targets in each image.

Different types of targets can comprise dynamic targets such as pedestrians and vehicles and static targets such as lane lines and traffic lights in the image, each recognition target is divided in the image to be marked, irrelevant areas in the image can be removed in the process, and the number of calculation of feature points of the whole image is reduced. Fig. 3 is a graph showing the effect of YOLOv5 algorithm on real-time target detection and segmentation.

Step A3: and dividing the coordinates of each camera image plane corresponding to each overlapped area and each non-overlapped area of the lens view field of the multi-view imaging device into a common-view area and a non-common-view area.

Specifically, the image of the lens with the minimum focal length is taken as a reference image, and the reference image is divided into a common visual area and a non-common visual area according to the homogeneous coordinate system transformation relation of object space corresponding to lenses with different focal lengths.

In the present embodiment, different focal length lenses are divided into a long focal length lens and a short focal length lens; regarding the short-focal-length lens, taking an image of a lens with the smallest focal length in the short-focal-length lens as a first reference image, and dividing an imaging interval of the first reference image corresponding to the field angle of other lenses in the short-focal-length lens into a first common visual area and a first non-common visual area, as shown in fig. 4;

for the long-focus lens, the image of the lens with the smallest focal length in the long-focus lens is taken as a second reference image, and the imaging interval of the second reference image corresponding to the field angle of the other lenses in the long-focus lens is divided into a second common view area and a second non-common view area, as shown in fig. 5.

It can be understood that the known short-focus and long-focus cameras have the CMOS size of 1/2.7 and the resolution of 1920 × 1080 pixels, and as shown in FIG. 4, a common-view region I and a non-common-view region I in a 90-degree large-view-field image plane corresponding to 8mm and 12mm short-focus lens groups are shown. The coordinates of the end points of the common view rectangular area are (-666,360), (-666, -360), (638,360), (638, -360), the first common view area respectively accounts for 67.6% and 66.7% of the coordinates of the whole image plane in the X-axis direction and the Y-axis direction and accounts for 45.28% of the image plane area of the whole short-focus large-field of view, and the first common view area is not strictly symmetrical in the horizontal direction but slightly shifted to the X-axis direction due to the base line distance of the two cameras.

Fig. 5 shows a second common-view region and a second non-common-view region in the small-field-of-view image plane of 40 degrees corresponding to the 16mm and 25mm telephoto lens groups. The coordinates of the end points of the common view rectangular region are (-641,345), (-641, -345), (610,345), (610, -345), and the second common view region respectively accounts for 65.2% and 63.9% of the coordinates of the whole image plane in the X-axis direction and the Y-axis direction and accounts for 41.63% of the area of the image plane of the whole tele small field of view.

Specifically, the image coordinate system 1 (x)₁,y₁) The spatial transformation relationship with the world coordinate system (X, Y, Z) can be expressed as:

image coordinate system 2 (x)₂,y₂) The spatial transformation relationship with the world coordinate system (X, Y, Z) can be expressed as:

the image coordinate system 1 and the image coordinate system 2 take pictures of the same target M (X, Y, Z) and image, and the relation between the image surface coordinates X and Y of the two camera image coordinate systems can be deduced by knowing the normalized focal length alpha and beta values.

For example, given that the resolutions of two cameras are 1280 × 1024, the sub-image plane can be divided into a non-common view region and a common view region according to the spatial transformation relationship between the two camera coordinate systems and the same world coordinate system (i.e. the derivation result of equation 3).

Step A4: and aiming at the common visual area, the SURF characteristic point extraction and matching algorithm is adopted to realize target characteristic point matching in the designated area, and finally, the binocular disparity focus stereoscopic vision algorithm is utilized to carry out three-dimensional reconstruction on the detected target (as shown in the following formula 2).

The ideal parallel binocular camera imaging model with different focal lengths can be deduced according to the geometrical relationship between the image plane coordinates of the two cameras and the space point M in the image plane of the two cameras in the image plane of the.

Assuming that the optical axes of the left camera and the right camera are parallel (the Y-axis value is the same), since the focal lengths are different, the image planes are not on one plane, and the effective focal lengths corresponding to the 12mm lens and the 8mm lens are respectively set as f₁And f₂From the triangle similarity principle, the distance (depth value) formula of the spatial point M from the camera can be obtained as follows:

the position relation between the pixel coordinates corresponding to the two cameras can also be obtained according to the formula (1):

wherein alpha is₁And alpha₂，β₁And beta₂Which respectively represent scale factors, also referred to as normalized focal lengths, which are the horizontal axis u and the vertical axis v, respectively, of the image. When alpha is₂<α₁，β₂<β₁From the above formula (3), the coordinate offset y of the pixel corresponding to the camera with 12mm focal length can be known₁Offset y of corresponding pixel coordinate of camera with focal length larger than 8mm₂I.e. the same object moves faster in the image plane of the tele camera than in the short camera. Meanwhile, the offset of the image plane coordinates of the two cameras in the X-axis direction is also related to the ratio of the baseline distance d to the depth Z, so that the pixel coordinate X corresponding to the d/Z value to the (3) type two-phase machine needs to be further researched₁And x₂Influence of the positional relationship.

Let Dx be alpha₂d/Z is the error value of the image surface position of the pixel coordinate corresponding to the two cameras, the working distance Z of the 8mm lens is 5-25 m, and the normalized focal length alpha₁＝β₁＝4000pixel，α₂＝β₂2666pixel, baseline distance d 0.05 m. The variation curve of the pixel coordinate image plane position error Dx between the two cameras of 8mm and 12mm along with the increase of the test distance L is shown in fig. 7. When the monitoring distance is more than 5m, the Dx error value is within 26 pixels, and since we only focus on the average value of the pixel coordinates in the target detection area, the pixel coordinate x of the d/Z value at this time can be ignored₂The influence, denoted as constant D, can be further approximated by equation (3):

as shown in FIG. 8, the average position information x under the pixel coordinate system of the camera with the focal length of 12mm according to the detected object can be obtained by the formula (4)₁The estimated approximate position x of the corresponding target under the pixel coordinate system of the 8mm focal camera₂Can avoid the linear search of the one-to-one correspondence relationship of the multi-target areasAnd the working efficiency of the feature point matching algorithm is improved. That is to say, the traversal of the detection target over the global search area can be effectively reduced, the calculation amount of feature point extraction, matching and point cloud reconstruction algorithms is reduced, the matching index process is accelerated, and the calculation complexity of target feature point matching is reduced.

Step A5: and aiming at the non-common visual area, acquiring the angle positioning information of the target in each non-common visual area in a perspective grid division mode.

For example, for a non-common-view region of a multi-camera imaging device, as shown in fig. 4, the non-common-view region is divided into four regions, i.e., an upper region, a lower region, a left region, a right region, and a left region, according to a first angle of view of a camera, and each region is divided into perspective angle grids to roughly position a direction angle of a detection target, so that subsequent laser radar point cloud data can be conveniently accurately positioned and clustered.

For example, the common viewing zone is not strictly symmetrical in the horizontal direction, but is slightly shifted in the X-axis direction. The non-common visual area is divided into four parts according to the upper part, the lower part, the left part and the right part, the corresponding space visual angles are respectively [7.7 degrees ], 12.4 degrees ], [ -7.7 degrees, -12.4 degrees ], [ -13.5 degrees, -19.3 degrees ], [13.5 degrees and 20.3 degrees ], each sub visual angle is divided into perspective grids, the direction of a detection target can be roughly estimated, and therefore the subsequent laser radar point cloud data can be conveniently and accurately positioned and clustered.

It should be noted that, in the embodiment of the present invention, the angle value is used for positioning the detection target in the non-common view area in real time instead of the three-dimensional point cloud, so as to accelerate the processing speed of sensing the image target information. For example, the space view fields corresponding to the upper and lower regions are respectively used for detecting the traffic light and the approximate direction angle of the anti-collision barrel, and prior category and direction information is provided for subsequent clustering and positioning of laser radar point cloud data.

Step A6: static targets such as vehicles and pedestrians and dynamic target three-dimensional data such as traffic lights and lane lines, which are detected and positioned by the multi-view imaging device, are fused into the laser radar point cloud data, and a real-time three-dimensional semantic map is constructed, so that the bionic vision-based unmanned technology is applied to an embedded processing platform in an engineering mode.

It should be noted that the three-dimensional semantic map to be constructed in this embodiment is different from the point cloud semantic map in the prior art, in this embodiment, only the bounding box and the category of the target need to be labeled, and the point cloud data does not need to be used to reconstruct the specific color and shape features of the target, as shown in fig. 11.

The method can construct the semantic map in real time, and the map constructed by the method is the semantic map constructed on line, so that compared with the point cloud and manual labeling-based off-line map construction method in the prior art, the method is simple and convenient to operate, does not need subsequent manual labeling, can meet the engineering application of an embedded processor, and has strong practicability.

EXAMPLE III

According to another aspect of the embodiments of the present invention, there is also provided an intelligent vehicle driving system, including: the multi-view image forming apparatus comprises a control device and a multi-view image forming device connected with the control device, wherein the multi-view image forming device comprises: at least one short focal length lens, at least one long focal length lens;

and after receiving the images collected by the short-focal-length lens and the long-focal-length lens, the control device adopts the intelligent vehicle real-time detection and positioning method of the first embodiment or the second embodiment to construct the three-dimensional semantic map of the intelligent vehicle on line in real time.

In practical application, the short-focus lens of the multi-view imaging device in this embodiment is an 8 cm-focus lens and a 12 cm-focus lens, and the long-focus lens is a 16 cm-focus lens and a 25 cm-focus lens;

the 8cm and 12cm short-focus lenses are used for monitoring short-distance targets of 5-35m in a large field angle of 90 degrees,

16cm and 25cm telephoto lenses are used to monitor long-distance targets 10-60m within a small field angle of 40 deg..

In practical applications, the intelligent vehicle may be an unmanned vehicle or an autonomous vehicle, and the intelligent vehicle may include the intelligent vehicle driving system. Specifically, above-mentioned 8cm focus lens, 12cm focus lens, 16cm focus lens and 25cm focus lens are placed side by side in same horizontal plane and in the place ahead of vehicle cab, and wherein long focus and short focus lens alternate the setting, and the interaxial spacing of adjacent short focus lens is 12 cm.

The intelligent vehicle driving system can build a semantic map on line in real time, realize engineering application and improve practicability.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions.

It should be noted that in the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the terms first, second, third and the like are for convenience only and do not denote any order. These words are to be understood as part of the name of the component.

Furthermore, it should be noted that in the description of the present specification, the description of the term "one embodiment", "some embodiments", "examples", "specific examples" or "some examples", etc., means that a specific feature, structure, material or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, the claims should be construed to include preferred embodiments and all changes and modifications that fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention should also include such modifications and variations.

Claims

1. A bionic-vision intelligent vehicle target real-time detection and positioning method is characterized by comprising the following steps:

2. The method according to claim 1, wherein the step S2 includes:

and/or, the target comprises: traffic lights, speed limit signs, pedestrians, small animals, vehicles or lane lines;

3. The method according to claim 1, wherein the step S3 includes:

4. The method according to claim 3, wherein the step S4 includes:

accordingly, the method further comprises step S5,

step S5 includes:

5. The method according to claim 3, wherein the step S4 includes:

and transferring the three-dimensional positioning information and the category label semantic information of all the targets in the common visual area of the first reference image and the second reference image and the angle positioning information of all the targets in the non-common visual area to a preset global coordinate system through coordinate transformation.

6. The method according to claim 1, wherein the step S3 includes:

when the coordinate systems of the two lenses are not coincident and/or the focal lengths of the lenses are different, based on the homogeneous coordinate system transformation relation of the object image space, taking the image of the lens with the minimum focal length as a reference image and dividing the reference image into a square-back area consisting of eight angular points a, b, c and d and a ', b', c 'and d';

7. The method according to any one of claims 1 to 6, wherein the step S4 includes:

specifically, the coordinates (X, Y, Z) of the spatial point M are:

8. The method of claim 4, wherein the short focal length lens is an 8mm focal length lens and a 12mm focal length lens, and the long focal length lens is a 16mm focal length lens and a 25mm focal length lens;

9. The utility model provides an intelligence car driving system which characterized in that includes: the multi-view image forming apparatus comprises a control device and a multi-view image forming device connected with the control device, wherein the multi-view image forming device comprises: at least one short focal length lens, at least one long focal length lens;

after the control device receives the images collected by the short-focal-length lens and the long-focal-length lens, the intelligent vehicle real-time detection and positioning method of any one of claims 1 to 8 is adopted to construct a three-dimensional semantic map of the intelligent vehicle on line in real time.

10. An intelligent vehicle, characterized by comprising the intelligent vehicle driving system of claim 9.