CN117953483A - Omnidirectional vision-oriented three-dimensional target detection method - Google Patents

Omnidirectional vision-oriented three-dimensional target detection method Download PDF

Info

Publication number
CN117953483A
CN117953483A CN202311697395.0A CN202311697395A CN117953483A CN 117953483 A CN117953483 A CN 117953483A CN 202311697395 A CN202311697395 A CN 202311697395A CN 117953483 A CN117953483 A CN 117953483A
Authority
CN
China
Prior art keywords
image
voxel
features
cameras
camera
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311697395.0A
Other languages
Chinese (zh)
Inventor
冯明驰
李坤
萧红
王昆
孔浩宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202311697395.0A priority Critical patent/CN117953483A/en
Publication of CN117953483A publication Critical patent/CN117953483A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a three-dimensional target detection method oriented to omnidirectional vision, and belongs to the fields of automatic driving, intelligent automobile environment sensing and target identification. After the installation of the multiple cameras is completed, full scene information around the vehicle is acquired, and then the multiple cameras are calibrated by internal and external parameters. And then extracting multi-scale features of the acquired image by using a pre-trained backbone network, and fusing the multi-scale features through a feature pyramid network to obtain features with rich targets. The features are projected into a 3D space according to a pinhole camera model, the features are divided into 3D voxel bodies through data averaging, the 3D voxel bodies are coded by using a transducer encoder architecture, and three-dimensional information of a target is detected by a 3D convolution detection head and projected to a BEV plane for display. The invention can be applied to the fields of automatic driving and auxiliary driving, can sense the surrounding environment of the vehicle in all directions, and has the characteristics of accurate classification and large sensing range.

Description

Omnidirectional vision-oriented three-dimensional target detection method
Technical Field
The invention belongs to the fields of automatic driving, image processing and three-dimensional target detection, and particularly relates to a multi-camera processing technology and an omnidirectional vision three-dimensional target detection method.
Background
In the past, the conventional target detection method is mainly limited to object recognition and positioning on a two-dimensional image, and cannot obtain three-dimensional position and posture information of a target. With the rise of the deep learning algorithm, new vitality is injected for three-dimensional target detection. The object detection algorithm based on deep learning, such as a Convolutional Neural Network (CNN) based method, can directly learn the representation and characteristics of objects from image data and realize the identification and positioning of the objects. Researchers have also focused on multi-sensor research in order to further improve the accuracy and robustness of three-dimensional object detection. By combining the data acquired by the binocular camera, the multi-camera combination and the like, the advantages of the binocular camera and the multi-camera combination can be comprehensively utilized, and the performance of target detection is improved.
In recent years, image-based three-dimensional object detection has been actively studied, in which the position and type of an object are detected in an image, RGB images provide visual cues about a scene and its objects, but for monocular images, a depth-learning-based three-dimensional object detection method can only infer the scale of data. Furthermore, since certain regions may not be visible, the geometry of the scene cannot be deduced explicitly from the monocular image. However, using a multi-camera captured image may help to obtain more information about the scene than a monocular image. Although point cloud information of all objects around a vehicle can be obtained by using the laser radar, the neural network based on the point cloud can also identify information such as the category and the position of the target, however, the point cloud data is sparse, semantic information is lacking, only a few position points can be received for a target at a distance, and the laser radar is high in price. The image has rich semantic information and less position information, the method of combining multiple cameras can obtain good target detection effect, and the camera is convenient to install and low in price, so that the method is very suitable for research.
The target detection is rapidly developed, and the environment perception is also in higher demand, so that the three-dimensional information, the central position and the like of the target are obtained. With the progress of deep learning algorithms, three-dimensional target detection is playing an increasingly important role in practical application, and a solid foundation is provided for realizing a more intelligent and safe system.
In summary, three-dimensional object detection is a trend in the field of object detection, and is rapidly developed at present, the detection is still limited to the front of a vehicle, and less research is still performed on environmental perception around the vehicle and object detection, and meanwhile, the problems of inaccurate identification and position estimation of a remote object are also caused.
Through retrieval, in the closest prior art, the application number is 202310261872.2, and the three-dimensional target detection method based on multi-scale feature fusion is provided. The method comprises the following steps: 1. inputting point cloud data of a target object to be detected into a 3D partition to obtain 3D feature images with different scales; 2. inputting the 3D feature images with different scales into a multi-scale feature fusion module to perform feature fusion, so as to obtain a fused 3D feature image; 3. inputting the fused 3D feature map into a dimension reduction module for dimension reduction operation to obtain a 2D feature map of the BEV visual angle; 4. inputting the 3D feature images and the 2D feature images with different dimensions into a feature fusion module to obtain a fused feature image; 5. inputting the fused feature map into an RPN network to obtain an ROI with a potential target object to be detected; 6. and inputting the ROI into the RoIPooling layers to extract the characteristics of the ROI, and obtaining a candidate frame of the target object to be detected. According to the method, point cloud data are used, the problem that image data are difficult to predict target depth information is solved, but the point cloud data lack target semantic information in the image data, extracted target features are still inaccurate, and rich semantic information can be provided just according to the extracted features of the image. In addition, the problem of inaccurate predicted depth information is also avoided by projecting features into 3D space.
Disclosure of Invention
The present invention is directed to solving the above problems of the prior art. A three-dimensional target detection method facing to omnidirectional vision is provided. The technical scheme of the invention is as follows:
a three-dimensional target detection method facing omnidirectional vision comprises the following steps:
step 1, installing a plurality of camera modules, so that an image acquired by a camera can contain all scene information around;
Step 2, calibrating among a plurality of cameras, calibrating internal parameters of the cameras and external parameters among the cameras, and acquiring data of an omnidirectional environment to obtain image data of the plurality of cameras;
Step 3, extracting image features from the acquired image data by using a backbone network, and fusing features with different scales by using a feature pyramid network to obtain features with rich scale information and semantic information;
Step 4, back-projecting the extracted features into a 3D space by using a projection model of a pinhole camera, and dividing the 3D space into volume blocks in the 3D space according to the same volume size mode to form a plurality of 3D voxel bodies;
Step 5, inputting the 3D voxel data into a neural network, and encoding the 3D voxel by using a transducer encoder architecture to obtain an encoded 3D voxel;
And 6, inputting the tensor after being encoded and decoded by the transducer into a 3D convolutional neural network detection head to predict target three-dimensional information and displaying the target three-dimensional information by using a bird's eye view BEV.
Further, in the step 1, a plurality of camera modules are installed, so that the image collected by the camera can contain all scene information around, and the method specifically includes:
step 11, a plurality of cameras are arranged on the same horizontal line and are arranged in the front, back, left and right directions of the vehicle roof, so that all scene information around the vehicle can be obtained;
Step 12, designing program codes, installing single-chip microcomputer control equipment, controlling the start and end of acquisition in a key triggering mode, ensuring that multiple cameras acquire data synchronously, and enabling images acquired by the multiple cameras at the same moment to be scenes around the vehicle at the same moment.
Further, in the step 2, calibration between the plurality of cameras, calibration of internal parameters of the cameras and external parameters between the cameras, and collecting data of an omni-directional environment at the same time, and obtaining image data of the plurality of cameras, specifically includes:
And 21, acquiring a calibration image of a black-and-white checkerboard according to the installation position of the camera, and extracting the coordinates of the corner pixels on the calibration plate image by using a Zhang Zhengyou calibration method, wherein the world coordinate system is well defined in advance, so that the coordinates of the corner on the calibration plate under the world coordinate system are known. And calibrating the camera by using the pixel coordinates of each corner point and the physical coordinates of the corner points under the world coordinate system to obtain an internal reference matrix, an external reference matrix and a distortion coefficient of the camera.
Step 22, collecting scene information around the vehicle, wherein the image collection frequency of the camera is kept constant, and the camera is set to automatically expose in the collection process; image data of multiple cameras is acquired.
Further, in the step 3, for views of a plurality of cameras, features of the target in the view are extracted by using a neural network, and specifically include:
Step 31, extracting image features from an input image by using a pre-trained 2D convolution backbone network to obtain feature images with multiple scales, and providing input for subsequent processing;
and step 32, fusing the extracted image multi-scale features by using a feature pyramid model FPN, and outputting a whole 2D feature tensor.
Further, in step 31, the image features are extracted from the input image by using a pre-trained 2D convolution backbone network, so as to obtain a feature map with multiple scales, which specifically includes:
The 2D convolution backbone network mainly comprises an input layer, a convolution layer, a residual block, a batch normalization layer, a pooling layer and the like, wherein the input layer receives an image and transmits the image to the next layer; the convolution layer carries out convolution on the input image through convolution check, and multi-scale characteristics of the image are extracted; the residual block adds the input to the output of the convolution layer, thus reducing information loss; carrying out batch normalization on the data for standardization treatment; the pooling layer reduces the data dimension while retaining the main features. The shape of the extracted multi-scale image feature map is in particular
Where W, H is the input image size and α 0 is a constant, the specific value is related to the dimension of the 2D convolutional backbone network used.
Further, in the step 32, the extracted multi-scale features of the image are fused by using a feature pyramid model FPN, and a tensor of an integral 2D feature is output, which specifically includes:
The feature pyramid model carries out up-sampling on the high-level feature images, and then fuses the feature images with the adjacent bottom-layer feature images, wherein the fusion mode is element-by-element addition, and the scale information and the semantic information of the feature images are increased. The output shape after multi-scale feature fusion using FPN model is
Where α 1 is a constant, the specific value is related to the FPN dimension used.
Further, the step 4 of back projecting the extracted features into a 3D space by using a projection model of a pinhole camera, and dividing the 3D space into volume blocks according to the same volume size in the 3D space, to form a plurality of 3D voxel bodies, specifically includes:
Step 41, using the small-bore imaging model to determine the relationship between the 2D coordinates (u, v) on the feature map and the coordinates (x, y, z) of 3 in the voxel volume, as follows:
wherein K and R are internal and external matrices and P is a perspective mapping; after 2D features are projected, all voxels on a certain camera ray will be filled with the same features;
Step 42, defining a binary mask M t having the same shape as the voxel volume, indicating whether each voxel is in the camera view cone; for each image I t,Mt is defined as:
Step 43, projecting the FPN-aggregated 2D feature F t into each valid voxel in the voxel V t:
Step 44, fusing M 1,...,Mt to obtain a binary mask M:
Step 45, the characteristics of V 1,...,Vt are averaged to obtain the final 3D voxel V:
further, the step 5 is to input 3D voxel data into a neural network, encode the 3D voxel by using a transducer encoder architecture, and obtain an encoded 3D voxel; the method specifically comprises the following steps:
Step 51, performing encoding processing on the 3D voxel body by using a transducer encoder architecture, to obtain an encoded 3D voxel body, specifically, a network structure formed by a plurality of 3D convolution and downsampling operations, and inputting the 3D voxel body obtained by projection into a network, where the input shape is:
Nx×Ny×Nz×α1
Where N x、Ny、Nz represents the voxel volume sizes in the x-axis, y-axis, and z-axis, respectively.
Step 52, the output shape after the transformation encoder architecture is:
Nx×Ny×α2
where α 2 is a constant, related to the dimension of the convolutional layer.
Further, the step 6 of inputting the tensor after being encoded and decoded by the transducer into a 3D convolutional neural network detection head to predict the three-dimensional information of the target and displaying the three-dimensional information by using a bird's eye view BEV specifically includes:
Step 61, the detection head comprises two parallel 3D convolution layers, one layer estimating class probability and the other layer regressing 7 parameters of the bounding box, including (x, y, z, w, h, l, θ), where (x, y, z) is the coordinates of the center point, (w, h, l) is width, height, and length, θ is the rotation angle around the z-axis;
In step 62, the input shape is:
Nx×Ny×α2
for output, the detection head returns a class probability p and a 3D box containing the bounding box detection parameters:
Step 63, correcting the result by using the loss function; the penalty includes a plurality of penalty entries, respectively, a positioning penalty L loc, a classification penalty L cls, and a directional penalty L dir. Then:
wherein n pos is the number of positive sample three-dimensional frames, lambda locclsdir is a constant, and the weight of each loss is given;
and step 64, projecting the detected 3D frame to the bird's eye view BEV plane display according to the projection relation.
An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the omnidirectional vision-oriented three-dimensional object detection method of any one of the claims when the program is executed.
The invention has the advantages and beneficial effects as follows:
1) The invention collects the complete environment around the vehicle by utilizing the combination of a plurality of cameras, and can identify and detect the target three-dimensional information in four directions of front, rear, left and right of the vehicle. Meanwhile, the long-short-focus camera is combined in front of the vehicle, the pixels for displaying the long-distance targets of the short-focus image are few, the visual field range is large, the long-focus visual field range is narrower, but the pixels for targets at the far distance are more.
2) According to the invention, the 2D convolution backbone network and the FPN network are utilized to extract image features, the image multi-scale features are obtained first, then the multi-scale features are fused, semantic information of the image is fully utilized, and feature extraction of a large target and a small target in the image is more efficient.
3) According to the invention, the coordinates on the feature map are projected into the 3D space, the principle of a small-hole imaging model is utilized, the 3D space is divided into a plurality of voxel bodies, and the final voxel is obtained for the projected 3D voxel bodies by simple element average, so that the prediction error on the image depth is reduced, the calculated amount is reduced, and meanwhile, an accurate result is obtained.
4) On a three-dimensional target detection network, the invention converts images acquired by a plurality of cameras into a detection task of a BEV plane, firstly obtains 2D representation of a 3D voxel body through 3D convolution and downsampling, and then carries out regression and classification through a traditional detection head to obtain a three-dimensional target detection result. The method avoids the prediction error on the height of the target, and simultaneously uses a mature two-dimensional target detection network to obtain the three-dimensional result of the target, and has the characteristics of wide target coverage, high detection accuracy and less omission.
5) The multi-camera acquisition data is utilized to obtain the three-dimensional target detection result of the omnidirectional vision, so that the high cost of using laser radar and other equipment is avoided, and the practicability is high.
In the conventional method, the depth information of the target is firstly predicted and then the size of the target is determined through three-dimensional target detection of the image, and the acquired image lacks data of the target depth information, so that the prediction error of the target depth information is larger. According to the method, the characteristics are projected into the 3D space, so that the depth information of the predicted image target is avoided, the 3D space is divided into the voxel form, the processing data is more convenient, the 3D characteristics of the target can be obtained more accurately, and the subsequent detection and recognition effects are better.
Drawings
FIG. 1 is a flow chart of a three-dimensional object detection method for omni-directional vision according to a preferred embodiment of the present invention;
FIG. 2 is a schematic diagram of the mounting locations of multiple cameras for omni-directional vision;
Fig. 3 is a schematic diagram of an architecture of a network model employed.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and specifically described below with reference to the drawings in the embodiments of the present invention. The described embodiments are only a few embodiments of the present invention.
The technical scheme for solving the technical problems is as follows:
The invention provides a three-dimensional object detection method facing to omnidirectional vision, which is used for acquiring the size, the position and the gesture of vehicles around the vehicle, in particular the position, the deflection angle and the length, the width and the height of the center of the surrounding vehicle relative to the vehicle under the traffic scene, wherein the omnidirectional vision is to use a plurality of cameras, such as 7 cameras, for example, the cameras are arranged at the positions shown in fig. 2, the three-dimensional object is detected as the type of the vehicles around the vehicle, the position of the center of the surrounding vehicle relative to the vehicle, and the length, the width, the height and the yaw angle of the surrounding vehicle, and the three-dimensional object detection method comprises the following steps:
step1, installing a plurality of camera modules so that the images acquired by the cameras can contain all scene information around.
And 2, calibrating among a plurality of cameras, calibrating internal parameters of the cameras and external parameters among the cameras, and acquiring data of an omnidirectional environment to obtain image data of the plurality of cameras.
Step 3, extracting image features from the acquired image data by using a backbone network, and fusing features with different scales by using a feature pyramid network to obtain features with rich scale information and semantic information;
step 4, back-projecting the extracted features into a 3D space by using a projection model of a pinhole camera, and dividing the 3D space into volume blocks in the 3D space according to the same volume size mode to form a plurality of 3D voxel bodies;
step 5, inputting the 3D voxel data into a neural network, and encoding the 3D voxel by using a transducer encoder architecture to obtain an encoded 3D voxel;
And 6, inputting the tensor after being encoded and decoded by the transducer into a 3D convolutional neural network detection head to predict target three-dimensional information and displaying the target three-dimensional information by using a Bird's Eye View (BEV).
As a possible implementation manner of this embodiment, the step 1 of installing a plurality of camera modules, so that an image collected by a camera can include all scene information around, includes the following steps:
and 11, installing the positions of a plurality of cameras, wherein the cameras are positioned on the same horizontal line and are arranged in four directions of the front, the rear, the left and the right of the vehicle roof in order to accurately contain the scenes around the vehicle, for example, 7 cameras are respectively positioned in the four directions of the front, the rear, the left and the right of the vehicle and are positioned on the same horizontal line, so that all scene information around the vehicle can be ensured to be acquired.
And 12, installing control equipment, and ensuring that multiple cameras synchronously acquire data, so that images acquired by the multiple cameras at the same moment are scenes around the vehicle at the same moment.
As a possible implementation manner of this embodiment, the calibrating between the multiple cameras in step 2, calibrating the internal parameters of the cameras and the external parameters between the cameras, and collecting the data of the omni-directional environment at the same time, and obtaining the image data of the multiple cameras includes the following steps:
And 21, according to the mounting positions of the cameras, arranging calibration plates in the front, rear, left and right directions of the vehicle, wherein the positions of a plurality of cameras are provided, the calibration plates are required to be in a common field of view of the cameras, and meanwhile, the calibration plates are arranged in different postures and positions, and calibration images are acquired to calibrate the internal parameters and the external parameters of the cameras.
Step 22, collecting scene information around the vehicle, wherein the image collection frequency of the camera is kept constant, and the camera is set to automatically expose in the collection process; image data of multiple cameras is acquired.
As a possible implementation manner of this embodiment, the step 3 extracts image features from the collected image data using a backbone network, and fuses features with different scales by using a feature pyramid network to obtain features with rich scale information and semantic information, and includes the following steps:
And step 31, extracting image features from the input image by using a pre-trained 2D convolution backbone network to obtain feature images with multiple scales, and providing input for subsequent processing.
And step 32, fusing the extracted image multi-scale features by using a feature pyramid model (FPN) to output a tensor of the whole 2D features.
Step 33, extracting a multi-scale image feature map, specifically the shape
Where W, H is the input image size and α 0 is a constant, the specific value is related to the dimension of the 2D convolutional backbone network used.
And step 34, the feature pyramid model carries out up-sampling on the high-level feature images, and then fuses the feature pyramid model with the adjacent bottom-layer feature images, wherein the fusion mode is element-by-element addition, and the scale information and the semantic information of the feature images are increased. The output shape after multi-scale feature fusion using FPN model is
Where α 1 is a constant, the specific value is related to the FPN dimension used.
As a possible implementation manner of this embodiment, the step 4 uses a projection model of a pinhole camera to back project the extracted features into a 3D space, and divides the 3D space into volume blocks in the 3D space according to the same volume size, to form a plurality of 3D voxel volumes, including the following steps:
Step 41, using the small-bore imaging model to determine the relationship between the 2D coordinates (u, v) on the feature map and the coordinates (x, y, z) of 3 in the voxel volume, as follows:
where K and R are the inner and outer matrices and P is the perspective mapping. After the 2D feature is projected, all voxels on a certain camera ray will be filled with the same feature.
Step 42 defines a binary mask M t of the same shape as the voxel volume, indicating whether each voxel is within the camera view cone. For each image I t,Mt is defined as:
Step 43, projecting the FPN-aggregated 2D feature F t into each valid voxel in the voxel V t:
Step 44, fusing M 1,...,Mt to obtain a binary mask M:
Step 45, the characteristics of V 1,...,Vt are averaged to obtain the final 3D voxel V:
As a possible implementation manner of this embodiment, the step5 inputs 3D voxel volume data into a neural network, and encodes the 3D voxel volume using a transducer encoder architecture to obtain an encoded 3D voxel volume, and includes the following steps:
Step 51, inputting 3D voxel data into a neural network, and encoding the 3D voxel by using a transducer encoder architecture to obtain an encoded 3D voxel, specifically, inputting the 3D voxel obtained by projection into an encoder network structure formed by a plurality of 3D convolution and downsampling operations, where the input shape is: n x×Ny×Nz×α1
Where N x、Ny、Nz represents the voxel volume sizes in the x-axis, y-axis, and z-axis, respectively.
Step 52, the output shape after the transformation encoder architecture is:
Nx×Ny×α2
where α 2 is a constant, related to the dimension of the convolutional layer.
As a possible implementation manner of this embodiment, the step 6 of inputting the tensor after being encoded and decoded by the transducer into the 3D convolutional neural network detection head to predict the three-dimensional information of the target and displaying the three-dimensional information by using a Bird's Eye View (BEV) includes the following steps:
Step 61, the detection head comprises two parallel 2D convolution layers, one layer estimating class probability and the other layer regressing 7 parameters of the bounding box, including (x, y, z, w, h, l, θ), where (x, y, z) is the coordinates of the center point, (w, h, l) is the width, height, and length, θ is the rotation angle around the z-axis.
In step 62, the input shape is:
Nx×Ny×α2
For output, the detection head returns a class probability p and a 3D box containing 7 parameters (x, y, z, w, h, l, θ) of the bounding box.
Step 63, the result is corrected using the loss function. The penalty includes a plurality of penalty entries, respectively, a positioning penalty L loc, a classification penalty L cls, and a directional penalty L dir. Then:
Where n pos is the number of positive sample three-dimensional boxes, lambda locclsdir is a constant, and is the weight lost for each term.
Step 64, projecting the detected 3D frame onto a Bird's Eye View (BEV) plane display in a projection relationship.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
The above examples should be understood as illustrative only and not limiting the scope of the invention. Various changes and modifications to the present invention may be made by one skilled in the art after reading the teachings herein, and such equivalent changes and modifications are intended to fall within the scope of the invention as defined in the appended claims.

Claims (10)

1. The three-dimensional target detection method facing to the omnidirectional vision is characterized by comprising the following steps of:
step 1, installing a plurality of camera modules, so that an image acquired by a camera can contain all scene information around;
Step 2, calibrating among a plurality of cameras, calibrating internal parameters of the cameras and external parameters among the cameras, and acquiring data of an omnidirectional environment to obtain image data of the plurality of cameras;
Step 3, extracting image features from the acquired image data by using a backbone network, and fusing features with different scales by using a feature pyramid network to obtain features with rich scale information and semantic information;
Step 4, back-projecting the extracted features into a 3D space by using a projection model of a pinhole camera, and dividing the 3D space into volume blocks in the 3D space according to the same volume size mode to form a plurality of 3D voxel bodies;
Step 5, inputting the 3D voxel data into a neural network, and encoding the 3D voxel by using a transducer encoder architecture to obtain an encoded 3D voxel;
And 6, inputting the tensor after being encoded and decoded by the transducer into a 3D convolutional neural network detection head to predict target three-dimensional information and displaying the target three-dimensional information by using a bird's eye view BEV.
2. The method for detecting a three-dimensional object oriented to omni-directional vision according to claim 1, wherein a plurality of camera modules are installed in the step 1, so that an image collected by a camera can contain scene information of all four sides, and the method specifically comprises:
step 11, a plurality of cameras are arranged on the same horizontal line and are arranged in the front, back, left and right directions of the vehicle roof, so that all scene information around the vehicle can be obtained;
Step 12, designing program codes, installing single-chip microcomputer control equipment, controlling the start and end of acquisition in a key triggering mode, ensuring that multiple cameras acquire data synchronously, and enabling images acquired by the multiple cameras at the same moment to be scenes around the vehicle at the same moment.
3. The method for detecting the three-dimensional object oriented to the omnidirectional vision according to claim 1, wherein the calibrating between the plurality of cameras in the step 2, calibrating the internal parameters of the cameras and the external parameters between the cameras, and simultaneously collecting the data of the omnidirectional environment, and obtaining the image data of the plurality of cameras, comprises the following steps:
And 21, acquiring a calibration image of a black-and-white checkerboard according to the installation position of the camera, and extracting the coordinates of the corner pixels on the calibration plate image by using a Zhang Zhengyou calibration method, wherein the world coordinate system is well defined in advance, so that the coordinates of the corner on the calibration plate under the world coordinate system are known. And calibrating the camera by using the pixel coordinates of each corner point and the physical coordinates of the corner points under the world coordinate system to obtain an internal reference matrix, an external reference matrix and a distortion coefficient of the camera.
Step 22, collecting scene information around the vehicle, wherein the image collection frequency of the camera is kept constant, and the camera is set to automatically expose in the collection process; image data of multiple cameras is acquired.
4. The method for detecting an omni-directional vision-oriented three-dimensional object according to claim 1, wherein in the step 3, for views of a plurality of cameras, features of the object in the view are extracted by using a neural network, and the method specifically comprises:
Step 31, extracting image features from an input image by using a pre-trained 2D convolution backbone network to obtain feature images with multiple scales, and providing input for subsequent processing;
and step 32, fusing the extracted image multi-scale features by using a feature pyramid model FPN, and outputting a whole 2D feature tensor.
5. The method for detecting the three-dimensional object oriented to the omni-directional vision according to claim 4, wherein the step 31 extracts image features from the input image by using a pre-trained 2D convolution backbone network to obtain feature maps with multiple scales, specifically:
The 2D convolution backbone network mainly comprises an input layer, a convolution layer, a residual block, a batch normalization layer, a pooling layer and the like, wherein the input layer receives an image and transmits the image to the next layer; the convolution layer carries out convolution on the input image through convolution check, and multi-scale characteristics of the image are extracted; the residual block adds the input to the output of the convolution layer, thus reducing information loss; carrying out batch normalization on the data for standardization treatment; the pooling layer reduces the data dimension while retaining the main features. The shape of the extracted multi-scale image feature map is in particular
Where W, H is the input image size and α 0 is a constant, the specific value is related to the dimension of the 2D convolutional backbone network used.
6. The method for detecting an omni-directional vision-oriented three-dimensional object according to claim 4, wherein the step 32 is to aggregate the extracted multi-scale features of the image by using a feature pyramid model FPN, and output a tensor of an integral 2D feature, specifically:
The feature pyramid model carries out up-sampling on the high-level feature images, and then fuses the feature images with the adjacent bottom-layer feature images, wherein the fusion mode is element-by-element addition, and the scale information and the semantic information of the feature images are increased. The output shape after multi-scale feature fusion using FPN model is
Where α 1 is a constant, the specific value is related to the FPN dimension used.
7. The method for detecting a three-dimensional object oriented to omni-directional vision according to claim 1, wherein the step 4 uses a projection model of a pinhole camera to back project the extracted features into a 3D space, and divides the 3D space into volume blocks in the 3D space according to the same volume size, so as to form a plurality of 3D voxel volumes, and specifically comprises:
Step 41, using the small-bore imaging model to determine the relationship between the 2D coordinates (u, v) on the feature map and the coordinates (x, y, z) of 3 in the voxel volume, as follows:
wherein K and R are internal and external matrices and P is a perspective mapping; after 2D features are projected, all voxels on a certain camera ray will be filled with the same features;
Step 42, defining a binary mask M t having the same shape as the voxel volume, indicating whether each voxel is in the camera view cone; for each image I t,Mt is defined as:
Step 43, projecting the FPN-aggregated 2D feature F t into each valid voxel in the voxel V t:
Step 44, fusing M 1,...,Mt to obtain a binary mask M:
Step 45, the characteristics of V 1,...,Vt are averaged to obtain the final 3D voxel V:
8. The omnidirectional vision-oriented three-dimensional object detection method according to claim 1, wherein the step 5 inputs 3D voxel volume data into a neural network, and encodes the 3D voxel volume by using a transducer encoder architecture to obtain an encoded 3D voxel volume, and specifically comprises:
Step 51, performing encoding processing on the 3D voxel body by using a transducer encoder architecture, to obtain an encoded 3D voxel body, specifically, a network structure formed by a plurality of 3D convolution and downsampling operations, and inputting the 3D voxel body obtained by projection into a network, where the input shape is:
Nx×Ny×Nz×α1
Where N x、Ny、Nz represents the voxel volume sizes in the x-axis, y-axis, and z-axis, respectively.
Step 52, the output shape after the transformation encoder architecture is:
Nx×Ny×α2
where α 2 is a constant, related to the dimension of the convolutional layer.
9. The method for detecting the three-dimensional object oriented to the omni-directional vision according to claim 1, wherein the step 6 inputs the tensor after being encoded and decoded by the transducer into a 3D convolutional neural network detection head to predict the three-dimensional information of the object and displays the three-dimensional information by using a bird's eye view BEV, and specifically comprises the following steps:
Step 61, the detection head comprises two parallel 3D convolution layers, one layer estimating class probability and the other layer regressing 7 parameters of the bounding box, including (x, y, z, w, h, l, θ), where (x, y, z) is the coordinates of the center point, (w, h, l) is width, height, and length, θ is the rotation angle around the z-axis;
In step 62, the input shape is:
Nx×Ny×α2
for output, the detection head returns a class probability p and a 3D box containing the bounding box detection parameters:
Step 63, correcting the result by using the loss function; the penalty includes a plurality of penalty entries, respectively, a positioning penalty L loc, a classification penalty L cls, and a directional penalty L dir. Then:
wherein n pos is the number of positive sample three-dimensional frames, lambda locclsdir is a constant, and the weight of each loss is given;
and step 64, projecting the detected 3D frame to the bird's eye view BEV plane display according to the projection relation.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the omnidirectional vision-oriented three-dimensional object detection method according to any one of claims 1 to 9 when the program is executed by the processor.
CN202311697395.0A 2023-12-12 2023-12-12 Omnidirectional vision-oriented three-dimensional target detection method Pending CN117953483A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311697395.0A CN117953483A (en) 2023-12-12 2023-12-12 Omnidirectional vision-oriented three-dimensional target detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311697395.0A CN117953483A (en) 2023-12-12 2023-12-12 Omnidirectional vision-oriented three-dimensional target detection method

Publications (1)

Publication Number Publication Date
CN117953483A true CN117953483A (en) 2024-04-30

Family

ID=90791434

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311697395.0A Pending CN117953483A (en) 2023-12-12 2023-12-12 Omnidirectional vision-oriented three-dimensional target detection method

Country Status (1)

Country Link
CN (1) CN117953483A (en)

Similar Documents

Publication Publication Date Title
CN106952308B (en) Method and system for determining position of moving object
CN113819890B (en) Distance measuring method, distance measuring device, electronic equipment and storage medium
CA2395257C (en) Any aspect passive volumetric image processing method
CN110826499A (en) Object space parameter detection method and device, electronic equipment and storage medium
CN112368756B (en) Method for calculating collision time of object and vehicle, calculating device and vehicle
KR101163042B1 (en) Apparatus and Method for Extracting Vehicle
US20220319146A1 (en) Object detection method, object detection device, terminal device, and medium
CN114495064A (en) Monocular depth estimation-based vehicle surrounding obstacle early warning method
CN112446227A (en) Object detection method, device and equipment
CN115116049B (en) Target detection method and device, electronic equipment and storage medium
CN114089329A (en) Target detection method based on fusion of long and short focus cameras and millimeter wave radar
CN111127516A (en) Target detection and tracking method and system without search box
CN115410167A (en) Target detection and semantic segmentation method, device, equipment and storage medium
CN115205380A (en) Volume estimation method and device, electronic equipment and storage medium
CN110197104B (en) Distance measurement method and device based on vehicle
CN114118247A (en) Anchor-frame-free 3D target detection method based on multi-sensor fusion
CN114298151A (en) 3D target detection method based on point cloud data and image data fusion
US10223803B2 (en) Method for characterising a scene by computing 3D orientation
CN116012805B (en) Target perception method, device, computer equipment and storage medium
US20230109473A1 (en) Vehicle, electronic apparatus, and control method thereof
CN114648639B (en) Target vehicle detection method, system and device
CN116246119A (en) 3D target detection method, electronic device and storage medium
CN117953483A (en) Omnidirectional vision-oriented three-dimensional target detection method
CN116246033A (en) Rapid semantic map construction method for unstructured road
CN115497061A (en) Method and device for identifying road travelable area based on binocular vision

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination