CN118172422A

CN118172422A - Method and device for positioning and imaging interest target by combining vision, inertia and laser

Info

Publication number: CN118172422A
Application number: CN202410564402.8A
Authority: CN
Inventors: 杨必胜; 陈驰
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2024-05-09
Filing date: 2024-05-09
Publication date: 2024-06-11

Abstract

The invention relates to a visual, inertial and laser cooperative target-of-interest positioning imaging method and device, which are used for adapting to different illumination conditions and enhancing data quality by combining a visual sensor dim light enhancing module and a data coupling network through a data coupling and enhancing module. And accurately positioning the target by using the neural network with automatic identification and relative positioning through the proposed target detection neural network and a projection method. The method combines laser radar point cloud data and multi-angle images, and realizes high-precision multidimensional imaging of the interest target through voxel map representation, covariance estimation and luminosity gradient optimization. The method is applied to the fields of unmanned automobiles, robot navigation, high-precision map drawing and the like, and the efficiency and the accuracy of target positioning and imaging are remarkably improved through efficient data fusion and processing.

Description

Method and device for positioning and imaging interest target by combining vision, inertia and laser

Technical Field

The invention relates to the technical field of target positioning and imaging, in particular to a method and a device for target positioning and imaging of interest by combining vision, inertia and laser. The technology is mainly applied to the fields of unmanned automobiles, robot navigation, high-precision map drawing and the like.

Background

With the rapid development of automation and intellectualization technologies, the need for accurate and real-time target positioning and imaging technologies has become particularly critical in numerous fields such as unmanned automobiles, robotic navigation, and high-precision mapping. Conventional target localization methods rely primarily on a single sensor technology, such as vision systems or lidar systems, but these methods face significant limitations in complex environments.

The vision system based on the camera is degraded in low illumination condition or extreme illumination change, and is also seriously affected by the shielding of the sight. While vision systems have advantages in cost and popularity, their reliability in complex environments is challenged by a number of factors. The laser radar system is superior in certain applications in terms of high-precision distance measurement and good anti-interference capability, but has high cost, is sensitive to severe weather conditions such as rain, fog and the like and reflective properties of certain materials, and limits the wide application of the laser radar system. While Inertial Measurement Units (IMUs) provide continuous motion tracking, it is difficult to provide absolute position information independently, and the data accumulates errors over time, affecting accuracy over long periods of use.

Faced with these limitations of single sensor technology, existing positioning and imaging systems perform far from meeting the requirements of high accuracy and high reliability in highly dynamic, diverse environments. This challenge is particularly acute in applications requiring extremely high reliability and accuracy, such as autonomous cars navigating in complex urban environments, or accurate mapping in varying terrain.

Disclosure of Invention

In view of the limitations of the prior art, the invention provides a visual, inertial and laser cooperative interest target positioning imaging method and device. According to the technology, the data of three different types of sensors acquired by the fusion device, namely the light image data, the laser radar (LiDAR) point cloud data and the Inertial Measurement Unit (IMU) measurement information, are integrated, so that higher-precision and more stable target positioning and imaging are realized.

The invention relates to a visual, inertial and laser cooperative target-of-interest positioning imaging method, which comprises the following steps:

step 1, visible light image data, laser radar data, UWB distance data and system IMU direction information related to a positioning target are obtained;

Step 2, constructing a coupling-enhancing network model comprising a dim light enhancing module and a data coupling module, wherein the dim light enhancing module enhances visible light image data; transforming the laser radar data to obtain a sparse depth map, wherein the data coupling module comprises a visual branch aiming at the enhanced visible light image data and a laser branch aiming at the sparse depth map, and the depth maps predicted by the data coupling module from the two branches are adaptively fused with the learned confidence weights and output as a dense depth map;

Step 3, detecting an interest target boundary box from the enhanced visible light image data by a target detection method;

The relation of the interest target relative to the initial position of the system is obtained, and the specific calculation is as follows:

Wherein, The relative distance of the target of interest from the system is calculated by combining the target of interest position with UWB ranging, wherein the target of interest position is the center position of a target boundary frame,/>、/>IMU direction measurements acquired by the system respectively,/>A direction vector of a connecting line of a target and a system position, which is measured by combining the laser radar and the UWB distance;

Step 4, obtaining a Gaussian parameterized point cloud map by using the dense depth map, and combining the enhanced visible light image data and the relative position of the interest target Optimizing spherical harmonic coefficients and Gaussian parameterized point cloud maps to obtain visually optimized point cloud maps, and obtaining structurally optimized point cloud maps by adopting an adaptive control method; combining the point cloud map with vision and structure optimization, and synthesizing by using a Gaussian model to obtain multidimensional imaging of the interest target.

Further, the dim light enhancement module is calculated as follows:

wherein M and A represent local components; gamma is the correction parameter; is a joint color transform matrix; max () takes the maximum value; Is given illumination condition/> The input image below, i.e. the original acquired visible light image data,/>Is converted into proper uniform light/>The lower target outputs an image.

The dim light enhancement module comprises a local component and a global component;

The local components expand the channel dimension of standard red, green and blue color space sRGB data through two convolution layers, and transfer the data after expanding the channel dimension to two independent branches stacked by pixel-by-pixel enhancement modules PEM, and the two PEM modules have the same structure: firstly, position information is encoded through depth convolution, then local details are enhanced through a point convolution-depth convolution-point convolution structure, finally two convolutions are adopted to respectively enhance token representation, a stacked PEM connects an output feature with an input feature through jump connection element-by-element addition, channel dimensions are reduced through convolution, local components M and A are generated through functions, and functions adopt ReLU or Tanh functions;

The global component is first stacked with two convolutions as a lightweight encoder, the generated features are transferred to a transducer multi-head attention module, and after a feedforward network with two linear layers is input, two additional initialized learning parameter layers are added to output a color matrix and correction parameters gamma.

Further, the two branch trunks of the data coupling module are both encoder-decoder networks with symmetrical jump connection, the encoder comprises a convolution layer and ten residual blocks, the decoder comprises a plurality of deconvolution layers and a convolution layer, and a BN layer and a ReLU activation layer are arranged behind the convolution layers; the decoder function of the visual branch is associated with the corresponding encoder function in the laser branch, and the CSPN ++ refinement trunk predicted depth map is adopted to obtain a dense depth map.

Further, the target detection method in step3 is specifically as follows:

Constructing a target detection model, converting a target detection task into a diffusion process from a noise boundary box to a target boundary box, wherein the model comprises an encoder and a decoder;

The encoder extracts depth features from an image and only runs once, and is of a large-core network LKNet structure, and comprises four stages, wherein each stage consists of a plurality of large-core convolution blocks, small-core convolution blocks and large-core convolution blocks, the stages are connected by downsampling blocks, and the large-core convolution blocks consist of a depth convolution layer and a feedforward network with global response normalization GRN units;

The decoder in the target detection model consists of a plurality of cascade stages, the target candidate frame is sampled from Gaussian distribution in an evaluation stage, a detection head is repeatedly used for evaluation in an iterative mode, parameters are shared among different steps, and the steps are embedded into corresponding diffusion processes;

and detecting the bounding box of the target of interest from the enhanced visible light image data by using the trained target detection model.

Further, training of the target detection model is as follows:

And filling the extra frames into the real frames by using a method of splicing random frames to obtain a certain number of filled frames, adding Gaussian noise into the filled frames, inputting the filled frames as a detector, predicting the type and coordinates of the filled frames, selecting the first k predictions with the lowest cost by using an optimal transmission allocation method, allocating a plurality of predicted frames for each true frame, using the aggregate prediction loss as a loss function of training, and setting k according to actual needs.

Further, the specific process of obtaining the gaussian parameter point cloud map is as follows:

The small voxels are adopted and the voxels are finely divided through multiple adaptivity to approximate a Gaussian surface, and the characteristics of the divided voxels can be represented by all points in the voxels Weighted average position/>Normal vector/>Covariance matrix of point distribution in voxelTo describe the process of the present invention,

Wherein N represents the number of all points, T represents the transposition of the matrix, and i is the sequence number of the points in the voxel; to ensure that no data holes are created when scaling up, a scaling factor is introduced for each pointTo adjust covariance matrix/>For/>The following formula:

。

further, the specific process of optimizing the spherical harmonic coefficient and the Gaussian parameterized point cloud map is as follows:

Firstly, initializing corresponding spherical harmonic coefficients by using a Gaussian parameterized map, and then calculating the spherical harmonic coefficients by using a second-order spherical harmonic function, wherein each Gaussian requires a plurality of harmonic coefficients, calculating a luminosity gradient by using an enhanced visible light image, and combining the luminosity gradient with the relative position of an interesting target And the spherical harmonic coefficients are adjusted through luminosity gradient optimization, so that the visual reality and detail performance of the point cloud are refined.

The structure optimization process of the point cloud map comprises the following steps:

inputting a visually optimized point cloud map, copying adjacent gauss aiming at the point cloud map area needing densification of the area with insufficient scanning, and optimizing, positioning and copying a gauss function by using a light gradient to supplement; for an excessively dense region, its net contribution is periodically evaluated and the excessive unnecessary region is eliminated.

Based on the same inventive concept, the scheme also designs an interest target positioning imaging system with cooperative vision, inertia and laser, which comprises a sensor module, a sensor module and a control module, wherein the sensor module is used for acquiring visible light image data, laser radar data, UWB distance data and system IMU direction information related to a positioning target; the system comprises a multi-line laser radar, an MEMS IMU inertial sensor, a visible light camera and a UWB positioning device;

The data transmission module is used for connecting each sensor with the data processing module;

The data processing module is used for processing the data acquired by the sensor module and carrying out target-of-interest positioning imaging;

The data processor includes:

The data coupling and enhancing module is used for enhancing the visible light image data; transforming the laser radar data to obtain a sparse depth map, wherein the data coupling module comprises a visual branch aiming at the enhanced visible light image data and a laser branch aiming at the sparse depth map, and the depth maps predicted by the data coupling module from the two branches are adaptively fused with the learned confidence weights and output as a dense depth map;

the object detection module is used for detecting an interest object boundary box from the enhanced visible light image data through an object detection method; and obtaining the relation of the interest target relative to the initial position of the system, wherein the relation is specifically calculated as follows:

the multidimensional imaging module obtains a Gaussian parameterized point cloud map by using a dense depth map and combines the enhanced visible light image data and the relative position of the interest target Optimizing spherical harmonic coefficients and Gaussian parameterized point cloud maps to obtain visually optimized point cloud maps, and obtaining structurally optimized point cloud maps by adopting an adaptive control method; combining the point cloud map with vision and structure optimization, and synthesizing by using a Gaussian model to obtain multidimensional imaging of the interest target.

The invention provides a visual, inertial and laser cooperative interesting target positioning imaging device for data acquisition, which comprises a visual sensor, an inertial measurement unit and a laser radar sensor, wherein the visual information, the motion state information and the distance information can be acquired by the devices under various environments, and the sensors cooperatively work to keep time synchronization and spatial alignment, so that a plurality of acquired indoor and outdoor scene data sets are obtained.

Based on the same inventive concept, the invention also designs an electronic device comprising:

One or more processors;

A storage means for storing one or more programs;

The one or more programs, when executed by the one or more processors, cause the one or more processors to implement a method of object-of-interest imaging with visual, inertial and laser coordination.

Based on the same inventive concept, the invention also designs a computer readable medium, on which a computer program is stored, which when being executed by a processor, realizes a visual, inertial and laser cooperative target of interest positioning imaging method.

The invention has the advantages that:

The data coupling and enhancing module is designed, so that the data collected by the vision sensor and the laser radar sensor can be effectively fused and enhanced, and the performance of the whole system is improved through the cooperative work of the sensors. For example, the performance of a vision sensor in a dim light environment may be degraded, while a lidar may provide accurate range information. Under the dark light environment, more accurate depth information can be obtained through a laser radar depth complement technology, the root mean square error can reach the meter level, and the time efficiency can reach 20 frames per second. The design automatically identifies and locates objects of interest. Through deep learning and machine vision technology, the object of interest is effectively identified in the scene and accurately positioned, the detection precision reaches mIoU of 0.85, and the positioning precision reaches 5cm. And finally, the system performs high-definition multi-dimensional computational imaging of the interest target, effectively fuses multi-sensor data, and synthesizes high-definition images from a new view angle. The peak signal-to-noise ratio is greater than 15 dB, and the structural similarity of the images is greater than 0.6. The invention integrates the advantages of vision, laser radar and inertial measurement unit, and not only overcomes the limitation of each single technology, but also provides a high-precision and high-reliability target positioning and imaging solution under various complex environments through a high-efficiency data fusion algorithm.

Drawings

FIG. 1 is a flow chart of a method for imaging a target of interest with vision, inertia and laser coordination in accordance with an embodiment of the present invention.

FIG. 2 is a flow chart of a data coupling and enhancement method according to an embodiment of the present invention.

FIG. 3 is a flowchart of a method for detecting an interest object according to an embodiment of the present invention.

FIG. 4 is a diagram of a model of an object-of-interest location imaging assembly with visual, inertial and laser coordination in accordance with an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is described below with reference to the accompanying drawings and examples.

Example 1

Referring to fig. 1, the present embodiment provides a method for imaging a target of interest with vision, inertia and laser. Comprises the following steps:

And 2, data coupling and enhancement, namely, constructing a data coupling and enhancement neural network, performing data coupling and enhancement on visible light image data and laser radar point cloud data acquired by the visual sensor and the laser radar sensor, and acquiring enhanced visible light image data and a dense depth map, as shown in the figure 2. The network structure mainly comprises two parts, namely a visual sensor dim light enhancement module and a data coupling network collected by the visual sensor and the laser radar sensor.

Step 2.1, the visual sensor dim light enhancement module enables a given lighting conditionLower input sRGB imageI.e. the original collected visible light image data is converted into proper uniform light/>Target output RGB image/>I.e., enhanced visible light image data, as in equation 1. Wherein/>Representing the real number field, meaning that the image pixel values may be any real number; h W represents the dimension of the dimension, 3 represents the channel dimension/>. The dim light enhancement module is divided into two branches, namely a local branch and a global branch, and features of the original image are extracted and fused.

Wherein M and A represent local components; gamma is the correction parameter; is a joint color transform matrix; max () takes the maximum value; Is given illumination condition/> The input image below, i.e. the original acquired visible light image data,/>Is converted into proper uniform light/>The lower target outputs an image. In this embodiment/>Is a3 x 3 joint color transform matrix, taking into account the white balance and the color transform matrix.

The local components first extend the channel dimension for sRGB data by two convolution layers and pass them to two independent branches stacked by a pixel-wise enhancement module PEM. The structure of each PEM module is identical: the location information is first encoded by a 3 x3 depth convolution and then the local detail is enhanced using PWConv-DWConv-PWConv structure. Finally, two 1 x 1 convolutions are used to enhance the token representation separately. 3 PEMs are stacked in each branch, and then the output features are connected to the input features by a jump connection element-wise addition. Finally, the channel dimensions are reduced by a 3 x3 convolution, and local components M and a are generated by a function that uses a ReLU or Tanh function to correct for the effects of illumination.

The global component first stacks two convolutions as a lightweight encoder, encoding the high-dimensional features at a lower resolution. The generated features are then passed to a transducer multi-head attention module, and after inputting a feed forward network FFN with two linear layers, two additional layers of initialization learning parameters are added to output a color matrix and。

The enhanced visible light image data can be calculated by the formula 1 through the local components M and A output by the local network branches and the color matrix and gamma output by the global network branches.

And 2.2, a data coupling network acquired by a visual sensor and a laser radar sensor, wherein the whole frame consists of two branch trunks and a depth refinement module. The sparse depth map is obtained by simple transformation of the laser point cloud acquired by the laser radar, and the enhanced visible light image data obtained in the step 2.1 are used as input of the step. The two branch trunks are a vision sensor dominant branch and a laser radar sensor dominant branch, namely a vision branch and a laser branch. The depth maps predicted from the two branches are adaptively fused with the learned confidence weights. The fused features are further input into a refinement module to improve depth quality and output a dense depth map.

The two branch trunks are identical encoder-decoder networks with symmetrical hopping connections. The encoder comprises one convolutional layer and ten basic residual blocks in series. The decoder concatenates five deconvolution layers and one convolution layer. Each convolution layer is followed by a BN layer and a ReLU activation layer. In particular, a decoder-encoder fusion strategy is employed to fuse the features of the visual branch into the laser branch. In particular, the decoder function of the visual branch is associated with a corresponding encoder function in the laser branch. And finally, refining the depth map of the trunk prediction by CSPN ++, and obtaining a dense depth map.

Step 3, an automatic identification and relative positioning method of interest targets is provided, as shown in fig. 3, and a neural network for automatically identifying interest targets in a sensing area and a relative positioning method are provided.

Step 3.1, the invention provides LKDiffusionDet a target detection method, which detects the boundary box of the target of interest from the enhanced visible light image data. LKDiffusionDet the object detection method converts the object detection task into a diffusion process from the noise bounding box to the object bounding box. During the training phase, the target boxes diffuse from the truth boxes to a random distribution, and the model learns to twist this noise process. In the target reasoning process, the model method provided by the invention refines a group of randomly generated frames into output results in a progressive mode, and can dynamically detect the number of frames and evaluate the iteration. Meanwhile, in the feature extraction stage, a large kernel convolutional neural network LKNet is provided, which is different from a small kernel convolutional neural network, and LKNet can acquire a wider range of features, so that the performance of the neural network is improved.

LKDiffusionDet include encoders and decoders. The encoder extracts depth features from the image, which are run only once. The encoder uses the proposed LKNet to generate a depth profile. LKNet the structural design adopts a high-efficiency structure, so that communication between channels can be realized, and space aggregation can be realized to increase depth. The body of the LKNet model is divided into four phases, connected by several downsampling blocks. Specifically, the first downsampling module converts the original input into a feature map using two conversion layers of step size 2, and the other three downsampling modules perform twice the channel expansion using conversion layers of step size 2, respectively. The first stage consists of a deep convolutional layer and a feed-forward network with GRN units, after which the block regularization module is used, which can be equivalently incorporated into the convolutional layer to eliminate its inference cost. And another regularization module is used after the feed-forward neural network. In the third stage, a basic structure consisting of a plurality of large-kernel convolution blocks, small-kernel convolution blocks and large-kernel convolution blocks is used, so that the feature extraction performance is improved, the receptive field is expanded, and the depth of feature extraction is increased.

In the decoder part, the proposed box is disturbed by the real box during the training phase and sampled directly from the gaussian distribution during the evaluation phase. The decoder consists of 6 cascaded stages and is evaluated in an iterative manner using the detector heads repeatedly, and the parameters are shared between different steps, each step being embedded in a corresponding diffusion process.

In the training process, a diffusion process from the real frame to the noise frame is first constructed, and then the training model reverses the process. Firstly, filling some extra frames into the real frames in a mode of splicing random frames to obtain a fixed number N of frames. These filled boxes are then added to gaussian noise. The detector takes the N obtained frames added with Gaussian noise as input, predicts N categories and coordinates, selects top k predictions with lowest cost through an optimal transmission allocation method, allocates a plurality of prediction frames for each truth frame, and uses the aggregate prediction loss as a loss function of training. LKDiffusionDet the process of predicting the target frame is a denoising sampling process from the noise frame to the target frame, starting from a Gaussian distribution sampling frame, and gradually refining the prediction result by the model until the correct target boundary frame is predicted. N may be set to {100, 300, 500}, the proposal in the experiment is set to 300, the k proposal value is 10 or 12, and the actual situation is determined according to the actual experimental data.

And 3.2, acquiring the direction of the target of interest relative to the system camera based on the projection model of the system camera after acquiring the target boundary frame and the pixel position thereof in the enhanced visible light image data, wherein the center of the target boundary frame is the target position of interest. And then calculating the relative position of the target and the system by fusing the dense depth map obtained in the step 2.2, the UWB data distance acquired by the device and the direction information in the IMU measurement data. The relative position is calculated as formula 2 below, whereinRelative distance of interest object from system calculated by combining UWB ranging,/>、/>IMU measurements acquired separately for the system,/>Direction vector of target and system position connecting line measured by combining laser radar and UAB,/>And in order to calculate the distance between the interest object and the system, the initial relative position relationship between the interest object and the system can be determined by combining the direction obtained by the projection model.

After the original pose estimation is obtained, ESKF is adopted to filter the original relative pose estimation in order to improve the estimation quality, the reference system motion is additionally considered, and finally the optimized relative position of the target of interest is obtained.

And 4, high-definition multi-dimensional computational imaging of the interest object. The imaging input is data such as dense laser radar point cloud obtained by simple transformation and restoration of the dense depth map obtained in the step 2.2, enhanced visible light image obtained in the step 2.1, and relative position of the interest target obtained in the step 3.2. Firstly, a scale self-adaptive voxel representation plane is utilized, a dense laser radar point cloud is divided into a plurality of voxels, and a Gaussian parameterized point cloud map is obtained. And then, the spherical harmonic coefficient and the Gaussian parameterized point cloud map are optimized by combining the enhanced visible light image obtained in the step 2.1 and the relative position of the interest target obtained in the step 3.2, and the point cloud map after visual optimization is obtained. And secondly, adaptively controlling and optimizing the point cloud map to obtain the point cloud map with optimized structure. And finally, combining the point cloud map with the structure optimization, and performing new view angle synthesis by using a Gaussian model to obtain high-definition multidimensional imaging of the interest target. The method is divided into four parts:

Step 4.1, gaussian initialization based on LiDAR constraint. And inputting dense laser radar point clouds, and dividing the dense laser radar point clouds by adopting voxels with self-adaptive scales to achieve the aim of constructing a fine Gaussian curved surface. Specifically, in order to obtain a fine map having a gaussian surface normal vector, small voxels are employed and the gaussian surface is approximated by a plurality of adaptively finely divided voxels. The characteristics of the divided voxels can be obtained by all points in the voxels Weighted average position/>Normal vector/>Covariance matrix/>, of point distribution within voxelsTo describe.

Wherein N represents the number of all points, T represents the transposition of the matrix, and i is the sequence number of the points in the voxel; covariance matrixAn approximate shape and pose of the point may be determined, as well as a pose of the surface gaussian. In order to realize seamless integration among dense laser radar point clouds and ensure that no data hole is generated during scale-up and the integrity of original data is maintained, a scaling factor/>, is introduced for each pointTo adjust covariance matrix/>For/>The scaling factor is determined by the density of the dots. The step outputs a Gaussian parameterized point cloud map with Gaussian surface normal vectors, which contains parameters such as average position, normal vector, covariance matrix, etc.

And 4.2, optimizing spherical harmonic coefficients and refining a map structure with a Gaussian surface normal vector. The inputs of the process are the Gaussian parameterized point cloud map output in step 4.1, the enhanced visible light image obtained in step 2.1 and the relative position of the object of interest obtained in step 3.2. First, the corresponding spherical harmonic coefficients are initialized using a gaussian parameterized map. Spherical harmonic coefficients are calculated using a second order spherical harmonic function, where multiple harmonic coefficients are required per gaussian, preferably 27 for this embodiment. And calculating luminosity gradient by using the enhanced visible light image, and optimizing and adjusting spherical harmonic coefficients by combining the relative positions of the interest targets through luminosity gradient, so as to refine the visual reality and detail representation of the point cloud. This step outputs a visually optimized point cloud map.

And 4.3, optimizing the 3D Gao Sidian cloud map by self-adaptive control. And (5) inputting the visually optimized point cloud map obtained in the step 4.2. For the areas of the under-scan, the point cloud map areas that need densification are identified, adjacent gauss are replicated, and the replicated gauss functions are precisely positioned using the light gradient optimization to supplement. The net contribution of the excessively dense areas is periodically evaluated, and the excessive unnecessary areas are eliminated, so that redundant points in the map are effectively reduced, and the optimization efficiency is enhanced. The step outputs a point cloud map with optimized structure.

And 4.4, performing new view angle synthesis by using a Gaussian model. And (5) inputting the point cloud map with optimized structure obtained in the step 4.3. And (3) synthesizing a high-precision multidimensional imaging of the interest target through alpha mixing by adopting a Gaussian synthesized image generated by a rasterization technology from the point cloud map with optimized structure.

Example two

Based on the same inventive concept, the second embodiment also designs an interest target positioning imaging device with cooperative vision, inertia and laser, as shown in fig. 4, which comprises a sensor module for acquiring visible light image data, laser radar data, UWB distance data and system IMU direction information related to a positioning target, wherein the sensor module comprises a multi-line laser radar, an MEMS IMU inertial sensor, a visible light camera and a UWB positioning device;

The data processor includes:

The multi-line laser radar takes on the task of point cloud data acquisition, the detection distance is 200m, the upper view angle is 52 degrees, the lower view angle is 7 degrees, the horizontal view angle is 360 degrees, and the frame rate is 10Hz. The camera takes on the task of collecting image data with a frame rate of 30 frames/s and a highest resolution of 1080P. The MEMS IMU inertial sensor bears the auxiliary positioning function, the angle precision is 0.2 degree (roll and pitch directions) and 1 degree (yaw direction); the whole measuring range of the gyroscope is 450 degrees/s, the bias stability in operation is 10 degrees/h, and the bandwidth is 415Hz. The UWB positioning device bears auxiliary positioning tasks, the plane precision is 10cm, and the three-dimensional precision is 30cm. Depth cameras can be used to capture three-dimensional structural information of the environment under various lighting conditions, providing accurate spatial position data, assisting the lidar in more complex scene analysis. The infrared camera is suitable for night or low light environment, and can capture targets through thermal imaging for enhancing the target detection capability of the visible light camera.

Main equipment in the data transmission module comprises a Mesh networking node module, the total bandwidth of a transmission link is 18-52Mbps, an antenna is 1.4GHz-1.5GHz, and the center frequency is 1.42-1.45GHz. The system can provide long-distance and high-bandwidth data transmission functions, and has the main functions of transmitting results of target identification and the like to ground equipment through high-speed bandwidth and long distance so as to complete relevant data transmission and equipment working condition checking.

In this embodiment, the data processing module is located in the center of the unmanned aerial vehicle device, and is composed of an embedded high-performance GPU processor and related components, and includes a @ v8.2.64-bit CPU with an 8-core NVIDIA ARM Cortex a78AE architecture, a processing performance with 100 trillion times of computing power (TOP), and a 16GB 128-bit LPDDR5 memory module. The module is internally provided with a Ubuntu operating system and is provided with an interest target positioning imaging method with cooperation of vision, inertia and laser, and the module can receive data received by the data acquisition module for real-time processing.

Because the system described in the second embodiment of the present invention is a system for implementing the method for locating and imaging an interest target by combining vision, inertia and laser according to the first embodiment of the present invention, a person skilled in the art can know the specific structure and deformation of the electronic device based on the method described in the first embodiment of the present invention, and therefore, the detailed description thereof is omitted herein.

Example III

Based on the same inventive concept, the invention also provides an electronic device comprising one or more processors; a storage means for storing one or more programs; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method described in embodiment one.

Because the device described in the third embodiment of the present invention is an electronic device used for implementing the method for positioning and imaging an interest target by combining vision, inertia and laser according to the first embodiment of the present invention, a person skilled in the art can know the specific structure and deformation of the electronic device based on the method described in the first embodiment of the present invention, and therefore, the detailed description thereof is omitted herein. All electronic devices adopted by the method of the embodiment of the invention belong to the scope of protection to be protected.

Example IV

Based on the same inventive concept, the present invention also provides a computer readable medium having stored thereon a computer program which, when executed by a processor, implements the method described in embodiment one.

Because the apparatus described in the fourth embodiment of the present invention is a computer readable medium for implementing the method for positioning and imaging an object of interest by combining vision, inertia and laser according to the first embodiment of the present invention, a person skilled in the art can understand the specific structure and deformation of the electronic apparatus based on the method described in the first embodiment of the present invention, and therefore, the detailed description thereof is omitted herein. All electronic devices adopted by the method of the embodiment of the invention belong to the scope of protection to be protected.

Claims

1. The visual, inertial and laser cooperative target-of-interest positioning imaging method is characterized by comprising the following steps of:

Step 2, constructing a coupling-enhancing network model comprising a dim light enhancing module and a data coupling module, wherein the dim light enhancing module enhances visible light image data; the data coupling module comprises a visual branch aiming at the enhanced visible light image data and a laser branch aiming at the sparse depth map, and the depth map predicted by the data coupling module from the two branches is adaptively fused with the learned confidence weight and output as a dense depth map;

Acquiring the relative position of the interest target according to the interest target bounding box The position is the initial position relative to the system, and is specifically calculated as follows:

Step 4, obtaining a Gaussian parameterized point cloud map by using the dense depth map, and combining the enhanced visible light image data and the relative position of the interest target Optimizing spherical harmonic coefficients and Gaussian parameterized point cloud maps to obtain visually optimized point cloud maps, and obtaining structurally optimized point cloud maps by adopting an adaptive control method; combining vision and structure optimization point cloud map, and synthesizing by using Gaussian model to obtain multidimensional imaging of interest target.

2. The visual, inertial and laser collaborative object-of-interest locating imaging method of claim 1, wherein: the dim light enhancement module is calculated as follows:

wherein M and A represent local components; gamma is the correction parameter; is a joint color transform matrix; max () takes the maximum value; /(I) Is an input image under given lighting conditions,/>Is converted into a target output image under proper uniform light;

The local component expands the channel dimension of standard red, green and blue color space data through two convolution layers, and transmits the data after expanding the channel dimension to two independent branches stacked by pixel-by-pixel enhancement modules, wherein the two pixel-by-pixel enhancement modules have the same module structure: firstly, position information is encoded by utilizing depth convolution, then local details are enhanced by utilizing a point convolution-depth convolution-point convolution structure, finally, token representation is enhanced by adopting two convolutions respectively, an output characteristic and an input characteristic are connected by a stacked pixel-by-pixel enhancement module through jump connection element-by-element addition, channel dimensionality is reduced by utilizing convolution, local components M and A are generated by utilizing a function, and the function adopts a ReLU or Tanh function;

3. The visual, inertial and laser collaborative object-of-interest locating imaging method of claim 1, wherein:

The two branch trunks of the data coupling module are both encoder-decoder networks with symmetrical jump connection, the encoder comprises a convolution layer and ten residual blocks, the decoder comprises a plurality of deconvolution layers and a convolution layer, and a BN layer and a ReLU activation layer are arranged behind the convolution layers; the decoder function of the visual branch is associated with the corresponding encoder function in the laser branch, and the CSPN ++ refinement trunk predicted depth map is adopted to obtain a dense depth map.

4. The visual, inertial and laser collaborative object-of-interest locating imaging method of claim 1, wherein: the target detection method in the step 3 specifically comprises the following steps:

The encoder extracts depth features from an image and only runs once, is of a large-core network structure, comprises four stages, each stage consists of a plurality of large-core convolution blocks, small-core convolution blocks and large-core convolution blocks, and is connected by a downsampling block, wherein the large-core convolution blocks consist of a depth convolution layer and a feedforward network with a global response normalization unit;

5. A visual, inertial and laser collaborative object-of-interest location imaging method according to claim 3, wherein: training of the target detection model is as follows:

6. The visual, inertial and laser collaborative object-of-interest locating imaging method of claim 1, wherein: the specific process for obtaining the Gaussian parameter point cloud map is as follows:

The small voxels are adopted and the voxels are finely divided through multiple adaptivity to approximate a Gaussian surface, and the characteristics of the divided voxels can be represented by all points in the voxels Weighted average position/>Normal vector/>Covariance matrix/>, of point distribution within voxelsTo describe the process of the present invention,

。

7. The visual, inertial and laser collaborative object-of-interest locating imaging method of claim 6, wherein: the visual optimization process of the point cloud map comprises the following steps:

Firstly, initializing corresponding spherical harmonic coefficients by using a Gaussian parameterized map, and then calculating the spherical harmonic coefficients by using a second-order spherical harmonic function, wherein each Gaussian multiple harmonic coefficient, calculating a luminosity gradient by using an enhanced visible light image, and combining the luminosity gradient with the relative position of an interesting target Adjusting spherical harmonic coefficients through luminosity gradient optimization, and refining visual reality and detail expression of point cloud;

8. A visual, inertial and laser cooperative interesting object positioning imaging device is characterized in that: the method for realizing the object-of-interest positioning imaging of any one of claims 1-7, comprising a sensor module for acquiring visible light image data, laser radar data, UWB distance data and system IMU direction information related to a positioning object, wherein the sensor module comprises a multi-line laser radar, a MEMS IMU inertial sensor, a visible light camera, a UWB positioning device;

The data processor includes:

9. An electronic device, comprising:

One or more processors;

A storage means for storing one or more programs;

The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the object-of-interest localization imaging method of any one of claims 1-7.

10. A computer readable medium having a computer program stored thereon, characterized by: the program, when executed by a processor, implements the object-of-interest localization imaging method of any one of claims 1-7.