CN117830611A

CN117830611A - Target detection method and device and electronic equipment

Info

Publication number: CN117830611A
Application number: CN202311696423.7A
Authority: CN
Inventors: 刘凯; 乌月汗; 郑铮; 李花
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2023-12-11
Filing date: 2023-12-11
Publication date: 2024-04-05

Abstract

The application provides a target detection method, a target detection device and electronic equipment. The method comprises the following steps: acquiring a multi-frame image and a multi-frame point cloud of a target scene, wherein one frame of image corresponds to one frame of point cloud acquired synchronously; projecting pixel points of each frame of image into a three-dimensional space, forming a projection point cloud by using the space points projected by the pixel points of each frame of image, and acquiring color information of points in each frame of point cloud to generate a laser color point cloud; acquiring aggregation of the multi-time-sequence characteristic aerial view according to a high-density point cloud consisting of the projection point cloud and the laser color point cloud; the characteristic aerial view has plane image information and three-dimensional space distribution information; according to aggregation of the multi-time sequence characteristic aerial views, determining and outputting detection information; the detection information characterizes object types and object position information in a target scene in the current time period. The method can solve the problem of how to better fuse the three-dimensional target detection and the two-dimensional image so as to improve the accuracy of the 3D target detection.

Description

Target detection method and device and electronic equipment

Technical Field

The present disclosure relates to the field of image processing and point cloud processing technologies, and in particular, to a target detection method, a target detection device, and an electronic device.

Background

In the automatic driving process, in order to realize path planning and collision avoidance in an automatic driving task, an external environment of a driving vehicle needs to be perceived through a three-Dimensional (3D) target detection method so as to provide object type information and position information in the driving environment.

The explosion in the field of computer vision has prompted most of the existing target detectors to be based on two-dimensional (2D) images. However, there are significant drawbacks to two-dimensional image-based object detection techniques. Firstly, a two-dimensional image is acquired by a camera, and the camera is used as a passive sensor, so that the quality of the acquired image is seriously influenced by illumination and climate conditions although the image contains rich semantic information. Second, two-dimensional images cannot provide depth information, which is essential for path planning and collision avoidance in an autopilot mission. Thus, three-dimensional object detection is introduced to provide more detailed object size and position information.

How to better fuse the three-dimensional target detection and the two-dimensional image to improve the accuracy of the 3D target detection is still needed to be solved.

Disclosure of Invention

The application provides a target detection method, a target detection device and electronic equipment, which are used for solving the problem of how to better fuse three-dimensional target detection and two-dimensional images so as to improve the accuracy of 3D target detection.

In one aspect, the present application provides a target detection method, the method comprising:

acquiring a multi-frame image and a multi-frame point cloud of a target scene, wherein one frame of image corresponds to one frame of point cloud acquired synchronously;

projecting pixel points of each frame of image into a three-dimensional space, forming a projection point cloud by using the space points projected by the pixel points of each frame of image, and acquiring color information of points in each frame of point cloud to generate a laser color point cloud;

acquiring aggregation of a multi-time-sequence characteristic aerial view according to the high-density point cloud formed by the projection point cloud and the laser color point cloud; wherein the characteristic aerial view has planar image information and three-dimensional spatial distribution information;

determining and outputting detection information according to the aggregation of the characteristic aerial views of the multiple time sequences; the detection information characterizes object types and object position information in a target scene in the current time period.

In one embodiment, the acquiring the aggregate of the multi-time-sequence characteristic aerial view according to the high-density point cloud composed of the projection point cloud and the laser color point cloud includes:

Acquiring information vectors of each point in the high-density point cloud; wherein the information vector comprises two-dimensional coordinates, three-channel gray values, color features and features of the high-density point cloud, wherein the features of the high-density point cloud comprise features of structural information of the high-density point cloud, and the two-dimensional coordinates comprise two-dimensional coordinates of points in view angle images of a plurality of different view angles contained in one frame of image;

after cylindrical voxel segmentation is carried out on the high-density point cloud, carrying out voxel coding on the segmented spatial point cloud based on the information vector of each point in the segmented voxel spatial point cloud, and obtaining the voxel point cloud;

carrying out asymmetric convolution on the voxelized point cloud to aggregate information vectors of different points, and obtaining the aggregated information point cloud;

compressing the aggregated information point cloud along the height direction to obtain a multi-time-sequence characteristic aerial view;

the image feature aggregation method based on the deformable attention mechanism is used for processing the multi-time-sequence feature aerial views and obtaining the aggregation of the multi-time-sequence feature aerial views.

In one embodiment, the composing the high-density point cloud according to the projection point cloud and the laser color point cloud includes:

Forming an initial high-density point cloud according to the projection point cloud and the laser color point cloud;

acquiring depth information of space points projected by matched pixel points of each frame of image in the initial high-density point cloud to acquire color point clouds in the projected point clouds, wherein one frame of image comprises view angle images with different view angles, the matched pixel points are pixel points in one view angle image, the same three-dimensional projected space point is arranged between the matched pixel points and one pixel point in the other view angle image, and the depth information of the matched pixel points and the depth information of the one pixel point are the depth information of the three-dimensional projected space point;

forming a dense point cloud according to the color point cloud and the laser color point cloud;

estimating depth information of the space points projected by the non-matching pixels in the initial high-density point cloud based on the dense point cloud, and obtaining the depth information of the space points projected by the non-matching pixels in the initial high-density point cloud;

and correcting the depth information of the space points projected by the non-matching pixels in the initial high-density point cloud based on the dense point cloud to obtain the high-density point cloud.

In one embodiment, the obtaining the depth information of the matched pixel point in each frame of image includes:

matching pixel points in view angle images of different view angles aiming at view angle images of different view angles contained in each frame of image to obtain each pair of matched pixel points of every two view angle images, wherein one pair of matched pixel points contains two matched pixel points;

acquiring the same three-dimensional projection space point corresponding to each pair of matched pixel points in each frame of image, and acquiring coordinates of the three-dimensional projection space point in a coordinate system of each image acquisition device and coordinates of the three-dimensional projection space point in a world coordinate system;

and determining depth information of the three-dimensional projection space point under each view angle according to the coordinates of the three-dimensional projection space point under the coordinate system of each image acquisition device and the coordinates of the three-dimensional projection space point under the world coordinate system, and taking the depth information of the three-dimensional projection space point under each view angle as the depth information of the matched pixel point in the corresponding view angle image.

In one embodiment, after the acquiring the multi-frame image and the multi-frame point cloud of the target scene, the method further includes:

screening out key frame images and non-key frame images in the multi-frame images; the key frame point cloud corresponding to the time sequence of the key frame image is obtained, and the non-key frame point cloud corresponding to the time sequence of the non-key frame image is obtained;

The projecting the pixel points of each frame of image into the three-dimensional space, and forming a projection point cloud by the space points projected by the pixel points of each frame of image comprises:

projecting the pixel point of each key frame image into a three-dimensional space, and projecting the pixel point of each non-key frame image into the three-dimensional space, wherein the projected point cloud is formed by the space point projected by the pixel point of each key frame image and the space point projected by the pixel point of each non-key frame image;

the obtaining the color information of the points in each frame of point cloud to generate the laser color point cloud comprises:

and acquiring color information of each key frame point cloud, and acquiring color information of each non-key frame point cloud to generate a laser color point cloud, wherein the laser color point cloud comprises a key frame color point cloud and a non-key frame color point cloud.

In one embodiment, after the screening out the key frame image and the non-key frame image in the multi-frame image, the method further includes:

acquiring a noise reduction image of the key frame image and edge texture images with different scales, inputting the noise reduction image into an image feature extraction network to acquire each scale feature image of the noise reduction image, and overlapping the edge texture images with different scales on each scale feature image to acquire each scale feature image subjected to edge texture enhancement processing;

Deducing non-key frame image characteristics according to the optical flow between the key frame image and the non-key frame image;

the method further comprises the steps of after the key frame point cloud corresponding to the time sequence of the key frame image is obtained and the non-key frame point cloud corresponding to the non-key frame image is obtained:

extracting key frame point cloud characteristics aiming at the key frame point cloud, and carrying out deformable attention characteristic aggregation on the key frame point cloud characteristics to obtain aggregated key frame point cloud characteristics;

for the non-key frame point cloud, deducing non-key frame point cloud characteristics according to scene flows between the key frame point cloud and the non-key frame point cloud;

the aggregate characteristic point cloud of the high-density point cloud has characteristics including: the image contour enhancement processing method comprises the steps of non-key frame image characteristics, non-key frame point cloud characteristics, key frame image characteristics subjected to image contour enhancement processing and aggregated key frame point cloud characteristics.

In one embodiment, the acquiring the noise reduction image and the edge texture image of different scales of the key frame image includes:

performing wavelet decomposition on the key frame image to obtain a plurality of different wavelet components;

setting different thresholds for different wavelet component graphs, and carrying out threshold filtering on each wavelet component according to the threshold corresponding to each wavelet component so as to obtain a plurality of wavelet components after noise reduction;

And recombining the noise-reduced image according to the noise-reduced plurality of wavelet components, and recombining the edge texture images with different scales according to the noise-reduced plurality of wavelet components.

In one embodiment, the recombining the edge texture images with different scales according to the plurality of wavelet components after noise reduction includes:

recombining initial edge texture images with different scales according to the wavelet components after noise reduction;

and performing erosion operation and expansion operation on the initial edge texture image of each scale to obtain edge texture images of different scales.

In one embodiment, before the noise reduction image is input to the image feature extraction network to obtain each scale feature map of the noise reduction image, the method further includes:

converting the noise reduction image into a gray image, dividing the gray image into tiles with the same size, and obtaining the information entropy value of each tile;

dividing a gray image according to the information entropy value of each image block, and obtaining a plurality of image areas obtained by dividing;

and arranging each image area to a corresponding position in the feature extraction network according to the information entropy value of each image area, wherein the image areas with the information entropy value larger than the preset information entropy value are arranged at positions which can pass through the first number of convolution layers, and the image areas with the information entropy value smaller than or equal to the preset information entropy value are arranged at positions which can pass through the second number of convolution layers, and the first number is larger than the second number.

In one embodiment, the dividing the gray image according to the entropy value of each tile and obtaining the plurality of divided image areas includes:

counting the number proportion of each group of pixels with the same information entropy value according to the information entropy value of the pixels in each block aiming at each block so as to obtain an information entropy distribution statistical histogram of the block;

after Gaussian smoothing filtering processing is carried out on the information entropy distribution statistical histogram, peak points of the information entropy distribution statistical histogram are identified;

the number of peak points is taken as the clustering number, the information entropy corresponding to the peak points is taken as the clustering center, fuzzy C-means clustering is carried out on each pixel point in the information entropy distribution statistical histogram, and the information entropy value of the pixel points belonging to the same kind is updated to the information entropy value of the clustering center to which the pixel points of the same kind belong;

after the information entropy value is updated, dividing the gray image according to the updating result of the information entropy value, and obtaining a plurality of image areas obtained by dividing.

In one embodiment, each frame of image includes view images of a plurality of different views, and the obtaining color information of points in each frame of point cloud to generate a laser color point cloud includes:

projecting each point in each frame of point cloud on each view angle image of the corresponding frame image;

When a point is projected to an integer pixel point, acquiring color information of the projected integer pixel point as color information of the point;

when the point is projected to the non-integer pixel point, estimating the color information of the non-integer pixel point projected to the point, and taking the estimated color information as the color information of the point.

In one embodiment, the color information of the non-integer pixel point to which the estimated point is projected includes:

acquiring four pixel points nearest to the projected non-integer pixel point, and determining weight coefficients of the four pixel points according to the distance between each pixel point in the four pixel points and the projected non-integer pixel point and the color gray scale distance;

substituting the weight coefficients of the four pixel points and the color information of the four pixel points into a Gaussian function to obtain an estimation result of the projected color information of the non-integer pixel points.

In one embodiment, after the obtaining the multi-frame point cloud of the target scene, the method further includes:

and filtering out the ground point cloud in each frame of point cloud.

In another aspect, the present application provides an object detection apparatus, including:

the acquisition module is used for acquiring a multi-frame image and a multi-frame point cloud of a target scene, wherein one frame of image corresponds to one frame of point cloud which is synchronously acquired;

The point cloud generation module is used for projecting pixel points of each frame of image into a three-dimensional space, forming projection point clouds by using the space points projected by the pixel points of each frame of image, and acquiring color information of points in each frame of point clouds to generate laser color point clouds;

the acquisition module is further used for acquiring aggregation of the multi-time-sequence characteristic aerial view according to the high-density point cloud formed by the projection point cloud and the laser color point cloud; wherein the characteristic aerial view has planar image information and three-dimensional spatial distribution information;

the detection module is used for determining and outputting detection information according to the aggregation of the characteristic aerial views of the multiple time sequences; the detection information characterizes object types and object position information in a target scene in the current time period.

In another aspect, the present application provides an electronic device, including: a processor, and a memory communicatively coupled to the processor;

the memory stores computer-executable instructions;

the processor executes computer-executable instructions stored in the memory to implement the method as described in the first aspect.

In another aspect, the present application provides a computer-readable storage medium having stored therein computer-executable instructions that, when executed, cause a computer to perform the method of the first aspect.

In another aspect, the present application provides a computer program product comprising a computer program which, when executed by a processor, implements the method according to the first aspect.

In summary, an embodiment of the present application provides a target detection method, including: acquiring a multi-frame image and a multi-frame point cloud of a target scene, wherein one frame of image corresponds to one frame of point cloud acquired synchronously; projecting pixel points of each frame of image into a three-dimensional space, forming a projection point cloud by using the space points projected by the pixel points of each frame of image, and acquiring color information of points in each frame of point cloud to generate a laser color point cloud; acquiring aggregation of the multi-time-sequence characteristic aerial view according to the high-density point cloud formed by the projection point cloud and the laser color point cloud; wherein the characteristic aerial view has planar image information and three-dimensional space distribution information; according to the aggregation of the multi-time sequence characteristic aerial view, determining and outputting detection information; wherein the detection information characterizes object category and object position information in the target scene within the current time period.

Namely, each pixel point in the multi-frame image of the detection target scene is projected to a three-dimensional space to generate a projection point cloud with color information, wherein the projection point cloud not only carries rich color information, but also carries rich edge texture information. Aiming at multi-frame point clouds of a detection target scene, color information of points in each frame of point clouds is acquired, and then a laser color point cloud is generated, wherein the laser color point cloud carries rich color information. And then, combining the projection point cloud and the laser color point cloud to obtain a high-density point cloud which has rich color information and edge texture information. Meanwhile, due to the addition of the projection point cloud, the high-density point cloud has the characteristics of more point clouds, more ordered point cloud arrangement and more uniformity, and the like, and the defects of sparsity, disorder, uneven space arrangement and the like are overcome. Therefore, the aggregation of the multi-time-sequence characteristic aerial views is obtained according to the high-density point cloud, and the detection information is determined and output according to the aggregation of the multi-time-sequence characteristic aerial views, so that the output detection information is more accurate.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a schematic diagram of an application scenario of a target detection method provided in the present application;

FIG. 2 is a flow chart of a target detection method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a pixel projection in a three-dimensional space in an object detection method according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram of projection of a matched pixel point in a three-dimensional space in the target detection method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a nonlinear solution in a target detection method according to an embodiment of the present application;

FIG. 6 is a schematic view of voxel segmentation of a point cloud in a target detection method according to one embodiment of the present application;

fig. 7 is a flowchart of a target detection method according to another embodiment of the present application;

FIG. 8 is a flow chart of a target detection method according to another embodiment of the present application;

FIG. 9 is a schematic diagram of feature vectors in a target detection method according to an embodiment of the present application;

FIG. 10 is another schematic diagram of feature vectors in the object detection method according to an embodiment of the present application;

FIG. 11 is a schematic diagram of feature vectors in the object detection method according to an embodiment of the present application;

FIG. 12 is another schematic diagram of feature vectors in the object detection method according to an embodiment of the present application;

FIG. 13 is a schematic diagram of feature vectors in the object detection method according to an embodiment of the present application;

FIG. 14 is a schematic diagram of an object detection device according to an embodiment of the present application;

fig. 15 is a schematic diagram of an electronic device according to an embodiment of the present application.

Specific embodiments of the present disclosure have been shown by way of the above drawings and will be described in more detail below. These drawings and the written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the disclosed concepts to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

In the description of the present application, it should be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or an implicit indication of the number of technical features being indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

The terms referred to in this application are explained first:

and (3) point cloud: it is a set of vectors in a three-dimensional coordinate system, and the scan data is recorded in the form of points, and includes at least one point data, where each point data includes three-dimensional coordinates, and some point data may include color information or reflection intensity.

Self-attention mechanism: is a configuration of an artificial intelligence network based on an attention mechanism, self-attention is free of any learnable parameters.

Deformable attention mechanism: the system is based on the configuration of the artificial intelligent network of the attention mechanism, the parameters of the deformable attention mechanism can be obtained by learning the input data, and then the parameters of the deformable attention mechanism are adjusted along with the change of the input data.

The automatic driving can relieve driving fatigue, reduce traffic accidents, optimize travel paths, relieve traffic pressure, reasonably use energy sources, reduce air pollution and the like. Therefore, the method is widely focused on the domestic and foreign industries and academia. Accurate perception of the surrounding environment is a necessary prerequisite to ensure reliable operation of autopilot. The environment sensing system for automatic driving generally adopts various methods to extract semantic information from the data collected by the sensor, so as to realize automatic driving. Target detection is also a current direction of intense research as a fundamental component of environmental awareness.

The explosion in the field of computer vision has prompted most of the existing target detectors to be based on two-dimensional (2D) images. However, there are significant disadvantages to 2D image-based object detection techniques. Firstly, a 2D image is acquired by a camera, and the camera is used as a passive sensor, although the image contains rich semantic information, and the quality of the acquired image is seriously affected by illumination and climate conditions. Second, 2D images cannot provide depth information, which is essential for path planning and collision avoidance in an automatic driving mission. Thus, the introduction of a three-dimensional (3D) object detection method can provide more detailed object size and position information. However, the 3D point cloud acquired by the radar does not contain information such as color, texture and the like, and has the defects of sparsity, disorder, uneven spatial arrangement and the like, which brings challenges to accurate detection of targets based on the 3D point cloud.

Based on the above, the application provides a target detection method, a target detection device and electronic equipment. The target detection method comprises the following steps: acquiring a multi-frame image and a multi-frame point cloud of a target scene, wherein one frame of image corresponds to one frame of point cloud acquired synchronously; projecting pixel points of each frame of image into a three-dimensional space, forming a projection point cloud by using the space points projected by the pixel points of each frame of image, and acquiring color information of points in each frame of point cloud to generate a laser color point cloud; acquiring aggregation of the multi-time-sequence characteristic aerial view according to the high-density point cloud formed by the projection point cloud and the laser color point cloud; wherein the characteristic aerial view has planar image information and three-dimensional space distribution information; according to the aggregation of the multi-time sequence characteristic aerial view, determining and outputting detection information; wherein the detection information characterizes object category and object position information in the target scene within the current time period.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards, and provide corresponding operation entries for the user to select authorization or rejection.

It should be noted that the target detection method, the target detection device and the electronic device provided by the application can be used in the technical fields of image processing and point cloud processing, and also can be used in any field except the technical fields of image processing and point cloud processing, and the application fields of the target detection method, the target detection device and the electronic device are not limited.

The target detection method is applied to the electronic equipment, and the electronic equipment can be a processor, a cloud server and the like which are loaded with an automatic driving system. Fig. 1 is an application schematic diagram of the target detection method provided in the present application, where the electronic device obtains a multi-frame image and a multi-frame point cloud of a target scene, where one frame of image corresponds to one frame of point cloud collected synchronously. The method comprises the steps of projecting pixel points of each frame of image into a three-dimensional space, forming a projection point cloud by using space points projected by the pixel points of each frame of image, and acquiring color information of points in each frame of point cloud to generate a laser color point cloud. Acquiring aggregation of the multi-time-sequence characteristic aerial view according to the high-density point cloud formed by the projection point cloud and the laser color point cloud; wherein the characteristic aerial view has planar image information and three-dimensional spatial distribution information. According to the aggregation of the multi-time sequence characteristic aerial view, determining and outputting detection information; wherein the detection information characterizes object category and object position information in the target scene within the current time period.

The following describes the technical solutions of the present application and how the technical solutions of the present application solve the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Referring to fig. 2, an embodiment of the present application provides a target detection method, including:

s210, acquiring a multi-frame image and a multi-frame point cloud of a target scene, wherein one frame of image corresponds to one frame of point cloud acquired synchronously.

In the automatic driving process, multiple frame images in a target scene (also can be understood as a driving environment) in a current period of time can be acquired through an image acquisition device, such as a camera, and multiple frame point clouds in the target scene in the current period of time can be acquired through a point cloud acquisition device, such as a laser radar. Wherein the point cloud characterizes three-dimensional geometric information of the target scene.

The electronic device acquires at least one frame of image of the target scene which needs to be subjected to target detection in the current time period, such as the current time period of 1 second or the current time period of 5 seconds, from the image acquisition device, and acquires at least one frame of point cloud of the target scene which needs to be subjected to target detection in the current time period, such as the current time period of 1 second or the current time period of 5 seconds, from the point cloud acquisition device. The electronic equipment acquires each frame of image and each frame of point cloud in a one-to-one correspondence, and space and time synchronicity are provided between each frame of image and each frame of point cloud acquired by the electronic equipment.

In an alternative embodiment, after the multi-frame point clouds of the target scene are acquired, the ground point clouds in each frame of point clouds are filtered out. When the ground point cloud is filtered, the ground points in the point cloud can be filtered by using a GPF_Ranac method. The ground point clouds are numerous but do not contain targets, and filtering the ground point clouds can effectively improve the calculation efficiency of target detection and realize real-time target detection.

S220, the pixel points of each frame of image are projected to a three-dimensional space, the projected point cloud is formed by the space points projected by the pixel points of each frame of image, and color information of the points in each frame of point cloud is acquired to generate a laser color point cloud.

After the pixel points of one frame of image are projected to the three-dimensional space, the information carried by each space point in the projected space points comprises three-dimensional coordinates, three-channel gray values, three primary color information of a color image and the like. Each frame of image provided in this embodiment includes a plurality of view angle images with different view angles, so the two-dimensional coordinates in the information carried by each point in the projected point cloud are two-dimensional coordinates corresponding to the image including each view angle. For example, a frame of image includes 6 view images with different views, and a three-dimensional space point corresponding to a pixel point of the frame of image corresponds to 6 two-dimensional coordinates.

When the color information of the points in the point cloud of one frame is obtained, the points in the point cloud of one frame are corresponding to the pixels of the synchronous image of one frame (one frame of image comprises images of a plurality of visual angles), and then the color information of the points in the point cloud is determined according to the color information of the pixels. In addition, after the points in the point cloud correspond to the pixel points, the two-dimensional coordinates of the pixel points can be obtained as the two-dimensional coordinates of the points. The information carried by the points in the laser color point cloud contains two-dimensional coordinate information, color information, three-dimensional coordinate information, and depth information (i.e., z-coordinate depth=z in each view camera coordinate system). Together they form a vector (u, v, I) representing information of each point in the laser color point cloud _R (u，v)，I _G (u，v)，I _B (u, v), F (u, v), x, y, z). The laser color point cloud can obtain the characteristic of the aggregated point cloud through a point cloud characteristic extraction and aggregation method, and then the information vector of the laser color point cloud can be expressed as (u, v, I) _R (u，v)，I _G (u，v)，I _B (u, v), F (u, v), x, y, z). The two-dimensional coordinates in the information carried by each point in the laser color point cloud are two-dimensional coordinates corresponding to images containing each view angle in the corresponding frame image. For example, one point in the point cloud corresponds to a pixel point of a frame image containing 6 view images of different views, and then one point in the point cloud corresponds to 6 two-dimensional coordinates.

In an alternative embodiment, each frame of image comprises a plurality of view angle images with different view angles, when generating a laser color point cloud, each point in each frame of point cloud is projected on each view angle image of the corresponding frame of image, and then color information of the point is determined according to the projected pixel point. That is, when a dot is projected onto an integer pixel, the color information of the projected integer pixel is acquired as the color information of the dot. When the point is projected to the non-integer pixel point, estimating the color information of the non-integer pixel point projected to the point, and taking the estimated color information as the color information of the point.

Specifically, two-dimensional coordinates of the point cloud acquired by the laser radar on the images of all the visual angles can be obtained through the projection matrix. The information vectors of each point in the point cloud acquired by the laser radar can be divided into two types, and the information vectors of the point cloud projected to the integer pixel point are (u, v, I) _R (u，v)，I _G (u，v)，I _B (u, v), F (u, v), x, y, z). The color information of the point cloud projected to the non-integer pixel point is unknown, and the information vector is (u, v, x, y, z). It is necessary to estimate color information of the corresponding point.

In an alternative embodiment, the step of estimating color information of the non-integer pixel point to which the point is projected includes:

Step one, four pixel points nearest to the projected non-integer pixel point are obtained, and the weight coefficients of the four pixel points are determined according to the distance between each pixel point in the four pixel points and the projected non-integer pixel point and the color gray scale distance.

The following describes in detail how color information of non-integer pixel points to be point-projected is estimated.

And when estimating the color information of the non-integer pixel point projected to the point, gaussian filtering is used as a reference. Namely, 4 pixel points closest to the non-integer pixel (i.e., u, v are not integers) of the color information to be estimated are found, and the weight coefficient of the color information of the pixel points to be estimated is determined by the Gaussian function according to the position distance and the color gray scale distance between the nearest 4 pixel points and the pixel points to be estimated. The closer the pixel is to the pixel to be estimated, the greater the influence of the color gray of the pixel on the color gray of the pixel to be estimated, and the greater the corresponding weight coefficient. In the color estimation process of the laser point cloud, the two-dimensional coordinates, the corresponding three-dimensional coordinates, the colors, the color features, the corresponding space point features and other information of the adjacent pixel points of the corresponding image pixel point are considered, the two-dimensional information and the three-dimensional information are considered, the estimated laser point cloud color is more accurate, and the Gaussian function is adopted to realize the calculation of the weight coefficient based on the distance (the coefficient shows gradual change of the near-far size according to the Gaussian function).

Specifically, a specific formula for estimating color information of a pixel to be estimated is provided, as follows:

formula (1):

formula (2):

equation (3):

equation (4):

in the above formula, I _R (i, j) represents the gradation value of the red component of the integral pixel (i, j) on the image, the color gradation information of the integral pixel on the image is determined for the image like (i, j), and the color gradation information of the non-integral pixel like (u, v) is obtained by training.Representing the red gray value of the pixel (u, v) to be estimated after t times of training, and using the red gray value pair of the image pixel nearest to the pixel (u, v) to be estimated>Initialization is performed. (i, j) represents the coordinates of each of the 4 pixels nearest to the coordinates of the pixel to be estimated, (i, j) ∈KNAR (u, v). MLP stands for multi-layer perceptron.

w (i, j) represents a coefficient obtained by combining weight coefficients related to the 2d coordinate distance, the 2d color gradation distance, the 2d image feature distance, the 3d coordinate distance (obtained by camera parameter projection), and the 3d point cloud feature distance. That is, w (i, j) is the position2d, wcgl (i, j), the position2d, the position3d, the position ₃ d (i, j) are combined to obtain coefficients. Wherein, the position ₂ d stands for 2d sittingThe scale distance dependent weighting coefficient wcgl (i, j) represents the weighting coefficient relating to the 2d color gray scale distance, wcfeature ₂ d represents the weight coefficient related to the 2d image feature distance, wposition ₃ d represents the weight coefficient associated with the 3d coordinate distance (obtained by camera parameter projection), wfeature ₃ d (i, j) represents a weight coefficient related to the 3d point cloud feature distance.

Delta is the normal too much standard deviation of the Gaussian function, and the larger the delta is, the smaller the difference of influences of different distances (refer to space distance/color gray scale distance) on the pixel points to be estimated is, and the smaller the delta is, the larger the difference of influences of different distances on the pixel points to be estimated is. Delta is the parameter to be learned.

The position is ₃ d (i, j) represents a weight coefficient related to the 3d point cloud coordinate distance. Through the previous processing, only the space coordinate information of part of pixel points in each view angle image can be obtained. Thus, in calculating the wposition ₃ In the process of d (i, j), it is necessary to consider whether or not the spatial coordinate information corresponding to the pixel point (i, j) is known.

pos ₃ dx(i,j)、pos ₃ dy (i, j) and pos ₃ dz (i, j) represents the x, y, z coordinates of the 3d point corresponding to the pixel point (i, j), respectively.

fea ₃ d represents the 3d feature. It is also necessary to determine whether the corresponding 3d coordinates are present.

Substituting known parameters into the formulas, and determining based on formula (1)Is a value of (2).

It should be noted that, the above-described method is a gray level estimation method of a red component of a non-integral pixel (u, v) (i.e., u, v are not integers), and the gray level estimation method of the rest of the green component and the gray level estimation method of the blue component are the same as the gray level estimation method of the red component, which is not repeated here.

Therefore, the estimation of the color information of the non-integer pixel point is completed, and the estimation result of the color information of the pixel point to be estimated is obtained.

S230, acquiring aggregation of the multi-time-sequence characteristic aerial view according to the high-density point cloud formed by the projection point cloud and the laser color point cloud; wherein the characteristic aerial view has planar image information and three-dimensional spatial distribution information.

In an alternative embodiment, the step of composing a high density point cloud from the projected point cloud and the laser color point cloud comprises:

and step one, forming an initial high-density point cloud according to the projection point cloud and the laser color point cloud.

The depth information of the point in the initial high-density point cloud comprises the depth information of the point in the projection point cloud and the depth information of the point in the laser color point cloud. The depth information of the point in the laser color point cloud is unknown and needs to be estimated.

The points in the projection point cloud include the spatial points projected by the matched pixel points and the spatial points projected by the non-matched pixel points as described below, and correspondingly, when estimating the depth information of the points in the projection point cloud, not only the depth information of the spatial points projected by the matched pixel points but also the depth information of the spatial points projected by the non-matched pixel points need to be estimated.

Step two, obtaining depth information of space points projected by matched pixel points of each frame of image in the initial high-density point cloud to obtain color point cloud in the projected point cloud, wherein one frame of image comprises view angle images with different view angles, the matched pixel points are pixel points in one view angle image, the same three-dimensional projected space point is arranged between the matched pixel points and one pixel point in the other view angle image, and the depth information of the matched pixel points and the depth information of the one pixel point are both the depth information of the three-dimensional projected space point; .

This step describes how to estimate the depth information of the spatial point of the projection of the matched pixel point, i.e. how to acquire the depth information of the matched pixel point in each frame of image.

In an alternative embodiment, acquiring depth information of the matched pixels in each frame of image includes: matching pixel points in view angle images of different view angles aiming at view angle images of different view angles contained in each frame of image to obtain each pair of matched pixel points of every two view angle images, wherein one pair of matched pixel points contains two matched pixel points; acquiring the same three-dimensional projection space point corresponding to each pair of matched pixel points in each frame of image, and acquiring coordinates of the three-dimensional projection space point in a coordinate system of each image acquisition device and coordinates of the three-dimensional projection space point in a world coordinate system; and determining depth information of the three-dimensional projection space point under each view angle according to the coordinates of the three-dimensional projection space point under the coordinate system of each image acquisition device and the coordinates of the three-dimensional projection space point under the world coordinate system, and taking the depth information of the three-dimensional projection space point under each view angle as the depth information of the matched pixel point in the corresponding view angle image.

Taking an image of one frame as an example, referring to fig. 3, in the case where the pixel coordinates (u, v) of a point p in the image are known, the position of the corresponding spatial point cannot be determined. Since the point p on the image plane may be an image of any point on the ray pP according to the camera imaging principle. Therefore, for a single view image, the coordinates of the spatial points corresponding to the pixels in the image cannot be determined. However, if a frame of image (the same scene image) includes a plurality of view images with different views, the coordinates of the spatial point corresponding to the pixel point in the view image of the pixel point can be determined according to the different view images.

Referring to fig. 4, the spatial point P corresponding to two pixels P and P 'of the image plane can be determined by the two pixels P and P' of the image with different viewing angles.

One specific method of determination is a nonlinear solution. Referring to the non-linear solution diagram shown in fig. 5, the spatial point P may not be found directly due to noise. In this case an optimal P is required ^* And (5) a dot. The spatial point P is projected by the camera in the matrix M and known ^* Projected to two image planes. So that MP ^* Closest to the p-point. And M' P ^* P nearest to P ^* I.e., the point closest to P as required, the approximation may replace P. The nonlinear solution formula is as follows

The method can obtain the coordinates of the matched pixel point pairs P and P' of the images of all the visual angles under the coordinate system of each camera and the coordinates of the corresponding space point P under the world coordinate system, and the pixel points of the image plane can be projected to the three-dimensional space. And the image depth may be determined. The image depth is the z-axis coordinate of the spatial point under the coordinate system of each view camera. After the depth information of the matched pixel points in each frame of image is obtained, the color point cloud in the projection point cloud can be obtained.

And thirdly, forming a dense point cloud according to the color point cloud and the laser color point cloud.

From the above, the depth information of each point in the dense point cloud that is composed is known.

And step four, estimating the depth information of the space points projected by the non-matching pixels in the initial high-density point cloud based on the dense point cloud, and obtaining the depth information of the space points projected by the non-matching pixels in the initial high-density point cloud.

And estimating the depth information of the space point projected by the non-matching pixel point based on the dense point cloud, and correcting the depth information of the initial high point cloud based on the dense point cloud.

As described above, the spatial points projected by the non-matching pixels also exist in the projected point cloud, and depth information of the spatial points projected by the non-matching pixels also needs to be estimated. At this time, it is necessary to estimate depth information of spatial points projected by non-matching pixel points in the initial high-density point cloud based on the dense point cloud.

Estimating depth information of the spatial point projected by the non-matching pixel point comprises:

the first step: establishing information vectors of pixels (non-integer pixels) with unknown spatial information in each view image:

(ur，vr，I _R (ur，vr)，I _G (ur，vr)，I _B (ur，vr)，F(ur，vr)，/>

and a second step of: the local structural characteristics of the initial high-density point cloud are learned according to the 3d coordinates of the initial high-density point cloud by a principal component analysis (Principal Component Analysis, abbreviated as PCA) and Pointet combined method.

And a third step of: local structural feature H (xhd, yhd, zhd) of initial high-density point cloud =

[Fstructure ₁ (xhd,yhd,zhd)⊙Fstructure ₂ (xhd,yhd,zhd)]And the feature vector of the initial high-density point cloud is formed by splicing the feature vector with the information vector of the initial high-density point cloud. I.e. the feature vector of each point in the initial high-density point cloud is (uhd, vhd, I _R (uhd，vhd)，I _G (uhd，vhd)，I _B (uhd，vhd)，F(uhd，vhd)，

Fourth step: and aggregating the feature vectors of the initial high-density point cloud through a feature aggregation network based on a deformable attention mechanism to obtain an aggregation feature pcfh of each point. And inputting the aggregation characteristic pcfh into a multi-layer perceptron to realize the depth prediction of each point in the initial high-density point cloud. Namely D ^pre (uhd，vhd)＝MLP(pcfh)。

Thereby, depth information of each point (including the spatial point of non-matching pixel projection) in the initial high-density point cloud is obtained.

Step five: and correcting the depth information of the space points projected by the non-matching pixels in the initial high-density point cloud based on the dense point cloud to obtain the high-density point cloud.

Specifically, based on the formula L (D ^pre )＝‖(D ^pre -D ^gt )⊙I(D ^gt >0)‖ ² Depth information of the estimated spatial points of the non-matching pixel projections in the initial high-density point cloud is corrected. Wherein D is ^pre Is estimated depth information, D ^gt Is the true depth (i.e., the depth of the dense point cloud) used to supervise the estimated depth. Because only partial points (namely, image pixel points obtained by projecting points in the laser point cloud onto each view angle image through parameters inside and outside the camera and matched pixel points) of each view angle image can obtain more accurate depth information through the previous processing, D is a method ^gt Consider only those pixels having significant depth values, i.e. D ^gt Only the depth information of points in the dense point cloud is substituted.

I () is a guide function for D ^gt >Point i=1 for 0, D ^gt <0,I =0. As indicated by the letter "element multiplication". L (D) ^pre ) Is the predicted depth loss function. And learning and adjusting corresponding parameters through the loss function, so that the learned characteristics are suitable for depth estimation and target detection.

The following describes how to obtain a multi-time-series characteristic aerial view from the high-density point cloud.

In an alternative embodiment, acquiring the multi-time-sequence characteristic aerial view according to the high-density point cloud includes:

step one, obtaining information vectors of each point in the high-density point cloud; the information vector comprises two-dimensional coordinates, three-channel gray values, color features and features of the high-density point cloud, wherein the features of the high-density point cloud comprise features of structural information of the high-density point cloud, and the two-dimensional coordinates comprise two-dimensional coordinates of points in view angle images of a plurality of different view angles contained in one frame of image.

After the high-density point cloud is obtained through the steps, the information vector of each point in the high-density point cloud can be obtained. Specifically, the information vector of each point includes two-dimensional coordinates, three-channel gray values, color features, and features of the high-density point cloud. That is, the information vector is: (uhd, vhd, I) _R (uhd，vhd)，I _G (uhd，vhd)，I _B (uhd, vhd), F (uhd, vhd), xhd, yhd, zhd, PCF (xhd, yhd, zhd)), wherein uhd, vhd represent two-dimensional coordinates, I _R (uhd，vhd)，I _G (uhd，vhd)，I _B (uhd, vhd) represent three channel gray values,xhd, yhd, zhd represent three-dimensional coordinates, and F (uhd, vhd) and PCF (xhd, yhd, zhd) represent color features and features of the high-density point cloud.

And secondly, performing cylindrical voxel segmentation on the high-density point cloud, and performing voxel coding on the segmented spatial point cloud based on the information vector of each point in the segmented voxel spatial point cloud to obtain the voxel point cloud.

The distribution of the high-density point cloud has the characteristics of near density and far hydrophobicity, and the distribution characteristics of the point cloud are well adapted by adopting cylindrical voxel.

The cylindrical voxel segmentation of the high-density point cloud comprises the following steps:

first, the high-density point cloud is mapped from a rectangular coordinate system to a cylindrical coordinate system.

Specifically, by the formulaAnd realizing coordinate mapping.

Wherein,

and secondly, dividing the point cloud space by taking Deltaρ, deltaθ and Deltaz as intervals. Fig. 6 shows a schematic view of voxel segmentation of a point cloud.

Thirdly, performing voxel coding on each divided space point cloud by utilizing PointNet, so that the multi-point information contained in each space point cloud is aggregated into one piece of voxel information, and the voxel-formed point cloud is obtained.

Thereby, cylindrical voxel segmentation of the high-density point cloud is completed. Compared with cubic voxelization, the cylindrical voxelization is more suitable for the near-dense and far-sparse distribution characteristics of the point cloud, can effectively improve the non-empty rate of the point cloud voxels, and is beneficial to improving the accuracy rate of target detection.

And thirdly, carrying out asymmetric convolution on the voxelized point cloud to aggregate information vectors of different points, and obtaining the aggregated information point cloud.

Correspondingly, the aggregation of the voxel information as described above is output, i.e. aggregated voxel information is output. The asymmetric convolution enhances the response in the horizontal-vertical direction and improves the characterization capability of the point cloud compared to conventional sparse convolutions.

And fourthly, compressing the aggregated information point cloud along the height direction, and obtaining the multi-time-sequence characteristic aerial view.

And fifthly, processing the multi-time-sequence characteristic aerial views based on the image characteristic aggregation method of the deformable attention mechanism, and acquiring aggregation of the multi-time-sequence characteristic aerial views.

The bird's eye views at different moments are aggregated based on a deformable attention mechanism. The aggregated information point cloud is compressed in the height direction to obtain the characteristic aerial views of each frame, and the image characteristic aerial views are processed by using an image characteristic aggregation method of a deformable attention mechanism, so that aggregation of the characteristic aerial views of different time sequences can be realized, and the obtained aggregation characteristics are beneficial to improving the target detection accuracy.

For the same aerial view, an input aerial view x epsilon R is given ^C×R×Φ Wherein C represents the information dimension, R represents the radius length of the information aerial view, and phi the angle of the information aerial view. Let q be query information z _q Sequence number, p _q Is z to _q The corresponding two-dimensional coordinates, and the calculation formula of information aggregation based on the deformable attention mechanism is as follows:

wherein,

/>

ξmqk＝MLP(pq⊙pmk⊙Δpmqk⊙‖Δpmqk‖)

the present embodiment employs a multi-head attention mechanism, where M represents the attention head number and M represents the total number of attention headsTaking (m=8), K represents the number of key information points information sampled around the query information point, K represents the total number of surrounding information points of interest for each point, and taking (k=4) in this embodiment, that is, only the 4 key information points most relevant to the query information zq are aggregated with zq in this embodiment. Δpmqk represents the sampling offset, i.e. the offset of the coordinates of the key information point obtained by sampling relative to the coordinates pq corresponding to the query information point zq. Δpmqk is learned by Full Connection (FC) based on the query information point zq. For the case that the offset is a non-integer, the features of the relevant information point are obtained by interpolation by taking the features of the 4 nearest feature pointsAnd (3) representing. Beta mqk (ζ mqk, η mqk) is the correlation weighting coefficient of the key feature points correlated with the query feature points. MLP stands for multilayer perceptron (Multilayer Perceptron, MLP). F (F) _DA Is an aggregate feature based on deformable attention mechanisms.

In this embodiment, the relevant weighting coefficient is a function of the query feature point, the coordinates and features of the relevant feature point, and the euclidean distance between the coordinates and features. Compared with the conventional feature extraction method in which the correlation weighting coefficient obtained by learning does not change with the change of the input sample after model training is completed, in the method of the embodiment, the correlation weighting coefficient is a function of the coordinates and features of the input sample after model training is completed, so that the correlation weighting coefficient can change with the change of the input sample, that is, the correlation weighting coefficient has higher flexibility. The self-adaptive capacity of the correlation weighting coefficient obtained by learning is stronger, and the correlation degree between the characteristics obtained by aggregation through the method and the input sample is higher, so that the accuracy of target detection is improved.

For the aerial views at different moments, setFor a set of input different time sequences of birds-eye views,/->Represents the firstAnd (3) a bird's eye view at the moment l. The calculation formula for multi-scale feature aggregation based on deformable attention mechanisms is:

ξmtqk＝MLP(Pq⊙pmtk⊙Δpmtqk⊙‖Δpmtqk‖)

t represents the timing, T represents the total number of timings, and t=4 is generally taken. The other coefficients have the same meaning, and t in the corner mark is used for distinguishing the bird's eye view images of different time sequences. F (F) _MSDA Namely, the aggregate information bird's eye view based on the multi-time sequence bird's eye view of the deformable attention mechanism outputs the aggregate characteristic F _MSDA And acquiring aggregation of the multi-time-series characteristic aerial views.

S240, determining and outputting detection information according to aggregation of the multi-time sequence characteristic aerial views; wherein the detection information characterizes object category and object position information in the target scene within the current time period.

The aggregate of the bird's eye views with different time sequence characteristics is input to different task heads, so that various tasks can be completed, and specific targets such as 3D target detection or map segmentation can be met. For example, the aggregate of the bird's eye views with different time sequence features is input to a single-stage or two-stage 3D target detection head (a single-stage target detection head such as YOLO series, a two-stage target detection head such as RCNN series), so as to realize the task of detecting the 3D target. The aggregation of the bird's eye views with different time sequence features is input to the image segmentation head to realize map segmentation.

In summary, the present embodiment provides a target detection method, including: acquiring a multi-frame image and a multi-frame point cloud of a target scene, wherein one frame of image corresponds to one frame of point cloud acquired synchronously; projecting pixel points of each frame of image into a three-dimensional space, forming a projection point cloud by using the space points projected by the pixel points of each frame of image, and acquiring color information of points in each frame of point cloud to generate a laser color point cloud; acquiring aggregation of the multi-time-sequence characteristic aerial view according to the high-density point cloud formed by the projection point cloud and the laser color point cloud; wherein the characteristic aerial view has planar image information and three-dimensional space distribution information; according to the aggregation of the multi-time sequence characteristic aerial view, determining and outputting detection information; wherein the detection information characterizes object category and object position information in the target scene within the current time period.

Referring to fig. 7, another embodiment of the present application provides a target detection method, including:

S710, acquiring a multi-frame image and a multi-frame point cloud of a target scene, wherein one frame of image corresponds to one frame of point cloud acquired synchronously.

The description of this step may refer to the related description in step S210, and will not be repeated here.

S720, screening out key frame images and non-key frame images in the multi-frame images; and acquiring a keyframe point cloud corresponding to the time sequence of the keyframe image, and acquiring a non-keyframe point cloud corresponding to the time sequence of the non-keyframe image.

In the actual data acquisition process, more similar data are usually encountered. The high similarity of data introduces more redundant information. The method has the advantages of small calculation significance for the redundant information and large calculation amount for the redundant information processing, so that the target detection efficiency is low. In order to reduce the processing of redundant data and improve the network work efficiency, key frame images can be screened out from the acquired multi-frame images. It should be noted that, the screening of the key frame image is applied to the image acquired by the same camera.

The present embodiment provides two methods for screening out key frame images. The first method is a screening method of key frame images based on Euclidean distance, and the second method is a screening method of key frame images based on image correlation coefficients.

The two methods are described in detail below, respectively.

The first method is a screening method of key frame images based on Euclidean distance.

First, the dimension of each frame of image obtained in step S710 is reduced based on PCA, and then the key frame images are screened out according to the distance between the dimension-reduced vectors of each frame of image.

And (3) reducing the dimension of each frame of acquired image based on PCA, wherein the main information of the high-dimension data is reserved as much as possible while the high-dimension data is projected into the low-dimension data. The method comprises the following specific steps: first, the picture is straightened. Input picture set (i.e., multi-frame image) i= { I ₁ ,I ₂ …,I _m Straightening each two-dimensional picture matrix in the picture set into a one-dimensional vector to obtainX is a matrix of m rows and n columns. Where m represents the total number of pictures (i.e., the number of samples is m), n represents the total number of pixels each picture contains (i.e., each sample dimension is n), and the objective of PCA dimension reduction is to reduce the dimension of each sample (n dimension) to a smaller dimension (k dimension). Here, k<n. Second, the sample is decentered, and the sample mean value is calculated>Each row in X, i.e. each sample +.>Subtracting the sample mean +.>Obtaining a decentered matrix-> Matrix B is a matrix with an average value of 0. Third step, covariance matrix of the decentralised matrix is calculated +. > Fourth, calculating the characteristic value lambda of C ₁ ≥λ ₂ …≥λ _n Not less than 0 and feature vector corresponding to each feature value ∈0>Fifth, selecting the first k eigenvalues lambda ₁ ≥λ ₂ …≥λ _n Corresponding feature vector>Form a transformation matrix T _n×k Further obtaining the image vector set after dimension reduction +.> That is, an original image is represented by an n-dimensional vector, and each image is represented by a k-dimensional vector after the PCA dimension reduction.

And screening out the key frame images according to the distance between the dimension reduction vectors of each frame image. Because of the similarity of images with similar time sequence in the input image data setHigher degree, and therefore more redundant information, makes the whole process of image processing time consuming longer. Therefore, it is necessary to minimize the processing of redundant information. The key frame is screened out by taking the similarity as a judging index (namely, a picture with smaller similarity between the current frame and the adjacent previous key frame is taken as the key frame). The method comprises the following specific steps: first, key frame initialization, setting a first frame as an initial key frame,I _ki representing the ith key frame image. Step two, calculating the Euclidean distance between the subsequent frame and the key frame from the second frame in turn, and the +_>Setting a threshold D _th When D is calculated>D _th When (i.e. the distance is larger and the similarity is smaller), the i-th frame is defined as the key frame, +. > Third, and so on, screen the r key frame I _kr When the method starts from the image of the following frame of the r-1 key frame, the Euclidean distance between the following frame and the r-1 key frame is calculated in sequence, and the Europe is calculated>Setting a threshold D _th When D is calculated>D _th When (i.e., greater distance, less similarity), the j-th frame is selected as the r-th key frame, +.> And fourthly, circularly executing the third step until all the pictures are screened. Thereby obtaining a dimension-reducing vector set +.>Image key frame set { I } _k1 ,I _k2 …,I _ks And s represents the total number of key frame images.

The second method is a screening method of key frame images based on image correlation coefficients.

Each frame of image is a color image, and gray scale processing is performed on the color image to obtain a gray scale image. Then, a gradation distribution histogram of the obtained gradation image is calculated. And finally, screening the key frames according to the similarity of the image gray distribution histograms corresponding to the images of each frame. The method comprises the following specific steps: first, gray processing is performed on an input color image, and a gray distribution histogram of the gray processed image is calculatedObtaining a histogram set of an image based on a gray distribution histogram of each frame of the image +.>Second, key frame initialization, setting the first frame as key frame +. >/>Third, and so on, when screening the r-1 key frame Ikr, calculating the correlation coefficient between the subsequent frame and the key frame from the subsequent frame image of the r-1 key frame in sequence, and calculating +.>Setting a threshold value ρ _th When calculating ρ<ρ _th When (i.e. similarity is small), the j-th frame is defined as the key frame +.>And fourthly, circularly executing the third step until all the pictures are screened. Thereby obtaining a histogram set of the key frame image +.>Keyframe image set { I } _k1 ,I _k2 …,I _ks And s represents the total number of key frame images.

After the key frame images in the multi-frame images are screened out, non-key frame images in the multi-frame images can be obtained.

After screening out the key frame image and the non-key frame image in the multi-frame image, acquiring the key frame point cloud corresponding to the time sequence of the key frame image, and acquiring the non-key frame point cloud corresponding to the time sequence of the non-key frame image. That is, a point cloud synchronized with a key frame image is defined as a key frame point cloud, and a point cloud synchronized with a non-key frame image is defined as a non-key frame point cloud.

Referring to fig. 8, in an alternative embodiment, after screening out the key frame image, the non-key frame image, the key frame point cloud and the non-key frame point cloud in the multi-frame image, the target detection method further includes:

S810, obtaining a noise reduction image of the key frame image and edge texture images with different scales.

And carrying out noise reduction processing on all the key frame images to obtain the noise reduction image of each key frame image and edge texture images with different scales.

Taking a key frame image as an example, how to obtain a noise reduction image of the key frame image and edge texture maps with different scales will be described below.

Alternatively, wavelet decomposition may be performed on the key frame image, and a threshold may be set to perform noise reduction processing on the key frame image. That is, the texture and edge characteristics of each layer of image are recombined by using the high-frequency wavelet coefficients in the horizontal and vertical directions subjected to threshold filtering.

Specifically, the method comprises the following steps:

and step one, carrying out wavelet decomposition on the key frame image to obtain a plurality of different wavelet components.

First, wavelet decomposition is performed on the key frame image, and three layers of wavelet decomposition is performed on the key frame image by using Haar in this embodiment, so as to obtain a plurality of different wavelet components.

Setting different thresholds for different wavelet component graphs, and carrying out threshold filtering on each wavelet component according to the threshold corresponding to each wavelet component to obtain a plurality of wavelet components after noise reduction.

Based on each wavelet component obtained by the decomposition, a threshold value Wavecoeffth is adaptively calculated to realize filtering noise reduction of each wavelet component.Where M represents the length of the key frame image, N represents the width of the key frame image, and σ represents the noise standard deviation. σ= mida (Wavcoeff)/0.6745, where mida (Wavcoeff) represents calculating the median value of each wavelet component.

Each wavelet component is soft threshold filtered based on the calculated threshold value of each wavelet component. The soft threshold filtering is formulated as follows:

when wavecoeffilted is 0, the corresponding wavelet component is directly filtered out. When wavecoeffiitered is not 0, it means that the wavelet component occupies a relatively large area in the key frame image, and the wavelet component is acquired to reconstruct the image.

And thirdly, recombining a noise reduction image according to the plurality of wavelet components after noise reduction, and recombining edge texture images with different scales according to the plurality of wavelet components after noise reduction.

When the noise-reduced image is recombined according to the plurality of wavelet components after noise reduction, the wavelet components in the diagonal direction contain more noise information, and contain less image information. Therefore, wavelet components in the diagonal direction are removed in the image reconstruction process, and the rest of the filtered wavelet component reconstructed images are applied to obtain the noise reduction image.

When edge texture images with different scales are recombined according to the plurality of wavelet components after noise reduction, the edge texture images of the wavelet components of different layers are recombined through the horizontal wavelet components and the vertical wavelet components of different layers respectively. In an alternative embodiment, the initial edge texture images with different scales are recombined according to the wavelet components after noise reduction, and the erosion operation and the expansion operation are performed on the initial edge texture image with each scale so as to obtain the edge texture images with different scales. The edges in the initial edge texture image may be discontinuous, the erosion operation is to further reduce noise, the dilation operation is to join the points of discontinuity into a line, and finally a closed edge is obtained.

Through the steps, the noise reduction image of the key frame image and the edge texture images with different scales are obtained.

S820, inputting the noise reduction image into an image feature extraction network to obtain each scale feature image of the noise reduction image, and overlapping edge texture images with different scales on each scale feature image to obtain each scale feature image subjected to edge texture enhancement processing.

The image feature extraction network is, for example, res2Net, and Res2Net can realize fine granularity extraction of image features without increasing the calculation amount. The Res2Net module equally divides the input image into a plurality of different data blocks, and convolves the different data blocks for different times.

However, in the process of extracting features from image data, res2Net only adopts a method of equally dividing the data, and the data of each part is not distinguished, and different layer convolution processing is directly performed, so that the processing method is too random. Therefore, the embodiment further designs an image feature extraction network for guiding Res2Net based on image information entropy so as to realize detailed and proper feature extraction of image data with different complexity.

For example, res2Net module equally divides image data into 4 data blocks x ₁ 、x ₂ 、x ₃ And x ₄ For different data blocks, 3 x 3 convolutions are performed a different number of times, data block x ₁ ～x ₄ The number of convolutions over 3 x 3 is 0, 3, 2, 1, respectively. And the more the number of convolutions, the more the information contained in the characteristics of the data block. For example, for data block x ₂ ，y ₂ Comprises x ₂ Primary semantic information of y ₃ Comprises x ₂ Intermediate semantic information of y ₄ Wherein x is included in ₂ Is described. And the output result of the module is y ₁ ～y ₄ Is a combination of (a) and (b). I.e. the module output contains x ₂ Is described herein, and semantic information. Then, the more textured and complex part of the data in the image can be placed at x ₂ Sequentially placing data blocks into x in the graph according to the complexity of the data (from high to low) ₂ ，x ₃ ，x ₄ ，x ₁ At the location of the vehicle. Therefore, the image can be subjected to feature extraction according to the complexity of the image block in a guiding way, and the quality of the feature image is improved. The image feature extraction method based on the image information entropy guides Res2Net to sort the complexity of the image data blocks based on the image information entropy.

The calculation formula of the image information entropy is as follows:

where i represents the image gray level, K represents the total number of image gray levels, pi represents the proportion of the number of pixels having gray level i in the total number of pixels. Entropy of image information I _E Larger represents more complex images. Conversely, the image information entropy I _E Smaller represents a simpler image.

Since Res2Net involves convolutions with different numbers of layers for different data in this section. Therefore, the Res2Net is guided by the image information entropy to carry out convolution with different layers on data with different complexity, and the feature extraction task can be completed better.

That is, in an alternative embodiment, the method prior to inputting the noise reduced image into the image feature extraction network comprises:

step one, converting the noise reduction image into a gray image, dividing the gray image into tiles with the same size, and obtaining the information entropy value of each tile.

That is, the key frame image is subjected to gradation processing. The graying treatment adopts a more common method, namely, the gray is obtained by carrying out weighted average on three gray scales according to a certain weight according to the sensitivity of human eyes to R, G, B colors.

The specific formula is as follows:

I _G ray(x,y)＝0.3×I _R (x,y)+0.59×I _G (x,y)+0.11×I _B (x,y)

wherein I is _G ray represents the image graying result, I _R Is the red component of a color image, I _G Is the green component of the color image, I _B Is the blue component of the color image.

The gray level image I _G ray partitioning results in tiles of 3 x 3 size per block.

And respectively calculating the information entropy value of each image block, and recording the pixel point value in each image block as the information entropy value of the image block. The specific formula is as follows:

I _E (x,y)＝I _E _ _3×3 (x,y)

wherein I is _E _ _3×3 Information entropy values of an image representing a 3×3 small tile containing pixel points (x, y).

And secondly, dividing the gray image according to the information entropy value of each image block, and obtaining a plurality of image areas obtained by dividing.

Specifically, first, for each tile, according to the size of the information entropy value of the pixels in the tile, the number duty ratio of each group of pixels with the same information entropy value is counted, so as to obtain the information entropy distribution statistical histogram of the tile. And then carrying out Gaussian smoothing filtering treatment on the information entropy distribution statistical histogram, and identifying peak points of the information entropy distribution statistical histogram. And carrying out fuzzy C-means clustering on each pixel point in the information entropy distribution statistical histogram by taking the number of peak points as the clustering number and the information entropy corresponding to the peak points as the clustering center, and updating the information entropy value of the pixel points belonging to the same kind to the information entropy value of the clustering center to which the pixel points belong. After the information entropy value is updated, dividing the gray image according to the updating result of the information entropy value, and obtaining a plurality of image areas obtained by dividing.

First, an information entropy diagram of a gray image may be obtained from the information entropy value of each tile. Optionally, the entropy diagram is subjected to gaussian smoothing filtering to obtain an entropy smoothing diagram with continuous entropy values.

Calculating an information entropy distribution histogram Him_entr of the information entropy diagram, wherein the formula is as follows:

wherein N is _e In the information entropy diagram, the total number of pixels with the information entropy value of e is represented, and N represents the total number of pixels contained in the key frame image. N is used for recording the peak number of the information entropy distribution histogram _peak And (3) representing. Recording information entropy distribution histogram H _{im_entr} (e) The entropy of the information corresponding to each peak is denoted as E _{hist_peak} By using

Distributing the peak value number N according to the information entropy _peak Determining the cluster number N of fuzzy C-means clustering _cluster . The specific formula is as follows:

wherein N is _th Is the maximum number of clusters and is used for preventing the number of clusters from being excessively large.

By E _{hist_peaks} Initializing a clustering center, and carrying out fuzzy C-means clustering on pixel points in the smooth information entropy diagram. According to the clustering result, changing the information entropy of the pixel points belonging to the same category into a center information entropy value of the category.

And finishing information entropy value gradient of the information entropy diagram and segmentation of the information entropy diagram through the steps.

And mapping the division result of the information entropy diagram onto the gray level image to realize the division of the gray level image, and obtaining a plurality of image areas obtained by division.

And thirdly, arranging each image area to a corresponding position in the feature extraction network according to the information entropy value of each image area, wherein the image areas with the information entropy value larger than the preset information entropy value are arranged at positions which can pass through the first number of convolution layers, and the image areas with the information entropy value smaller than or equal to the preset information entropy value are arranged at positions which can pass through the second number of convolution layers, and the first number is larger than the second number.

According to the information entropy value of each image area in the segmentation result, each area of the gray level image is arranged at different positions of the Res2Net feature extraction network, and image feature extraction guided by the information entropy is realized. Image areas with large information entropy values are arranged at positions which can pass through a large number of convolution layers, and image areas with small information entropy values are arranged at positions which can pass through a small number of convolution layers.

In the process of extracting the features of the noise reduction image, the image information entropy value is introduced to represent the gray distribution complexity of the local image block, so as to guide the image to be optimally segmented. Meanwhile, the data segmentation and data arrangement modes in the multi-scale backbone network Res2Net are improved, namely, image areas with large information entropy values are arranged at positions which can be subjected to more-layer convolution, and image areas with small information entropy values are arranged at positions which are subjected to fewer-layer convolution. Compared with the traditional mode of equally dividing and orderly arranging the data of the multi-scale backbone network Res2Net, the method provided by the embodiment realizes detailed and proper feature extraction of the image, thereby improving the efficiency and accuracy of target detection.

After the processing procedure before the noise reduction image is input to the image feature extraction network is completed, the noise reduction image is input to the image feature extraction network to obtain each scale feature map of the noise reduction image. And overlapping the edge texture images with different scales on the feature images with different scales to obtain feature images with different scales which are subjected to edge texture enhancement processing. That is, the feature map of the different layers is added to the edge texture map of the corresponding layer (the same size) to obtain the feature map of the different layers subjected to edge texture enhancement.

S830, deducing the non-key frame image characteristics according to the optical flow between the key frame image and the non-key frame image.

Alternatively, the non-key frame image features can be deduced by using a sparse feature propagation method, a dense feature aggregation method, a high-performance video target detection method and the like. The following describes 3 methods in detail, respectively.

The first method is as follows: sparse feature propagation methods.

The sparse feature propagation method introduces the concept of key frames for the first time in the video target detection method. Sparse feature propagation methods are designed because similar appearances between adjacent frames typically result in similar features. Therefore, it is not necessary to calculate the features of all frames.

In the image feature reasoning process, only the key frame image extracts features through a complex feature extraction network. The feature of the non-key frame image i is obtained by propagation of the feature of each pixel point in the previous key frame k based on the optical flow. The motion of each pixel point between the key frame and the non-key frame is recorded in a two-dimensional motion field M _i→k In this example, the process of feature propagation from key frame k to non-key frame i can be expressed as:

wherein, the w characteristic warp function, F _k Features representing key frames, M _i→k Representing the motion field between a key frame and a non-key frame. Deducing the above process to obtain non-key frame characteristic F _k→i Without computing the true features F of the non-key frame images through complex feature extraction and aggregation networks _i . Playground M _i→k May be obtained by a lightweight optical flow estimation network,I _k ,，I _i is an input key frame image and a non-key frame image.

The second method is as follows: dense feature aggregation methods.

The dense feature aggregation method introduces the concept of time feature aggregation for the first time in the video target detection method. The motivation for this approach is that the deteriorated appearance (motion blur, collision) can compromise the depth features. This type of problem can be ameliorated by aggregating nearby frames.

In the image characteristic reasoning process, all frame images are communicatedAnd extracting the characteristics through a characteristic extraction network. For any frame i, in the time window [ i-r, i+r ]]All frame feature maps within (r=2-12) will be propagated to the i-th frame by the sparse feature propagation method. Thereby forming a feature mapping set { F _k→i |k∈[i-r,i+r]Obtaining an aggregate feature map of the ith frame by weighted averaging all features in the feature map set

Wherein the weight W _k→i According to the propagation characteristic diagram F _k→i And a true feature map F _i And calculating the similarity between the two. In the process of calculating the similarity weight, W _k→i (p) is not directly obtained from te F. Instead, feature F is first projected as an embedded feature F before similarity is calculated ^e 。Wherein ε (·) represents a miniature full convolution network. The weight coefficient is derived by:

features are aggregated in a point-wise fashion. For any pixel point p, the weights are normalized within the adjacent frames:

the third method is as follows: a high-performance video object detection method is provided.

The high-performance video object detection (Towards High Performance Video Object Detection, THP) method proposed by Zhu et al can perform sparse recursive feature aggregation on key frames, namely, given two consecutive key frames k and k ', the aggregation feature of the key frame k' is calculated by the following formula:

Wherein,as indicated by dot product, W _k→k′ (p)+W _k′→k′ (p) =1, p representing a certain pixel point.

The deduction of the features of the non-key frames directly through sparse feature propagation is though efficient. However, there is a large change in appearance for some adjacent frames. Non-key frame features are deduced from key frame features directly by optical flow, which is prone to errors. Therefore, in the optical flow estimation process, a matrix Q for distinguishing the feature time consistency is introduced _k→i Estimating propagation characteristics F of non-key frames in pixel-by-pixel quantization _k→i Whether or not it is a non-key frame true feature F _i If so, non-key frame features are deduced by optical flow. Otherwise, the feature extraction and feature aggregation network is used for calculating the features of the pixel. In particular, the network is estimated by utilizing optical flowPredicting key frame image I _k Non-key frame image I _i Optical flow matrix M between _i→k Consistent matrix Q with characteristic time _k→i I.e. +.>If Q _k→i (p) is less than a certain threshold value Q _{p_th} At this time, for pixel point p, the feature F is propagated _k→i (p) true features F of non-key frames with images _i (p) inconsistencies, i.e. F _k→i (p) true features F of pixel point p need to be processed through feature extraction and feature aggregation network, which are not good approximations to the true features of non-key frames _i (p) performing calculation to obtain F _k→i (p)。

In this embodiment, the third method (high-performance video object detection method) is preferably used to derive the non-key frame image features, because the third method improves the efficiency of feature extraction while guaranteeing the quality of the extracted features.

S840, aiming at the key frame point cloud, extracting key frame point cloud characteristics, and carrying out deformable attention characteristic aggregation on the key frame point cloud characteristics to obtain aggregated key frame point cloud characteristics.

The embodiment adopts a method based on principal component analysis (PCA for short) and PointNet combination to extract the characteristics of the key frame point cloud.

S850, for the non-key frame point cloud, deducing the characteristics of the non-key frame point cloud according to the scene flow between the key frame point cloud and the non-key frame point cloud.

The point cloud is to scan the space object by using the laser radar system to obtain the three-dimensional coordinates of the object reflection points, the reflection points of each object are distributed in the three-dimensional space in the form of the three-dimensional coordinates, and the reflection points are stored in the form of a matrix. Therefore, the arrangement and structure of the points in the point cloud are main characteristics of the point cloud, and the point clouds of different objects have different arrangement structures. PCA is an effective method for distinguishing different targets by analyzing points in each local space in the point cloud and obtaining the arrangement structure of the local point cloud according to the characteristic value distribution. PointNet is the feature of each point in the point cloud and the global feature through the deep learning method. And the characteristics comprising the information of each point in the point cloud, the local structure information and the global information can be obtained by splicing the characteristics of the point cloud obtained by the two methods. In the embodiment, two schemes are adopted to extract structural features of the key frame point cloud, and finally the point cloud features extracted by the two schemes are spliced to obtain the final required features of each point in the key frame point cloud, wherein the features of each point comprise local structural information of the key frame point cloud.

Specifically, the step of extracting key frame point cloud features based on PCA and PointNet includes:

first, a principal component analysis method is adopted to learn the structural characteristics of the key frame point cloud. The method specifically comprises the following steps:

inputting key frame point cloud to be processedComputing a keyframe point cloud P _pc Tensor of three-dimensional structure of (2)

Wherein,is a point p in the point cloud ₀ Nearest neighbor p of (2) _i ，i＝1,…,N _near Is defined by a geometric center of the mold. N (N) _near According to the actual situation, N is _near ＝30。

And p is as follows ₀ May be slightly different. Due to the tensor of the three-dimensional structure>Is a symmetric positive definite matrix, so it has three non-negative eigenvalues μ ₁ ，μ ₂ Sum mu ₃ And the corresponding feature vectors are mutually orthogonal. Wherein mu ₁ ，μ ₂ ，/>And mu ₁ ≥μ ₂ ≥μ ₃ And the equal to or greater than 0 represents the variation range of the three-dimensional point cloud along the main characteristic axis. Thus, the local three-dimensional shape can be characterized by the eigenvalues. The point cloud distribution structure corresponding to the relative sizes of the eigenvalues is shown in table 1.

Table 1:

the local 3D shape can be characterized by eigenvalues, derived from linear L _μ Planarity P _μ And degree of scattering S _μ The dimension of the representation. In particularRepresenting 1D, 2D and 3D features.

As shown in Table 1, when μ ₁ ≈μ ₂ ≈μ ₃ At 0, the point in the neighborhood changes little at 3 direction coordinates, and the point is distributed near the point. At this time, L _μ ≈P _μ ≈0,S _μ ≈1。

When mu ₁ ＞＞μ ₂ ≈μ ₃ And when the coordinate distribution of the points in the neighborhood is more dispersed in 1 direction, the coordinate distribution is more concentrated in the other two directions, and the points are distributed near one line segment. At this time, L _μ ≈1,P _μ ≈S _μ ≈0。

When mu ₁ ≈μ ₂ ＞＞μ ₃ And when the coordinate distribution of the points in the corresponding neighborhood is more dispersed in 2 directions, the coordinate distribution is more concentrated in the other direction, and the points are distributed near a plane area. At this time, L _μ ≈0,P _μ ≈1,S _μ ≈0。

When mu ₁ ≈μ ₂ ≈μ ₃ When > 0, the corresponding adjacent points are more dispersed in 3 directions and have similar dispersion degree, and the points are distributed in a sphere. At this time, L _μ ≈0,P _μ ≈0,S _μ ≈1。

When mu ₁ >μ ₂ >μ ₃ And when the coordinate distribution of the points in the corresponding neighborhood is more dispersed in 3 directions, and the points are distributed in an ellipsoid. At this time, L _μ ，P _μ ，S _μ Is a fraction between 0 and 1.

Normalizing the characteristic value to obtain

Shannon entropy E by normalizing eigenvalues _μ To implement a measure of the complexity of the local point cloud structure:

estimating local surface variations by eigenvalues:for change of curvature, C _μ The value varies with kn, C during the gradual variation of kn _μ The value will jump indicating a strong deviation in the direction of the neighborhood surface normal.

O by full variance _μ Estimating the structural dispersion of the local point cloud:

the anisotropy of the point cloud can be usedAnd (5) measuring.

The vectors are spliced to obtain the structural characteristics of the point cloud based on the main components: the ". As indicated above, indicates splice.

Second, the pcd is processed ₁ Inputting PointNet feature extraction network to obtain feature point set

Third, splicing the characteristics to obtain the pcd ₁ Feature point set h= [ F _structure1 ⊙F _structure2 ]And the extracted key frame point cloud characteristics are obtained.

Thus, the extraction of key frame point cloud features is completed.

And performing deformable attention feature aggregation on the key frame point cloud features to obtain aggregated key frame point cloud features. According to the graph theoryThe graph is, in theory, composed of points and edges. Denoted G (V, E). Where V represents each vertex in the graph and E represents the edge between the vertices. For input point cloud (Key frame point cloud)V= {1,2, … N } represents the vertex set. E= |v|×|v| represents a set of edges. N represents the number of points contained in the input point cloud.

Feature point setIs a feature associated with each vertex i e V. Wherein F+9 represents each vertex feature dimension.

Specifically, the method comprises the following steps:

wherein,

wherein p is _i The coordinates of the ith point in the input point cloud are represented by M, M represents the attention header number, M represents the header total number (m=8) in this embodiment, K represents the number of key feature points around the query feature, K represents the total number of surrounding feature points concerned by each query feature point, and k=4 in this embodiment. Learning the obtained Δp _mik The feature of the non-integer feature point of the feature map is not necessarily an integer, and is obtained by bilinear interpolation of 4 feature points nearest to the feature point, and is recorded as W _m Then the coefficient matrix of the mth attention header is obtained by looking up the feature h' _i (p _i ) And inputting the learning result of the FC layer. As a result ofAnd (5) splicing. F (F) _GDA I.e. is characteristic of aggregation. />

And obtaining the aggregated key frame point cloud characteristics through the two steps.

It should be noted that, the above description is a step of extracting and aggregating single-layer features in the key frame point cloud, and a step of extracting and aggregating multiple-layer features in the key frame point cloud based on a deformable attention mechanism is similar to a step of extracting and aggregating single-layer features in the key frame point cloud, and the difference is that the feature extraction stage adopts a feature extraction method combining PCA and pointet++.

Specifically, the steps of extracting and aggregating the multi-layer features in the keyframe point cloud based on the deformable attention mechanism include:

first, the key frame point cloud P _pc Inputting the PointNet++ feature extraction network to obtain a multi-scale feature point set of the point cloud,

and thirdly, determining K feature points with stronger correlation with the query feature point i based on a graph deformable attention mechanism. Taking 4 here, realizing point cloud multi-scale feature aggregation, and obtaining multi-scale aggregation features based on a graph deformable attention mechanism, wherein the specific formula is as follows:

Wherein,

ξ _mlik ＝MLP(p _i ⊙p _mlk ⊙Δp _mlik ⊙‖Δp _mlik ‖)

the coordinates are normalized for three-dimensional coordinates, and the coordinates are normalizedThe purpose of the unification is to align the different feature layers.

Function phi _l () For normalizing coordinatesRescaling to the first level feature coordinates. Other coefficients have the same meaning as 3.4.2, and the angle marks l are used for distinguishing different characteristic layers, F _MSGDA I.e. the aggregated point cloud features.

Marking the aggregated keyframe point cloud feature as pcf=f _MSGDA 。

The aggregate feature point cloud of the high-density point cloud has features including: the image contour enhancement processing method comprises the steps of non-key frame image characteristics, non-key frame point cloud characteristics, key frame image characteristics subjected to image contour enhancement processing and aggregated key frame point cloud characteristics.

S730, the pixel point of each key frame image is projected to a three-dimensional space, and the pixel point of each non-key frame image is projected to the three-dimensional space, and the projected point cloud is formed by the space point projected by the pixel point of each key frame image and the space point projected by the pixel point of each non-key frame image.

That is, the projected point cloud includes a spatial point cloud corresponding to a key frame image and a spatial point cloud corresponding to a non-key frame image.

As described in relation to step S220, each frame of image contains view images of a plurality of different views. That is, a view image including a plurality of different views regardless of whether the key frame image or the non-key frame image.

The information carried by each point in the projection point cloud comprises two-dimensional coordinates, three-channel gray values, three-primary color information of a color image and the like. The two-dimensional coordinates in the information carried by each point in the projected point cloud are two-dimensional coordinates corresponding to the image containing each view angle. For example, a frame of image includes 6 view images with different views, and a three-dimensional space point corresponding to a pixel point of the frame of image corresponds to 6 two-dimensional coordinates.

S740, acquiring color information of each key frame point cloud, and acquiring color information of each non-key frame point cloud to generate a laser color point cloud, wherein the laser color point cloud comprises a key frame color point cloud and a non-key frame color point cloud.

Please refer to the related description in step S220. When the color information of the points in the point cloud of the key frame is obtained, the points in the point cloud of the key frame are corresponding to the pixels of a synchronous key frame image (one frame image comprises images of a plurality of visual angles), and then the color information of the points in the point cloud is determined according to the color information of the pixels. When color information of points in a non-key frame point cloud is obtained, after points in the non-key frame point cloud are corresponding to the pixel points of a synchronous non-key frame image (one frame image comprises images of a plurality of visual angles), the color information of the points in the point cloud is determined according to the color information of the pixel points.

S750, acquiring aggregation of the multi-time-sequence characteristic aerial view according to the high-density point cloud formed by the projection point cloud and the laser color point cloud; wherein the characteristic aerial view has planar image information and three-dimensional spatial distribution information.

The projected point cloud includes a spatial point cloud projected by a key frame image and a spatial point cloud projected by a non-key frame image. The laser color point cloud includes a key frame laser color point cloud and a non-key frame laser color point cloud.

The description of step S750 may refer to the description of step S240, which is not repeated here. After the key frames and the non-key frames are distinguished, the generation process of the high-density point cloud and the aggregation generation process of the multi-time-series characteristic aerial view are not affected. The process of generating the high-density point cloud and the process of generating the aggregate of the multi-temporal feature aerial view are related to each point in the projected point cloud and each point in the laser color point cloud.

It should be noted that, when the matching pixel is obtained, the image features enhanced by the edge texture are adopted, so that the matching pixel is more beneficial to obtaining. The embodiment can learn the relation among the pixels of the images of each view angle based on a self-attention mechanism, and can realize the relation among the different pixels of the images of different view angles based on a cross-attention mechanism.

The specific method comprises the following steps: first, the infusion is carried outAnd entering the edge-enhanced layers of characteristic images of the two images to be matched. The coarse granularity characteristic with the side length of 1/8 of the original image size is recorded asAnd->The fine granularity characteristic with the side length of 1/2 of the original image size is recorded asAnd->Then, based on the deformable attention mechanism or transducer, relevant information between different tiles is learned. Coarse granularity feature map->And->Conversion to a feature map which is easy to match +.>And->Implemented are two feature maps +.>And->The matching can be divided into the following steps:

calculating a score matrix for representing the correlation degree of two feature graphsMatching probability->Then it is obtained by the following formula: />The softmax function implements a normalization function.

Matching selection, first set a threshold value theta _c . Probability of matchingGreater than theta _c Is selected as a candidate matching pixel pair. Then, according to the mutual nearest neighbor strategy (mutual nearest neighbor, MNN), the candidate matched pixel pairs are further filtered and screened, and abnormal rough matched pixel pairs are filtered. The coarse-grained matching of images is expressed as: />

Because of the transform-based matching, it is a match between tiles. Thus, the matching is not fine enough, and in order to achieve fine-grained matching, fine-grained matching is performed here on the basis of coarse-grained results.

Fine granularity matching, which is a refined matching of coarse granularity matching results. I.e., the tile level matching result, to pixel level matching. Fine-grained matching of refinement matches based on relevance. The method comprises the following specific steps:

for each coarse matching pointFirst in a fine-grained feature map->And->Position of upper positioning it->Two sets of partial windows of size w x w are then clipped.

Applying attention mechanism-based matching feature extraction module to smaller windows, clipping within each windowFeature transformation N _f And twice. Generating two transformed local feature mapsAnd->

Respectively byAnd->Is central. Then, will->Center vector and->Is associated with all vectors in order to generate a heat map, expressed in +.>Every pixel in the w×w neighborhood centered with pixel +.>Matching probabilities of (c) are determined. By calculating the expected value of the probability distribution, the final position +.>Record all matching results->Generating a final fine-level match->

S760, determining and outputting detection information according to the aggregation of the multi-time sequence characteristic aerial views; wherein the detection information characterizes object category and object position information in the target scene within the current time period.

The description of step S760 may refer to the description of step S250, which is not repeated here.

In summary, the object detection method provided in this embodiment is more beneficial to improving the effect of the object detection method compared to the object detection method provided in the previous embodiment. Specifically, based on the similarity of the images of the adjacent frames, the frames with larger content variation (namely smaller similarity) of the images of the adjacent frames are divided into image key frames, and the rest images with higher similarity and smaller content variation with the image key frames are divided into image non-key frames. And dividing the point cloud data synchronized with the image key frames into point cloud key frames, and dividing the rest into point cloud non-key frames. The feature extraction method is mainly applied to image key frames and point cloud key frames. Features of the non-key frames are obtained through deduction of features of the 2D optical flow or the 3D scene flow embedded key frames between the key frames and the non-key frames, so that the aim of improving the computing efficiency is achieved. The characteristics of the key frame can be obtained by weighting the characteristics of the current key frame and the previous key frame, namely, the characteristic memory function is realized, the blurring phenomenon caused by the too fast movement of the target can be effectively avoided, and the shielding problem is solved to a certain extent. In addition, wavelet transformation is introduced to reduce noise of the image and to extract edges and textures of the reduced noise image. The wavelet transformation has the advantages that the wavelet transformation can realize image noise reduction of different scales and realize edge and texture extraction of the images of different scales. In addition, the edge texture enhancement of each scale feature image is beneficial to the subsequent feature extraction for the matching of pixels of different visual angles, because the feature of the pixel points with intense gray level change is enhanced through the feature of the edge texture enhancement, the feature is equivalent to enhancing the reference point information of different images, and the alignment of the matching pixel points of different visual angles is facilitated. In addition, the image information entropy is introduced to represent the gray distribution complexity of the local image blocks, and the image is guided to be optimally segmented. Meanwhile, the data segmentation mode and the data arrangement mode in the multi-scale backbone network Res2Net are improved. In addition, a point cloud feature extraction method based on combination of PCA and PointNet and a point cloud feature aggregation method based on a deformable attention mechanism are provided, so that accuracy of target detection is improved. In addition, the aggregation of the point cloud features is realized by adopting a deformable attention mechanism, namely K feature points which are most relevant to the query feature points are dynamically selected, and the aggregation of the point cloud features is realized by adopting a multi-head attention mechanism. Because the selected characteristic points are the most relevant characteristic points to the query characteristic points, the characteristic obtained by aggregation can better represent the point cloud, so that the accuracy of target detection can be improved. In addition, the method and the device realize aggregation of information of different modes and different time sequences, perform target identification and map segmentation based on the aggregate information aerial view, realize advantage complementation of different sensors, effectively overcome the defects of a single sensor and improve the reliability of a target detection system.

Optionally, the present embodiment may train the model, the program, and the like used in the target detection method provided in any one of the foregoing embodiments according to the data set, so as to improve the practical application effect of the target detection method. The dataset may be nuscenes dataset, which is a large-scale dataset of the autopilot domain, containing 1000 complex driving scenarios. A target scene having a duration of 20 seconds may be manually selected. nuscenes dataset contains 1000 driving scenarios. The entire dataset includes approximately 140 tens of thousands of camera images, 39 tens of thousands of lidar scanning point clouds, 140 tens of thousands of radar scanning results, and 140 tens of thousands of target frames over 4 tens of thousands of keyframes. Training the models, programs, etc. used in the target detection method using nuscenes dataset full of complexity helps to improve the practical use of the target detection method in complex environments (urban areas with tens of objects per scene).

Referring to fig. 14, one embodiment of the present application provides an object detection device 10, comprising:

the acquiring module 11 is configured to acquire a multi-frame image and a multi-frame point cloud of the target scene, where one frame of image corresponds to one frame of point cloud acquired synchronously.

The point cloud generating module 12 is configured to project pixels of each frame of image into a three-dimensional space, form a projected point cloud with the space points projected by the pixels of each frame of image, and obtain color information of points in each frame of point cloud to generate a laser color point cloud.

The acquiring module 11 is further configured to acquire aggregation of the characteristic aerial view of multiple time sequences according to the high-density point cloud composed of the projection point cloud and the laser color point cloud; wherein the characteristic aerial view has planar image information and three-dimensional spatial distribution information.

A detection module 13, configured to determine and output detection information according to aggregation of the multi-time sequence characteristic aerial views; wherein the detection information characterizes object category and object position information in the target scene within the current time period.

Optionally, the obtaining module 11 is specifically configured to: acquiring information vectors of each point in the high-density point cloud; wherein the information vector comprises two-dimensional coordinates, three-channel gray values, color features and features of the high-density point cloud, wherein the features of the high-density point cloud comprise features of structural information of the high-density point cloud, and the two-dimensional coordinates comprise two-dimensional coordinates of points in view angle images of a plurality of different view angles contained in one frame of image; after cylindrical voxel segmentation is carried out on the high-density point cloud, voxel coding is carried out on the segmented spatial point cloud based on the information vector of each point in the segmented voxel spatial point cloud, and the voxel-formed point cloud is obtained; carrying out asymmetric convolution on the voxelized point cloud to aggregate information vectors of different points, and obtaining the aggregated information point cloud; compressing the aggregated information point cloud along the height direction to obtain a multi-time-sequence characteristic aerial view; the image feature aggregation method based on the deformable attention mechanism is used for processing the multi-time-sequence feature aerial views and obtaining the aggregation of the multi-time-sequence feature aerial views.

Optionally, the point cloud generating module 12 is specifically configured to form an initial high-density point cloud according to the projection point cloud and the laser color point cloud; acquiring depth information of space points projected by matched pixel points of each frame of image in the initial high-density point cloud to acquire color point cloud in the projected point cloud, wherein one frame of image comprises view angle images with different view angles, the matched pixel points are pixel points in one view angle image, the matched pixel points and one pixel point in the other view angle image are provided with the same three-dimensional projected space point, and the depth information of the matched pixel points and the depth information of the one pixel point are both the depth information of the three-dimensional projected space point; forming a dense point cloud according to the color point cloud and the laser color point cloud; estimating depth information of the space points projected by the non-matching pixels in the initial high-density point cloud based on the dense point cloud, and obtaining the depth information of the space points projected by the non-matching pixels in the initial high-density point cloud; and correcting the depth information of the space points projected by the non-matching pixels in the initial high-density point cloud based on the dense point cloud to obtain the high-density point cloud.

Optionally, the obtaining module 11 is specifically configured to match pixels in view images of different views for view images of different views included in each frame of image, to obtain each pair of matched pixels of each two view images, where a pair of matched pixels includes two matched pixels; acquiring the same three-dimensional projection space point corresponding to each pair of matched pixel points in each frame of image, and acquiring coordinates of the three-dimensional projection space point in a coordinate system of each image acquisition device and coordinates of the three-dimensional projection space point in a world coordinate system; and determining depth information of the three-dimensional projection space point under each view angle according to the coordinates of the three-dimensional projection space point under the coordinate system of each image acquisition device and the coordinates of the three-dimensional projection space point under the world coordinate system, and taking the depth information of the three-dimensional projection space point under each view angle as the depth information of the matched pixel point in the corresponding view angle image.

Optionally, the object detection device 10 further includes a screening module 14, configured to screen keyframe images and non-keyframe images in the multi-frame image; and acquiring a keyframe point cloud corresponding to the time sequence of the keyframe image, and acquiring a non-keyframe point cloud corresponding to the time sequence of the non-keyframe image. The point cloud generating module 12 is specifically configured to project the pixel point of each key frame image into a three-dimensional space, and project the pixel point of each non-key frame image into the three-dimensional space, where the projected space point of the pixel point of each key frame image and the projected space point of the pixel point of each non-key frame image form a projected point cloud; the method comprises the steps of obtaining color information of each key frame point cloud and obtaining color information of each non-key frame point cloud to generate a laser color point cloud, wherein the laser color point cloud comprises the key frame color point cloud and the non-key frame color point cloud.

Optionally, the obtaining module 11 is further configured to obtain a noise reduction image of the key frame image and edge texture images with different scales, input the noise reduction image to an image feature extraction network to obtain feature images with different scales of the noise reduction image, and superimpose the edge texture images with different scales on the feature images with different scales to obtain feature images with different scales subjected to edge texture enhancement processing; and deducing the characteristics of the non-key frame image according to the optical flow between the key frame image and the non-key frame image. Aiming at the key frame point cloud, extracting key frame point cloud characteristics, and carrying out deformable attention characteristic aggregation on the key frame point cloud characteristics to obtain aggregated key frame point cloud characteristics; for the non-key frame point cloud, deducing non-key frame point cloud characteristics according to scene flows between the key frame point cloud and the non-key frame point cloud; the aggregate characteristic point cloud of the high-density point cloud has characteristics including: the image contour enhancement processing method comprises the steps of non-key frame image characteristics, non-key frame point cloud characteristics, key frame image characteristics subjected to image contour enhancement processing and aggregated key frame point cloud characteristics.

Optionally, the acquiring module 11 is specifically configured to perform wavelet decomposition on the key frame image to acquire a plurality of different wavelet components; setting different thresholds for different wavelet component graphs, and carrying out threshold filtering on each wavelet component according to the threshold corresponding to each wavelet component so as to obtain a plurality of wavelet components after noise reduction; and recombining the noise-reduced image according to the noise-reduced plurality of wavelet components, and recombining the edge texture images with different scales according to the noise-reduced plurality of wavelet components.

Optionally, the acquiring module 11 is specifically configured to reconstruct initial edge texture images with different scales according to the wavelet components after noise reduction; and performing erosion operation and expansion operation on the initial edge texture image of each scale to obtain edge texture images of different scales.

Optionally, the object detection device 10 further includes an image processing module 15, configured to convert the noise reduction image into a gray image, and divide the gray image into tiles with the same size, to obtain an information entropy value of each tile; dividing a gray image according to the information entropy value of each image block, and obtaining a plurality of image areas obtained by dividing; and arranging each image area to a corresponding position in the feature extraction network according to the information entropy value of each image area, wherein the image areas with the information entropy value larger than the preset information entropy value are arranged at positions which can pass through the first number of convolution layers, and the image areas with the information entropy value smaller than or equal to the preset information entropy value are arranged at positions which can pass through the second number of convolution layers, and the first number is larger than the second number.

Optionally, the image processing module 15 is specifically configured to: counting the number proportion of each group of pixels with the same information entropy value according to the information entropy value of the pixels in each block aiming at each block so as to obtain an information entropy distribution statistical histogram of the block; after Gaussian smoothing filtering processing is carried out on the information entropy distribution statistical histogram, peak points of the information entropy distribution statistical histogram are identified; the number of peak points is taken as the clustering number, the information entropy corresponding to the peak points is taken as the clustering center, fuzzy C-means clustering is carried out on each pixel point in the information entropy distribution statistical histogram, and the information entropy value of the pixel points belonging to the same kind is updated to the information entropy value of the clustering center to which the pixel points of the same kind belong; after the information entropy value is updated, dividing the gray image according to the updating result of the information entropy value, and obtaining a plurality of image areas obtained by dividing.

Optionally, each frame of image comprises a plurality of view images of different views. The point cloud generating module 12 is specifically configured to project each point in each frame of point cloud on each view angle image of the corresponding frame image; when a point is projected to an integer pixel point, acquiring color information of the projected integer pixel point as color information of the point; when the point is projected to the non-integer pixel point, estimating the color information of the non-integer pixel point projected to the point, and taking the estimated color information as the color information of the point.

Optionally, the point cloud generating module 12 is specifically configured to obtain four pixel points nearest to the projected non-integer pixel point, and determine a weight coefficient of the four pixel points according to a distance and a color gray scale distance between each of the four pixel points and the projected non-integer pixel point; substituting the weight coefficients of the four pixel points and the color information of the four pixel points into a Gaussian function to obtain an estimation result of the projected color information of the non-integer pixel points.

Optionally, the object detection device 10 further includes a filtering module 16, configured to filter out the ground point cloud in each frame of point clouds.

Referring to fig. 15, an embodiment of the present application further provides an electronic device 20, including: a processor 21, and a memory 22 communicatively coupled to the processor 21. The memory 22 stores computer-executable instructions and the processor 21 executes the computer-executable instructions stored in the memory 22 to implement the object detection method as provided in any of the embodiments above.

The present application also provides a computer-readable storage medium having stored therein computer-executable instructions that, when executed, cause a computer to execute instructions that when executed by a processor, are configured to implement the object detection method provided in any of the embodiments above.

The present application also provides a computer program product comprising a computer program for implementing the object detection method provided in any of the embodiments above when being executed by a processor.

The computer readable storage medium may be a Read Only Memory (ROM), a programmable Read Only Memory (Programmable Read-Only Memory, PROM), an erasable programmable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), an electrically erasable programmable Read Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), a magnetic random access Memory (Ferromagnetic Random Access Memory, FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a compact disk Read Only Memory (Compact Disc Read-Only Memory, CD-ROM). But may be various electronic devices such as mobile phones, computers, tablet devices, personal digital assistants, etc. that include one or any combination of the above-mentioned memories.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) containing several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method described in the embodiments of the present application.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the claims, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application, or direct or indirect application in other related technical fields are included in the scope of the claims of the present application.

Claims

1. A method of target detection, the method comprising:

2. The method of claim 1, wherein the acquiring the aggregate of the multi-temporal feature aerial views from the high-density point cloud comprising the projected point cloud and the laser color point cloud comprises:

3. The method according to claim 1 or 2, wherein said composing a high density point cloud from said projected point cloud and said laser color point cloud comprises:

correcting depth information of space points projected by non-matching pixels in the initial high-density point cloud based on the dense point cloud to obtain the high-density point cloud;

the step of obtaining depth information of the matched pixel points in each frame of image comprises the following steps:

4. The method of claim 1, wherein after the acquiring the multi-frame image and the multi-frame point cloud of the target scene, the method further comprises:

5. The method of claim 4, wherein after the screening out the key frame images and the non-key frame images in the multi-frame image, the method further comprises:

6. The method of claim 5, wherein the acquiring the noise reduced image and the edge texture image of the key frame image at different scales comprises:

recombining a noise reduction image according to the plurality of noise reduction wavelet components, and recombining edge texture images with different scales according to the plurality of noise reduction wavelet components;

the step of recombining the edge texture images with different scales according to the plurality of wavelet components after noise reduction comprises the following steps:

7. The method of claim 5, wherein before inputting the noise reduced image to an image feature extraction network to obtain each scale feature map of the noise reduced image, the method further comprises:

according to the information entropy value of each image area, arranging each image area to a corresponding position in the feature extraction network, wherein the image areas with the information entropy value larger than the preset information entropy value are arranged at positions capable of passing through a first number of convolution layers, and the image areas with the information entropy value smaller than or equal to the preset information entropy value are arranged at positions capable of passing through a second number of convolution layers, and the first number is larger than the second number;

the dividing the gray image according to the information entropy value of each image block to obtain a plurality of divided image areas comprises:

8. The method of claim 1, wherein each frame of image comprises view images of a plurality of different views, and wherein the obtaining color information for points in each frame of point cloud to generate the laser color point cloud comprises:

when the point is projected to the non-integer pixel point, estimating the color information of the non-integer pixel point projected to the point, and taking the estimated color information as the color information of the point;

The color information of the non-integer pixel point projected by the estimation point comprises:

9. An object detection apparatus, comprising:

10. An electronic device, comprising: a processor, and a memory communicatively coupled to the processor;

the memory stores computer-executable instructions;

the processor executes computer-executable instructions stored in the memory to implement the method of any one of claims 1 to 8.