CN114926485B

CN114926485B - Image depth labeling method, device, equipment and storage medium

Info

Publication number: CN114926485B
Application number: CN202210161561.4A
Authority: CN
Inventors: 韩文韬; 鲁赵晗; 韩旭
Original assignee: Guangzhou Weride Technology Co Ltd
Current assignee: Guangzhou Weride Technology Co Ltd
Priority date: 2022-02-22
Filing date: 2022-02-22
Publication date: 2024-06-18
Anticipated expiration: 2042-02-22
Also published as: CN114926485A

Abstract

The invention relates to the field of artificial intelligence, and discloses an image depth labeling method, device, equipment and storage medium, wherein the method comprises the following steps: performing point cloud separation on a point cloud sequence corresponding to the multi-frame camera image to obtain corresponding dynamic point clouds and static point clouds; transforming all the static point clouds to a world coordinate system, and processing to obtain corresponding background point clouds; performing interpolation processing on the dynamic point cloud, and superposing the dynamic point cloud subjected to the interpolation processing with the background point cloud to obtain dense point cloud of the corresponding frame; and projecting the dense point cloud to a camera plane of a camera image of the corresponding frame to obtain a depth mark corresponding to the camera image. Aiming at the difficulty of image depth acquisition, the technical scheme of the invention separates and superimposes the point clouds of the multi-frame point clouds, so that the sparse point clouds become dense point clouds, and further, the depth data in the converted depth image are dense.

Description

Image depth labeling method, device, equipment and storage medium

Technical Field

The present invention relates to the field of artificial intelligence, and in particular, to a method, apparatus, device, and storage medium for image depth labeling.

Background

In order to complete visual tasks such as obstacle detection and segmentation on an image and convert the visual tasks to a world coordinate system for a downstream decision module, particularly in a system without other 3D measurement sensor configurations, depth estimation with pixel granularity is generally required on the image, an existing depth estimation scheme mainly comprises monocular depth estimation based on depth learning and depth estimation based on a binocular camera system, and various modes exist for realizing the monocular depth estimation based on the depth learning, wherein a ranging method suitable for an automatic driving scene relies on a laser radar to convert a 3D measurement result to an image plane to obtain depth according to calibration between the laser radar and a camera, but is limited by resolution of the laser radar, and the depth obtained by the mode is generally sparse and is difficult to reach quality expected by a depth learning algorithm.

Disclosure of Invention

The invention mainly aims to solve the technical problem that the depth data in the existing depth image obtained by monocular depth estimation is too sparse.

The first aspect of the invention provides an image depth labeling method, which comprises the following steps: performing point cloud separation on a point cloud sequence corresponding to the multi-frame camera image to obtain corresponding dynamic point clouds and static point clouds; transforming all the static point clouds to a world coordinate system, and processing to obtain corresponding background point clouds; performing interpolation processing on the dynamic point cloud, and overlapping the dynamic point cloud subjected to the interpolation processing with the background point cloud to obtain dense point cloud of a corresponding frame; and projecting the dense point cloud to a camera plane of the camera image of the corresponding frame to obtain a depth mark corresponding to the camera image.

In this embodiment, in a first implementation manner of the first aspect of the present invention, performing point cloud separation on a point cloud sequence corresponding to a multi-frame camera image to obtain a corresponding dynamic point cloud and a static point cloud includes: inputting the point cloud sequence into a preset obstacle detection and segmentation model, detecting and segmenting an obstacle outline in a frame corresponding to the point cloud sequence through the obstacle detection and segmentation model, and carrying out semantic annotation on the obstacle outline to obtain a corresponding semantic type, wherein the semantic type comprises a dynamic obstacle; marking point clouds contained by obstacle outlines with semantic types of dynamic obstacle types in the point cloud sequence as dynamic point clouds; and marking the point clouds which are not contained by the obstacle outline with the semantic type being the dynamic obstacle type in the point cloud sequence as static point clouds.

In a second implementation manner of the first aspect of the present invention, the inputting the point cloud sequence into a preset obstacle detection and segmentation model, detecting and segmenting an obstacle contour in a frame corresponding to the point cloud sequence through the obstacle detection and segmentation model, and performing semantic annotation on the obstacle contour, where obtaining a corresponding semantic type includes: inputting the point cloud sequence into a preset obstacle detection and segmentation model, wherein the obstacle detection and segmentation model is divided into a feature extraction part and a semantic segmentation part; d-dimensional features of n point clouds in an input point cloud sequence are obtained as local features through a feature extraction part in the obstacle detection and segmentation model, the local features are classified and learned, and global features are obtained through maximum pooling processing; and splicing the local features and the global features through semantic segmentation parts in the obstacle detection and segmentation model, performing dimension reduction processing through multi-layer MLP, and finally predicting the semantic types of each point cloud of the point cloud sequence.

In a third implementation manner of the first aspect of the present invention, the transforming all the static point clouds to the world coordinate system and processing the same to obtain corresponding background point clouds includes: determining a corresponding pose conversion matrix according to the conversion relation between the laser radar coordinate system corresponding to the point cloud sequence and the world coordinate system; converting all the static point clouds to a world coordinate system according to the pose conversion matrix; and processing all the static point clouds transformed to the world coordinate system to obtain corresponding background point clouds.

In this embodiment, in a fourth implementation manner of the first aspect of the present invention, the processing all the static point clouds transformed to the world coordinate system to obtain the corresponding background point clouds includes: superposing all the static point clouds transformed to the world coordinate system to obtain superposed point clouds; calculating the neighborhood of each point cloud in the superimposed point cloud in a non-current frame through a kd-tree nearest neighbor algorithm; counting semantic types corresponding to all point clouds in the neighborhood, and determining the semantic type of a central point of the neighborhood according to the semantic type of each point cloud in the neighborhood; and filtering point clouds in the neighborhood, which are different from the semantic type of the central point, to obtain corresponding background point clouds.

In this embodiment, in a fifth implementation manner of the first aspect of the present invention, performing interpolation processing on the dynamic point cloud, and overlapping the dynamic point cloud after the interpolation processing with the background point cloud, to obtain a dense point cloud of a corresponding frame includes: constructing a three-dimensional plane of the dynamic obstacle corresponding to the dynamic point cloud; up-sampling the three-dimensional plane to obtain sampling points, and carrying out interpolation processing on the dynamic point cloud according to the sampling points; and superposing the dynamic point cloud processed by interpolation processing with the background point cloud to obtain dense point cloud of the corresponding frame.

In this embodiment, in a sixth implementation manner of the first aspect of the present invention, projecting the dense point cloud to a camera plane of the camera image of the corresponding frame, and obtaining a depth label corresponding to the camera image includes: projecting the dense point cloud to a camera plane of the camera image of a corresponding frame; calculating the shielding relation between dynamic obstacles corresponding to the dynamic point cloud in the dense point cloud; according to the shielding relation, the depth value of the dynamic point cloud is endowed to the corresponding pixel of the projected camera plane; and giving depth values according to the static point cloud to pixels which are not given with depth values in the projected camera plane.

The second aspect of the present invention provides an image depth labeling apparatus, comprising: the point cloud separation module is used for carrying out point cloud separation on a point cloud sequence corresponding to the multi-frame camera image to obtain corresponding dynamic point clouds and static point clouds; the static processing module is used for converting all the static point clouds into a world coordinate system and processing the same to obtain corresponding background point clouds; the dynamic processing module is used for carrying out interpolation processing on the dynamic point cloud, and superposing the dynamic point cloud subjected to the interpolation processing and the background point cloud to obtain dense point cloud of a corresponding frame; and the projection conversion module is used for projecting the dense point cloud to a camera plane of the camera image of the corresponding frame to obtain a depth mark corresponding to the camera image.

In this embodiment, in a first implementation manner of the second aspect of the present invention, the point cloud separation module specifically includes: the semantic segmentation unit is used for inputting the point cloud sequence into a preset obstacle detection and segmentation model, detecting and segmenting an obstacle outline in a frame corresponding to the point cloud sequence through the obstacle detection and segmentation model, and carrying out semantic annotation on the obstacle outline to obtain a corresponding semantic type, wherein the semantic type comprises a dynamic obstacle; the dynamic point cloud marking unit marks the point cloud contained by the obstacle outline with the semantic type being the dynamic obstacle type in the point cloud sequence as dynamic point cloud; and the static point cloud marking unit marks the point clouds which are not contained by the obstacle outline with the semantic type being the dynamic obstacle type in the point cloud sequence as static point clouds.

In this embodiment, in a second implementation manner of the second aspect of the present invention, the semantic segmentation unit is specifically configured to: inputting the point cloud sequence into a preset obstacle detection and segmentation model, wherein the obstacle detection and segmentation model is PointNet in a network structure and is divided into a feature extraction part and a semantic segmentation part; d-dimensional features of n point clouds in an input point cloud sequence are obtained as local features through a feature extraction part in the obstacle detection and segmentation model, the local features are classified and learned, and global features are obtained through maximum pooling processing; and splicing the local features and the global features through semantic segmentation parts in the obstacle detection and segmentation model, performing dimension reduction processing through multi-layer MLP, and finally predicting the semantic types of each point cloud of the point cloud sequence.

In this embodiment, in a third implementation manner of the second aspect of the present invention, the static processing module specifically includes: the matrix determining unit is used for determining a corresponding pose conversion matrix according to the conversion relation between the laser radar coordinate system corresponding to the point cloud sequence and the world coordinate system; the coordinate system conversion unit is used for converting all the static point clouds to a world coordinate system according to the pose conversion matrix; and the superposition smoothing unit is used for processing all the static point clouds transformed to the world coordinate system to obtain corresponding background point clouds.

In this embodiment, in a fourth implementation manner of the second aspect of the present invention, the superposition smoothing unit is specifically configured to: superposing all the static point clouds transformed to the world coordinate system to obtain superposed point clouds; calculating the neighborhood of each point cloud in the superimposed point cloud in a non-current frame through a kd-tree nearest neighbor algorithm; counting semantic types corresponding to all point clouds in the neighborhood, and determining the semantic type of a central point of the neighborhood according to the semantic type of each point cloud in the neighborhood; and filtering point clouds in the neighborhood, which are different from the semantic type of the central point, to obtain corresponding background point clouds.

In this embodiment, in a fifth implementation manner of the second aspect of the present invention, the dynamic processing module is specifically configured to: constructing a three-dimensional plane of the dynamic obstacle corresponding to the dynamic point cloud; up-sampling the three-dimensional plane to obtain sampling points, and carrying out interpolation processing on the dynamic point cloud according to the sampling points; and superposing the dynamic point cloud processed by interpolation processing with the background point cloud to obtain dense point cloud of the corresponding frame.

In this embodiment, in a sixth implementation manner of the second aspect of the present invention, the projection conversion module is specifically configured to: projecting the dense point cloud to a camera plane of the camera image of a corresponding frame; calculating the shielding relation between dynamic obstacles corresponding to the dynamic point cloud in the dense point cloud; according to the shielding relation, the depth value of the dynamic point cloud is endowed to the corresponding pixel of the projected camera plane; and giving depth values according to the static point cloud to pixels which are not given with depth values in the projected camera plane.

A third aspect of the present invention provides an image depth annotation device, comprising: a memory and at least one processor, the memory having instructions stored therein, the memory and the at least one processor being interconnected by a line; the at least one processor invokes the instructions in the memory to cause the image depth marking device to perform the steps of the image depth marking method described above.

A fourth aspect of the present invention provides a computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the steps of the image depth marking method described above.

According to the technical scheme, the point cloud separation is carried out on the point cloud sequence corresponding to the multi-frame camera image, so that corresponding dynamic point cloud and static point cloud are obtained; transforming all the static point clouds to a world coordinate system, and processing to obtain corresponding background point clouds; performing interpolation processing on the dynamic point cloud, and superposing the dynamic point cloud subjected to the interpolation processing with the background point cloud to obtain dense point cloud of the corresponding frame; and projecting the dense point cloud to a camera plane of a camera image of the corresponding frame to obtain a depth mark corresponding to the camera image. Aiming at the difficulty of image depth acquisition, the method of automatic model labeling algorithm and time sequence fusion is utilized, the quality of image depth is greatly improved, and sparse point cloud is enabled to be dense point cloud through point cloud separation and superposition of multi-frame point cloud data, so that depth data in a converted depth image is dense.

Drawings

FIG. 1 is a schematic diagram of a first embodiment of an image depth labeling method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a second embodiment of an image depth labeling method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a third embodiment of an image depth labeling method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an embodiment of an image depth marking apparatus according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of another embodiment of an image depth labeling apparatus according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of an embodiment of an image depth labeling apparatus according to an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides an image depth labeling method, device, equipment and storage medium, which are used for solving the technical problem that depth data in a depth image obtained by monocular depth estimation is too sparse.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

For easy understanding, the following describes a specific flow of an embodiment of the present invention, referring to fig. 1, and a first embodiment of an image depth labeling method in an embodiment of the present invention includes:

101. performing point cloud separation on a point cloud sequence corresponding to the multi-frame camera image to obtain corresponding dynamic point clouds and static point clouds;

It can be understood that the execution body of the present invention may be an image depth marking device, and may also be a terminal or a server, which is not limited herein. The embodiment of the invention is described by taking a server as an execution main body as an example.

In practical application, the camera technology is mature, stable, low in cost and rich in information, so that the camera becomes an important sensing element for unmanned perception, compared with laser point clouds, the camera image can provide richer detail and texture information, an object shot by the camera mainly needs to be a point cloud for representing an area, such as any area containing vehicles, pedestrians, traffic indication marks or billboards in urban road environments, and the like, in the automatic driving process, the area shot by the camera is a view area in front of the vehicles, the camera shooting result is a camera image, meanwhile, the laser radar equipment is used for transmitting and receiving laser radar signals (such as laser signals, ultrasonic signals and the like) and performing certain signal processing to obtain point clouds, and the camera image and the point clouds are generated simultaneously by the camera and the laser radar in the driving process, so that a plurality of frames of camera images and corresponding multi-frame point cloud sequences are obtained.

In this embodiment, the point cloud separation is mainly performed by using a preset obstacle detection and segmentation model based on the point cloud, and the obstacle detection and segmentation model can detect and segment an input point cloud sequence. In practical application, the semantic segmentation network can detect obstacles on an input point cloud sequence to obtain an obstacle profile corresponding to each frame of point cloud sequence, predict each point cloud in each frame of point cloud sequence to carry out semantic classification, for example, given an input point cloud sequence { P ₀,P₁,…,P_n-1 }, a model predicts the obstacle profile { B _i0,B_i1,…B_i _m-1 } existing in each frame and the semantic classification of each point cloud (including categories of pavement, sidewalk, static obstacle, noise and the like), and the semantic classification of each point cloud is obtained according to the semantic classification.

In this embodiment, point cloud separation is performed according to semantic types of each point cloud to obtain a dynamic point cloud and a static point cloud, where the dynamic point cloud is mainly a point cloud of a dynamic obstacle, for example, a vehicle running in front of the vehicle in an automatic driving process, and the static point cloud is mainly a background which cannot move and is recorded into a road surface, a sidewalk, and the like. In this embodiment, the point cloud separation is mainly performed by marking the points contained in any dynamic obstacle contour B _ij as dynamic points, otherwise marking the points as static points, recording the sets D _i and S _i of the dynamic and static points in each frame of point cloud, and simultaneously retaining the classification result of each point.

102. Transforming all the static point clouds to a world coordinate system, and processing to obtain corresponding background point clouds;

In this embodiment, the time sequence of the point cloud sequences of different frames is different, so that the corresponding reference systems may be different, for example, during automatic driving, the camera and the laser radar perform real-time shooting and point cloud generation, and the automatic driving vehicle may move during driving, for example, the vehicle may also have the same reference system during driving, for example, the automatic driving vehicle waits for a red light, at which time the automatic driving vehicle is stationary, and the camera and the laser radar are stationary, at which time the reference systems of the point cloud sequences of different frames are the same.

In this embodiment, a pose conversion matrix between a reference system corresponding to a point cloud sequence of different frames and a preset world coordinate system is determined, and the separated static point clouds are translated and rotated through matrix transformation, so that the static point clouds of each frame can be converted from a laser radar reference system to the world coordinate system in the unified world coordinate system.

In practical application, the processing is mainly used for converting multi-frame sparse static point clouds into dense background point clouds, in this embodiment, the processing mainly includes superposition processing and smoothing processing, dense static point clouds can be obtained by superposing continuous N-frame static point clouds, smoothing processing is needed for the superposed static point clouds, because noise may exist in each frame of static point clouds, after multi-frame static point clouds are superposed, noise of each frame is superposed on the finally obtained dense static point clouds, and the superposed noise is uniformly filtered through smoothing processing, so that the finally obtained point clouds are the background point clouds.

103. Performing interpolation processing on the dynamic point cloud, and superposing the dynamic point cloud subjected to the interpolation processing with the background point cloud to obtain dense point cloud of the corresponding frame;

In this embodiment, for a dynamic point cloud and an obstacle contour obtained by dividing an obstacle detection and division model, because its point 3D measurement source is only the point cloud of the current frame, there is still a sparse problem compared with a static point cloud superimposed by multiple frames. To cope with this problem, in this embodiment, according to each obstacle contour B _ij, interpolation is performed on the obstacle surface in three-dimensional space according to the contour and the point cloud contained in the obstacle, so as to fill the gap between sparse point clouds on the obstacle surface. And (3) obtaining a new dynamic point cloud D' _i of the current frame after the surface thickening treatment of all the obstacles.

In this embodiment, after the point cloud processing in the above step is completed, the dynamic and static points D '_i and S' _i of each frame are superimposed to obtain the complete dense point cloud P 'of the current frame' _i

104. And projecting the dense point cloud to a camera plane of a camera image of the corresponding frame to obtain a depth mark corresponding to the camera image.

In the embodiment, according to the obstacle outline detected in the obstacle detection and segmentation model, calculating the shielding relation between dynamic point clouds contained in each obstacle, and simultaneously, according to the static calibration parameter K between the camera and the laser radar, firstly projecting the dynamic point clouds D' i to an image plane from far to near so as to ensure that the obtained depth accords with the shielding relation of the obstacle of the current frame; and giving the value of the point cloud under the camera coordinate system to the corresponding pixel after projection, and taking the value as the depth value of the pixel. And then, only the image area which is not endowed with depth by the dynamic point cloud is projected to an image plane, and depth values are endowed in the same way, so that the depth marking can be carried out on the camera image, as the obtained point cloud is dense, the depth information in the camera image obtained after the point cloud conversion is also dense, and a plurality of images corresponding to each frame in the point cloud sequence can be generated through the depth, so that a depth image of one sequence is generated.

In the embodiment, the corresponding dynamic point cloud and static point cloud are obtained by carrying out point cloud separation on the point cloud sequence corresponding to the multi-frame camera image; transforming all the static point clouds to a world coordinate system, and processing to obtain corresponding background point clouds; performing interpolation processing on the dynamic point cloud, and superposing the dynamic point cloud subjected to the interpolation processing with the background point cloud to obtain dense point cloud of the corresponding frame; and projecting the dense point cloud to a camera plane of a camera image of the corresponding frame to obtain a depth mark corresponding to the camera image. Aiming at the difficulty of obtaining the image depth, the method of automatic model labeling algorithm and time sequence fusion is utilized, the quality of the image depth is greatly improved, and the sparse point cloud is enabled to be dense point cloud through point cloud separation and superposition of multi-frame point cloud data, so that the depth data in the converted depth image are dense.

Referring to fig. 2, a second embodiment of the image depth labeling method according to the present invention includes:

201. inputting the point cloud sequence into a preset obstacle detection and segmentation model, wherein the obstacle detection and segmentation model is divided into a feature extraction part and a semantic segmentation part;

In practical application, the obstacle detection and segmentation model can use PointNet, pointNet ++ or FocusNet convolutional network structure, and in this embodiment, the invention mainly uses PointNet network structure to perform semantic segmentation on the point cloud sequence.

202. D-dimensional features of n point clouds in an input point cloud sequence are obtained as local features through a feature extraction part in the obstacle detection and segmentation model, the local features are classified and learned, and global features are obtained through maximum pooling processing;

203. splicing the local features and the global features through semantic segmentation parts in the obstacle detection and segmentation model, performing dimension reduction processing through multi-layer MLP, and finally predicting semantic types of each point cloud of the point cloud sequence;

In this embodiment, pointNet network structures including a first T-Net layer, a second T-Net layer, multiple perceptrons (Multilayer Perception, MLP) and a feature fusion layer, the first T-Net layer is used for aligning the position of point cloud in the data set to be processed, the MLP is used for raising the dimension of the local feature of the point cloud from 3 to 64, the second T-Net layer is used for aligning the feature of the point cloud, the MLP is used for raising the dimension of the local feature of the point cloud from 64 to 128, and then raising the dimension of the local feature of the point cloud to 1024, the maximum symmetric function is used for carrying out pooling processing on the point cloud to obtain the global feature of the point cloud, the feature fusion layer is used for splicing the global feature of the point cloud and the local feature of the point cloud, the MLP is used for carrying out dimension reduction processing on the spliced point cloud feature of the point cloud to realize semantic segmentation, and the semantic segmentation process is as follows: firstly, performing dimension reduction processing on the obtained point cloud global features by using a multi-layer perceptron MLP; classifying the point cloud through a softmax function to obtain probability scores of each point in each category; and finally, carrying out label classification to realize semantic segmentation processing of the point cloud.

204. Marking point clouds contained by obstacle outlines with semantic types of dynamic obstacle types in the point cloud sequence as dynamic point clouds;

205. marking point clouds which are not contained by the obstacle outline with the semantic type being a dynamic obstacle type in the point cloud sequence as static point clouds;

206. transforming all the static point clouds to a world coordinate system, and processing to obtain corresponding background point clouds;

207. performing interpolation processing on the dynamic point cloud, and superposing the dynamic point cloud subjected to the interpolation processing with the background point cloud to obtain dense point cloud of the corresponding frame;

208. projecting the dense point cloud to a camera plane of a camera image of the corresponding frame;

209. Calculating the shielding relation between dynamic obstacles corresponding to the dynamic point cloud in the dense point cloud;

in this embodiment, according to the detected outline of the obstacle, the occlusion relationship between the dynamic point clouds included in each obstacle is calculated, because there may be a change in the position relationship between the dynamic obstacles, and in order to avoid confusion of the dynamic obstacles in the depth labeling process, it is necessary to determine the occlusion relationship between the dynamic obstacles first.

210. According to the shielding relation, the depth value of the dynamic point cloud is endowed to the corresponding pixel of the projected camera plane;

211. And assigning the depth value to the pixels which are not assigned with the depth value in the projected camera plane according to the depth value of the static point cloud.

In this embodiment, in the two-dimensional map of the camera plane, the two-dimensional coordinates of the pixel point are the two-dimensional coordinates of the corresponding pixel point in the three-dimensional point cloud, for example, the three-dimensional coordinates of a certain point in the three-dimensional point cloud are (x 1, y1, z 1), where z1 represents the depth value of the point relative to the camera imaging plane, and the two-dimensional coordinates of the corresponding pixel point in the two-dimensional depth map of the point are (x 1, y 1).

The embodiment describes in detail the process of performing point cloud separation on a point cloud sequence corresponding to a multi-frame camera image to obtain corresponding dynamic point clouds and static point clouds on the basis of the previous embodiment, wherein the point cloud sequence is input into a preset obstacle detection and segmentation model, obstacle outlines in frames corresponding to the point cloud sequence are detected and segmented through the obstacle detection and segmentation model, and semantic annotation is performed on the obstacle outlines to obtain corresponding semantic types, wherein the semantic types comprise dynamic obstacles; marking point clouds contained by obstacle outlines with semantic types of dynamic obstacle types in the point cloud sequence as dynamic point clouds; and marking the point clouds which are not contained by the obstacle outline with the semantic type being the dynamic obstacle type in the point cloud sequence as static point clouds. According to the method, the dynamic obstacle is obtained by automatically carrying out semantic segmentation on obstacle detection and segmentation model detection, and the rapid point cloud separation can be carried out on the point cloud sequence based on the dynamic obstacle, so that the rapid labeling of depth data is realized.

Referring to fig. 3, a third embodiment of an image depth labeling method according to an embodiment of the present invention includes:

301. Performing point cloud separation on a point cloud sequence corresponding to the multi-frame camera image to obtain corresponding dynamic point clouds and static point clouds;

302. Determining a corresponding pose conversion matrix according to the conversion relation between the laser radar coordinate system corresponding to the point cloud sequence and the world coordinate system;

303. Converting all static point clouds to a world coordinate system according to the pose conversion matrix;

304. Superposing all the static point clouds transformed to the world coordinate system to obtain superposed point clouds;

In this embodiment, the world coordinate system is a reference coordinate system selected in the application environment and used for describing positions of all objects in the environment, and in practical application, a reference system corresponding to a certain frame of point cloud sequence may be used as a reference system for converting other point cloud sequences, which only needs to convert all static point clouds into the same reference system.

In the present embodiment, in linear algebra, the mapping relationship of the linear transformation is expressed in the form of a transformation matrix. Illustratively, the linear transformation includes rotation, translation, scaling, reflection, or the like. In this embodiment, the pose conversion matrix may be used to represent the conversion relationship between the world coordinate system and the lidar coordinate system.

305. Calculating the neighborhood of each point cloud in the superimposed point cloud in the non-current frame;

306. Counting semantic types corresponding to all point clouds in the neighborhood, and determining the semantic type of a central point of the neighborhood according to the semantic type of each point cloud in the field;

307. Filtering point clouds with different semantic types from the central point in the neighborhood to obtain corresponding background point clouds;

In practical application, the method is not limited by the method, in the embodiment, the neighborhood of each point cloud in the non-current frame is calculated by using a kd-tree nearest neighbor algorithm, wherein the field refers to a spherical range with a certain radius in a 3D space by taking a certain point as a center, the label of the point is updated to be the highest frequency label in the neighborhood, for a plurality of points in a neighborhood, each point is provided with a category from a point cloud self-semantic segmentation model, the occurrence frequency of each category in the neighborhood is counted, the label of the center point is updated by taking the category with the highest frequency, the category is smoothed, and noise existing in a semantic segmentation model result is eliminated.

In practical applications, other ways may be used instead of using the above procedure to eliminate noise present in the semantic segmentation model results, such as voxel filter based methods, which are not limited in the present invention.

308. Constructing a three-dimensional plane of a dynamic obstacle corresponding to the dynamic point cloud;

309. Up-sampling the three-dimensional plane to obtain sampling points, and carrying out interpolation processing on the dynamic point cloud according to the sampling points;

310. superposing the dynamic point cloud and the background point cloud after interpolation processing to obtain dense point cloud of the corresponding frame;

In this embodiment, the interpolation operation of the dynamic point cloud is performed by: 1. reconstructing a three-dimensional surface of the obstacle by using the existing dynamic obstacle points through a triangulation algorithm, wherein the surface consists of a series of adjacent triangles with different normal directions; 2. sampling on the reconstructed surface, wherein each sampling point falls on the surface and can be expressed as a linear combination of 3 nearest real point coordinates; 3. and for the obstacle with too sparse point cloud and the triangulation result incapable of being obtained, sampling is directly carried out on the surface of a three-dimensional frame (3D bounding box) predicted by the obstacle detection model, and the surface is taken as approximation.

In the present embodiment, the point cloud interpolation thickening method is not unique, and alternative methods are, for example, a thickening method based on plane normal (surface normal) estimation and 3D object surface triangulation (triangulation), and the like

311. And projecting the dense point cloud to a camera plane of a camera image of the corresponding frame to obtain a depth mark corresponding to the camera image.

The embodiment describes in detail the process of processing all the static point clouds transformed to the world coordinate system to obtain corresponding background point clouds based on the previous embodiment, and the superimposed point clouds are obtained by superimposing all the static point clouds transformed to the world coordinate system; calculating the neighborhood of each point cloud in the superimposed point cloud in a non-current frame through a kd-tree nearest neighbor algorithm; counting semantic types corresponding to all point clouds in the neighborhood, and determining the semantic type of a central point of the neighborhood according to the semantic type of each point cloud in the neighborhood; and filtering point clouds in the neighborhood, which are different from the semantic type of the central point, to obtain corresponding background point clouds. Aiming at the difficulty of obtaining the image depth, the method of automatic model labeling algorithm and time sequence fusion is utilized, the quality of the image depth is greatly improved, and the sparse point cloud is enabled to be dense point cloud through point cloud separation and superposition of multi-frame point cloud data, so that the depth data in the converted depth image are dense.

The method for labeling image depth in the embodiment of the present invention is described above, and the apparatus for labeling image depth in the embodiment of the present invention is described below, referring to fig. 4, where an embodiment of the apparatus for labeling image depth in the embodiment of the present invention includes:

the point cloud separation module 401 is configured to perform point cloud separation on a point cloud sequence corresponding to the multi-frame camera image, so as to obtain a corresponding dynamic point cloud and a static point cloud;

the static processing module 402 is configured to transform all the static point clouds to a world coordinate system, and process the static point clouds to obtain corresponding background point clouds;

The dynamic processing module 403 is configured to perform interpolation processing on the dynamic point cloud, and superimpose the dynamic point cloud after the interpolation processing with the background point cloud to obtain a dense point cloud of a corresponding frame;

The projection conversion module 404 is configured to project the dense point cloud to a camera plane of the camera image of the corresponding frame, so as to obtain a depth label corresponding to the camera image.

In the embodiment of the invention, the image depth labeling device runs the image depth labeling method, and the image depth labeling device obtains corresponding dynamic point clouds and static point clouds by carrying out point cloud separation on a point cloud sequence corresponding to a multi-frame camera image; transforming all the static point clouds to a world coordinate system, and processing to obtain corresponding background point clouds; performing interpolation processing on the dynamic point cloud, and superposing the dynamic point cloud subjected to the interpolation processing with the background point cloud to obtain dense point cloud of the corresponding frame; and projecting the dense point cloud to a camera plane of a camera image of the corresponding frame to obtain a depth mark corresponding to the camera image. Aiming at the difficulty of image depth acquisition, the method of automatic model labeling algorithm and time sequence fusion is utilized, the quality of image depth is greatly improved, and sparse point cloud is enabled to be dense point cloud through point cloud separation and superposition of multi-frame point cloud data, so that depth data in a converted depth image is dense.

Referring to fig. 5, a second embodiment of an image depth marking apparatus according to an embodiment of the present invention includes:

In this embodiment, the point cloud separation module 401 specifically includes: the semantic segmentation unit 4011 is configured to input the point cloud sequence into a preset obstacle detection and segmentation model, detect and segment an obstacle contour in a frame corresponding to the point cloud sequence through the obstacle detection and segmentation model, and perform semantic annotation on the obstacle contour to obtain a corresponding semantic type, where the semantic type includes a dynamic obstacle; the dynamic point cloud marking unit 4012 marks the point cloud contained in the obstacle outline of which the semantic type is the dynamic obstacle type in the point cloud sequence as dynamic point cloud; the static point cloud marking unit 4013 marks, as a static point cloud, a point cloud which is not included in the obstacle outline of which the semantic type is a dynamic obstacle type in the point cloud sequence.

In this embodiment, the semantic segmentation unit 4011 is specifically configured to: inputting the point cloud sequence into a preset obstacle detection and segmentation model, wherein the obstacle detection and segmentation model is PointNet in a network structure and is divided into a feature extraction part and a semantic segmentation part; d-dimensional features of n point clouds in an input point cloud sequence are obtained as local features through a feature extraction part in the obstacle detection and segmentation model, the local features are classified and learned, and global features are obtained through maximum pooling processing; and splicing the local features and the global features through semantic segmentation parts in the obstacle detection and segmentation model, performing dimension reduction processing through multi-layer MLP, and finally predicting the semantic types of each point cloud of the point cloud sequence.

In this embodiment, the static processing module 402 specifically includes: a matrix determining unit 4021, configured to determine a corresponding pose conversion matrix according to a conversion relationship between the lidar coordinate system corresponding to the point cloud sequence and the world coordinate system; a coordinate system conversion unit 4022 configured to convert all the static point clouds onto a world coordinate system according to the pose conversion matrix; the superposition smoothing unit 4023 is configured to process all the static point clouds transformed to the world coordinate system to obtain corresponding background point clouds.

In this embodiment, the superposition smoothing unit 4023 specifically functions to: superposing all the static point clouds transformed to the world coordinate system to obtain superposed point clouds; calculating the neighborhood of each point cloud in the superimposed point cloud in a non-current frame through a kd-tree nearest neighbor algorithm; counting semantic types corresponding to all point clouds in the neighborhood, and determining the semantic type of a central point of the neighborhood according to the semantic type of each point cloud in the neighborhood; and filtering point clouds in the neighborhood, which are different from the semantic type of the central point, to obtain corresponding background point clouds.

In this embodiment, the dynamic processing module 403 is specifically configured to: constructing a three-dimensional plane of the dynamic obstacle corresponding to the dynamic point cloud; up-sampling the three-dimensional plane to obtain sampling points, and carrying out interpolation processing on the dynamic point cloud according to the sampling points; and superposing the dynamic point cloud processed by interpolation processing with the background point cloud to obtain dense point cloud of the corresponding frame.

In this embodiment, the projection conversion module 404 is specifically configured to: projecting the dense point cloud to a camera plane of the camera image of a corresponding frame; calculating the shielding relation between dynamic obstacles corresponding to the dynamic point cloud in the dense point cloud; according to the shielding relation, the depth value of the dynamic point cloud is endowed to the corresponding pixel of the projected camera plane; and giving depth values according to the static point cloud to pixels which are not given with depth values in the projected camera plane.

The embodiment describes the specific functions of each module and the unit constitution of part of the modules in detail on the basis of the previous embodiment, and performs point cloud separation on the point cloud sequence corresponding to the multi-frame camera image through each module and each unit to obtain corresponding dynamic point cloud and static point cloud; transforming all the static point clouds to a world coordinate system, and processing to obtain corresponding background point clouds; performing interpolation processing on the dynamic point cloud, and superposing the dynamic point cloud subjected to the interpolation processing with the background point cloud to obtain dense point cloud of the corresponding frame; and projecting the dense point cloud to a camera plane of a camera image of the corresponding frame to obtain a depth mark corresponding to the camera image. Aiming at the difficulty of image depth acquisition, the method of automatic model labeling algorithm and time sequence fusion is utilized, the quality of image depth is greatly improved, and sparse point cloud is enabled to be dense point cloud through point cloud separation and superposition of multi-frame point cloud data, so that depth data in a converted depth image is dense.

The middle image depth marking device in the embodiment of the present invention is described in detail from the point of view of the modularized functional entity in fig. 4 and fig. 5, and the image depth marking apparatus in the embodiment of the present invention is described in detail from the point of view of hardware processing.

Fig. 6 is a schematic structural diagram of an image depth marking device according to an embodiment of the present invention, where the image depth marking device 600 may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 610 (e.g., one or more processors) and a memory 620, one or more storage mediums 630 (e.g., one or more mass storage devices) storing application programs 633 or data 632. Wherein the memory 620 and the storage medium 630 may be transitory or persistent storage. The program stored on the storage medium 630 may include one or more modules (not shown), each of which may include a series of instruction operations in the image depth marking device 600. Still further, the processor 610 may be configured to communicate with the storage medium 630 and execute a series of instruction operations in the storage medium 630 on the image depth marking device 600 to implement the steps of the image depth marking method described above.

The image depth marking apparatus 600 may also include one or more power supplies 640, one or more wired or wireless network interfaces 650, one or more input/output interfaces 660, and/or one or more operating systems 631, such as Windows Serve, mac OS X, unix, linux, freeBSD, and the like. It will be appreciated by those skilled in the art that the image depth marking device structure illustrated in FIG. 6 is not limiting of the image depth marking device provided by the present invention, and may include more or fewer components than illustrated, or may combine certain components, or a different arrangement of components.

The present invention also provides a computer readable storage medium, which may be a non-volatile computer readable storage medium, and may also be a volatile computer readable storage medium, where instructions are stored in the computer readable storage medium, when the instructions are executed on a computer, cause the computer to perform the steps of the image depth labeling method.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the system or apparatus and unit described above may refer to the corresponding process in the foregoing method embodiment, which is not repeated herein.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. An image depth labeling method is characterized by comprising the following steps:

Performing point cloud separation on a point cloud sequence corresponding to the multi-frame camera image to obtain corresponding dynamic point clouds and static point clouds;

determining a corresponding pose conversion matrix according to the conversion relation between the laser radar coordinate system corresponding to the point cloud sequence and the world coordinate system; converting all the static point clouds to a world coordinate system according to the pose conversion matrix;

Superposing all the static point clouds transformed to the world coordinate system to obtain superposed point clouds; calculating the neighborhood of each point cloud in the overlapped point cloud in a non-current frame; counting semantic types corresponding to all point clouds in the neighborhood, and determining the semantic type of a central point of the neighborhood according to the semantic type of each point cloud in the neighborhood; filtering point clouds in the neighborhood, which are different from the semantic types of the central point, to obtain corresponding background point clouds;

Constructing a three-dimensional plane of the dynamic obstacle corresponding to the dynamic point cloud; up-sampling the three-dimensional plane to obtain sampling points, and carrying out interpolation processing on the dynamic point cloud according to the sampling points; superposing the dynamic point cloud after interpolation processing with the background point cloud to obtain dense point cloud of the corresponding frame;

Projecting the dense point cloud to a camera plane of the camera image of a corresponding frame; calculating the shielding relation between dynamic obstacles corresponding to the dynamic point cloud in the dense point cloud; according to the shielding relation, the depth value of the dynamic point cloud is endowed to the corresponding pixel of the projected camera plane; and giving depth values according to the static point cloud to pixels which are not given with depth values in the projected camera plane.

2. The method of claim 1, wherein performing point cloud separation on a point cloud sequence corresponding to the multi-frame camera image to obtain a corresponding dynamic point cloud and a static point cloud comprises:

Inputting the point cloud sequence into a preset obstacle detection and segmentation model, detecting and segmenting an obstacle outline in a frame corresponding to the point cloud sequence through the obstacle detection and segmentation model, and carrying out semantic annotation on the obstacle outline to obtain a corresponding semantic type, wherein the semantic type comprises a dynamic obstacle;

marking point clouds contained by obstacle outlines with semantic types of dynamic obstacle types in the point cloud sequence as dynamic point clouds;

and marking the point clouds which are not contained by the obstacle outline with the semantic type being the dynamic obstacle type in the point cloud sequence as static point clouds.

3. The image depth labeling method according to claim 2, wherein inputting the point cloud sequence into a preset obstacle detection and segmentation model, detecting and segmenting an obstacle contour in a frame corresponding to the point cloud sequence through the obstacle detection and segmentation model, and performing semantic labeling on the obstacle contour, and obtaining a corresponding semantic type comprises:

Inputting the point cloud sequence into a preset obstacle detection and segmentation model, wherein the obstacle detection and segmentation model is divided into a feature extraction part and a semantic segmentation part;

D-dimensional features of n point clouds in an input point cloud sequence are obtained as local features through a feature extraction part in the obstacle detection and segmentation model, the local features are classified and learned, and global features are obtained through maximum pooling processing;

And splicing the local features and the global features through semantic segmentation parts in the obstacle detection and segmentation model, performing dimension reduction processing through a multi-layer perceptron, and finally predicting the semantic types of each point cloud of the point cloud sequence.

4. An image depth marking device, characterized in that the image depth marking device comprises:

The point cloud separation module is used for carrying out point cloud separation on a point cloud sequence corresponding to the multi-frame camera image to obtain corresponding dynamic point clouds and static point clouds;

The static processing module is used for determining a corresponding pose conversion matrix according to the conversion relation between the laser radar coordinate system corresponding to the point cloud sequence and the world coordinate system; converting all the static point clouds to a world coordinate system according to the pose conversion matrix; superposing all the static point clouds transformed to the world coordinate system to obtain superposed point clouds; calculating the neighborhood of each point cloud in the overlapped point cloud in a non-current frame; counting semantic types corresponding to all point clouds in the neighborhood, and determining the semantic type of a central point of the neighborhood according to the semantic type of each point cloud in the neighborhood; filtering point clouds in the neighborhood, which are different from the semantic types of the central point, to obtain corresponding background point clouds;

The dynamic processing module is used for constructing a three-dimensional plane of the dynamic obstacle corresponding to the dynamic point cloud; up-sampling the three-dimensional plane to obtain sampling points, and carrying out interpolation processing on the dynamic point cloud according to the sampling points; superposing the dynamic point cloud after interpolation processing with the background point cloud to obtain dense point cloud of the corresponding frame;

A projection conversion module for projecting the dense point cloud to a camera plane of the camera image of a corresponding frame; calculating the shielding relation between dynamic obstacles corresponding to the dynamic point cloud in the dense point cloud; according to the shielding relation, the depth value of the dynamic point cloud is endowed to the corresponding pixel of the projected camera plane; and giving depth values according to the static point cloud to pixels which are not given with depth values in the projected camera plane.

5. An image depth marking device, the image depth marking device comprising: a memory and at least one processor, the memory having instructions stored therein, the memory and the at least one processor being interconnected by a line;

the at least one processor invoking the instructions in the memory to cause the image depth marking device to perform the steps of the image depth marking method of any one of claims 1-3.

6. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the image depth labeling method of any of claims 1-3.