CN111833358A

CN111833358A - Semantic segmentation method and system based on 3D-YOLO

Info

Publication number: CN111833358A
Application number: CN202010593311.9A
Authority: CN
Inventors: 赵健; 温志津; 刘阳; 李晋徽; 鲍雁飞; 雍婷; 晋晓曦; 张清毅; 温可涵
Original assignee: 32802 Troops Of People's Liberation Army Of China
Current assignee: 32802 Troops Of People's Liberation Army Of China
Priority date: 2020-06-26
Filing date: 2020-06-26
Publication date: 2020-10-27

Abstract

The invention discloses a semantic segmentation method and a semantic segmentation system based on 3D-YOLO, which respectively sort collected RGB images and depth images according to shooting time to obtain an RGB image frame sequence and a depth image frame sequence; converting the RGB image frame sequence and the depth image frame sequence into a three-dimensional point cloud picture; taking the three-dimensional point cloud image as input, converting the three-dimensional point cloud image into a three-dimensional feature tensor through a feature learning network, and inputting the three-dimensional point cloud image into a 3D-Net network; and obtaining a target three-dimensional position prediction frame through the 3D-Net network. The main aim at is solved and is used for the target detection algorithm slow in operation of unmanned aerial vehicle and can't generate the problem of three-dimensional mark frame, mainly is applied to computer vision technical field.

Description

Semantic segmentation method and system based on 3D-YOLO

Technical Field

The embodiment of the invention relates to the technical field of computer vision, in particular to a semantic segmentation method and a semantic segmentation system based on 3D-YOLO.

Background

When the unmanned aerial vehicle runs at a high speed, the time for the obstacles to appear in the field of view of the unmanned aerial vehicle is usually short, so that a target detection algorithm is required to be capable of quickly and accurately identifying the specific type of the obstacles to make a real-time response.

Currently, mainstream object detection algorithms such as R-CNN use a candidate region method to first generate possible candidate region boxes of an object on an image, and then obtain a label of the object by running a classifier in the candidate boxes. After classification is finished, the target enclosure frame is refined through back-end processing, repeated detection is eliminated, and the target is subdivided according to other objects in the scene. Because these processes all need to be trained separately, the whole target detection algorithm is slow and difficult to optimize, and cannot be applied to a system with high real-time requirements such as an unmanned aerial vehicle running at a high speed.

The YOLO network defines target detection as a single regression problem, and directly utilizes the characteristics of the whole image to perform target positioning and type judgment, but because a labeling frame obtained by the traditional YOLO algorithm is two-dimensional, the two-dimensional information of obstacles is not enough to construct a three-dimensional constraint condition in the flight of the unmanned aerial vehicle.

Disclosure of Invention

In view of this, the embodiment of the invention provides a semantic segmentation method and a semantic segmentation system based on 3D-YOLO, and mainly aims to solve the problems that a target detection algorithm for an unmanned aerial vehicle is slow in operation speed and cannot generate a three-dimensional labeling frame.

In order to solve the above problems, embodiments of the present invention mainly provide the following technical solutions:

in a first aspect, an embodiment of the present invention provides a semantic segmentation method based on 3D-YOLO, where the method includes:

respectively sequencing the acquired RGB images and the depth images according to shooting time to obtain an RGB image frame sequence and a depth image frame sequence;

converting the RGB image frame sequence and the depth image frame sequence into a three-dimensional point cloud picture;

taking the three-dimensional point cloud image as input, converting the three-dimensional point cloud image into a three-dimensional feature tensor through a feature learning network, and inputting the three-dimensional point cloud image into a 3D-Net network;

and obtaining a target three-dimensional position prediction frame through the 3D-Net network.

Optionally, before converting the RGB image frame sequence and the depth image frame sequence into the three-dimensional point cloud map, the method further includes:

selecting a frame of image as a key frame in each frame with the preset number from the RGB image frame sequence and the depth image frame sequence;

the determined key frame sequence is retained and the other frame sequences are discarded.

Optionally, converting the RGB image frame sequence and the depth image frame sequence into the three-dimensional point cloud image includes:

and sequentially converting each frame in the RGB image coordinate system and the depth image coordinate system into each frame in the point cloud coordinate system according to the mapping relation between the point cloud coordinate system and the RGB image coordinate system and the depth image coordinate system for storing the key frames respectively.

Optionally, the step of converting the three-dimensional point cloud image into a three-dimensional feature tensor through a feature learning network by using the three-dimensional point cloud image as an input, and inputting the three-dimensional point cloud image into a 3D-Net network includes:

dividing the three-dimensional point cloud picture into a plurality of three-dimensional sub-grids;

and selecting a preset number of three-dimensional point cloud pictures as input of a feature learning network through random sampling, and converting the three-dimensional point cloud pictures into three-dimensional feature tensors.

Optionally, the converting the three-dimensional point cloud image into a three-dimensional feature tensor includes:

sequentially inputting the three-dimensional point cloud picture into a full connection layer, a ReLU activation function and a BN layer of a feature learning network to obtain point cloud features;

inputting the point cloud characteristics into an Element-wise Maxpool layer to obtain locally polymerized LocallyAggregated characteristics;

processing the local aggregation features and the point cloud features through a point cloud splicing layer to obtain four-dimensional cloud splicing features;

and reshaping the four-dimensional cloud splicing characteristics to obtain a three-dimensional characteristic tensor.

Optionally, obtaining the target three-dimensional position prediction frame through the 3D-Net network includes:

inputting the three-dimensional feature tensor into the 3D-Net network, and obtaining a feature tensor to be processed according to the three-dimensional feature tensor according to preset parameters and the number of preset prediction frames; wherein, the preset parameters include: the center of the prediction box, the length, width, height and the proportion of the prediction box, the confidence coefficient and the rotation angle of the box;

calculating the coordinates of the three-dimensional position prediction frame according to the preset parameters, the number of the preset prediction frames and the to-be-processed feature tensor;

normalizing the three-dimensional position prediction frame according to the length of the diagonal line of the three-dimensional position prediction frame;

and removing the overlapped three-dimensional position prediction frames by a non-maximum value inhibition method, wherein each sub grid generates a preset number of three-dimensional position prediction frames.

And performing loss calculation on the three-dimensional position prediction frame by minimizing a preset loss function, and minimizing the value of the loss function by utilizing a random gradient descent algorithm to obtain the target three-dimensional position prediction frame.

In a second aspect, an embodiment of the present invention further provides a semantic segmentation system based on 3D-YOLO, where the system includes:

the sequencing unit is used for respectively sequencing the collected RGB images and the depth images according to shooting time to obtain an RGB image frame sequence and a depth image frame sequence;

the first conversion unit is used for converting the RGB image frame sequence and the depth image frame sequence into a three-dimensional point cloud picture;

the second conversion unit is used for taking the three-dimensional point cloud picture as input and converting the three-dimensional point cloud picture into a three-dimensional feature tensor through a feature learning network;

the input unit is used for inputting the three-dimensional feature tensor into a 3D-Net network;

and the acquisition unit is used for acquiring the target three-dimensional position prediction frame through the 3D-Net network.

Optionally, the system further includes:

the selection unit is used for selecting one frame of image as a key frame from the RGB image frame sequence and the depth image frame sequence in each preset number of frames before the RGB image frame sequence and the depth image frame sequence are converted into the three-dimensional point cloud images by the first conversion unit;

a reservation unit for reserving the determined key frame sequence;

a discarding unit for discarding the other frame sequences.

Optionally, the first conversion unit is further configured to sequentially convert each frame in the RGB image coordinate system and the depth image coordinate system into each frame in the point cloud coordinate system according to mapping relationships between the point cloud coordinate system and the RGB image coordinate system and the depth image coordinate system storing the key frames, respectively.

Optionally, the second conversion unit includes:

the dividing module is used for dividing the three-dimensional point cloud picture into a plurality of three-dimensional sub-grids;

the selecting module is used for selecting a preset number of three-dimensional point cloud pictures as the input of the feature learning network through random sampling;

and the conversion module is used for converting the three-dimensional point cloud picture into a three-dimensional characteristic tensor.

Optionally, the conversion module includes:

the first input submodule is used for sequentially inputting the three-dimensional point cloud picture into a full connection layer, a ReLU activation function and a BN layer of a feature learning network to obtain point cloud features;

the second input sub-module is used for inputting the point cloud characteristics to an Element-wise Maxpool layer to obtain Locally polymerized Locally Aggregated characteristics;

the processing submodule is used for processing the local aggregation characteristics and the point cloud characteristics through a point cloud splicing layer to obtain four-dimensional cloud splicing characteristics;

and the reshaping submodule is used for reshaping the four-dimensional cloud splicing characteristics to obtain three-dimensional characteristic tensor.

Optionally, the obtaining unit includes:

an input module for inputting the three-dimensional feature tensor into the 3D-Net network;

the processing module is used for obtaining a to-be-processed feature tensor according to the three-dimensional feature tensor according to preset parameters and the number of preset prediction frames; wherein, the preset parameters include: the center of the prediction box, the length, width, height and the proportion of the prediction box, the confidence coefficient and the rotation angle of the box;

the first calculation module is used for calculating the coordinates of the three-dimensional position prediction frame according to the preset parameters, the number of the preset prediction frames and the tensor of the features to be processed;

the normalization module is used for normalizing the three-dimensional position prediction frame according to the length of the diagonal line of the three-dimensional position prediction frame;

a removing module, configured to remove the overlapped three-dimensional position prediction frames by a non-maximum suppression method, where each sub-grid generates a preset number of three-dimensional position prediction frames;

the second calculation module is used for performing loss calculation on the three-dimensional position prediction frame by minimizing a preset loss function;

and the acquisition module is used for minimizing the value of the loss function by utilizing a random gradient descent algorithm to obtain the target three-dimensional position prediction frame.

By the technical scheme, the technical scheme provided by the embodiment of the invention at least has the following advantages:

the semantic segmentation method and the semantic segmentation system based on 3D-YOLO provided by the embodiment of the invention respectively sort the acquired RGB images and the depth images according to shooting time to obtain an RGB image frame sequence and a depth image frame sequence; converting the RGB image frame sequence and the depth image frame sequence into a three-dimensional point cloud picture; taking the three-dimensional point cloud picture as input, and converting the three-dimensional point cloud picture into a three-dimensional feature tensor through a feature learning network; and obtaining a target three-dimensional position prediction frame through the 3D-Net network. Compared with the prior art, on the basis of the traditional YOLO algorithm, the image depth information obtained by the binocular camera of the unmanned aerial vehicle is combined, semantic segmentation is realized, the three-dimensional marking frame is generated, the background error suppression capability is strong, the high generalization capability is realized, the 3D position of the obstacle in the space can be obtained, and the accuracy of target identification is improved.

The foregoing description is only an overview of the technical solutions of the embodiments of the present invention, and the embodiments of the present invention can be implemented according to the content of the description in order to make the technical means of the embodiments of the present invention more clearly understood, and the detailed description of the embodiments of the present invention is provided below in order to make the foregoing and other objects, features, and advantages of the embodiments of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the embodiments of the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a flowchart illustrating a semantic segmentation method based on 3D-YOLO according to an embodiment of the present invention;

FIG. 2 is a flow chart of another 3D-YOLO-based semantic segmentation method provided by the embodiment of the invention;

FIG. 3 is a diagram illustrating key frame selection according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a feature learning network provided by an embodiment of the present invention;

FIG. 5 is a schematic diagram of a 3D-Net network according to an embodiment of the present invention;

fig. 6 is a block diagram illustrating a demodulation apparatus for a low modulation index continuous phase signal according to an embodiment of the present invention;

fig. 7 is a block diagram illustrating another demodulation apparatus for a continuous phase signal with a small modulation index according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The embodiment of the invention provides a 3D-YOLO-based semantic segmentation method, and mainly aims to solve the problems that an existing target detection algorithm is low in speed and difficult to optimize, cannot be applied to a high-real-time-requirement system such as an unmanned plane and the like which operate at a high speed, and can only generate a two-dimensional labeling frame. In order to solve the above problem, an embodiment of the present invention provides a semantic segmentation method based on 3D-YOLO, as shown in fig. 1, the method includes:

101. and respectively sequencing the collected RGB images and the depth images according to shooting time to obtain an RGB image frame sequence and a depth image frame sequence.

The embodiment of the invention is mainly applied to the unmanned aerial vehicle flying at high speed, and accurately identifies the specific type of the obstacle so as to make real-time response. In practical application, data is firstly acquired, a binocular camera of an unmanned aerial vehicle can be adopted as the acquisition method, continuous shooting is carried out on the surrounding environment, color (RGB) images of N frames and depth images of N frames are obtained, the N color images and the depth images are respectively sequenced according to the shooting time from front to back, and an RGB image sequence C is obtained₁，C₂，C₃，...，C_i，...， C_NDepth image sequence C_d1，C_d2，C_d3，...，C_di，...，C_dN。

102. And converting the RGB image frame sequence and the depth image frame sequence into a three-dimensional point cloud picture.

In the embodiment of the invention, in order to identify the target detection object, the obtained RGB image frame sequence and depth image frame sequence need to be converted into the three-dimensional point cloud picture C_Y1，C_Y2，C_Y3，...， C_Yi，...C_yNThe specific transformation method may be implemented according to the mapping relationship between the point cloud coordinate system and the RGB image coordinate system and the depth image coordinate system, and is not limited specifically.

103. And taking the three-dimensional point cloud image as input, converting the three-dimensional point cloud image into a three-dimensional feature tensor through a feature learning network, and inputting the three-dimensional feature tensor into a 3D-Net network.

Taking the three-dimensional point cloud picture as input, dividing a large point cloud picture into a plurality of small three-dimensional sub-grids, wherein the size of each sub-grid is (v)_d×v_h×v_w)v_d，v_h，v_wRespectively the length, height and width of each sub-grid, if the size of a three-dimensional point cloud picture is DxHxW (D, H and W are respectively the length, height and width of the point cloud under a standard coordinate system), then the number of the three-dimensional sub-grids is (D/v)_d×H/v_h×W/v_w). Wherein D ═ D/v_d，H’＝H/v_h，W’＝W/v_wWhen the sub-grids are processed, random sampling is used, the number of point clouds equal to T are randomly selected as input each time, the three-dimensional point cloud image is converted into a three-dimensional feature tensor through the feature learning network, and the three-dimensional feature tensor is input into a subsequent 3D-Net network.

104. And obtaining a target three-dimensional position prediction frame through the 3D-Net network.

Taking a three-dimensional point cloud feature tensor as input, and processing through a 3D-Net network to obtain a three-dimensional position prediction frame of a target, wherein the corresponding parameters are 7: three-dimensional center coordinate, length, width, height and confidence p of prediction frame in world coordinate system_i。

The semantic segmentation method based on 3D-YOLO provided by the embodiment of the invention respectively sequences the collected RGB images and depth images according to shooting time to obtain an RGB image frame sequence and a depth image frame sequence; converting the RGB image frame sequence and the depth image frame sequence into a three-dimensional point cloud picture; taking the three-dimensional point cloud image as input, and converting the three-dimensional point cloud image into a three-dimensional feature tensor through a feature learning network; and obtaining a target three-dimensional position prediction frame through the 3D-Net network. Compared with the prior art, on the basis of the traditional YOLO algorithm, the image depth information obtained by combining a binocular camera of the unmanned aerial vehicle is combined, semantic segmentation is realized, a three-dimensional marking frame is generated, the background error suppression capability is strong, the generalization capability is high, the 3D position of the obstacle in the space can be obtained, and the accuracy of target identification is improved.

As a refinement and extension of the above embodiments, in the embodiment of the present invention, one frame of image is selected as a key frame in each preset number of frames, the determined key frame sequence is retained, and other frame sequences are discarded. And performing loss calculation on the three-dimensional position prediction frame by minimizing a preset loss function, and minimizing the value of the loss function by using a random gradient descent algorithm to obtain the target three-dimensional position prediction frame. The calculation speed of the algorithm is improved, and the real-time performance of the algorithm is realized. In order to implement the above functions, an embodiment of the present invention further provides a semantic segmentation method based on 3D-YOLO, and as shown in fig. 2, the method includes:

201. and respectively sequencing the collected RGB images and the depth images according to shooting time to obtain an RGB image frame sequence and a depth image frame sequence.

For the description of step 201, please refer to the detailed description of step 101, and the embodiments of the present invention are not described herein again.

202. In the RGB image frame sequence and the depth image frame sequence, one frame of image is selected as a key frame for each preset number of frames, the determined key frame sequence is reserved, and other frame sequences are discarded.

In the embodiment of the invention, because the obtained images of the same barrier are continuous, a large amount of information redundancy exists, when the images are marked on the navigation map, the key frames can be selected according to a certain rule, all information in the key frames is reserved, and the information in other frames is discarded, so that the redundancy of the information is greatly reduced, the data processing amount is reduced, and the calculation efficiency is improved.

In practical application, the selection of the key frame is related to the speed of the unmanned aerial vehicle, the faster the speed is, the shorter the time of the obstacle with the same size in the view field of the unmanned aerial vehicle is, the fewer images of the obstacle exist, and the selection interval of the key frame can be reduced under the high-speed motion of the unmanned aerial vehicle in order to effectively obtain the space information of the obstacle. In the embodiment of the present invention, the preset number of frames may be set to select one frame image every 3 frames of images as a key frame, and keep the key frame image, and leave away other image frames, and the specific preset number of frames is not limited, as shown in fig. 3, the unmanned aerial vehicle shoots an obstacle to obtain a sequence of images, and selects one frame image every 3 frames of images as a key frame to reduce the data processing amount, and obtains an RGB image sequence C of i frames_x1，C_x2，C_x3，...，C_xiAnd i-frame depth image sequence C_xd1，C_xd2，C_xd3，...，C_xdi。

203. And converting the RGB image frame sequence and the depth image frame sequence into a three-dimensional point cloud picture.

And sequentially converting each frame in the RGB image coordinate system and the depth image coordinate system into each frame in the point cloud coordinate system according to the mapping relation between the point cloud coordinate system and the RGB image coordinate system and the depth image coordinate system for storing the key frames respectively. The specific conversion method is as follows:

wherein x, y, z are point cloud coordinate systems, x ', y', 1 are image coordinate systems, D is a value of a depth image, f_xAnd f_yIs the equivalent focal length of the camera. We have a sequence C of RGB images for N frames_x1， C_x2，C_x3，...，C_xiAnd depth map of N framesImage sequence C_xd1，C_xd2，C_xd3，...，C_xdiEach pixel of (a) is processed by the above method to obtain a three-dimensional point cloud picture C_Y1，C_Y2，C_Y3，...，C_Yi。

204. And sequentially inputting the three-dimensional point cloud picture into a full connection layer, a ReLU activation function and a BN layer of a feature learning network to obtain point cloud features.

205. And inputting the point cloud characteristics into an Element-wise Maxpool layer to obtain locally polymerized LocallyAggregated characteristics.

206. And processing the local aggregation characteristics and the point cloud characteristics through a point cloud splicing layer to obtain four-dimensional cloud splicing characteristics.

207. And reshaping the four-dimensional cloud splicing characteristics to obtain a three-dimensional characteristic tensor, and inputting the three-dimensional characteristic tensor into a 3D-Net network.

Taking the three-dimensional point cloud picture as input, dividing a large point cloud picture into a plurality of small three-dimensional sub-grids, wherein the size of each sub-grid is (vd multiplied by vh multiplied by vw), and if the size of one three-dimensional point cloud picture is D multiplied by H multiplied by W, the number of the three-dimensional sub-grids is (D/vd multiplied by H/vh multiplied by W/vw). When processing the sub-grids, random sampling is used, the number of point clouds equal to T is randomly selected as input, and through the feature learning network, the three-dimensional point cloud image is converted into a three-dimensional feature tensor, and the three-dimensional feature tensor is input into a subsequent 3D-Net network.

Suppose that

Is the ith non-empty voxel of the jth three-dimensional point cloud picture,

coordinates representing a three-dimensional point cloud. To pair

The input information of the characteristic learning network can be obtained by conversion

Wherein

Is the center coordinate of the ith subnet.

In an embodiment of the present invention, the feature learning network framework is shown in FIG. 4 for non-empty voxels

As input, and outputs a C-dimensional feature tensor (C is an adaptive numerical value)

For empty voxels, we translate it into the 0 tensor. In that

After the point cloud characteristic is input into a network, the point cloud characteristic is obtained through a full connection layer, and then a ReLU activation function and a BN layer; then generating a local polymerization characteristic through Element-wise Maxpool; then combining the local aggregation characteristics and the point cloud characteristics and processing the point cloud characteristics through a point cloud splicing layer to obtain 4-dimensional point cloud splicing characteristics

We reshape it to obtain a three-dimensional point cloud feature tensor (H ' × W ' × C · D ') for subsequent processing.

208. And inputting the three-dimensional feature tensor into the 3D-Net network, and obtaining a feature tensor to be processed according to the three-dimensional feature tensor according to preset parameters and the number of preset prediction frames.

In the embodiment of the present invention, a schematic diagram of a 3D-Net network is shown in fig. 5, and is composed of 13 convolutional layers and 3 max pooling layers. Inputting a three-dimensional feature tensor into the 3D-Net network, and obtaining a tensor with the size of the feature tensor to be processed being (H '/8 xW'/8 xB (8+ K)) according to the three-dimensional feature tensor and preset parameters and the number of preset prediction frames, wherein B represents the number of frames, and 8 represents the obtained number of frames8 parameters: wherein t is_x、t_y、t_zRespectively representing the offset from the central position of the initial frame of the image; t is t_l、t_w、 t_hRespectively representing the offset from the length, width and height of the initial frame of the image; k represents the category of the frame and represents that a common K category target is detected in the task; confidence p₁…p_kRepresenting the likelihood of prediction for each class of targets; t is t_θThe offset amount of the frame rotation angle is indicated, and K indicates the type of the prediction frame.

209. And calculating the coordinates of the three-dimensional position prediction frame according to the preset parameters, the number of the preset prediction frames and the to-be-processed feature tensor.

210. And normalizing the three-dimensional position prediction frame according to the length of the diagonal line of the three-dimensional position prediction frame.

By way of the description of step 208, the coordinates of the three-dimensional position prediction frame are calculated as follows:

x＝t_xd_a+x_a

y＝t_yd_a+y_a

z＝t_zd_a+z_a

w＝e^tw+w_a

h＝e^th+h_a

d＝e^tl+l_a

b_θ＝t_θ+θ_a

p_i＝σ_sigmiod(p₁…p_k)_j

wherein x is_a、y_a、z_aRespectively representing the coordinates of the central point of the initial frame of the image to be detected,

and representing the length of the diagonal line of the labeling frame, and performing normalization processing on the obtained three-dimensional position prediction frame as a constraint condition. By the centre coordinates of the initial frame and the offset t_x、t_y、t_zObtaining the central coordinates x, y and z of the prediction frame; w is a_a、h_a、l_aRepresenting the size of the initial frame of the image to be detected, by an offset t_l、t_w、t_hCorrecting the prediction frame to obtain the length, width and height d, w and h of the prediction frame; theta_aIs the angle of rotation of the initial frame, t_θIs the offset of the rotation angle; p is a radical of_iIs that the prediction box is for each class of objects (p)₁…p_k) Confidence of the prediction.

211. And removing the overlapped three-dimensional position prediction frames by a non-maximum value inhibition method, wherein each sub-grid predicts the three-dimensional position prediction frames with the preset number of prediction frames.

B three-dimensional frames are predicted in each sub-grid, and overlapped frames can be removed through a non-maximum suppression method to obtain a desired result. Finally, 7 parameters, namely three-dimensional center coordinates (x, y, z) in a world coordinate system, the length d, the width w, the height h and the confidence coefficient p of a prediction box are generated_i。

Wherein B is a constant.

212. And performing loss calculation on the three-dimensional position prediction frame by minimizing a preset loss function, and minimizing the value of the loss function by utilizing a random gradient descent algorithm to obtain the target three-dimensional position prediction frame.

In the training process, the loss calculation is performed on the three-dimensional position prediction frame by minimizing the preset loss function, so that the three-dimensional position prediction frame label meeting the condition can be obtained, and the specific calculation process is shown as follows.

Where G denotes the number of sub-grids,

indicating the loss of the jth label box in the ith sub-grid,

to representThe mesh contains no loss of the object. Lambda [ alpha ]₁And λ_noobjThe weight coefficients of the sub-grids with and without targets are calculated respectively. And minimizing the value of the loss function by utilizing a random gradient descent algorithm to obtain the three-dimensional position prediction box label meeting the condition.

In summary, according to the semantic segmentation method based on 3D-YOLO, one frame of image is selected as a key frame for each frame with a preset number, the determined key frame sequence is retained, other frame sequences are discarded, the RGB image frame sequence and the depth image frame sequence are converted into a three-dimensional point cloud image, the three-dimensional point cloud image is converted into a three-dimensional feature tensor through a feature learning network, and the three-dimensional point cloud image is input into a 3D-Net network. Normalization processing is carried out on the three-dimensional position prediction frame; and removing the overlapped three-dimensional position prediction frame by a non-maximum value inhibition method, performing loss calculation on the three-dimensional position prediction frame by minimizing a preset loss function, and minimizing the value of the loss function by utilizing a random gradient descent algorithm to obtain the target three-dimensional position prediction frame. The calculation speed of the algorithm is improved, and the real-time performance of the algorithm is realized.

Further, as an implementation of the method shown in the above embodiment, another embodiment of the present invention further provides a semantic segmentation system based on 3D-YOLO. The system embodiment corresponds to the foregoing method embodiment, and details in the foregoing method embodiment are not repeated in this system embodiment for convenience of reading, but it should be clear that the system in this embodiment can correspondingly implement all the contents in the foregoing method embodiment.

An embodiment of the present invention provides a semantic segmentation system based on 3D-YOLO, as shown in fig. 6, including:

the sorting unit 31 is configured to sort the acquired RGB images and the depth images according to shooting time, respectively, to obtain an RGB image frame sequence and a depth image frame sequence;

a first conversion unit 32, configured to convert the RGB image frame sequence and the depth image frame sequence into a three-dimensional point cloud image;

a second conversion unit 33, configured to use the three-dimensional point cloud image as an input, and convert the three-dimensional point cloud image into a three-dimensional feature tensor through a feature learning network;

an input unit 34, configured to input the three-dimensional feature tensor into a 3D-Net network;

and the obtaining unit 35 is configured to obtain the target three-dimensional position prediction frame through the 3D-Net network.

The semantic segmentation system based on the 3D-YOLO provided by the embodiment of the invention respectively sequences the acquired RGB images and the depth images according to shooting time to obtain an RGB image frame sequence and a depth image frame sequence; converting the RGB image frame sequence and the depth image frame sequence into a three-dimensional point cloud picture; taking the three-dimensional point cloud image as input, and converting the three-dimensional point cloud image into a three-dimensional feature tensor through a feature learning network; and obtaining a target three-dimensional position prediction frame through the 3D-Net network. Compared with the prior art, on the basis of the traditional YOLO algorithm, the image depth information obtained by combining a binocular camera of the unmanned aerial vehicle is combined, semantic segmentation is realized, a three-dimensional marking frame is generated, the background error suppression capability is strong, the generalization capability is high, the 3D position of the obstacle in the space can be obtained, and the accuracy of target identification is improved.

Further, as shown in fig. 7, the system further includes:

a selecting unit 36, configured to select, before the first converting unit 32 converts the RGB image frame sequence and the depth image frame sequence into the three-dimensional point cloud image, one frame of image as a key frame for each preset number of frames in the RGB image frame sequence and the depth image frame sequence;

a reservation unit 37 for reserving the determined key frame sequence;

a dropping unit 38 for dropping the other frame sequences.

Further, as shown in fig. 7, the first converting unit 32 is further configured to sequentially convert each frame in the RGB image coordinate system and the depth image coordinate system into each frame in the point cloud coordinate system according to mapping relationships between the point cloud coordinate system and the RGB image coordinate system and the depth image coordinate system of the stored key frame, respectively.

Further, as shown in fig. 7, the second conversion unit 33 includes:

a dividing module 331, configured to divide the three-dimensional point cloud graph into a plurality of three-dimensional sub-grids;

a selecting module 332, configured to select a preset number of three-dimensional point cloud images as input of the feature learning network through random sampling;

a converting module 333, configured to convert the three-dimensional point cloud graph into a three-dimensional feature tensor.

Further, as shown in fig. 7, the conversion module 333 includes:

the first input submodule 3331 is configured to sequentially input the three-dimensional point cloud image into a full connection layer, a ReLU activation function and a BN layer of a feature learning network to obtain point cloud features;

the second input sub-module 3332 is used for inputting the point cloud features into an Element-wise Maxpool layer to obtain Locally polymerized Locally Aggregated features;

the processing sub-module 3333 is configured to process the local aggregation features and the point cloud features through a point cloud splicing layer to obtain four-dimensional cloud splicing features;

and the reshaping submodule 3334 is used for reshaping the four-dimensional cloud splicing characteristics to obtain a three-dimensional characteristic tensor.

Further, as shown in fig. 7, the acquiring unit 35 includes:

an input module 351, configured to input the three-dimensional feature tensor into the 3D-Net network;

the processing module 352 is configured to obtain a to-be-processed feature tensor according to the three-dimensional feature tensor according to a preset parameter and a preset number of prediction frames; wherein, the preset parameters include: the center of the prediction box, the length, width, height and the proportion of the prediction box, the confidence coefficient and the rotation angle of the box;

the first calculating module 353 is configured to calculate coordinates of a three-dimensional position prediction frame according to the preset parameters, the preset number of prediction frames, and the to-be-processed feature tensor;

the normalizing module 354 is used for normalizing the three-dimensional position prediction frame according to the length of the diagonal line of the three-dimensional position prediction frame;

a removal module 355 for removing the overlapped three-dimensional position prediction frames by a non-maximum suppression method, wherein each sub-grid generates a predetermined number of three-dimensional position prediction frames.

A second calculation module 356 for performing a loss calculation on the three-dimensional position prediction box by minimizing a preset loss function,

an obtaining module 357, configured to obtain the target three-dimensional position prediction box by using a random gradient descent algorithm to minimize a value of the loss function.

In summary, the semantic segmentation system based on 3D-YOLO selects one frame of image as a key frame for each preset number of frames, retains the determined key frame sequence, discards other frame sequences, converts the RGB image frame sequence and the depth image frame sequence into a three-dimensional point cloud image, converts the three-dimensional point cloud image into a three-dimensional feature tensor through a feature learning network, and inputs the three-dimensional feature tensor into a 3D-Net network. Normalization processing is carried out on the three-dimensional position prediction frame; and removing the overlapped three-dimensional position prediction frame by a non-maximum value inhibition method, performing loss calculation on the three-dimensional position prediction frame by minimizing a preset loss function, and minimizing the value of the loss function by utilizing a random gradient descent algorithm to obtain the target three-dimensional position prediction frame. The calculation speed of the algorithm is improved, and the real-time performance of the algorithm is realized.

Since the 3D-YOLO-based semantic segmentation system described in this embodiment is a system that can execute the 3D-YOLO-based semantic segmentation method in the embodiment of the present invention, based on the 3D-YOLO-based semantic segmentation method described in the embodiment of the present invention, those skilled in the art can understand the specific implementation manner and various variations of the 3D-YOLO-based semantic segmentation system in this embodiment, and therefore, how the 3D-YOLO-based semantic segmentation system implements the various 3D-YOLO-based semantic segmentation methods in the embodiment of the present invention is not described in detail herein. As long as those skilled in the art implement the system adopted by the 3D-YOLO-based semantic segmentation method in the embodiment of the present invention, the system is within the scope of the present application.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create a system for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including an instruction system which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal or a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement or the like made within the spirit and principle of the present application shall be included within the scope of the claims of the present application.

Claims

1. A semantic segmentation method based on 3D-YOLO is characterized by comprising the following steps:

respectively sequencing the collected RGB images and the collected depth images according to shooting time to obtain an RGB image frame sequence and a depth image frame sequence;

2. The method of claim 1, wherein prior to converting the sequence of RGB image frames and the sequence of depth image frames into a three-dimensional point cloud map, the method further comprises:

selecting one frame of image as a key frame from each frame with preset number in the RGB image frame sequence and the depth image frame sequence;

3. The method of claim 2, wherein converting the sequence of RGB image frames and the sequence of depth image frames into a three-dimensional point cloud map comprises:

4. The method of claim 1, wherein taking the three-dimensional point cloud graph as input, converting the three-dimensional point cloud graph into a three-dimensional feature tensor through a feature learning network, and inputting into a 3D-Net network comprises:

5. The method of claim 4, wherein the converting the three-dimensional point cloud graph into a three-dimensional feature tensor comprises:

inputting the point cloud characteristics into an Element-wise Maxpool layer to obtain Locally polymerized Locally Aggregated characteristics;

6. The method of claim 4, wherein obtaining a target three-dimensional position prediction box by the 3D-Net network comprises:

removing the overlapped three-dimensional position prediction frames by a non-maximum value inhibition method, wherein each sub-grid generates a preset number of three-dimensional position prediction frames;

7. A semantic segmentation system based on 3D-YOLO, comprising:

the sequencing unit is used for respectively sequencing the collected RGB images and the depth images according to the shooting time to obtain an RGB image frame sequence and a depth image frame sequence;

8. The system of claim 7, further comprising:

a reservation unit for reserving the determined key frame sequence;

a discarding unit for discarding the other frame sequences.

9. The system of claim 7, wherein the first converting unit is further configured to sequentially convert each frame in the RGB image coordinate system and the depth image coordinate system into each frame in the point cloud coordinate system according to mapping relationships between the point cloud coordinate system and the RGB image coordinate system and the depth image coordinate system storing the key frame, respectively.

10. The system of claim 7, wherein the second conversion unit comprises:

the selection module is used for selecting a preset number of three-dimensional point cloud pictures as the input of the feature learning network through random sampling;