CN111833358A - Semantic segmentation method and system based on 3D-YOLO - Google Patents

Semantic segmentation method and system based on 3D-YOLO Download PDF

Info

Publication number
CN111833358A
CN111833358A CN202010593311.9A CN202010593311A CN111833358A CN 111833358 A CN111833358 A CN 111833358A CN 202010593311 A CN202010593311 A CN 202010593311A CN 111833358 A CN111833358 A CN 111833358A
Authority
CN
China
Prior art keywords
dimensional
point cloud
frame
frame sequence
dimensional point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010593311.9A
Other languages
Chinese (zh)
Inventor
赵健
温志津
刘阳
李晋徽
鲍雁飞
雍婷
晋晓曦
张清毅
温可涵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
32802 Troops Of People's Liberation Army Of China
Original Assignee
32802 Troops Of People's Liberation Army Of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 32802 Troops Of People's Liberation Army Of China filed Critical 32802 Troops Of People's Liberation Army Of China
Priority to CN202010593311.9A priority Critical patent/CN111833358A/en
Publication of CN111833358A publication Critical patent/CN111833358A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a semantic segmentation method and a semantic segmentation system based on 3D-YOLO, which respectively sort collected RGB images and depth images according to shooting time to obtain an RGB image frame sequence and a depth image frame sequence; converting the RGB image frame sequence and the depth image frame sequence into a three-dimensional point cloud picture; taking the three-dimensional point cloud image as input, converting the three-dimensional point cloud image into a three-dimensional feature tensor through a feature learning network, and inputting the three-dimensional point cloud image into a 3D-Net network; and obtaining a target three-dimensional position prediction frame through the 3D-Net network. The main aim at is solved and is used for the target detection algorithm slow in operation of unmanned aerial vehicle and can't generate the problem of three-dimensional mark frame, mainly is applied to computer vision technical field.

Description

Semantic segmentation method and system based on 3D-YOLO
Technical Field
The embodiment of the invention relates to the technical field of computer vision, in particular to a semantic segmentation method and a semantic segmentation system based on 3D-YOLO.
Background
When the unmanned aerial vehicle runs at a high speed, the time for the obstacles to appear in the field of view of the unmanned aerial vehicle is usually short, so that a target detection algorithm is required to be capable of quickly and accurately identifying the specific type of the obstacles to make a real-time response.
Currently, mainstream object detection algorithms such as R-CNN use a candidate region method to first generate possible candidate region boxes of an object on an image, and then obtain a label of the object by running a classifier in the candidate boxes. After classification is finished, the target enclosure frame is refined through back-end processing, repeated detection is eliminated, and the target is subdivided according to other objects in the scene. Because these processes all need to be trained separately, the whole target detection algorithm is slow and difficult to optimize, and cannot be applied to a system with high real-time requirements such as an unmanned aerial vehicle running at a high speed.
The YOLO network defines target detection as a single regression problem, and directly utilizes the characteristics of the whole image to perform target positioning and type judgment, but because a labeling frame obtained by the traditional YOLO algorithm is two-dimensional, the two-dimensional information of obstacles is not enough to construct a three-dimensional constraint condition in the flight of the unmanned aerial vehicle.
Disclosure of Invention
In view of this, the embodiment of the invention provides a semantic segmentation method and a semantic segmentation system based on 3D-YOLO, and mainly aims to solve the problems that a target detection algorithm for an unmanned aerial vehicle is slow in operation speed and cannot generate a three-dimensional labeling frame.
In order to solve the above problems, embodiments of the present invention mainly provide the following technical solutions:
in a first aspect, an embodiment of the present invention provides a semantic segmentation method based on 3D-YOLO, where the method includes:
respectively sequencing the acquired RGB images and the depth images according to shooting time to obtain an RGB image frame sequence and a depth image frame sequence;
converting the RGB image frame sequence and the depth image frame sequence into a three-dimensional point cloud picture;
taking the three-dimensional point cloud image as input, converting the three-dimensional point cloud image into a three-dimensional feature tensor through a feature learning network, and inputting the three-dimensional point cloud image into a 3D-Net network;
and obtaining a target three-dimensional position prediction frame through the 3D-Net network.
Optionally, before converting the RGB image frame sequence and the depth image frame sequence into the three-dimensional point cloud map, the method further includes:
selecting a frame of image as a key frame in each frame with the preset number from the RGB image frame sequence and the depth image frame sequence;
the determined key frame sequence is retained and the other frame sequences are discarded.
Optionally, converting the RGB image frame sequence and the depth image frame sequence into the three-dimensional point cloud image includes:
and sequentially converting each frame in the RGB image coordinate system and the depth image coordinate system into each frame in the point cloud coordinate system according to the mapping relation between the point cloud coordinate system and the RGB image coordinate system and the depth image coordinate system for storing the key frames respectively.
Optionally, the step of converting the three-dimensional point cloud image into a three-dimensional feature tensor through a feature learning network by using the three-dimensional point cloud image as an input, and inputting the three-dimensional point cloud image into a 3D-Net network includes:
dividing the three-dimensional point cloud picture into a plurality of three-dimensional sub-grids;
and selecting a preset number of three-dimensional point cloud pictures as input of a feature learning network through random sampling, and converting the three-dimensional point cloud pictures into three-dimensional feature tensors.
Optionally, the converting the three-dimensional point cloud image into a three-dimensional feature tensor includes:
sequentially inputting the three-dimensional point cloud picture into a full connection layer, a ReLU activation function and a BN layer of a feature learning network to obtain point cloud features;
inputting the point cloud characteristics into an Element-wise Maxpool layer to obtain locally polymerized LocallyAggregated characteristics;
processing the local aggregation features and the point cloud features through a point cloud splicing layer to obtain four-dimensional cloud splicing features;
and reshaping the four-dimensional cloud splicing characteristics to obtain a three-dimensional characteristic tensor.
Optionally, obtaining the target three-dimensional position prediction frame through the 3D-Net network includes:
inputting the three-dimensional feature tensor into the 3D-Net network, and obtaining a feature tensor to be processed according to the three-dimensional feature tensor according to preset parameters and the number of preset prediction frames; wherein, the preset parameters include: the center of the prediction box, the length, width, height and the proportion of the prediction box, the confidence coefficient and the rotation angle of the box;
calculating the coordinates of the three-dimensional position prediction frame according to the preset parameters, the number of the preset prediction frames and the to-be-processed feature tensor;
normalizing the three-dimensional position prediction frame according to the length of the diagonal line of the three-dimensional position prediction frame;
and removing the overlapped three-dimensional position prediction frames by a non-maximum value inhibition method, wherein each sub grid generates a preset number of three-dimensional position prediction frames.
And performing loss calculation on the three-dimensional position prediction frame by minimizing a preset loss function, and minimizing the value of the loss function by utilizing a random gradient descent algorithm to obtain the target three-dimensional position prediction frame.
In a second aspect, an embodiment of the present invention further provides a semantic segmentation system based on 3D-YOLO, where the system includes:
the sequencing unit is used for respectively sequencing the collected RGB images and the depth images according to shooting time to obtain an RGB image frame sequence and a depth image frame sequence;
the first conversion unit is used for converting the RGB image frame sequence and the depth image frame sequence into a three-dimensional point cloud picture;
the second conversion unit is used for taking the three-dimensional point cloud picture as input and converting the three-dimensional point cloud picture into a three-dimensional feature tensor through a feature learning network;
the input unit is used for inputting the three-dimensional feature tensor into a 3D-Net network;
and the acquisition unit is used for acquiring the target three-dimensional position prediction frame through the 3D-Net network.
Optionally, the system further includes:
the selection unit is used for selecting one frame of image as a key frame from the RGB image frame sequence and the depth image frame sequence in each preset number of frames before the RGB image frame sequence and the depth image frame sequence are converted into the three-dimensional point cloud images by the first conversion unit;
a reservation unit for reserving the determined key frame sequence;
a discarding unit for discarding the other frame sequences.
Optionally, the first conversion unit is further configured to sequentially convert each frame in the RGB image coordinate system and the depth image coordinate system into each frame in the point cloud coordinate system according to mapping relationships between the point cloud coordinate system and the RGB image coordinate system and the depth image coordinate system storing the key frames, respectively.
Optionally, the second conversion unit includes:
the dividing module is used for dividing the three-dimensional point cloud picture into a plurality of three-dimensional sub-grids;
the selecting module is used for selecting a preset number of three-dimensional point cloud pictures as the input of the feature learning network through random sampling;
and the conversion module is used for converting the three-dimensional point cloud picture into a three-dimensional characteristic tensor.
Optionally, the conversion module includes:
the first input submodule is used for sequentially inputting the three-dimensional point cloud picture into a full connection layer, a ReLU activation function and a BN layer of a feature learning network to obtain point cloud features;
the second input sub-module is used for inputting the point cloud characteristics to an Element-wise Maxpool layer to obtain Locally polymerized Locally Aggregated characteristics;
the processing submodule is used for processing the local aggregation characteristics and the point cloud characteristics through a point cloud splicing layer to obtain four-dimensional cloud splicing characteristics;
and the reshaping submodule is used for reshaping the four-dimensional cloud splicing characteristics to obtain three-dimensional characteristic tensor.
Optionally, the obtaining unit includes:
an input module for inputting the three-dimensional feature tensor into the 3D-Net network;
the processing module is used for obtaining a to-be-processed feature tensor according to the three-dimensional feature tensor according to preset parameters and the number of preset prediction frames; wherein, the preset parameters include: the center of the prediction box, the length, width, height and the proportion of the prediction box, the confidence coefficient and the rotation angle of the box;
the first calculation module is used for calculating the coordinates of the three-dimensional position prediction frame according to the preset parameters, the number of the preset prediction frames and the tensor of the features to be processed;
the normalization module is used for normalizing the three-dimensional position prediction frame according to the length of the diagonal line of the three-dimensional position prediction frame;
a removing module, configured to remove the overlapped three-dimensional position prediction frames by a non-maximum suppression method, where each sub-grid generates a preset number of three-dimensional position prediction frames;
the second calculation module is used for performing loss calculation on the three-dimensional position prediction frame by minimizing a preset loss function;
and the acquisition module is used for minimizing the value of the loss function by utilizing a random gradient descent algorithm to obtain the target three-dimensional position prediction frame.
By the technical scheme, the technical scheme provided by the embodiment of the invention at least has the following advantages:
the semantic segmentation method and the semantic segmentation system based on 3D-YOLO provided by the embodiment of the invention respectively sort the acquired RGB images and the depth images according to shooting time to obtain an RGB image frame sequence and a depth image frame sequence; converting the RGB image frame sequence and the depth image frame sequence into a three-dimensional point cloud picture; taking the three-dimensional point cloud picture as input, and converting the three-dimensional point cloud picture into a three-dimensional feature tensor through a feature learning network; and obtaining a target three-dimensional position prediction frame through the 3D-Net network. Compared with the prior art, on the basis of the traditional YOLO algorithm, the image depth information obtained by the binocular camera of the unmanned aerial vehicle is combined, semantic segmentation is realized, the three-dimensional marking frame is generated, the background error suppression capability is strong, the high generalization capability is realized, the 3D position of the obstacle in the space can be obtained, and the accuracy of target identification is improved.
The foregoing description is only an overview of the technical solutions of the embodiments of the present invention, and the embodiments of the present invention can be implemented according to the content of the description in order to make the technical means of the embodiments of the present invention more clearly understood, and the detailed description of the embodiments of the present invention is provided below in order to make the foregoing and other objects, features, and advantages of the embodiments of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the embodiments of the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 is a flowchart illustrating a semantic segmentation method based on 3D-YOLO according to an embodiment of the present invention;
FIG. 2 is a flow chart of another 3D-YOLO-based semantic segmentation method provided by the embodiment of the invention;
FIG. 3 is a diagram illustrating key frame selection according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a feature learning network provided by an embodiment of the present invention;
FIG. 5 is a schematic diagram of a 3D-Net network according to an embodiment of the present invention;
fig. 6 is a block diagram illustrating a demodulation apparatus for a low modulation index continuous phase signal according to an embodiment of the present invention;
fig. 7 is a block diagram illustrating another demodulation apparatus for a continuous phase signal with a small modulation index according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The embodiment of the invention provides a 3D-YOLO-based semantic segmentation method, and mainly aims to solve the problems that an existing target detection algorithm is low in speed and difficult to optimize, cannot be applied to a high-real-time-requirement system such as an unmanned plane and the like which operate at a high speed, and can only generate a two-dimensional labeling frame. In order to solve the above problem, an embodiment of the present invention provides a semantic segmentation method based on 3D-YOLO, as shown in fig. 1, the method includes:
101. and respectively sequencing the collected RGB images and the depth images according to shooting time to obtain an RGB image frame sequence and a depth image frame sequence.
The embodiment of the invention is mainly applied to the unmanned aerial vehicle flying at high speed, and accurately identifies the specific type of the obstacle so as to make real-time response. In practical application, data is firstly acquired, a binocular camera of an unmanned aerial vehicle can be adopted as the acquisition method, continuous shooting is carried out on the surrounding environment, color (RGB) images of N frames and depth images of N frames are obtained, the N color images and the depth images are respectively sequenced according to the shooting time from front to back, and an RGB image sequence C is obtained1,C2,C3,...,Ci,..., CNDepth image sequence Cd1,Cd2,Cd3,...,Cdi,...,CdN
102. And converting the RGB image frame sequence and the depth image frame sequence into a three-dimensional point cloud picture.
In the embodiment of the invention, in order to identify the target detection object, the obtained RGB image frame sequence and depth image frame sequence need to be converted into the three-dimensional point cloud picture CY1,CY2,CY3,..., CYi,...CyNThe specific transformation method may be implemented according to the mapping relationship between the point cloud coordinate system and the RGB image coordinate system and the depth image coordinate system, and is not limited specifically.
103. And taking the three-dimensional point cloud image as input, converting the three-dimensional point cloud image into a three-dimensional feature tensor through a feature learning network, and inputting the three-dimensional feature tensor into a 3D-Net network.
Taking the three-dimensional point cloud picture as input, dividing a large point cloud picture into a plurality of small three-dimensional sub-grids, wherein the size of each sub-grid is (v)d×vh×vw)vd,vh,vwRespectively the length, height and width of each sub-grid, if the size of a three-dimensional point cloud picture is DxHxW (D, H and W are respectively the length, height and width of the point cloud under a standard coordinate system), then the number of the three-dimensional sub-grids is (D/v)d×H/vh×W/vw). Wherein D ═ D/vd,H’=H/vh,W’=W/vwWhen the sub-grids are processed, random sampling is used, the number of point clouds equal to T are randomly selected as input each time, the three-dimensional point cloud image is converted into a three-dimensional feature tensor through the feature learning network, and the three-dimensional feature tensor is input into a subsequent 3D-Net network.
104. And obtaining a target three-dimensional position prediction frame through the 3D-Net network.
Taking a three-dimensional point cloud feature tensor as input, and processing through a 3D-Net network to obtain a three-dimensional position prediction frame of a target, wherein the corresponding parameters are 7: three-dimensional center coordinate, length, width, height and confidence p of prediction frame in world coordinate systemi
The semantic segmentation method based on 3D-YOLO provided by the embodiment of the invention respectively sequences the collected RGB images and depth images according to shooting time to obtain an RGB image frame sequence and a depth image frame sequence; converting the RGB image frame sequence and the depth image frame sequence into a three-dimensional point cloud picture; taking the three-dimensional point cloud image as input, and converting the three-dimensional point cloud image into a three-dimensional feature tensor through a feature learning network; and obtaining a target three-dimensional position prediction frame through the 3D-Net network. Compared with the prior art, on the basis of the traditional YOLO algorithm, the image depth information obtained by combining a binocular camera of the unmanned aerial vehicle is combined, semantic segmentation is realized, a three-dimensional marking frame is generated, the background error suppression capability is strong, the generalization capability is high, the 3D position of the obstacle in the space can be obtained, and the accuracy of target identification is improved.
As a refinement and extension of the above embodiments, in the embodiment of the present invention, one frame of image is selected as a key frame in each preset number of frames, the determined key frame sequence is retained, and other frame sequences are discarded. And performing loss calculation on the three-dimensional position prediction frame by minimizing a preset loss function, and minimizing the value of the loss function by using a random gradient descent algorithm to obtain the target three-dimensional position prediction frame. The calculation speed of the algorithm is improved, and the real-time performance of the algorithm is realized. In order to implement the above functions, an embodiment of the present invention further provides a semantic segmentation method based on 3D-YOLO, and as shown in fig. 2, the method includes:
201. and respectively sequencing the collected RGB images and the depth images according to shooting time to obtain an RGB image frame sequence and a depth image frame sequence.
For the description of step 201, please refer to the detailed description of step 101, and the embodiments of the present invention are not described herein again.
202. In the RGB image frame sequence and the depth image frame sequence, one frame of image is selected as a key frame for each preset number of frames, the determined key frame sequence is reserved, and other frame sequences are discarded.
In the embodiment of the invention, because the obtained images of the same barrier are continuous, a large amount of information redundancy exists, when the images are marked on the navigation map, the key frames can be selected according to a certain rule, all information in the key frames is reserved, and the information in other frames is discarded, so that the redundancy of the information is greatly reduced, the data processing amount is reduced, and the calculation efficiency is improved.
In practical application, the selection of the key frame is related to the speed of the unmanned aerial vehicle, the faster the speed is, the shorter the time of the obstacle with the same size in the view field of the unmanned aerial vehicle is, the fewer images of the obstacle exist, and the selection interval of the key frame can be reduced under the high-speed motion of the unmanned aerial vehicle in order to effectively obtain the space information of the obstacle. In the embodiment of the present invention, the preset number of frames may be set to select one frame image every 3 frames of images as a key frame, and keep the key frame image, and leave away other image frames, and the specific preset number of frames is not limited, as shown in fig. 3, the unmanned aerial vehicle shoots an obstacle to obtain a sequence of images, and selects one frame image every 3 frames of images as a key frame to reduce the data processing amount, and obtains an RGB image sequence C of i framesx1,Cx2,Cx3,...,CxiAnd i-frame depth image sequence Cxd1,Cxd2,Cxd3,...,Cxdi
203. And converting the RGB image frame sequence and the depth image frame sequence into a three-dimensional point cloud picture.
And sequentially converting each frame in the RGB image coordinate system and the depth image coordinate system into each frame in the point cloud coordinate system according to the mapping relation between the point cloud coordinate system and the RGB image coordinate system and the depth image coordinate system for storing the key frames respectively. The specific conversion method is as follows:
Figure RE-GDA0002648594560000081
wherein x, y, z are point cloud coordinate systems, x ', y', 1 are image coordinate systems, D is a value of a depth image, fxAnd fyIs the equivalent focal length of the camera. We have a sequence C of RGB images for N framesx1, Cx2,Cx3,...,CxiAnd depth map of N framesImage sequence Cxd1,Cxd2,Cxd3,...,CxdiEach pixel of (a) is processed by the above method to obtain a three-dimensional point cloud picture CY1,CY2,CY3,...,CYi
204. And sequentially inputting the three-dimensional point cloud picture into a full connection layer, a ReLU activation function and a BN layer of a feature learning network to obtain point cloud features.
205. And inputting the point cloud characteristics into an Element-wise Maxpool layer to obtain locally polymerized LocallyAggregated characteristics.
206. And processing the local aggregation characteristics and the point cloud characteristics through a point cloud splicing layer to obtain four-dimensional cloud splicing characteristics.
207. And reshaping the four-dimensional cloud splicing characteristics to obtain a three-dimensional characteristic tensor, and inputting the three-dimensional characteristic tensor into a 3D-Net network.
Taking the three-dimensional point cloud picture as input, dividing a large point cloud picture into a plurality of small three-dimensional sub-grids, wherein the size of each sub-grid is (vd multiplied by vh multiplied by vw), and if the size of one three-dimensional point cloud picture is D multiplied by H multiplied by W, the number of the three-dimensional sub-grids is (D/vd multiplied by H/vh multiplied by W/vw). When processing the sub-grids, random sampling is used, the number of point clouds equal to T is randomly selected as input, and through the feature learning network, the three-dimensional point cloud image is converted into a three-dimensional feature tensor, and the three-dimensional feature tensor is input into a subsequent 3D-Net network.
Suppose that
Figure BDA0002556546250000091
Is the ith non-empty voxel of the jth three-dimensional point cloud picture,
Figure BDA0002556546250000092
coordinates representing a three-dimensional point cloud. To pair
Figure BDA0002556546250000093
The input information of the characteristic learning network can be obtained by conversion
Figure BDA0002556546250000094
Wherein
Figure BDA0002556546250000095
Is the center coordinate of the ith subnet.
In an embodiment of the present invention, the feature learning network framework is shown in FIG. 4 for non-empty voxels
Figure BDA0002556546250000096
As input, and outputs a C-dimensional feature tensor (C is an adaptive numerical value)
Figure BDA0002556546250000097
For empty voxels, we translate it into the 0 tensor. In that
Figure BDA0002556546250000098
After the point cloud characteristic is input into a network, the point cloud characteristic is obtained through a full connection layer, and then a ReLU activation function and a BN layer; then generating a local polymerization characteristic through Element-wise Maxpool; then combining the local aggregation characteristics and the point cloud characteristics and processing the point cloud characteristics through a point cloud splicing layer to obtain 4-dimensional point cloud splicing characteristics
Figure BDA0002556546250000099
We reshape it to obtain a three-dimensional point cloud feature tensor (H ' × W ' × C · D ') for subsequent processing.
208. And inputting the three-dimensional feature tensor into the 3D-Net network, and obtaining a feature tensor to be processed according to the three-dimensional feature tensor according to preset parameters and the number of preset prediction frames.
In the embodiment of the present invention, a schematic diagram of a 3D-Net network is shown in fig. 5, and is composed of 13 convolutional layers and 3 max pooling layers. Inputting a three-dimensional feature tensor into the 3D-Net network, and obtaining a tensor with the size of the feature tensor to be processed being (H '/8 xW'/8 xB (8+ K)) according to the three-dimensional feature tensor and preset parameters and the number of preset prediction frames, wherein B represents the number of frames, and 8 represents the obtained number of frames8 parameters: wherein t isx、ty、tzRespectively representing the offset from the central position of the initial frame of the image; t is tl、tw、 thRespectively representing the offset from the length, width and height of the initial frame of the image; k represents the category of the frame and represents that a common K category target is detected in the task; confidence p1…pkRepresenting the likelihood of prediction for each class of targets; t is tθThe offset amount of the frame rotation angle is indicated, and K indicates the type of the prediction frame.
209. And calculating the coordinates of the three-dimensional position prediction frame according to the preset parameters, the number of the preset prediction frames and the to-be-processed feature tensor.
210. And normalizing the three-dimensional position prediction frame according to the length of the diagonal line of the three-dimensional position prediction frame.
By way of the description of step 208, the coordinates of the three-dimensional position prediction frame are calculated as follows:
x=txda+xa
y=tyda+ya
z=tzda+za
w=etw+wa
h=eth+ha
d=etl+la
bθ=tθa
pi=σsigmiod(p1…pk)j
wherein x isa、ya、zaRespectively representing the coordinates of the central point of the initial frame of the image to be detected,
Figure BDA0002556546250000101
and representing the length of the diagonal line of the labeling frame, and performing normalization processing on the obtained three-dimensional position prediction frame as a constraint condition. By the centre coordinates of the initial frame and the offset tx、ty、tzObtaining the central coordinates x, y and z of the prediction frame; w is aa、ha、laRepresenting the size of the initial frame of the image to be detected, by an offset tl、tw、thCorrecting the prediction frame to obtain the length, width and height d, w and h of the prediction frame; thetaaIs the angle of rotation of the initial frame, tθIs the offset of the rotation angle; p is a radical ofiIs that the prediction box is for each class of objects (p)1…pk) Confidence of the prediction.
211. And removing the overlapped three-dimensional position prediction frames by a non-maximum value inhibition method, wherein each sub-grid predicts the three-dimensional position prediction frames with the preset number of prediction frames.
B three-dimensional frames are predicted in each sub-grid, and overlapped frames can be removed through a non-maximum suppression method to obtain a desired result. Finally, 7 parameters, namely three-dimensional center coordinates (x, y, z) in a world coordinate system, the length d, the width w, the height h and the confidence coefficient p of a prediction box are generatedi
Wherein B is a constant.
212. And performing loss calculation on the three-dimensional position prediction frame by minimizing a preset loss function, and minimizing the value of the loss function by utilizing a random gradient descent algorithm to obtain the target three-dimensional position prediction frame.
In the training process, the loss calculation is performed on the three-dimensional position prediction frame by minimizing the preset loss function, so that the three-dimensional position prediction frame label meeting the condition can be obtained, and the specific calculation process is shown as follows.
Figure BDA0002556546250000111
Where G denotes the number of sub-grids,
Figure BDA0002556546250000112
indicating the loss of the jth label box in the ith sub-grid,
Figure BDA0002556546250000113
to representThe mesh contains no loss of the object. Lambda [ alpha ]1And λnoobjThe weight coefficients of the sub-grids with and without targets are calculated respectively. And minimizing the value of the loss function by utilizing a random gradient descent algorithm to obtain the three-dimensional position prediction box label meeting the condition.
In summary, according to the semantic segmentation method based on 3D-YOLO, one frame of image is selected as a key frame for each frame with a preset number, the determined key frame sequence is retained, other frame sequences are discarded, the RGB image frame sequence and the depth image frame sequence are converted into a three-dimensional point cloud image, the three-dimensional point cloud image is converted into a three-dimensional feature tensor through a feature learning network, and the three-dimensional point cloud image is input into a 3D-Net network. Normalization processing is carried out on the three-dimensional position prediction frame; and removing the overlapped three-dimensional position prediction frame by a non-maximum value inhibition method, performing loss calculation on the three-dimensional position prediction frame by minimizing a preset loss function, and minimizing the value of the loss function by utilizing a random gradient descent algorithm to obtain the target three-dimensional position prediction frame. The calculation speed of the algorithm is improved, and the real-time performance of the algorithm is realized.
Further, as an implementation of the method shown in the above embodiment, another embodiment of the present invention further provides a semantic segmentation system based on 3D-YOLO. The system embodiment corresponds to the foregoing method embodiment, and details in the foregoing method embodiment are not repeated in this system embodiment for convenience of reading, but it should be clear that the system in this embodiment can correspondingly implement all the contents in the foregoing method embodiment.
An embodiment of the present invention provides a semantic segmentation system based on 3D-YOLO, as shown in fig. 6, including:
the sorting unit 31 is configured to sort the acquired RGB images and the depth images according to shooting time, respectively, to obtain an RGB image frame sequence and a depth image frame sequence;
a first conversion unit 32, configured to convert the RGB image frame sequence and the depth image frame sequence into a three-dimensional point cloud image;
a second conversion unit 33, configured to use the three-dimensional point cloud image as an input, and convert the three-dimensional point cloud image into a three-dimensional feature tensor through a feature learning network;
an input unit 34, configured to input the three-dimensional feature tensor into a 3D-Net network;
and the obtaining unit 35 is configured to obtain the target three-dimensional position prediction frame through the 3D-Net network.
The semantic segmentation system based on the 3D-YOLO provided by the embodiment of the invention respectively sequences the acquired RGB images and the depth images according to shooting time to obtain an RGB image frame sequence and a depth image frame sequence; converting the RGB image frame sequence and the depth image frame sequence into a three-dimensional point cloud picture; taking the three-dimensional point cloud image as input, and converting the three-dimensional point cloud image into a three-dimensional feature tensor through a feature learning network; and obtaining a target three-dimensional position prediction frame through the 3D-Net network. Compared with the prior art, on the basis of the traditional YOLO algorithm, the image depth information obtained by combining a binocular camera of the unmanned aerial vehicle is combined, semantic segmentation is realized, a three-dimensional marking frame is generated, the background error suppression capability is strong, the generalization capability is high, the 3D position of the obstacle in the space can be obtained, and the accuracy of target identification is improved.
Further, as shown in fig. 7, the system further includes:
a selecting unit 36, configured to select, before the first converting unit 32 converts the RGB image frame sequence and the depth image frame sequence into the three-dimensional point cloud image, one frame of image as a key frame for each preset number of frames in the RGB image frame sequence and the depth image frame sequence;
a reservation unit 37 for reserving the determined key frame sequence;
a dropping unit 38 for dropping the other frame sequences.
Further, as shown in fig. 7, the first converting unit 32 is further configured to sequentially convert each frame in the RGB image coordinate system and the depth image coordinate system into each frame in the point cloud coordinate system according to mapping relationships between the point cloud coordinate system and the RGB image coordinate system and the depth image coordinate system of the stored key frame, respectively.
Further, as shown in fig. 7, the second conversion unit 33 includes:
a dividing module 331, configured to divide the three-dimensional point cloud graph into a plurality of three-dimensional sub-grids;
a selecting module 332, configured to select a preset number of three-dimensional point cloud images as input of the feature learning network through random sampling;
a converting module 333, configured to convert the three-dimensional point cloud graph into a three-dimensional feature tensor.
Further, as shown in fig. 7, the conversion module 333 includes:
the first input submodule 3331 is configured to sequentially input the three-dimensional point cloud image into a full connection layer, a ReLU activation function and a BN layer of a feature learning network to obtain point cloud features;
the second input sub-module 3332 is used for inputting the point cloud features into an Element-wise Maxpool layer to obtain Locally polymerized Locally Aggregated features;
the processing sub-module 3333 is configured to process the local aggregation features and the point cloud features through a point cloud splicing layer to obtain four-dimensional cloud splicing features;
and the reshaping submodule 3334 is used for reshaping the four-dimensional cloud splicing characteristics to obtain a three-dimensional characteristic tensor.
Further, as shown in fig. 7, the acquiring unit 35 includes:
an input module 351, configured to input the three-dimensional feature tensor into the 3D-Net network;
the processing module 352 is configured to obtain a to-be-processed feature tensor according to the three-dimensional feature tensor according to a preset parameter and a preset number of prediction frames; wherein, the preset parameters include: the center of the prediction box, the length, width, height and the proportion of the prediction box, the confidence coefficient and the rotation angle of the box;
the first calculating module 353 is configured to calculate coordinates of a three-dimensional position prediction frame according to the preset parameters, the preset number of prediction frames, and the to-be-processed feature tensor;
the normalizing module 354 is used for normalizing the three-dimensional position prediction frame according to the length of the diagonal line of the three-dimensional position prediction frame;
a removal module 355 for removing the overlapped three-dimensional position prediction frames by a non-maximum suppression method, wherein each sub-grid generates a predetermined number of three-dimensional position prediction frames.
A second calculation module 356 for performing a loss calculation on the three-dimensional position prediction box by minimizing a preset loss function,
an obtaining module 357, configured to obtain the target three-dimensional position prediction box by using a random gradient descent algorithm to minimize a value of the loss function.
In summary, the semantic segmentation system based on 3D-YOLO selects one frame of image as a key frame for each preset number of frames, retains the determined key frame sequence, discards other frame sequences, converts the RGB image frame sequence and the depth image frame sequence into a three-dimensional point cloud image, converts the three-dimensional point cloud image into a three-dimensional feature tensor through a feature learning network, and inputs the three-dimensional feature tensor into a 3D-Net network. Normalization processing is carried out on the three-dimensional position prediction frame; and removing the overlapped three-dimensional position prediction frame by a non-maximum value inhibition method, performing loss calculation on the three-dimensional position prediction frame by minimizing a preset loss function, and minimizing the value of the loss function by utilizing a random gradient descent algorithm to obtain the target three-dimensional position prediction frame. The calculation speed of the algorithm is improved, and the real-time performance of the algorithm is realized.
Since the 3D-YOLO-based semantic segmentation system described in this embodiment is a system that can execute the 3D-YOLO-based semantic segmentation method in the embodiment of the present invention, based on the 3D-YOLO-based semantic segmentation method described in the embodiment of the present invention, those skilled in the art can understand the specific implementation manner and various variations of the 3D-YOLO-based semantic segmentation system in this embodiment, and therefore, how the 3D-YOLO-based semantic segmentation system implements the various 3D-YOLO-based semantic segmentation methods in the embodiment of the present invention is not described in detail herein. As long as those skilled in the art implement the system adopted by the 3D-YOLO-based semantic segmentation method in the embodiment of the present invention, the system is within the scope of the present application.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create a system for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including an instruction system which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal or a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement or the like made within the spirit and principle of the present application shall be included within the scope of the claims of the present application.

Claims (10)

1. A semantic segmentation method based on 3D-YOLO is characterized by comprising the following steps:
respectively sequencing the collected RGB images and the collected depth images according to shooting time to obtain an RGB image frame sequence and a depth image frame sequence;
converting the RGB image frame sequence and the depth image frame sequence into a three-dimensional point cloud picture;
taking the three-dimensional point cloud image as input, converting the three-dimensional point cloud image into a three-dimensional feature tensor through a feature learning network, and inputting the three-dimensional point cloud image into a 3D-Net network;
and obtaining a target three-dimensional position prediction frame through the 3D-Net network.
2. The method of claim 1, wherein prior to converting the sequence of RGB image frames and the sequence of depth image frames into a three-dimensional point cloud map, the method further comprises:
selecting one frame of image as a key frame from each frame with preset number in the RGB image frame sequence and the depth image frame sequence;
the determined key frame sequence is retained and the other frame sequences are discarded.
3. The method of claim 2, wherein converting the sequence of RGB image frames and the sequence of depth image frames into a three-dimensional point cloud map comprises:
and sequentially converting each frame in the RGB image coordinate system and the depth image coordinate system into each frame in the point cloud coordinate system according to the mapping relation between the point cloud coordinate system and the RGB image coordinate system and the depth image coordinate system for storing the key frames respectively.
4. The method of claim 1, wherein taking the three-dimensional point cloud graph as input, converting the three-dimensional point cloud graph into a three-dimensional feature tensor through a feature learning network, and inputting into a 3D-Net network comprises:
dividing the three-dimensional point cloud picture into a plurality of three-dimensional sub-grids;
and selecting a preset number of three-dimensional point cloud pictures as input of a feature learning network through random sampling, and converting the three-dimensional point cloud pictures into three-dimensional feature tensors.
5. The method of claim 4, wherein the converting the three-dimensional point cloud graph into a three-dimensional feature tensor comprises:
sequentially inputting the three-dimensional point cloud picture into a full connection layer, a ReLU activation function and a BN layer of a feature learning network to obtain point cloud features;
inputting the point cloud characteristics into an Element-wise Maxpool layer to obtain Locally polymerized Locally Aggregated characteristics;
processing the local aggregation features and the point cloud features through a point cloud splicing layer to obtain four-dimensional cloud splicing features;
and reshaping the four-dimensional cloud splicing characteristics to obtain a three-dimensional characteristic tensor.
6. The method of claim 4, wherein obtaining a target three-dimensional position prediction box by the 3D-Net network comprises:
inputting the three-dimensional feature tensor into the 3D-Net network, and obtaining a feature tensor to be processed according to the three-dimensional feature tensor according to preset parameters and the number of preset prediction frames; wherein, the preset parameters include: the center of the prediction box, the length, width, height and the proportion of the prediction box, the confidence coefficient and the rotation angle of the box;
calculating the coordinates of the three-dimensional position prediction frame according to the preset parameters, the number of the preset prediction frames and the to-be-processed feature tensor;
normalizing the three-dimensional position prediction frame according to the length of the diagonal line of the three-dimensional position prediction frame;
removing the overlapped three-dimensional position prediction frames by a non-maximum value inhibition method, wherein each sub-grid generates a preset number of three-dimensional position prediction frames;
and performing loss calculation on the three-dimensional position prediction frame by minimizing a preset loss function, and minimizing the value of the loss function by utilizing a random gradient descent algorithm to obtain the target three-dimensional position prediction frame.
7. A semantic segmentation system based on 3D-YOLO, comprising:
the sequencing unit is used for respectively sequencing the collected RGB images and the depth images according to the shooting time to obtain an RGB image frame sequence and a depth image frame sequence;
the first conversion unit is used for converting the RGB image frame sequence and the depth image frame sequence into a three-dimensional point cloud picture;
the second conversion unit is used for taking the three-dimensional point cloud picture as input and converting the three-dimensional point cloud picture into a three-dimensional feature tensor through a feature learning network;
the input unit is used for inputting the three-dimensional feature tensor into a 3D-Net network;
and the acquisition unit is used for acquiring the target three-dimensional position prediction frame through the 3D-Net network.
8. The system of claim 7, further comprising:
the selection unit is used for selecting one frame of image as a key frame from the RGB image frame sequence and the depth image frame sequence in each preset number of frames before the RGB image frame sequence and the depth image frame sequence are converted into the three-dimensional point cloud images by the first conversion unit;
a reservation unit for reserving the determined key frame sequence;
a discarding unit for discarding the other frame sequences.
9. The system of claim 7, wherein the first converting unit is further configured to sequentially convert each frame in the RGB image coordinate system and the depth image coordinate system into each frame in the point cloud coordinate system according to mapping relationships between the point cloud coordinate system and the RGB image coordinate system and the depth image coordinate system storing the key frame, respectively.
10. The system of claim 7, wherein the second conversion unit comprises:
the dividing module is used for dividing the three-dimensional point cloud picture into a plurality of three-dimensional sub-grids;
the selection module is used for selecting a preset number of three-dimensional point cloud pictures as the input of the feature learning network through random sampling;
and the conversion module is used for converting the three-dimensional point cloud picture into a three-dimensional characteristic tensor.
CN202010593311.9A 2020-06-26 2020-06-26 Semantic segmentation method and system based on 3D-YOLO Pending CN111833358A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010593311.9A CN111833358A (en) 2020-06-26 2020-06-26 Semantic segmentation method and system based on 3D-YOLO

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010593311.9A CN111833358A (en) 2020-06-26 2020-06-26 Semantic segmentation method and system based on 3D-YOLO

Publications (1)

Publication Number Publication Date
CN111833358A true CN111833358A (en) 2020-10-27

Family

ID=72899439

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010593311.9A Pending CN111833358A (en) 2020-06-26 2020-06-26 Semantic segmentation method and system based on 3D-YOLO

Country Status (1)

Country Link
CN (1) CN111833358A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112418288A (en) * 2020-11-17 2021-02-26 武汉大学 GMS and motion detection-based dynamic vision SLAM method
CN112990293A (en) * 2021-03-10 2021-06-18 深圳一清创新科技有限公司 Point cloud marking method and device and electronic equipment
CN113449744A (en) * 2021-07-15 2021-09-28 东南大学 Three-dimensional point cloud semantic segmentation method based on depth feature expression
CN113848884A (en) * 2021-09-07 2021-12-28 华侨大学 Unmanned engineering machinery decision method based on feature fusion and space-time constraint

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109410307A (en) * 2018-10-16 2019-03-01 大连理工大学 A kind of scene point cloud semantic segmentation method
CN109658373A (en) * 2017-10-10 2019-04-19 中兴通讯股份有限公司 A kind of method for inspecting, equipment and computer readable storage medium
US10297070B1 (en) * 2018-10-16 2019-05-21 Inception Institute of Artificial Intelligence, Ltd 3D scene synthesis techniques using neural network architectures
CN109948661A (en) * 2019-02-27 2019-06-28 江苏大学 A kind of 3D vehicle checking method based on Multi-sensor Fusion

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109658373A (en) * 2017-10-10 2019-04-19 中兴通讯股份有限公司 A kind of method for inspecting, equipment and computer readable storage medium
CN109410307A (en) * 2018-10-16 2019-03-01 大连理工大学 A kind of scene point cloud semantic segmentation method
US10297070B1 (en) * 2018-10-16 2019-05-21 Inception Institute of Artificial Intelligence, Ltd 3D scene synthesis techniques using neural network architectures
CN109948661A (en) * 2019-02-27 2019-06-28 江苏大学 A kind of 3D vehicle checking method based on Multi-sensor Fusion

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WALEED ALI等: "Yolo3d: End-to-end real-time 3d oriented object bounding box detection from lidar point cloud", 《PROCEEDINGS OF THE EUROPEAN CONFERENCE ON COMPUTER VISION (ECCV)》 *
YIN ZHOU等: "VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection", 《2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112418288A (en) * 2020-11-17 2021-02-26 武汉大学 GMS and motion detection-based dynamic vision SLAM method
CN112990293A (en) * 2021-03-10 2021-06-18 深圳一清创新科技有限公司 Point cloud marking method and device and electronic equipment
CN112990293B (en) * 2021-03-10 2024-03-29 深圳一清创新科技有限公司 Point cloud labeling method and device and electronic equipment
CN113449744A (en) * 2021-07-15 2021-09-28 东南大学 Three-dimensional point cloud semantic segmentation method based on depth feature expression
CN113848884A (en) * 2021-09-07 2021-12-28 华侨大学 Unmanned engineering machinery decision method based on feature fusion and space-time constraint
CN113848884B (en) * 2021-09-07 2023-05-05 华侨大学 Unmanned engineering machinery decision method based on feature fusion and space-time constraint

Similar Documents

Publication Publication Date Title
CN111833358A (en) Semantic segmentation method and system based on 3D-YOLO
CN114708585B (en) Attention mechanism-based millimeter wave radar and vision fusion three-dimensional target detection method
CN110222626B (en) Unmanned scene point cloud target labeling method based on deep learning algorithm
CN113159151A (en) Multi-sensor depth fusion 3D target detection method for automatic driving
JP2018022360A (en) Image analysis device, image analysis method and program
CN110119679B (en) Object three-dimensional information estimation method and device, computer equipment and storage medium
CN111814753A (en) Target detection method and device under foggy weather condition
CN111986472B (en) Vehicle speed determining method and vehicle
CN113761999A (en) Target detection method and device, electronic equipment and storage medium
CN112734931B (en) Method and system for assisting point cloud target detection
CN114764778A (en) Target detection method, target detection model training method and related equipment
CN114463736A (en) Multi-target detection method and device based on multi-mode information fusion
CN112802197A (en) Visual SLAM method and system based on full convolution neural network in dynamic scene
CN116246119A (en) 3D target detection method, electronic device and storage medium
CN114638996A (en) Model training method, device, equipment and storage medium based on counterstudy
CN117173399A (en) Traffic target detection method and system of cross-modal cross-attention mechanism
CN117037141A (en) 3D target detection method and device and electronic equipment
Pinard et al. End-to-end depth from motion with stabilized monocular videos
CN114913519B (en) 3D target detection method and device, electronic equipment and storage medium
KR101919879B1 (en) Apparatus and method for correcting depth information image based on user's interaction information
Hernandez et al. 3D-DEEP: 3-Dimensional Deep-learning based on elevation patterns for road scene interpretation
Zhang et al. Boosting the speed of real-time multi-object trackers
Kim et al. LiDAR Based 3D object detection using CCD information
CN114170267A (en) Target tracking method, device, equipment and computer readable storage medium
Akın et al. Challenges in determining the depth in 2-d images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20201027

RJ01 Rejection of invention patent application after publication