CN117036836A

CN117036836A - Target detection method and device based on point cloud sequence, storage medium and terminal

Info

Publication number: CN117036836A
Application number: CN202310803335.6A
Authority: CN
Inventors: 黄超; 胡韬; 姚为龙
Original assignee: Shanghai Xiantu Intelligent Technology Co Ltd
Current assignee: Shanghai Xiantu Intelligent Technology Co Ltd
Priority date: 2023-06-30
Filing date: 2023-06-30
Publication date: 2023-11-10

Abstract

Target detection method and device based on point cloud sequence, storage medium and terminal, wherein the method comprises the following steps: performing target preliminary detection on each frame of point cloud in the point cloud sequence to obtain a target detection frame in each frame of point cloud, wherein the target detection frame is used for indicating a target to be detected in the point cloud to which the target detection frame belongs; tracking the track of the target detection frame in each frame point cloud to determine the target detection frame indicating the same target to be detected in each frame point cloud; for each target to be detected, carrying out coordinate space transformation on each target detection frame indicating the target to be detected so as to obtain a transformation detection frame corresponding to each target detection frame in a designated coordinate system; and extracting features based on the obtained multiple transformation detection frames, and inputting the extracted features into a pre-trained detection model to obtain the size and the gesture of the target to be detected at the target moment. By adopting the scheme, the accuracy of target detection based on the point cloud data is improved.

Description

Target detection method and device based on point cloud sequence, storage medium and terminal

Technical Field

The present invention relates to the field of point cloud target detection technologies, and in particular, to a target detection method and apparatus based on a point cloud sequence, a storage medium, and a terminal.

Background

With the development of three-dimensional sensing technology, point cloud data is widely applied to various fields, such as automatic driving, augmented reality, geographic information systems and the like. In a large number of practical applications, it is necessary to accurately identify or detect a target object in a point cloud in order to effectively use such data.

Currently, researchers have taken various approaches to detect target objects in point clouds. For example, a deep learning-based method (particularly a convolutional neural network-based target detection method), a transfer learning-based method, an unsupervised clustering-based method, and the like. The existing target detection method based on the point cloud generally takes the single-frame point cloud as input, however, due to the defects of sparsity, easiness in noise data interference, shielding and blurring caused by object movement and the like of the point cloud data, the detection accuracy of the existing target detection method based on the single-frame point cloud is to be improved, and the problems of false detection, omission detection and the like are easy to occur.

Disclosure of Invention

The technical problem solved by the embodiment of the invention is how to improve the accuracy of target detection based on point cloud data.

In order to solve the technical problems, an embodiment of the present invention provides a target detection method based on a point cloud sequence, including the following steps: performing target preliminary detection on each frame of point cloud in the point cloud sequence to obtain a target detection frame in each frame of point cloud, wherein the target detection frame is used for indicating a target to be detected in the point cloud to which the target detection frame belongs; tracking the track of the target detection frame in each frame point cloud to determine the target detection frame indicating the same target to be detected in each frame point cloud; for each target to be detected, carrying out coordinate space transformation on each target detection frame indicating the target to be detected so as to obtain a transformation detection frame corresponding to each target detection frame in a designated coordinate system; and extracting features based on the obtained multiple transformation detection frames, and inputting the extracted features into a pre-trained detection model to obtain the size and the gesture of the target to be detected at the target moment.

Optionally, the type of the target to be detected is selected from a static target and a dynamic target; the feature extraction based on the obtained plurality of transformation detection frames comprises: determining the type of the target to be detected; if the type of the target to be detected is a dynamic target, obtaining a transformation point cloud sequence according to the time sequence of each transformation detection frame; and extracting first characteristics of the transformation point cloud sequence to obtain first point cloud characteristics, wherein the first point cloud characteristics comprise size characteristics and posture characteristics of the transformation point cloud sequence.

Optionally, the pre-trained detection model is a pre-trained dynamic target detection model; before inputting the extracted features into the pre-trained detection model, the method further comprises: constructing a first training data set by adopting the same dynamic target to be detected and the sizes and the postures of target detection frames at a plurality of moments in a first preset period; and training a preset initial dynamic target detection model by adopting the first training data set to obtain the pre-trained dynamic target detection model.

Optionally, the target detection frames of the same dynamic target to be detected at a plurality of moments within the first preset period are selected from any one of the following: the same dynamic target to be detected is in the actual target detection frames of a plurality of non-shielded moments and at least one marked target detection frame of shielded moment in the first preset period; the same dynamic target to be detected is an actual target detection frame of a plurality of historical moments and an annotation target detection frame of future moments in the first preset period.

Optionally, the method further comprises: if the type of the target to be detected is a static target, carrying out point cloud accumulation on the point clouds of each transformation detection frame to obtain a point cloud accumulation result; and carrying out second feature extraction on the point cloud accumulation result to obtain a second point cloud feature, wherein the second point cloud feature comprises the size feature and the gesture feature of the point cloud accumulation result.

Optionally, the pre-trained detection model is a pre-trained static target detection model; before inputting the extracted features into the pre-trained detection model, the method further comprises: adopting the same static target to be detected to the size and the gesture of a target detection frame at a plurality of moments in a second preset period to construct a second training data set; and training a preset initial static target detection model by adopting the second training data set to obtain the pre-trained static target detection model.

Optionally, determining the type of the target to be detected includes: determining the distance between the positions of each pair of adjacent transformation detection frames indicating the target to be detected, and recording the distance as a target distance; if the sum of the determined target distances is smaller than a first preset threshold value, confirming that the type of the target to be detected is a static target; and if the sum of the determined target distances is greater than or equal to the first preset threshold value, confirming that the type of the target to be detected is a dynamic target.

Optionally, determining the type of the target to be detected includes: determining the cross-over ratio of each pair of adjacent transformation detection frames indicating the target to be detected, and marking the cross-over ratio as the target cross-over ratio; if the average value of the determined multiple target merging ratios is larger than or equal to a second preset threshold value, confirming that the type of the target to be detected is a static target; and if the average value of the determined multiple target merging ratios is smaller than the second preset threshold value, confirming that the type of the target to be detected is a dynamic target.

Optionally, the algorithm for tracking the track of the target detection frame in each frame point cloud is selected from: simpleTrack algorithm, deep Sort algorithm, fairMot algorithm, graphnn-mot algorithm.

The embodiment of the invention also provides a target detection device based on the point cloud sequence, which comprises: the first detection module is used for carrying out target preliminary detection on each frame of point cloud in the point cloud sequence to obtain a target detection frame in each frame of point cloud, wherein the target detection frame is used for indicating a target to be detected in the point cloud to which the target detection frame belongs; the track tracking module is used for tracking the track of the target detection frame in each frame point cloud so as to determine the target detection frame of the same target to be detected in each frame point cloud; the coordinate transformation module is used for carrying out coordinate space transformation on each target detection frame indicating the target to be detected for each target to be detected so as to obtain a transformation detection frame corresponding to each target detection frame in a specified coordinate system; and the second detection module is used for extracting the characteristics based on the obtained multiple transformation detection frames, and inputting the extracted characteristics into a pre-trained detection model to obtain the size and the gesture of the target to be detected at the target moment.

The embodiment of the invention also provides a storage medium, on which a computer program is stored, which when being executed by a processor, performs the steps of the target detection method based on the point cloud sequence.

The embodiment of the invention also provides a terminal, which comprises a memory and a processor, wherein the memory stores a computer program capable of running on the processor, and the processor executes the steps of the target detection method based on the point cloud sequence when running the computer program.

Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:

in the embodiment of the invention, a target detection frame for indicating the same target to be detected in each frame point cloud of a point cloud sequence is determined by combining the point cloud sequence as input and tracking the track; then, for each target to be detected, carrying out coordinate space transformation on each target detection frame indicating the target to be detected so as to obtain a transformation detection frame corresponding to each target detection frame in a designated coordinate system; and extracting the characteristics based on the obtained multiple transformation detection frames, and inputting the extracted characteristics into a pre-trained detection model to obtain the size and the gesture of the target to be detected at the target moment.

Compared with the prior art that the detection accuracy of target detection based on single-frame point clouds is low, the embodiment takes a point cloud sequence containing multi-frame point clouds as input data, and performs feature extraction and target detection on a plurality of transformation detection frames obtained through track tracking and coordinate transformation. On the one hand, the multi-frame point cloud has higher point cloud density compared with the single-frame point cloud, so that the accuracy of the size and the gesture of the target to be detected, which are obtained by feature extraction and subsequent detection, is improved. On the other hand, a plurality of transformation detection frames with time sequence which indicate the same target to be detected can be obtained through track tracking and coordinate transformation, and then feature information which contains the state change of the target to be detected with time sequence can be obtained through feature extraction, so that the accurate gesture and size of the target to be detected at the appointed target moment can be obtained through a pre-trained detection model. The target time can be the time when the target to be detected is blocked, so as to solve the problem that the target possibly exists in part of the point cloud frame is blocked; alternatively, the target time may be a future time to enable accurate prediction of the future state of the target to be detected.

Further, the feature extraction based on the obtained plurality of transformation detection frames includes: determining the type of the target to be detected; if the type of the target to be detected is a dynamic target, obtaining a transformation point cloud sequence according to the time sequence of each transformation detection frame; and extracting first characteristics of the transformation point cloud sequence to obtain first point cloud characteristics, wherein the first point cloud characteristics comprise size characteristics and posture characteristics of the transformation point cloud sequence.

In the embodiment of the invention, for the dynamic target, feature information capable of reflecting the change of the gesture and the size of the dynamic target with time sequence is obtained by carrying out feature extraction on a plurality of transformation detection frames with time sequence. Further, the pre-trained dynamic target detection model is used later to obtain the size and the gesture of the dynamic target at the target time (e.g., the blocked time or the future time or other designated time). Therefore, the method is beneficial to solving the problem of target detection under the condition that the target is blocked, avoiding the target from missing detection, improving the accuracy of target detection and realizing accurate prediction of the future state of the dynamic target.

Further, if the type of the target to be detected is a static target, performing point cloud accumulation on the point clouds of each transformation detection frame to obtain a point cloud accumulation result; and carrying out second feature extraction on the point cloud accumulation result to obtain a second point cloud feature, wherein the second point cloud feature comprises the size feature and the gesture feature of the point cloud accumulation result.

In the embodiment of the invention, for the static target, by extracting the characteristics of the accumulated point clouds of the plurality of transformation detection frames, the accumulated point clouds have higher density compared with the single-frame point clouds, so that more accurate characteristics capable of reflecting the gesture and the size of the static target can be extracted for the static target. Further, the method is beneficial to accurately obtaining the size and the posture of the static target by using a pre-trained static target detection model.

Further, in the embodiment of the present invention, by introducing the target detection frames marked at the blocked time in the first training dataset as the tag data, the initial dynamic target detection model can learn the mapping relationship between the gesture and the size of the target detection frames of the same dynamic target at a plurality of non-blocked times within a period of time and the gesture and the size of the target detection frames at the blocked time through training. Further, by adopting the pre-trained dynamic target detection model obtained through training, the gesture and the size of the moment when the dynamic target is blocked in a period can be obtained based on a plurality of time-sequence gestures and sizes of the input dynamic target in the period. Therefore, the problem that target detection or identification cannot be realized due to the fact that the target to be detected is shielded in the target detection technology based on the point cloud data can be solved, target missing detection is avoided, and the accuracy of target detection is improved.

Further, in the embodiment of the present invention, the target detection frames marked at the future time are introduced into the first training dataset as the tag data, so that the initial dynamic target detection model can learn the mapping relationship between the gesture and the size of the target detection frames of the same dynamic target at a plurality of historical times within a period of time and the gesture and the size of the target detection frames at the future time through training. Further, by adopting the pre-trained dynamic target detection model obtained through training, the gesture and the size of a certain dynamic target at a future time after the historical time can be predicted and obtained based on the gesture and the size of a plurality of time-sequential historical times of the input dynamic target in a period. Therefore, the future gesture and ruler of the target to be detected can be predicted based on the historical gesture and size of the target to be detected, and accurate prediction of the motion state of the target is achieved.

Drawings

FIG. 1 is a flow chart of a target detection method based on a point cloud sequence in an embodiment of the invention;

FIG. 2 is a flow chart of one embodiment of step S14 of FIG. 1;

FIG. 3 is a flow chart of another embodiment of step S14 of FIG. 1;

Fig. 4 is a schematic structural diagram of a target detection device based on a point cloud sequence in an embodiment of the present invention.

Detailed Description

In order to make the above objects, features and advantages of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.

Referring to fig. 1, fig. 1 is a flowchart of a target detection method based on a point cloud sequence in an embodiment of the present invention. The method may include steps S11 to S14:

step S11: performing target preliminary detection on each frame of point cloud in the point cloud sequence to obtain a target detection frame in each frame of point cloud, wherein the target detection frame is used for indicating a target to be detected in the point cloud to which the target detection frame belongs;

step S12: tracking the track of the target detection frame in each frame point cloud to determine the target detection frame indicating the same target to be detected in each frame point cloud;

step S13: for each target to be detected, carrying out coordinate space transformation on each target detection frame indicating the target to be detected so as to obtain a transformation detection frame corresponding to each target detection frame in a designated coordinate system;

step S14: and extracting features based on the obtained multiple transformation detection frames, and inputting the extracted features into a pre-trained detection model to obtain the size and the gesture of the target to be detected at the target moment.

In a specific implementation of step S11, the point cloud sequence may be a time-sequential point cloud sequence formed by a plurality of frames of point clouds obtained by performing point cloud acquisition on a target scene by using a laser sensor (e.g., a laser radar). Each frame of point clouds in the point cloud sequence has a respective acquisition time (or sampling time).

Wherein the target scene may be selected from, but is not limited to: highways, parking lots, communities, schools, parks, etc. The target scene typically includes one or more target objects (referred to as targets to be detected in this embodiment), such as pedestrians, vehicles, traffic signs, animals, and the like.

Further, the type of the object to be detected may be selected from a static object and a dynamic object. The static target may refer to a target with a constant position, orientation and size in each frame of point cloud of the point cloud sequence; the dynamic target may refer to a target in which one or more of position, orientation and size are changed in each frame point cloud of the point cloud sequence. In practical applications, targets with small variations in one or more of position, orientation, and size may also be considered static targets.

It should be noted that, whether the positions and orientations of the static target and the dynamic target are changed are relative to the same coordinate system, for example, a world coordinate system, a local coordinate system where the laser radar for collecting the point cloud is located, or other preset suitable coordinate systems.

Specifically, at the corresponding time of each frame point cloud of the point cloud sequence, the position, the orientation and the size of the target a are unchanged (i.e., in a static state), the type of the target a can be determined as a static target. For another example, at the corresponding time of each frame of point cloud of the point cloud sequence, the position and/or the orientation and/or the size of the target B are changed in at least two frames of point clouds, and then the type of the target B can be determined as a dynamic target.

For a detailed scheme for determining that the type of the object to be detected is a static object or a dynamic object, refer to the relevant content of the object type determining scheme in step S21 in the embodiment shown in fig. 2, and details are not repeated here.

Specifically, the method for performing the preliminary target detection on each frame of point cloud in the point cloud sequence can adopt various conventional target detection methods based on point cloud data. As a non-limiting example, a Point-Voxel Feature SetAbstraction for 3D Object Detection, abbreviated PV-RCNN, algorithm for three-dimensional object detection based on a Point cloud voxel feature set may be used for preliminary target detection.

Specifically, the PV-RCNN algorithm mainly comprises the following steps: firstly converting point cloud data into voxel representations by using a voxelization mode, and then extracting local features of each voxel; performing secondary aggregation on the local features of each voxel to obtain global features to describe each target to be detected; classifying each voxel into a target class or background by using a target classification network; and finally, carrying out regression prediction on the state information such as the position (for example, the geometric center point or the mass center representation), the orientation, the size (for example, the length, width and height representation) and the like of each object to be detected by using a regression network. Wherein the predicted position, orientation (the position and orientation may be collectively referred to as pose), and size of each target to be detected may be used to uniquely determine a target detection frame indicative of the target to be detected.

In the implementation of step S12, it will be understood that, as the state of rest or motion changes between different targets, occlusion and blocked of targets will occur, and new targets will enter the region of the target scene, and targets will leave the region of the target scene. For the same target to be detected, the target detection frame of the target to be detected may be detected in all the point cloud frames included in the point cloud sequence (i.e., the target detection frame of the target to be detected may be obtained in all the point cloud frames), or may be detected in only a part of the point cloud frames included in the point cloud sequence (i.e., the target detection frame of the target to be detected may be obtained in only a part of the point cloud frames).

As an example, the vehicle a where the lidar that collects the point cloud is located is at t ₁ -t ₂ In a stationary state during a period of time, a vehicle B located on the right side of the vehicle A is at t ₁ -t ₂ Is also stationary during the period of time, when the vehicle C is at t ₁ -t ₂ And driving into the area of the target scene within a time period, and passing through the middle of the vehicle A and the vehicle B. Thus, at t ₁ -t ₂ Among the point cloud frames collected in the time period, there may be a part of the point cloud frames in which the vehicle B can be detected but the vehicle C cannot be detected (because the vehicle C has not yet entered the target scene area in the time period corresponding to the part of the point cloud frames), and another part of the point cloud frames in which the vehicle C can be detected but the vehicle B cannot be detected (because the vehicle C passes between the vehicle a and the vehicle B in the time period corresponding to the part of the point cloud frames, resulting in that the vehicle B is blocked by the vehicle C).

In a specific implementation, the algorithm for tracking the track of the target detection frame in each frame point cloud may be any existing algorithm capable of implementing a multi-target tracking function, and is selected from but not limited to: simpleTrack algorithm, deep Sort algorithm, fairMot algorithm, graphnn-mot algorithm.

Taking the SimpleTrack algorithm as an example, the main implementation process of the algorithm comprises the following steps: (1) feature extraction: extracting features of target detection frames in each frame of point clouds of the point cloud sequence (specifically, extracting features of regional point clouds indicated or surrounded by the target detection frames); wherein the extracted features may be selected from the group consisting of appearance features (e.g., color, texture) and geometric features (e.g., location, size); (2) feature matching: and performing feature matching on the target detection frames in the current frame point cloud and the target detection frames in the previous frame point cloud based on the extracted features from the second frame point cloud of the point cloud sequence by using a feature matching algorithm (for example, nearest neighbor search or a hungarian algorithm based on a cost matrix) so as to establish a corresponding relation (so-called corresponding relation, namely, a matching relation), among the target detection frames in each frame point cloud, wherein the target detection frames matched with each other in each frame point cloud indicate the same target.

It should be noted that, compared with the conventional SimpleTrack algorithm for extracting the characteristics of the point cloud, because the target detection frame is arranged in the embodiment of the invention, the characteristic extraction can be performed based on the frame region limitation of the target detection frame in the process of extracting the characteristics of the regional point cloud indicated or surrounded by the target detection frame by adopting the SimpleTrack algorithm, so that the focusing and accuracy of the characteristic extraction are effectively improved.

Further, motion states (such as position, speed and acceleration) of the targets indicated by the target detection frames matched with each other can be predicted by a Kalman filter or other suitable state estimation methods; after the motion state of each target in the predicted current frame point cloud is obtained, the potential position of each target in the next frame point cloud can be calculated through a motion model, and for targets which are not matched in the current frame point cloud, a new unique ID can be allocated to the targets and a corresponding state estimator can be initialized; then, updating and maintaining the motion state estimation of the target through a certain iteration number or other proper termination conditions until the required tracking precision and stability are achieved; finally, the tracking results are smoothed and optimized (e.g., optical flow or conditional random field techniques may be used to further optimize the quality of target tracking).

It should be noted that, compared with a conventional state estimation method based on point cloud, due to the fact that the target detection frame is arranged in the embodiment of the invention, in the process of predicting the motion state of the target indicated by the target detection frame by adopting the state estimation method, prediction can be performed based on the frame region limitation of the detection frame, and prediction efficiency and prediction accuracy are effectively improved.

In the implementation of step S13, for each target to be detected, the coordinate space transformation is performed on each target detection frame indicating the target to be detected, so as to obtain a transformation detection frame corresponding to each target detection frame in the specified coordinate system. The coordinate space transformation is performed on each target detection frame, which may specifically refer to performing coordinate space transformation on points in the regional point cloud indicated or surrounded by each target detection frame.

Wherein the specified coordinate system may be a world coordinate system; or, the local coordinate system where a certain frame of point cloud in the point cloud sequence is located may be, for example, in the point cloud sequence, if the target a to be detected is blocked in the ith frame of point cloud and the corresponding target detection frame is not detected, the local coordinate system where the ith frame of point cloud is located may be used as the specified coordinate system; alternatively, the coordinate system may be another suitable coordinate system designated in advance.

It can be understood that the forms of the transformation detection frames corresponding to different types of targets to be detected can be different. For the static target, the plurality of transformation detection frames obtained through the coordinate space transformation in the step S13 have high overlapping degree, specifically, the point cloud sparseness degree of each transformation detection frame may be different, but the pose (including the position and the orientation) and the size are almost completely consistent. For the dynamic object, the overlapping degree of the plurality of transformation detecting frames obtained through the coordinate space transformation in the step S13 is low, for example, a motion track (for example, a linear track or a curve track) of the dynamic object may be displayed, specifically, not only the point cloud sparseness degree of each transformation detecting frame may be different, but also the position and/or orientation and/or size may be different.

In the implementation of step S14, feature extraction is performed based on the obtained multiple transformation detection frames, and the extracted features are input into a pre-trained detection model, so as to obtain the size and the posture of the target to be detected at the target moment.

Further, as described above, the forms of the plurality of transformation detection frames corresponding to the static target and the dynamic target are different from each other. Therefore, in the embodiment of the invention, corresponding feature extraction methods can be adaptively selected according to different types of targets to be detected so as to extract more accurate features which can represent the respective states of static targets or dynamic targets. Further, the pre-trained detection model based on the extracted features and the adaptive selection is helpful to obtain more accurate target detection results.

Referring to fig. 2, fig. 2 is a flowchart of one embodiment of step S14 in fig. 1. The step S14 may specifically include steps S21 to S23, where feature extraction is performed based on the obtained plurality of transformation detection frames.

In step S21, the type of the object to be detected is determined.

In a first specific embodiment, the step S21 may include: determining the distance between the positions of each pair of adjacent transformation detection frames indicating the target to be detected, and recording the distance as a target distance; judging whether the sum of the determined target distances is smaller than a first preset threshold value or not; if the sum of the determined target distances is smaller than a first preset threshold value, confirming that the type of the target to be detected is a static target; and if the sum of the determined target distances is greater than or equal to the first preset threshold value, confirming that the type of the target to be detected is a dynamic target.

The distance between the positions of each pair of adjacent transformation detection frames can be specifically the distance between the center points of the pair of adjacent transformation detection frames.

In practical applications, for a static target, the positions of the corresponding transformation detection frames are ideally identical, in other words, if factors such as errors, noise and the like are not considered, the sum of the determined target pitches approaches 0. Based on the above, in the embodiment of the present invention, by adopting the above-mentioned target type determination scheme based on the position interval of the transformation detection frame, it is helpful to accurately distinguish between a static target and a dynamic target.

As a non-limiting example, for a coordinate system with a minimum unit of cm, the first preset threshold may be selected to be a suitable value in the interval [0cm,5cm ].

In a second specific embodiment, the step S21 may include: determining the cross-over ratio of each pair of adjacent transformation detection frames indicating the target to be detected, and marking the cross-over ratio as the target cross-over ratio; if the average value of the determined multiple target merging ratios is larger than or equal to a second preset threshold value, confirming that the type of the target to be detected is a static target; and if the average value of the determined multiple target merging ratios is smaller than the second preset threshold value, confirming that the type of the target to be detected is a dynamic target.

The cross-over ratio of each pair of adjacent transformation detection frames refers to the ratio between the area of the intersection area and the area of the union area of the pair of adjacent transformation detection frames.

In practical applications, for a static target, the positions, orientations, and sizes of the corresponding transformation detection frames are ideally identical, in other words, if factors such as errors and noise are not considered, the average value of the determined multiple target merging ratios approaches 1. Based on the above, in the embodiment of the present invention, by adopting the above-mentioned target type determination scheme based on the intersection ratio of the transformation detection frames, it is helpful to accurately distinguish between a static target and a dynamic target.

As a non-limiting example, the second preset threshold may be selected to be a suitable value in the interval [0.8,1] in the coordinate system in cm.

With continued reference to fig. 2, in step S22, if the type of the object to be detected is a dynamic object, a transformed point cloud sequence is obtained according to the timing sequence of each transformed detection frame.

Specifically, the time sequence of each transformation detection frame is kept unchanged, and the transformation point cloud sequence is formed by the regional point clouds indicated or surrounded by each transformation detection frame, that is, the transformation point cloud sequence contains the time sequence characteristics of each transformation detection frame.

In step S23, a first feature extraction is performed on the transformed point cloud sequence to obtain a first point cloud feature, where the first point cloud feature includes a size feature and a posture feature of the transformed point cloud sequence.

In an implementation, the first feature extraction may be performed using an existing model (e.g., a PointNet-LSTM model) capable of feature extraction of a point cloud sequence including a time series feature, so as to obtain the first point cloud feature. The first point cloud features comprise size features and posture features of the transformation point cloud sequence, and specifically comprise feature information of the size, the position and the orientation of the target to be detected, which are indicated by each transformation detection frame in the transformation point cloud sequence, which change with time.

In the embodiment of the invention, the characteristic information capable of reflecting the change of the gesture and the size of the dynamic target along with the time sequence is obtained by carrying out characteristic extraction on a plurality of transformation detection frames with the time sequence. Further, the pre-trained dynamic target detection model is used later to obtain the size and the gesture of the dynamic target at the target time (e.g., the blocked time or the future time or other designated time). Therefore, the method is beneficial to solving the problem of target detection under the condition that the target is blocked, avoiding the target from missing detection, improving the accuracy of target detection and being beneficial to realizing the prediction of the future state of the dynamic target.

Further, in this embodiment, the pre-trained detection model is a pre-trained dynamic target detection model; before inputting the extracted features into the pre-trained detection model, the method further comprises: constructing a first training data set by adopting the same dynamic target to be detected and the sizes and the postures of target detection frames at a plurality of moments in a first preset period; and training a preset initial dynamic target detection model by adopting the first training data set to obtain the pre-trained dynamic target detection model.

The same dynamic target to be detected has time sequence succession among target detection frames at a plurality of moments in a first preset period.

Wherein the initial dynamic target detection model may be selected from, but is not limited to: a recurrent neural network (Recurrent Neural Network, RNN), a Long Short-Term Memory network (LSTM), and a gated recurrent unit (Gate Recurrent Unit, GRU).

In some embodiments, the target detection frames of the same dynamic target to be detected at a plurality of moments within the first preset period may be: the same dynamic target to be detected is an actual target detection frame of a plurality of non-shielded moments and at least one marked target detection frame of shielded moments in the first preset period.

The actual target detection frame at the moment when the dynamic target to be detected is not blocked by other targets is specifically a target detection frame which is detected from the point cloud and indicates the dynamic target to be detected under the condition that the dynamic target to be detected is not blocked by other targets; the marked target detection frame at the blocked moment specifically refers to a target detection frame marked manually or by an algorithm, that is, a target detection frame serving as tag data, when the dynamic target to be detected is blocked by other targets and cannot be detected from the point cloud.

In the embodiment of the invention, the target detection frames marked at the blocked time are introduced into the first training data set as the tag data, so that the initial dynamic target detection model can learn the mapping relation between the postures and the sizes of the target detection frames of the same dynamic target at a plurality of non-blocked times within a period of time and the postures and the sizes of the target detection frames at the blocked time through training.

Further, by adopting the pre-trained dynamic target detection model obtained through training, the gesture and the size of the moment when the dynamic target is blocked in a period can be obtained based on a plurality of time-sequence gestures and sizes of the input dynamic target in the period. Therefore, the problem that in the target detection technology based on the point cloud data, target detection or identification cannot be realized because the target to be detected is shielded can be solved.

In other embodiments, the target detection frames of the same dynamic target to be detected at a plurality of moments within the first preset period may be: the same dynamic target to be detected is an actual target detection frame of a plurality of historical moments and an annotation target detection frame of future moments in the first preset period.

The same dynamic target to be detected has time sequence succession among target detection frames at a plurality of moments in a second preset period.

The actual target detection frame at the historical moment specifically refers to a target detection frame which is detected from the point cloud and indicates the dynamic target to be detected under the condition that the dynamic target to be detected is not blocked by other targets; the target detection frame of the future time is specifically a target detection frame (also called a predicted target detection frame) marked by adopting a manual or algorithm at one or more future times after the historical time, that is, a target detection frame as tag data.

In the embodiment of the invention, the target detection frames marked at the future time are introduced into the first training data set as the tag data, so that the initial dynamic target detection model is trained and learned to obtain the mapping relationship between the postures and the sizes of the target detection frames of the same dynamic target at a plurality of historical times within a period of time and the postures and the sizes of the target detection frames at the future time.

Further, by adopting the pre-trained dynamic target detection model obtained through training, the gesture and the size of a certain dynamic target at a future time after the historical time can be predicted and obtained based on the gesture and the size of a plurality of time-sequential historical times of the input dynamic target in a period. Therefore, the future gesture and ruler of the target to be detected can be predicted based on the historical gesture and size of the target to be detected, and the accurate prediction of the motion state of the target is realized.

Referring to fig. 3, fig. 3 is a flowchart of another embodiment of step S14 in fig. 1. The step S14 may specifically include steps S31 to S33, where feature extraction is performed based on the obtained plurality of transformation detection frames.

In step S31, the type of the object to be detected is determined.

In step S32, if the type of the target to be detected is a static target, performing point cloud accumulation on the point clouds of each transformation detection frame to obtain a point cloud accumulation result.

In step S33, a second feature extraction is performed on the point cloud accumulation result to obtain a second point cloud feature, where the second point cloud feature includes a size feature and a posture feature of the point cloud accumulation result.

In a specific implementation, an existing model (for example, a PointNet model) capable of performing feature extraction on the point cloud data may be used to perform second feature extraction, so as to obtain the second point cloud feature. And the second point cloud characteristics comprise accumulated characteristic information of the size, the position and the orientation of the target to be detected indicated by each transformation detection frame in the point cloud accumulated result. In the embodiment of the invention, the characteristic extraction is carried out on the accumulated point clouds of the plurality of transformation detection frames, and the accumulated point clouds have higher density compared with the single-frame point clouds, so that more accurate characteristics capable of reflecting the gesture and the size of the static target can be extracted for the static target. Further, the method is beneficial to obtaining the accurate size and the gesture of the static target by using a pre-trained static target detection model.

Further, in this embodiment, the pre-trained detection model is a pre-trained static target detection model; before inputting the extracted features into the pre-trained detection model, the method further comprises: adopting the same static target to be detected to the size and the gesture of a target detection frame at a plurality of moments in a second preset period to construct a second training data set; and training a preset initial static target detection model by adopting the second training data set to obtain the pre-trained static target detection model.

The initial static target detection model may be an existing model capable of realizing size and gesture detection of a target, and may be the same as or different from the initial dynamic target detection model.

For the method for constructing the second training data set and the technical effects thereof, reference may be made to the foregoing description of the method for constructing the first training data set and the technical effects thereof, which are not repeated herein.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a target detection device based on a point cloud sequence in an embodiment of the present invention. The target detection device based on the point cloud sequence may include:

The first detection module 41 is configured to perform target preliminary detection on each frame of point cloud in the point cloud sequence, so as to obtain a target detection frame in each frame of point cloud, where the target detection frame is used to indicate a target to be detected in the point cloud to which the target detection frame belongs;

the track tracking module 42 is configured to track the target detection frames in each frame point cloud to determine target detection frames that indicate the same target to be detected in each frame point cloud;

the coordinate transformation module 43 is configured to, for each target to be detected, transform, in coordinate space, each target detection frame indicating the target to be detected, so as to obtain a transformation detection frame corresponding to each target detection frame in a specified coordinate system;

the second detection module 44 is configured to perform feature extraction based on the obtained multiple transformation detection frames, and input the extracted features into a pre-trained detection model to obtain the size and the pose of the target to be detected at the target moment.

Regarding the principle, implementation and beneficial effects of the device, reference is made to the foregoing and the related description of the target detection method based on the point cloud sequence shown in fig. 1 to 3, which are not repeated herein.

The embodiment of the invention also provides a computer readable storage medium, on which a computer program is stored, which when being executed by a processor, performs the steps of the target detection method based on the point cloud sequence shown in the above-mentioned fig. 1 to 3. The computer readable storage medium may include non-volatile memory (non-volatile) or non-transitory memory, and may also include optical disks, mechanical hard disks, solid state disks, and the like.

Specifically, in the embodiment of the present application, the processor may be a central processing unit (centralprocessing unit, abbreviated as CPU), and the processor may also be other general purpose processors, digital signal processors (digital signal processor, abbreviated as DSP), application specific integrated circuits (application specificintegrated circuit, abbreviated as ASIC), off-the-shelf programmable gate arrays (field programmable gate array, abbreviated as FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It should also be appreciated that the memory in embodiments of the present application may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable Programmable ROM (EPROM), an electrically erasable programmable ROM (electrically EPROM, EEPROM), or a flash memory. The volatile memory may be a random access memory (random accessmemory, RAM for short) which acts as an external cache. By way of example, and not limitation, many forms of random access memory (random access memory, abbreviated as RAM) are available, such as static random access memory (static RAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (double data rate SDRAM, ddr SDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), synchronous Link DRAM (SLDRAM), and direct memory bus RAM (direct rambus RAM, DR RAM).

The embodiment of the application also provides a terminal, which comprises a memory and a processor, wherein the memory stores a computer program capable of running on the processor, and the processor executes the steps of the target detection method based on the point cloud sequence shown in the figures 1 to 3 when running the computer program. The terminal can include, but is not limited to, terminal equipment such as a mobile phone, a computer, a tablet computer, a server, a cloud platform, and the like.

It should be understood that the term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In this context, the character "/" indicates that the front and rear associated objects are an "or" relationship.

The term "plurality" as used in the embodiments of the present application means two or more.

The first, second, etc. descriptions in the embodiments of the present application are only used for illustrating and distinguishing the description objects, and no order is used, nor is the number of the devices in the embodiments of the present application limited, and no limitation on the embodiments of the present application should be construed.

It should be noted that the serial numbers of the steps in the present embodiment do not represent a limitation on the execution sequence of the steps.

Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the invention, and the scope of the invention should be assessed accordingly to that of the appended claims.

Claims

1. The target detection method based on the point cloud sequence is characterized by comprising the following steps of:

performing target preliminary detection on each frame of point cloud in the point cloud sequence to obtain a target detection frame in each frame of point cloud, wherein the target detection frame is used for indicating a target to be detected in the point cloud to which the target detection frame belongs;

tracking the track of the target detection frame in each frame point cloud to determine the target detection frame indicating the same target to be detected in each frame point cloud;

for each target to be detected, carrying out coordinate space transformation on each target detection frame indicating the target to be detected so as to obtain a transformation detection frame corresponding to each target detection frame in a designated coordinate system;

and extracting features based on the obtained multiple transformation detection frames, and inputting the extracted features into a pre-trained detection model to obtain the size and the gesture of the target to be detected at the target moment.

2. The method according to claim 1, wherein the type of object to be detected is selected from the group consisting of a static object and a dynamic object;

the feature extraction based on the obtained plurality of transformation detection frames comprises:

determining the type of the target to be detected;

if the type of the target to be detected is a dynamic target, obtaining a transformation point cloud sequence according to the time sequence of each transformation detection frame;

and extracting first characteristics of the transformation point cloud sequence to obtain first point cloud characteristics, wherein the first point cloud characteristics comprise size characteristics and posture characteristics of the transformation point cloud sequence.

3. The method of claim 2, wherein the pre-trained detection model is a pre-trained dynamic target detection model;

before inputting the extracted features into the pre-trained detection model, the method further comprises:

constructing a first training data set by adopting the same dynamic target to be detected and the sizes and the postures of target detection frames at a plurality of moments in a first preset period;

and training a preset initial dynamic target detection model by adopting the first training data set to obtain the pre-trained dynamic target detection model.

4. A method according to claim 3, wherein the target detection frames of the same dynamic target to be detected at a plurality of moments within a first preset period are selected from any one of the following:

the same dynamic target to be detected is in the actual target detection frames of a plurality of non-shielded moments and at least one marked target detection frame of shielded moment in the first preset period;

the same dynamic target to be detected is an actual target detection frame of a plurality of historical moments and an annotation target detection frame of future moments in the first preset period.

5. The method according to claim 2, wherein the method further comprises:

if the type of the target to be detected is a static target, carrying out point cloud accumulation on the point clouds of each transformation detection frame to obtain a point cloud accumulation result;

and carrying out second feature extraction on the point cloud accumulation result to obtain a second point cloud feature, wherein the second point cloud feature comprises the size feature and the gesture feature of the point cloud accumulation result.

6. The method of claim 5, wherein the pre-trained detection model is a pre-trained static target detection model;

Adopting the same static target to be detected to the size and the gesture of a target detection frame at a plurality of moments in a second preset period to construct a second training data set;

and training a preset initial static target detection model by adopting the second training data set to obtain the pre-trained static target detection model.

7. A method according to claim 2 or 3, wherein determining the type of object to be detected comprises:

determining the distance between the positions of each pair of adjacent transformation detection frames indicating the target to be detected, and recording the distance as a target distance;

if the sum of the determined target distances is smaller than a first preset threshold value, confirming that the type of the target to be detected is a static target;

and if the sum of the determined target distances is greater than or equal to the first preset threshold value, confirming that the type of the target to be detected is a dynamic target.

8. A method according to claim 2 or 3, wherein determining the type of object to be detected comprises:

determining the cross-over ratio of each pair of adjacent transformation detection frames indicating the target to be detected, and marking the cross-over ratio as the target cross-over ratio;

if the average value of the determined multiple target merging ratios is larger than or equal to a second preset threshold value, confirming that the type of the target to be detected is a static target;

And if the average value of the determined multiple target merging ratios is smaller than the second preset threshold value, confirming that the type of the target to be detected is a dynamic target.

9. The method of claim 1, wherein the algorithm for tracking the trajectory of the target detection box in each frame point cloud is selected from the group consisting of:

SimpleTrack algorithm, deep Sort algorithm, fairMot algorithm, graphnn-mot algorithm.

10. A point cloud sequence-based object detection device, comprising:

the first detection module is used for carrying out target preliminary detection on each frame of point cloud in the point cloud sequence to obtain a target detection frame in each frame of point cloud, wherein the target detection frame is used for indicating a target to be detected in the point cloud to which the target detection frame belongs;

the track tracking module is used for tracking the track of the target detection frame in each frame point cloud so as to determine the target detection frame of the same target to be detected in each frame point cloud;

the coordinate transformation module is used for carrying out coordinate space transformation on each target detection frame indicating the target to be detected for each target to be detected so as to obtain a transformation detection frame corresponding to each target detection frame in a specified coordinate system;

and the second detection module is used for extracting the characteristics based on the obtained multiple transformation detection frames, and inputting the extracted characteristics into a pre-trained detection model to obtain the size and the gesture of the target to be detected at the target moment.

11. A storage medium having stored thereon a computer program, which when run by a processor performs the steps of the point cloud sequence based object detection method according to any of claims 1 to 9.

12. A terminal comprising a memory and a processor, the memory having stored thereon a computer program executable on the processor, characterized in that the processor, when executing the computer program, performs the steps of the point cloud sequence based object detection method according to any of claims 1 to 9.