WO2020108311A1

WO2020108311A1 - 3d detection method and apparatus for target object, and medium and device

Info

Publication number: WO2020108311A1
Application number: PCT/CN2019/118126
Authority: WO
Inventors: 史少帅; 李鸿升; 王晓刚
Original assignee: 北京市商汤科技开发有限公司
Priority date: 2018-11-29
Filing date: 2019-11-13
Publication date: 2020-06-04
Also published as: CN109635685A; CN109635685B; JP2022515591A; KR20210078529A

Abstract

Disclosed are a 3D detection method and apparatus for a target object, and an electronic device, a computer-readable storage medium and a computer program. The 3D detection method for a target object comprises: extracting feature information of point cloud data of an acquired scene; carrying out semantic segmentation on the point cloud data according to the feature information of the point cloud data to obtain first semantic information of multiple points in the point cloud data; predicting at least one foreground point, corresponding to a target object, in the multiple points according to the first semantic information; generating a 3D initial frame respectively corresponding to the at least one foreground point according to the first semantic information; and determining a 3D detection frame for the target object in the scene according to the 3D initial frame.

Description

Target object 3D detection method, device, medium and equipment

This disclosure requires the priority of the Chinese patent application filed on November 29, 2018 in the Chinese Patent Office with the application number 201811446588.8 and the invention titled "Target Object 3D Inspection Method, Device, Media and Equipment" In this disclosure.

Technical field

The present disclosure relates to computer vision technology, and in particular, to a target object 3D detection method and device, vehicle intelligent control method and device, obstacle avoidance navigation method and device, electronic equipment, computer readable storage medium, and computer program.

Background technique

3D detection can be applied to various technologies such as intelligent driving and obstacle avoidance navigation. In intelligent driving technology, through 3D detection, the specific location, shape, size, and direction of movement of target objects such as surrounding vehicles and pedestrians of intelligent driving vehicles can be obtained, which can help intelligent driving vehicles make intelligent driving decisions.

Summary of the invention

Embodiments of the present disclosure provide a technical solution for target object 3D detection, vehicle intelligent control driving, and obstacle avoidance navigation.

According to one aspect of an embodiment of the present disclosure, a 3D detection method for a target object is provided, which includes: extracting characteristic information of point cloud data of the acquired scene; performing semantics on the point cloud data according to the characteristic information of the point cloud data Segmentation to obtain first semantic information of multiple points in the point cloud data; predicting at least one previous scenic spot of the corresponding target object among the multiple points based on the first semantic information; generating based on the first semantic information A 3D initial frame corresponding to each of the at least one front sight; determining a 3D detection frame of the target object in the scene according to the 3D initial frame.

According to still another aspect of an embodiment of the present disclosure, a vehicle intelligent control method is provided, including: obtaining the 3D detection frame of the target object using the above-mentioned 3D detection method of the target object; and generating an instruction to control the vehicle according to the 3D detection frame Or warning information.

According to still another aspect of the embodiments of the present disclosure, an obstacle avoidance navigation method is provided, including: obtaining the 3D detection frame of the target object using the above-mentioned 3D detection method of the target object; and generating the obstacle avoidance for the robot according to the 3D detection frame Command or warning information of navigation control.

According to still another aspect of the embodiments of the present disclosure, there is provided a target object 3D detection device, including: an extraction feature module for extracting feature information of point cloud data of the acquired scene; a first semantic segmentation module for The feature information of the point cloud data performs semantic segmentation on the point cloud data to obtain first semantic information of multiple points in the point cloud data; the pre-predicted scenic spot module is used to predict the location based on the first semantic information At least one front sight corresponding to the target object in the plurality of points; generating an initial frame module for generating a 3D initial frame corresponding to each of the at least one front sight according to the first semantic information; determining a detection frame module for The 3D initial frame determines the 3D detection frame of the target object in the scene.

According to still another aspect of the embodiments of the present disclosure, there is provided a vehicle intelligent control device, comprising: using the above target object 3D detection device to obtain a 3D detection frame of the target object; a first control module configured to detect the 3D detection frame, Generate instructions or early warning information to control the vehicle.

According to still another aspect of the embodiments of the present disclosure, there is provided an obstacle avoidance navigation device, including: using the above target object 3D detection device to obtain a 3D detection frame of the target object; a second control module configured to detect the 3D detection frame, Generate instructions or early warning information for the obstacle avoidance navigation control of the robot.

According to still another aspect of the embodiments of the present disclosure, there is provided an electronic device including: a memory for storing a computer program; a processor for executing the computer program stored in the memory, and when the computer program is executed, it is implemented Any method embodiment of the present disclosure.

According to still another aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, any method embodiment of the present disclosure is implemented.

According to still another aspect of the embodiments of the present disclosure, there is provided a computer program, including computer instructions, which when implemented in a processor of a device, implements any method embodiment of the present disclosure.

Based on a target object 3D detection method and device, vehicle intelligent control method and device, obstacle avoidance navigation method and device, electronic equipment, computer-readable storage medium, and computer program provided in the present disclosure, the point cloud data in the present disclosure Feature extraction, and semantic segmentation of point cloud data based on the extracted feature information, this part is equivalent to the underlying data analysis; the 3D detection frame generated and determined based on the semantic segmentation results in this disclosure is equivalent to the upper layer data analysis Therefore, in the 3D detection process of the target object, the present disclosure has formed a bottom-up way to generate a 3D detection frame. In this way, not only can the projection processing of the point cloud data be avoided first, and then the image obtained after the projection processing can be used to perform 3D detection frame detection, resulting in the loss of the original information of the point cloud data; it can also avoid the detection of the 3D detection frame using the 2D image taken by the camera device due to the target object in the 2D image (such as a vehicle or an obstacle) It is blocked, which causes the phenomenon that affects the 3D detection frame detection. As can be seen from the above description, the technical solution provided by the present disclosure is beneficial to improve the detection performance of the 3D detection frame.

The technical solutions of the present disclosure will be further described in detail below through the accompanying drawings and embodiments.

BRIEF DESCRIPTION

The drawings that form a part of the specification describe the embodiments of the present disclosure, and together with the description serve to explain the principles of the present disclosure.

Referring to the drawings, the present disclosure can be more clearly understood from the following detailed description, in which:

1 is a flowchart of an embodiment of a 3D detection method for a target object of the present disclosure;

FIG. 2 is a flowchart of another embodiment of the target object 3D detection method of the present disclosure;

3 is a schematic structural diagram of a first-stage neural network of the present disclosure;

4 is another schematic structural diagram of the first-stage neural network of the present disclosure;

5 is a schematic structural diagram of a second-stage neural network of the present disclosure;

6 is a flowchart of an embodiment of a vehicle intelligent control method of the present disclosure;

7 is a flowchart of an embodiment of an obstacle avoidance navigation method of the present disclosure;

8 is a schematic structural diagram of an embodiment of a target object 3D device of the present disclosure;

9 is a schematic structural diagram of an embodiment of a vehicle intelligent control device of the present disclosure;

10 is a schematic structural diagram of an embodiment of an obstacle avoidance navigation device of the present disclosure;

11 is a block diagram of an exemplary device that implements an embodiment of the present disclosure.

Specific examples

Various exemplary embodiments of the present disclosure will now be described in detail with reference to the drawings. It should be noted that the relative arrangement of components and steps, numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

At the same time, it should be understood that, for ease of description, the dimensions of the various parts shown in the drawings are not drawn according to the actual proportional relationship. The following description of at least one exemplary embodiment is actually merely illustrative, and in no way serves as any limitation to the present disclosure and its application or use. Techniques, methods, and equipment known to those of ordinary skill in the related art may not be discussed in detail, but where appropriate, the techniques, methods, and equipment should be considered as part of the specification.

It should be noted that similar reference numerals and letters indicate similar items in the following drawings, therefore, once an item is defined in one drawing, there is no need to discuss it further in subsequent drawings. The embodiments of the present disclosure can be applied to electronic devices such as terminal devices, computer systems, and servers, which can operate together with many other general-purpose or special-purpose computing system environments or configurations. Examples of well-known terminal devices, computing systems, environments, and/or configurations suitable for use with electronic devices such as terminal devices, computer systems, and servers, including but not limited to: personal computer systems, server computer systems, thin clients, thick Clients, handheld or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, network personal computers, small computer systems, mainframe computer systems, and distributed cloud computing technology environments including any of the above, etc. .

Electronic devices such as terminal devices, computer systems, and servers may be described in the general context of computer system executable instructions (such as program modules) executed by the computer system. Generally, program modules may include routines, programs, target programs, components, logic, and data structures, etc., which perform specific tasks or implement specific abstract data types. The computer system/server can be implemented in a distributed cloud computing environment, where tasks are performed by remote processing devices linked through a communication network. In a distributed cloud computing environment, program modules may be located on local or remote computing system storage media including storage devices.

Exemplary embodiment

FIG. 1 is a flowchart of an embodiment of a target object 3D detection method of the present disclosure.

S100. Extract feature information of the acquired point cloud data of the scene.

In an optional example, the scene in the present disclosure may refer to a visual-based presentation screen. For example, the image captured by the camera and the point cloud data (Point Cloud Data) obtained by the lidar scan can be regarded as a scene.

In an optional example, the point cloud data in the present disclosure generally refers to scanning information recorded in the form of points. For example, point cloud data obtained through lidar scanning. Each point in the point cloud data can be described by a variety of information, and it can also be considered that each point in the point cloud data usually includes a variety of information, for example, it may include but is not limited to one or more of the following: Three-dimensional coordinates of points, color information (such as RGB information, etc.), and reflection intensity (Intensity) information, etc. In other words, a point in the point cloud data can be described by one or more types of information such as three-dimensional coordinates, color information, and reflection intensity information.

In an optional example, the present disclosure may utilize at least one convolutional layer in the neural network to process the point cloud data to form feature maps of the point cloud data, for example, for each point cloud data Each point forms a piece of feature information. Since the feature information of the point cloud data formed this time is the feature information formed separately for each point in consideration of all points in the entire spatial range of the point cloud data, therefore, the feature information formed this time can be This is called global feature information.

S110. Perform semantic segmentation on the point cloud data according to the feature information of the point cloud data to obtain first semantic information of multiple points in the point cloud data.

In an optional example, the present disclosure can use a neural network to perform semantic segmentation on point cloud data. The neural network can form a first semantic for each point in the point cloud data, or even for each point in the point cloud data. information. For example, after the point cloud data is provided to the neural network, and the feature information of the point cloud data is extracted by the neural network, the neural network continues to process the feature information of the point cloud data to obtain multiple points in the point cloud data The first semantic information.

In an optional example, the first semantic information of a point in the present disclosure generally refers to a semantic feature (SemanticFeature) generated for the point in consideration of the entire point cloud data. Therefore, the first semantic information can be This is called the first semantic feature or global semantic feature. The global semantic features of points in the present disclosure can generally be expressed in the form of a one-dimensional vector array including multiple (eg, 256) elements. The global semantic features in this disclosure may also be referred to as global semantic feature vectors.

In an optional example, the front sights and background points in the present disclosure are for the target object. Optionally, the points belonging to a target object are the front sights of the target object, but not the target object. The point is the background point of the target object. In the case where multiple target objects are included in the scene, for one of the target objects, the point belonging to the target object is the front sight of the target object, but since the point does not belong to other target objects, the point is Background points of other target objects.

In an optional example, in the case where the points in the point cloud data include: a target object’s front sight and a background point of the target object, the first semantic information of the multiple points obtained by the present disclosure generally includes: The global semantic features of the front point of the target object and the global semantic features of the background point of the target object. The scene in the present disclosure may include one or more target objects. Target objects in this disclosure include, but are not limited to: vehicles, non-motor vehicles, pedestrians, and/or obstacles, and the like.

S120. Predict at least one front scenic spot corresponding to the target object among multiple points according to the first semantic information.

In an optional example, the present disclosure may use a neural network to predict at least one front point of the corresponding target object among multiple points, the neural network may be a part of points in the point cloud data, or even each point in the point cloud data , To make predictions separately to generate the confidence level of the point as the previous scenic spot. The confidence of a point can be expressed as: the probability of the point being the front sight. For example, after the point cloud data is provided to the neural network, the feature information of the point cloud data is extracted by the neural network, and after the semantic segmentation process is performed by the neural network, the neural network continues to process the global semantic features to predict the point Multiple points in the cloud data are the confidence of the target object's front sight, and the neural network can generate the confidence for each point separately. In the present disclosure, each confidence level generated by the neural network can be judged separately, and the point whose confidence level exceeds a predetermined value can be used as the front sight of the target object.

It should be particularly noted that the operation of determining the confidence in the present disclosure may be performed in S120 or S130. In addition, if the confidence judgment operation is performed in S120, and the judgment result is that there is no point where the confidence exceeds a predetermined value, that is, there is no previous scenic spot, it can be considered that there is no target object in the scene.

S130. Generate a 3D initial frame corresponding to each of the at least one front sight according to the first semantic information.

In an optional example, in the case where S120 does not include an operation to determine the confidence, the present disclosure may obtain a global semantic feature of each point in S110, and generate a 3D initial frame for each point. In the present disclosure, all the confidences obtained in S120 can be judged to select the front attractions of the target object, and the selected front attractions can be used to select from the 3D initial frame generated by S130, so that each front attraction can be corresponding to each other 3D initial box. That is, each 3D initial frame generated by S130 usually includes: a 3D initial frame corresponding to the front sight and a 3D initial frame corresponding to the background point, so S130 needs to filter out the 3D initial frames corresponding to each front sight from all the generated 3D initial frames.

In an optional example, in the case where S120 includes an operation to judge the confidence, the present disclosure may generate a 3D initial frame respectively according to the global semantic features of each of the predicted spots predicted above, thereby obtaining each The 3D initial frames are the 3D initial frames corresponding to the front sight. That is, each 3D initial frame generated by S130 is a 3D initial frame corresponding to the front sight, that is to say, S130 may generate a 3D initial frame only for the front sight.

In an optional example, the 3D initial frame in the present disclosure may be described by the position information of the center point of the 3D initial frame, the length, width, and height information of the 3D initial frame, and the direction information of the 3D initial frame, that is, in the present disclosure The 3D initial frame may include position information of the center point of the 3D initial frame, length, width, and height information of the 3D initial frame, and direction information of the 3D initial frame. The 3D initial frame may also be referred to as 3D initial frame information.

In an alternative example, the present disclosure may utilize neural networks to generate 3D initial boxes. For example, after the point cloud data is provided to the neural network, the feature information of the point cloud data is extracted by the neural network, and after the semantic segmentation process is performed by the neural network, the neural network continues to process the global semantic features to target multiple Each of the points generates a 3D initial frame. As another example, when the point cloud data is provided to the neural network, the feature information of the point cloud data is extracted by the neural network, and the neural network performs semantic segmentation processing, and the neural network performs prediction processing on the global semantic features to obtain points After multiple points in the cloud data are the confidence of the front sight of the target object, the neural network can continue to process the global semantic features of the points whose confidence exceeds the predetermined value to generate a 3D initial frame for each front sight.

Since point cloud data has certain receptive fields, and semantic segmentation is based on the feature information of all points in the point cloud data, the semantic features formed by semantic segmentation include not only the semantic features of the point itself, but also the semantics of surrounding points Feature, so that multiple front sights in this disclosure can semantically point to the same target object in the scene. The corresponding 3D initial frames corresponding to different front attractions that point to the same target object are somewhat different, but usually the difference is not large.

In addition, if the 3D initial frame corresponding to the front sight does not exist in the 3D initial frame generated by S130 according to the first semantic information, it may be considered that there is no target object in the scene.

S140. Determine the 3D detection frame of the target object in the scene according to the 3D initial frame.

The present disclosure finally determines a 3D detection frame for each target object.

In an optional example, the present disclosure may perform redundant processing on the aforementioned 3D initial frames corresponding to all the front sights, thereby obtaining a 3D detection frame of the target object, that is, performing target object detection on point cloud data, and finally obtaining 3D detection frame. Optionally, the present disclosure may use the degree of overlap between the 3D initial frames to remove redundant 3D initial frames, thereby obtaining the 3D detection frame of the target object. For example, the present disclosure may determine the degree of overlap between the 3D initial frames corresponding to multiple front sights, filter the 3D initial frames whose overlap is greater than the set threshold, to obtain the 3D initial frames whose overlap is greater than the set threshold, and then, From the filtered 3D initial frame, the 3D detection frame of the target object is determined. Optionally, the present disclosure may use the NMS (Non-Maximum Suppression Non-Maximum Suppression) algorithm to perform redundant processing on the 3D initial frames corresponding to all the front spots, thereby removing redundant 3D detection frames that overlap each other, and obtain The final 3D detection frame. In the case where multiple target objects (such as one or more pedestrians, one or more non-motorized vehicles, one or more vehicles, etc.) are included in the scene, the present disclosure can obtain a final object for each target object in the scene 3D detection box.

In an optional example, the present disclosure may perform correction (or optimization) on the 3D initial frames corresponding to the currently obtained front spots, and then perform redundant processing on all the corrected 3D initial frames to obtain The 3D detection frame of the target object, that is, the 3D detection frame finally obtained by performing target object detection on the point cloud data.

In an optional example, the process of respectively correcting the 3D initial frame corresponding to each front sight in the present disclosure may include the following steps A1, B1, and C1:

Step A1: Acquire feature information of points in a partial area in the point cloud data, where the partial area includes at least a 3D initial frame.

Optionally, the present disclosure may set a 3D expansion frame containing a 3D initial frame, and obtain feature information of each point in the 3D expansion frame in the point cloud data. The 3D expansion box in the present disclosure is an implementation of partial regions in point cloud data. The 3D initial frame corresponding to each front sight in the present disclosure respectively corresponds to a 3D expansion frame, and the space range occupied by the 3D expansion frame generally completely covers and is slightly larger than the space range occupied by the 3D initial frame. Under normal circumstances, any surface of the 3D initial frame is not in the same plane as any surface of its corresponding 3D expansion frame, the center point of the 3D initial frame and the center point of the 3D expansion frame coincide with each other, and any surface of the 3D initial frame Both are parallel to the corresponding faces of their corresponding 3D expansion boxes. Since the positional relationship between such a 3D extension frame and the 3D initial frame is relatively standardized, it is beneficial to reduce the difficulty of forming a 3D extension frame, thereby helping to reduce the implementation difficulty of the present disclosure. Of course, the present disclosure does not exclude the case that although the two center points do not coincide, any face of the 3D initial frame is parallel to the corresponding face of the corresponding 3D extension frame.

Optionally, the present disclosure may be based on at least one of a preset X-axis direction increment (such as 20 cm), a Y-axis direction increment (such as 20 cm), and a Z-axis direction increment (such as 20 cm). The 3D initial frame corresponding to the front sight is expanded in 3D space, so as to form a 3D expansion frame including the 3D initial frame where two center points coincide with each other and the corresponding surfaces are parallel to each other.

Optionally, the increment in the present disclosure can be set according to actual needs, for example, the increment in the corresponding direction does not exceed N (such as N greater than 4) of the corresponding side length of the 3D initial frame, optional, The increment in the X axis direction does not exceed one tenth of the length of the 3D initial frame, the increment in the Y axis direction does not exceed one tenth of the width of the 3D initial frame, and the increment in the Z axis direction does not exceed ten times the height of the 3D initial frame One in one. In addition, the increment in the X-axis direction, the increment in the Y-axis direction, and the increment in the Z-axis direction may be the same or different.

Alternatively, assume that the i-th 3D initial frame b _i can be expressed as: b _i = (x _i , y _i , z _i , h _i , w _i , l _i , θ _i ), where x _i , y _i And z _i denote the coordinates of the center point of the i-th 3D initial frame, h _i , w _i and l _i denote the height, width and length of the i-th 3D initial frame, and θ _i represents the direction of the i-th 3D initial frame, For example, in a bird's eye view, the angle between the length of the i-th 3D initial frame and the X coordinate axis is θ _i ; then, the 3D expansion frame corresponding to the i-th 3D initial frame

It can be expressed as:

Among them, η represents increment.

Optionally, the present disclosure may use a neural network to obtain feature information of points in a part of the area in the point cloud data, for example, all points in the part of the area in the point cloud data are used as input to the neural network, and the neural network At least one convolutional layer in processes the point cloud data in the partial area, so that feature information can be formed for each point in the partial area. The feature information formed this time may be referred to as local feature information. The feature information of the point cloud data formed this time is the feature information separately formed for each point in the partial area when considering all the points in the partial area of the point cloud data. Therefore, the features formed this time Information can be called local feature information.

Step B1: Perform semantic segmentation on the points in the partial area according to the feature information of the points in the partial area to obtain second semantic information on the points in the partial area.

Optionally, the second semantic information of a point in the present disclosure refers to: a semantic feature vector formed for the point in consideration of all points in the spatial range formed by the 3D extension box. The second semantic information in this disclosure may be referred to as a second semantic feature or a local spatial semantic feature. A local spatial semantic feature can also be expressed in the form of a one-dimensional vector array including multiple (eg, 256) elements.

In the present disclosure, a neural network may be used to obtain local spatial semantic features of all points in the 3D expansion box, and a method of using neural networks to obtain local spatial semantic features of points may include the following steps a and b:

a. First, according to the preset target position of the 3D extension frame, coordinate transformation is performed on the coordinate information of the point cloud data located in the 3D extension frame, so that the coordinates of the points located in the 3D extension frame are displaced, so that the 3D extension frame performs Displacement and rotation (direction adjustment of the 3D expansion frame), and then transform to the preset target position of the 3D expansion frame. Optionally, the preset target position of the 3D extension frame may include: the center point of the 3D extension frame (that is, the center point of the 3D initial frame) is located at the origin of coordinates, and the length of the 3D extension frame is parallel to the X axis. Optionally, the above coordinate origin and X axis may be the coordinate origin and X axis of the coordinate system of the point cloud data, and of course, may also be the coordinate origin and X axis of other coordinate systems.

Continuing the previous example, assume that the i-th 3D initial frame b _i can be expressed as: b _i = (x _i , y _i , z _i , h _i , w _i , l _i , θ _i ), where x _i , y _i and Z _i represent the coordinates of the center point of the i-th frame of the original 3D, h _{_i,} w _i L _i and a high width to length respectively represent the i-th original frame 3D, [theta] _i represents the i-th frame of the original 3D directions, e.g. In the bird's eye view, the angle between the length of the i-th 3D initial frame and the X coordinate axis is θ _i ; then, after performing coordinate transformation on the 3D extended frame that contains the i-th 3D initial frame, the present disclosure obtains A new 3D initial box

The new 3D initial box

It can be expressed as:

In other words, the new 3D initial box

The center point of is located at the origin of coordinates, and in a bird’s eye view, the new 3D initial frame

The angle between the length of and the X coordinate axis is 0.

The above coordinate transformation manner of the present disclosure may be referred to as regularized coordinate transformation. The present disclosure performs coordinate conversion on a point, and usually only changes the coordinate information of the point, but does not change other information of a point. By performing the operation of regularized coordinate transformation in the present disclosure, the coordinates of the points in different 3D initial frames can be concentrated in a rough range, which is beneficial to the training of the neural network, that is, to improving the neural network to form local spatial semantics The accuracy of the features, in turn, helps to improve the accuracy of the 3D initial frame correction. It can be understood that the data method of coordinate transformation described above is only an optional example, and those skilled in the art may also adopt other transformation methods that transform the coordinates to a certain range.

b. The coordinate-converted point cloud data (that is, the coordinate-converted point cloud data located in the 3D extension box) is provided to the neural network, and the neural network performs semantic segmentation processing on the received points to be located in the 3D extension box. Each point within generates a local spatial semantic feature.

Optionally, the present disclosure may form a front sight mask (eg, set the point where the confidence exceeds a predetermined value (such as 0.5, etc.) to 1 according to the confidence generated for the front sight in the above steps, and set the confidence to Points exceeding a predetermined value are set to 0, thereby forming a mask of the front sight). The present disclosure can provide the front sight mask and the coordinate-transformed point cloud data together to the neural network, so that the neural network can refer to the front sight mask when performing semantic processing, thereby helping to improve the description accuracy of local spatial semantic features .

Step C1: Form the corrected 3D initial frame according to the first semantic information and the second semantic information of the points in the partial area.

Optionally, the method for obtaining the global semantic characteristics of multiple points in the 3D extension frame in the present disclosure may be: first, according to the coordinate information of each point in the point cloud data, determine whether each point belongs to the spatial range of the 3D extension frame Whether it is located in the 3D extension frame, may include: located on any surface of the 3D extension surface), for a point, if the position of the point belongs to the spatial range of the 3D extension frame, the point can be regarded as belonging to the 3D extension frame Point; if the position of the point does not belong to the spatial range of the 3D extension box, the point will not be regarded as a point belonging to the 3D extension box. Then, according to the global semantic features of multiple points (such as all points) in the point cloud data, the global semantic features of all points belonging to the 3D extension frame are determined. Optionally, in the case of determining that a point belongs to a 3D extension box, the global semantic feature of the point can be found from the global semantic features of each point obtained by the foregoing, and so on, the present disclosure can obtain 3D Global semantic features of all points of the expansion box.

Optionally, in the present disclosure, the neural network can process the global semantic features and local semantic features of each point, and obtain the corrected 3D initial frame according to the processing result of the neural network. For example, the neural network encodes the global semantic features and local spatial semantic features of the points in the 3D extension box to obtain the characteristics of the 3D initial box used to describe the 3D extension box, and uses the neural network to describe the 3D The characteristics of the initial frame predict the confidence of the 3D initial frame as the target object, and adjust the 3D initial frame according to the characteristics used to describe the 3D initial frame via the neural network, thereby obtaining the corrected 3D initial frame. By correcting the 3D initial frame, it is beneficial to the accuracy of the 3D initial frame, thereby helping to improve the accuracy of the 3D detection frame.

Optionally, in the present disclosure, the global semantic feature and local spatial semantic feature of each point in the 3D extension box can be stitched. For example, for any point in the 3D extension box, the global semantic feature and the point The local spatial semantic features are stitched together to form the stitched semantic features. The stitched semantic features are used as input to the neural network to facilitate the neural network to encode the stitched semantic features, and the neural network generates the encoding After processing, it is used to describe the characteristics of the 3D initial frame in the 3D extension frame (hereinafter referred to as the characteristics after the encoding process).

Optionally, after forming the encoding-processed features, the neural network can predict the confidence of the 3D initial frame as the target object for each input encoded feature, and for each 3D initial frame, form Confidence. The confidence level can represent the probability that the corrected 3D initial frame is the target object. At the same time, the neural network can form a new 3D initial frame (that is, the corrected 3D initial frame) for each input processed feature. For example, the neural network respectively forms the position information of the center point of the new 3D initial frame, the length, width, and height information of the new 3D initial frame, and the direction information of the new 3D initial frame according to the input features after each encoding process.

The present disclosure performs redundant processing on all the 3D initial frames after correction, so as to obtain the process of obtaining the 3D detection frame of the target object. Please refer to the corresponding descriptions above, which will not be described in detail here.

As shown in FIG. 2, one embodiment of the target object 3D detection method of the present disclosure includes steps: S200 and S210. Each step in FIG. 2 is described in detail below.

S200. Provide point cloud data to a neural network, perform feature extraction processing on points in the point cloud data via the neural network, and perform semantic segmentation processing on the point cloud data according to the extracted feature information to obtain semantic features of multiple points, According to the semantic features, the front points of the multiple points are predicted, and a 3D initial frame corresponding to at least some of the multiple points is generated.

In an optional example, the neural network in the present disclosure is mainly used to generate a 3D initial frame for multiple points in the input point cloud data (such as all points or multiple points in the point cloud data), thereby Make each of the multiple points in the point cloud data correspond to a 3D initial frame. Since multiple points (such as each point) in the point cloud data usually contain the front sight and the background point, the 3D initial information frame generated by the neural network of the present disclosure usually includes: the 3D initial frame corresponding to the front sight and the The 3D initial frame corresponding to the background point.

Since the input of the neural network of the present disclosure is point cloud data, the neural network performs feature extraction on the point cloud data and performs semantic segmentation on the point cloud data based on the extracted feature information, which belongs to the underlying data analysis; and because the neural network of the present disclosure is based on The result of semantic segmentation generates a 3D initial frame, which is equivalent to upper-layer data analysis. Therefore, in the process of 3D detection of a target object, the present disclosure forms a bottom-up way to generate a 3D detection frame. The neural network of the present disclosure generates a 3D initial frame by using a bottom-up generation method, which not only avoids the projection processing of point cloud data, but also uses the image obtained after the projection processing to perform 3D detection frame detection, resulting in point cloud data The phenomenon of loss of original information, and the phenomenon of loss of original information, are not conducive to improving the performance of 3D detection frame detection; Moreover, the present disclosure can also avoid the use of 2D images taken by the camera device for 3D detection frame detection, due to the The target object (such as a vehicle or an obstacle) is blocked, resulting in a phenomenon that affects the detection of the 3D detection frame, and this phenomenon is also not conducive to improving the performance of the 3D detection frame detection. It can be seen from this that the neural network of the present disclosure generates a 3D initial frame by using a bottom-up generation method, which is beneficial to improve the detection performance of the 3D detection frame.

In an optional example, the neural network in the present disclosure may be divided into multiple parts, and each part may be implemented by a small neural network (also called a neural network unit or a neural network module, etc.), that is, the The neural network consists of multiple small neural networks. Since part of the structure of the neural network of the present disclosure can adopt the structure of RCNN (Regions with Convolutional Neural Network), the neural network of the present disclosure can be called Point RCNN (Point Regions with Convolutional Neural Network). Regional Convolutional Neural Network).

In an optional example, the 3D initial frame generated by the neural network of the present disclosure may include: position information of the center point of the 3D initial frame (such as coordinates of the center point), length, width, and height information of the 3D initial frame, and the Direction information (such as the angle between the length of the 3D initial frame and the X coordinate axis), etc. Of course, the 3D initial frame formed by the present disclosure may also include: position information of the center point of the bottom or top surface of the 3D initial frame, length, width, and height information of the 3D initial frame, and direction information of the 3D initial frame. The present disclosure does not limit the specific expression form of the 3D initial frame.

In an alternative example, the neural network of the present disclosure may include: a first neural network, a second neural network, and a third neural network. The point cloud data is provided to the first neural network. The first neural network is used to: perform feature extraction processing on multiple points (such as all points) in the received point cloud data, so as to provide each point in the point cloud data A global feature information is formed separately, and semantic segmentation processing is performed according to the global feature information of multiple points (such as all points), thereby forming a global semantic feature for each point, and the first neural network outputs the global semantic feature of each point. Optionally, the global semantic features of points can usually be expressed in the form of a one-dimensional vector array including multiple (eg, 256) elements. The global semantic features in this disclosure may also be referred to as global semantic feature vectors. In the case where the points in the point cloud data include: front spots and background points, the information output by the first neural network usually includes: the global semantic features of the front spots and the global semantic features of the background points.

Optionally, the first neural network in the present disclosure may be implemented using Point Cloud Encoder (Point Cloud Data Encoder) and Point Cloud Decoder (Point Cloud Data Decoder). Alternatively, the first neural network may use PointNet++ or Network structure such as Pointsift network model. The second neural network in the present disclosure may be implemented using MLP (Multi-Layer Perceptron), and the output dimension of the MLP used to implement the second neural network may be 1. The third neural network in the present disclosure may also be implemented using MLP, and the output dimensions of the MLP used to implement the third neural network are multi-dimensional, and the number of dimensions is related to the information included in the 3D detection frame information.

In the case where the global semantic feature of the point is obtained, the present disclosure needs to use the global semantic feature to realize the prediction of the front spot and the generation of the 3D initial frame. The present disclosure can adopt the following two ways to realize the prediction of the front sight and the generation of the initial 3D frame.

Manner 1: The global semantic features of each point output by the first neural network are provided to the second neural network and the third neural network simultaneously (as shown in FIG. 3). The second neural network is used to predict the confidence of the point as the former scenic spot for each global semantic feature of the input, and output the confidence for each point. The confidence predicted by the second neural network may indicate the probability that the point is the front sight. The third neural network is used to generate a 3D initial frame for the global semantic feature of each input point and output it. For example, the third neural network outputs the position information of the center point of the 3D initial frame, the length, width, and height information of the 3D initial frame, and the direction information of the 3D initial frame for each point according to the global semantic features of each point.

Since the information output by the first neural network usually includes: the global semantic feature of the front sight and the global semantic feature of the background point; therefore, the 3D initial frame output by the third neural network usually includes: the 3D initial frame corresponding to the front sight and the background point 3D initial frame; however, the third neural network itself cannot distinguish whether each output 3D initial frame is the 3D initial frame corresponding to the front sight or the 3D initial frame corresponding to the background point.

Method 2: The global semantic features of each point output by the first neural network are first provided to the second neural network, and the second neural network predicts the confidence that the point is the previous scenic spot for the input global semantic features of each point In the present disclosure, when it is determined that the point output by the second neural network is that the confidence of the front sight exceeds a predetermined value, the global semantic feature of the point is provided to the third neural network (as shown in FIG. 4). The third neural network generates a 3D initial frame for each global semantic feature that it receives as the front sight, and outputs the corresponding 3D initial frame for each front sight. The present disclosure does not provide the global semantic feature of the point to the third neural network when it is determined that the point output by the second neural network is the confidence level of the previous scenic spot does not exceed a predetermined value. Therefore, all the output of the third neural network The 3D initial frames are the 3D initial frames corresponding to the front sight.

S210. Determine the final 3D detection frame according to the 3D detection frame information corresponding to the front scenic spot among the multiple points.

In an optional example, in the case where S200 adopts the first way, the present disclosure may determine that the 3D initial frames corresponding to the points output by the third neural network are corresponding to the front attractions according to the confidences output by the second neural network, respectively The 3D initial frame is also the 3D initial frame corresponding to the background point. For example, when it is determined that the first point output by the second neural network is the confidence level of the front sight, the point is determined as the front sight, so that the present disclosure can output the first point output by the third neural network The 3D initial frame corresponding to the point is determined to be the 3D initial frame corresponding to the front sight, and so on, according to the confidence of the output of the second neural network, the present disclosure can select all the front sights from all the 3D initial frames output by the third neural network Corresponding 3D initial frame. Afterwards, the present disclosure may perform redundant processing on the 3D initial frames corresponding to all the selected front sights, thereby obtaining a final 3D detection frame, that is, a 3D detection frame detected for point cloud data. For example, the present disclosure may use the NMS (Non-Maximum Suppression Non-Maximum Suppression) algorithm to perform redundant processing on the 3D detection frame information corresponding to all currently selected front spots, thereby removing redundant 3D detections that overlap each other Frame to obtain the final 3D detection frame.

In an optional example, in the case where S200 adopts the second way, the present disclosure can directly obtain the 3D initial frame corresponding to the front sight according to the 3D initial frame output by the third neural network, therefore, the present disclosure can directly target the third nerve All the 3D initial frames output by the network are redundantly processed to obtain the final 3D detection frame, that is, the 3D detection frame detected for the point cloud data (refer to the related description in the above embodiment). For example, the present disclosure may use the NMS algorithm to perform redundant processing on all 3D initial frames output by the third neural network, thereby removing redundant 3D initial frames that overlap each other, to obtain a final 3D detection frame.

In an optional example, regardless of whether the S200 adopts the first way or the second way, after obtaining the 3D initial frame corresponding to the front sight, the present disclosure can correct the 3D initial frame corresponding to each front sight separately, and The 3D initial frames corresponding to the corrected front spots are redundantly processed to obtain the final 3D detection frame. That is to say, the process of generating the 3D detection frame by the neural network of the present disclosure can be divided into two stages. The initial 3D frame generated by the neural network in the first stage of the neural network is provided to the second stage of the neural network. The stage neural network corrects the 3D initial frame generated by the first stage neural network (such as position optimization, etc.), and then, the present disclosure determines the final 3D detection frame according to the corrected 3D initial frame of the second stage neural network. The final 3D detection frame is the 3D detection frame detected by the present disclosure based on point cloud data. However, the process of generating the 3D initial frame by the neural network of the present disclosure may include only the first-stage neural network and not the second-stage neural network. In the case where the process of generating the 3D initial frame by the neural network includes only the first-stage neural network, it is also completely feasible for the present disclosure to determine the final 3D detection frame according to the 3D initial frame generated by the first-stage neural network. Since the corrected 3D initial frame is often more accurate, determining the final 3D detection frame based on the corrected 3D initial frame is beneficial to improve the accuracy of the 3D detection frame detection. Both the first-stage neural network and the second-stage neural network in this disclosure can be implemented by neural networks that can exist independently, or can be composed of part of the network structural units in a complete neural network; in addition, for ease of description, it may be related to The received neural network is called the first neural network, the second neural network, the third neural network, the fourth neural network, the fifth neural network, the sixth neural network, or the seventh neural network, but it should be understood that the first to seventh Each of the neural networks may be an independent neural network, or may be composed of some network structural units in a large neural network, which is not limited in this disclosure.

In an optional example, the process of using the neural network to correct the respective 3D initial frames corresponding to each front sight in the present disclosure may include the following steps A2, B2, and C2:

Step A2: Set a 3D expansion frame containing a 3D initial frame, and obtain global semantic features of points in the 3D expansion frame.

Optionally, each 3D initial frame in the present disclosure corresponds to a 3D extension frame, and the space range occupied by the 3D extension frame generally completely covers the space range occupied by the 3D initial frame. Under normal circumstances, any surface of the 3D initial frame is not in the same plane as any surface of its corresponding 3D expansion frame, the center point of the 3D initial frame and the center point of the 3D expansion frame coincide with each other, and any surface of the 3D initial frame Both are parallel to the corresponding faces of their corresponding 3D expansion boxes. Of course, the present disclosure does not exclude the case that although the two center points do not coincide, any face of the 3D initial frame is parallel to the corresponding face of the corresponding 3D extension frame.

Optionally, the present disclosure may be based on at least one of a preset X-axis direction increment (such as 20 cm), a Y-axis direction increment (such as 20 cm), and a Z-axis direction increment (such as 20 cm). The 3D initial frame of the front view point is expanded in 3D space, so as to form a 3D expansion frame including two 3D initial frames whose center points coincide with each other and the planes are parallel to each other.

It can be expressed as:

Among them, η represents increment.

Optionally, the local space in the present disclosure generally refers to: the spatial range formed by the 3D expansion frame. The local spatial semantic feature of a point generally refers to a semantic feature vector formed for that point when considering all the points in the spatial range formed by the 3D extension box. A local spatial semantic feature can also be expressed in the form of a one-dimensional vector array including multiple (eg, 256) elements.

Optionally, the method for obtaining the global semantic features of multiple points in the 3D extension frame in the present disclosure may be: first, according to the coordinate information of each point in the point cloud data, determine whether each point belongs to the spatial range of the 3D extension frame (ie Whether it is located in the 3D extension frame, may include: located on any surface of the 3D extension surface), for a point, if the position of the point belongs to the spatial range of the 3D extension frame, the point can be regarded as belonging to the 3D extension frame Point; if the position of the point does not belong to the spatial range of the 3D extension box, the point will not be regarded as a point belonging to the 3D extension box. Then, according to the global semantic features of multiple points (such as all points) in the point cloud data, the global semantic features of all points belonging to the 3D extension frame are determined. Optionally, in the case of determining that a point belongs to a 3D extension box, the global semantic feature of the point can be found from the global semantic features of each point obtained by the foregoing, and so on, the present disclosure can obtain 3D Global semantic features of all points of the expansion box.

Step B2: The point cloud data located in the 3D extension box is provided to the fourth neural network in the neural network, and the local spatial semantic features of the points in the 3D extension box are generated via the fourth neural network.

Optionally, the method for obtaining the local spatial semantic features of all points in the 3D extension frame in the present disclosure may include the following steps a and b:

The new 3D initial box

It can be expressed as:

In other words, the new 3D initial box

The angle between the length of and the X coordinate axis is 0.

b. The coordinate-converted point cloud data (that is, the coordinate-converted point cloud data in the 3D expansion box) is provided to the fourth neural network in the neural network, and the fourth neural network characterizes the received points Extraction processing, and semantic segmentation processing based on the extracted local feature information, so as to generate local spatial semantic features for each point located in the 3D expansion box.

Optionally, according to the confidence output by the second neural network, the present disclosure can also form a mask of the front sight (such as setting the point where the confidence exceeds a predetermined value (such as 0.5, etc.) to 1, while the confidence does not exceed the predetermined value Is set to 0). The present disclosure can provide the front sight mask together with the point cloud data after coordinate conversion to the fourth neural network, so that the fourth neural network can refer to the front sight mask when performing feature extraction and semantic processing, thereby facilitating improvement Description accuracy of local spatial semantic features.

Optionally, the fourth neural network in the present disclosure may be implemented using MLP, and the output dimensions of the MLP used to implement the fourth neural network are generally multi-dimensional, and the number of dimensions is related to the information included in the local spatial semantic features.

Step C2: Through the fifth neural network in the neural network, encode the global semantic features and local spatial semantic features of the points in the 3D extension box to obtain the features describing the 3D initial box in the 3D extension box, And through the sixth neural network in the neural network to predict the confidence of the 3D initial frame according to the characteristics of the 3D initial frame, the seventh neural network in the neural network according to the characteristics of the 3D initial frame, Correcting the 3D initial frame is beneficial to improve the accuracy of the 3D initial frame, and thus to improve the accuracy of the 3D detection frame.

Optionally, the fifth neural network in the present disclosure may be implemented using Point Cloud Encoder (point cloud data encoder). Alternatively, the fifth neural network may adopt a partial network structure such as PointNet++ or Pointsift network model. The sixth neural network in the present disclosure may be implemented using MLP, and the output dimension of the MLP used to implement the sixth neural network may be 1, and the number of dimensions may be related to the number of types of target objects. The seventh neural network in the present disclosure may also be implemented using MLP, and the output dimensions of the MLP used to implement the seventh neural network are multi-dimensional, and the number of dimensions is related to the information included in the 3D detection frame information. The first neural network to the seventh neural network in the present disclosure may all be implemented by a neural network that can exist independently, or by a part of a neural network that cannot exist independently.

Optionally, in the present disclosure, the global semantic feature and local spatial semantic feature of each point in the 3D extension box can be stitched. For example, for any point in the 3D extension box, the global semantic feature and the point The local spatial semantic features are stitched together to form the stitched semantic features. The stitched semantic features are taken as input and provided to the fifth neural network, so that the fifth neural network can encode the stitched semantic features. Five neural networks output features after encoding processing to describe the features of the 3D initial frame in the 3D expansion frame (hereinafter referred to as features after encoding processing).

Optionally, the encoded features output by the fifth neural network are simultaneously provided to the sixth neural network and the seventh neural network (as shown in FIG. 5). The sixth neural network is used to predict the confidence level of the 3D initial frame as the target object for each encoded feature of the input, and output the confidence level for each 3D initial frame. The confidence predicted by the sixth neural network may represent the probability that the corrected 3D initial frame is the target object. The target object here may be a vehicle or a pedestrian. The seventh neural network is used to form a new 3D initial frame (that is, the corrected 3D initial frame) for each input feature after encoding processing, and output it. For example, the seventh neural network outputs the position information of the center point of the new 3D initial frame, the length, width, and height information of the new 3D initial frame, and the direction information of the new 3D initial frame, respectively, according to the encoded features of each input Wait.

It should be noted that there are many ways to implement the neural network of the present disclosure. One way is shown in FIG. 3; another way is shown in FIG. 4; another way is shown in FIG. 3. With the combination of FIG. 5, yet another implementation manner is as shown in the combination of FIG. 4 and FIG. 5. The detailed description of each implementation will not be made here.

In an alternative example, the neural network of the present disclosure is obtained by training using multiple point cloud data samples with 3D annotation frames. For example, the present disclosure can obtain the loss corresponding to the confidence generated by the neural network to be trained, and the loss formed by the 3D initial frame generated by the neural network to be trained for the point cloud data sample relative to the 3D annotation frame of the point cloud data sample In order to use these two losses to adjust the network parameters of the neural network to be trained, the neural network can be trained. The network parameters in this disclosure may include, but are not limited to, convolution kernel parameters and weight values.

In the case where the process of forming a 3D detection frame by the neural network of the present disclosure includes only one stage (that is, the process of forming the 3D detection frame by the first stage neural network), the present disclosure can obtain the loss corresponding to the confidence generated by the first stage neural network The loss corresponding to the 3D initial box, and using the two losses of the first stage neural network, adjust the network parameters of the first stage neural network (such as the first neural network, the second neural network, and the third neural network), and After the successful training of the neural network in the first stage, the entire neural network is successfully trained.

In the case where the process of forming the 3D detection frame by the neural network of the present disclosure is divided into two stages, the present disclosure can separately train the first stage neural network and the second stage neural network. For example, first obtain the loss corresponding to the confidence generated by the first stage neural network and the loss corresponding to the 3D initial frame, and use these two losses to adjust the network parameters of the first stage neural network. After the successful training of the first stage neural network, the 3D initial frame corresponding to the front sight output by the first stage neural network is input as the input to the second stage neural network, and the corresponding confidence generated by the second stage neural network is obtained Loss and the corresponding loss of the corrected 3D initial box, and using these two losses of the second-stage neural network, the second-stage neural network (such as the fourth neural network, fifth neural network, sixth neural network, and seventh Neural network) network parameters are adjusted. After the successful completion of the second stage neural network training, the entire neural network is successfully trained.

The loss corresponding to the confidence generated by the first stage neural network in this disclosure can be expressed by the following formula (1):

L _focal (p _t )=-α _t (1-p _t ) ^γ log(p _t ) formula (1)

In the above formula (1), when point p is the front sight, p _t is the confidence of the front sight p; when point p is not the front sight, p _t is the confidence of 1 and the front sight p The difference between α _t and γ are constants. In an optional example, α _t =0.25 and γ=2.

The loss corresponding to the 3D initial frame generated by the first stage neural network in this disclosure can be expressed by the following formula (2):

In the above formula (2), L _reg represents the regression loss function of the 3D detection frame, and N _pos represents the number of front spots;

Represents the bin loss function of the 3D initial frame generated for the front spot p, and

It can be expressed in the form of the following formula (3);

Represents the residual loss function of the 3D initial frame generated for the front spot p, and

It can be expressed as the following formula (4).

In the above formula (3),

Represents the barrel loss function of the 3D initial frame generated for the front spot p; x, z, and θ respectively represent the x coordinate of the center point, the z coordinate of the center point, and the direction of the target object, and the target object can be the 3D initial generated by the neural network The frame can also be a 3D labeling frame in the point cloud data sample; F _cls (*) represents the cross-entropy classification loss of the classification (Cross-entropy classification loss);

Represents the number of the bucket where the parameter u of the center point of the 3D initial frame generated for the front spot p is located;

Represents the number of the bucket where the parameter u of the 3D annotation box in the point cloud data sample is located

with

When the parameter u is x, it can be expressed in the form of the following formula (5);

with

When the parameter u is z, it can be expressed in the form of the following formula (6); F _reg (*) represents a smooth L1 loss function (Smooth L1 Loss);

Represents the offset of the parameter u of the 3D initial frame generated for the front spot p in the corresponding bucket;

Represents the offset of the parameter u of the 3D annotation frame in the point cloud data sample in the corresponding bucket;

with

When the parameter u is x or z, it can be expressed as the following formula (7)

For a point, a bucket in the present disclosure may refer to: dividing the spatial range around the point, a range of value ranges, called a bucket, each bucket may have its corresponding number, usually In this case, the range of the bucket is fixed. In an optional example, the range of the bucket is the length. In this case, the bucket has a fixed length. In another optional example, the range of the bucket is Angle range, at this time, the bucket has a fixed angle interval. Optionally, for the x direction or the z direction, the length of the bucket may be 0.5m. At this time, the value range of different buckets may be 0-0.5m and 0.5m-1m. Optionally, the present disclosure can divide 2π into multiple angle intervals, one angle interval corresponds to a range of value ranges. In this case, the size of the bucket (that is, the angle interval) can be 45 degrees or 30 degrees.

In the above formula (4),

Represents the residual loss function of the 3D initial frame generated for the front sight p; y, h, w, and l respectively represent the y coordinate of the center point of the 3D initial frame generated for the front sight p, and the 3D initial generated for the front sight p The height, width and length of the box; F _reg (*) represents a smooth L1 loss function; when the parameter v is y,

Represents the offset of the y coordinate of the front sight p from the y coordinate of the center point of the 3D initial frame generated for the front sight p, as shown in formula (8); when the parameter v is h, w, or l,

Represents the offset of the height, width or length of the 3D initial frame generated for the front spot p with respect to the corresponding preset parameters; when the parameter v is y,

The offset of the y coordinate of the front sight p relative to the y coordinate of the center point of the 3D annotation frame, as shown in formula (8); when the parameter v is h, w, or l,

Represents the offset of the height, width or length of the 3D annotation frame relative to the corresponding preset parameters; the preset parameters in the present disclosure may be the length and width of the 3D annotation frame in each point cloud data sample in the training data Statistical calculations are carried out separately for high and high, while the long, wide and high averages are obtained.

In the above formula (5) and formula (6),

Represents the barrel number of the center point of the 3D annotation frame in the point cloud data sample in the direction of the X coordinate axis;

Represents the barrel number of the center point of the 3D annotation frame in the point cloud data sample in the direction of the Z coordinate axis; (x ^(p) , z ^(p) ) represents the x coordinate and z coordinate of the front spot p, (x ^p , z ^p ) represents the x coordinate and z coordinate of the center point of the 3D initial frame generated for the front sight p; δ represents the length of the barrel, and S represents the search distance of the front sight p on the x-axis or z-axis.

In the above formula (7), S represents the search distance of the previous spot p on the x-axis or z-axis, that is, in the case that the parameter u is x, S represents the 3D initial frame generated for the front spot p The distance of the center point of x from the x coordinate of the front sight p in the x-axis direction, and in the case where the parameter u is z, S represents the center point of the 3D initial frame generated for the front sight p from the front sight in the z-axis direction The distance of the z coordinate of p; δ represents the length of the barrel, and the length of the barrel is a constant value, for example, δ=0.5m;

As shown in the above formula (5) and formula (6); C is a constant value, and C may be related to the length of the bucket, for example, C is equal to the length of the bucket or half the length of the bucket.

In the above formula (8),

Represents the offset of the y coordinate of the front sight in the corresponding bucket; y ^p represents the y coordinate of the center point of the 3D initial frame generated for the front sight p; y ^(p) represents the y coordinate of the front sight.

In an optional example, when the training for the first to third neural networks reaches a predetermined iteration condition, the training process ends. The predetermined iteration conditions in the present disclosure may include that the difference between the 3D initial frame output by the third neural network and the 3D annotation frame of the point cloud data sample meets the predetermined difference requirement, and the confidence level output by the second neural network meets the predetermined requirement. In the case that both meet the requirements, this time the first to third neural networks are successfully trained. The predetermined iteration conditions in the present disclosure may also include: training the first to third neural networks, the number of point cloud data samples used meets the predetermined number requirements, and so on. When the number of point cloud data samples used reaches the predetermined number requirement, however, both of them do not meet the requirements, this time the first to third neural networks are not successfully trained.

Optionally, in a case where the process of forming the 3D detection frame by the neural network of the present disclosure includes one stage, the first to third neural networks that have been successfully trained can be used for 3D detection of the target object.

Optionally, in the case where the process of forming a 3D detection frame by the neural network of the present disclosure includes two stages, the first to third neural networks that have been successfully trained can also be used to generate 3D corresponding to the front sight for the point cloud data sample The initial frame, that is, the present disclosure can again provide point cloud data samples to the successfully trained first neural network, and store the information output by the second neural network and the third neural network separately, so as to facilitate the second-stage neural network Provide input (that is, the 3D initial frame corresponding to the front sight); after that, obtain the loss corresponding to the confidence generated in the second stage and the loss corresponding to the corrected 3D initial frame, and use the obtained loss to adjust the fourth neural network to the third Seven neural network parameters, and after the fourth to seventh neural networks are successfully trained, the entire neural network is successfully trained.

The loss function used for adjusting the network parameters of the fourth to seventh neural networks in the second-stage neural network in the present disclosure includes the loss corresponding to the confidence and the loss corresponding to the corrected 3D initial frame. Use the following formula (9) to express:

In the above formula (9), B represents the 3D initial box set; ||B|| represents the number of 3D initial boxes in the 3D initial box set; F _cls (*) represents the crossover used to supervise the confidence of the prediction Entropy loss function, that is, F _cls (*) is the cross-entropy loss function based on classification; prob _i represents the corrected i-th 3D initial frame predicted by the sixth neural network is the confidence of the target object; label _i represents the i Whether the 3D initial frame is the label of the target object, the label can be obtained by calculation, for example, when the overlap degree between the i-th 3D initial frame and the corresponding 3D label frame exceeds the set threshold, the value of the label is 1, Otherwise, the value of the label is 0; B _pos is a subset of B, and the overlap between the 3D initial box in B _pos and the corresponding 3D label box exceeds the set threshold; ||B _pos || means the sub The number of concentrated 3D initial frames;

With the above

similar,

With the above

Similar, just using

(Replace the ith 3D initial box b _i in the formula) and

(Replace the i-th 3D callout box information in the formula),

with

It can be expressed in the form of the following formula (10):

In the above formula (10),

Annotate frame information for the i-th 3D;

Represents the information of the i-th 3D labeled frame after coordinate conversion; (xi, yi, zi, hi, wi, li, θi) is the i-th 3D initial frame after correction,

Indicates the ith 3D initial frame after coordinate conversion.

When calculating formula (9), you need to use the above formula (3), and the formula (3)

with

It can be replaced with the following formula (11):

In the above formula (11), ω represents the size of the barrel, that is, the angle interval of the barrel.

with

It can be replaced by the following formula (12):

Among them, ω represents the size of the barrel, that is, the angle interval of the barrel.

In an optional example, when the training for the fourth to seventh neural networks reaches a predetermined iteration condition, the training process ends. The predetermined iteration conditions in the present disclosure may include: the difference between the 3D initial frame output by the seventh neural network and the 3D annotation frame of the point cloud data sample meets the predetermined difference requirement, and the confidence of the sixth neural network output meets the predetermined requirement. In the case that both meet the requirements, the fourth to seventh neural networks are successfully trained this time. The predetermined iteration conditions in the present disclosure may also include: training the fourth to seventh neural networks, and the number of point cloud data samples used reaches a predetermined number of requirements, etc. When the number of point cloud data samples used reaches the predetermined number requirement, however, both of them do not meet the requirements, this time the fourth to seventh neural networks are not successfully trained.

6 is a flowchart of an embodiment of a vehicle intelligent control method of the present disclosure.

As shown in FIG. 6, the method of this embodiment includes steps: S600, S610, S620, S630, S640, and S650. Each step in FIG. 6 will be described in detail below.

S600. Extract feature information of the acquired point cloud data of the scene.

S610. Perform semantic segmentation on the point cloud data according to the feature information of the point cloud data to obtain first semantic information of multiple points in the point cloud data.

S620. Predict at least one front scenic spot corresponding to the target object among the multiple points according to the first semantic information.

S630. Generate a 3D initial frame corresponding to each of the at least one front sight according to the first semantic information.

S640. Determine the 3D detection frame of the target object in the scene according to the 3D initial frame.

For the specific implementation process of the foregoing S600-S640, reference may be made to the relevant description in the above-mentioned embodiment, and the description will not be repeated here. Moreover, the above S600-S640 can be implemented by providing point cloud data to a neural network, performing feature information extraction processing on points in the point cloud data via the neural network, and performing semantic segmentation processing based on the extracted feature information to obtain The semantic features of multiple points, according to the semantic features, predict the front points of the multiple points, and generate a 3D initial frame corresponding to at least some of the multiple points.

S650. According to the above 3D detection frame, generate a command or warning prompt information for controlling the vehicle.

Optionally, the present disclosure may first determine at least one of the following information of the target object according to the 3D detection frame: the spatial position, size, distance to the vehicle, and relative orientation information of the target object in the scene. Then, according to the determined at least one piece of information, an instruction or early warning message for controlling the vehicle is generated. The instructions generated by the present disclosure are, for example, an instruction to increase the speed, an instruction to decrease the speed, or an emergency braking instruction. The generated warning prompt information such as the attention information of a target object such as a vehicle or pedestrian paying attention to a certain position, etc. The present disclosure does not limit the specific implementation of generating instructions or warning prompt information according to the 3D detection frame.

7 is a flowchart of an embodiment of an obstacle avoidance navigation method of the present disclosure.

As shown in FIG. 7, the method in this embodiment includes steps: S700, S710, S720, S730, S740, and S750. Next, each step in FIG. 7 will be described in detail.

S700. Extract feature information of the acquired point cloud data of the scene.

S710. Perform semantic segmentation on the point cloud data according to the feature information of the point cloud data to obtain first semantic information of multiple points in the point cloud data.

S720. Predict at least one front scenic spot corresponding to the target object among multiple points according to the first semantic information.

S730. Generate a 3D initial frame corresponding to each of the at least one front sight according to the first semantic information.

S740. Determine the 3D detection frame of the target object in the scene according to the 3D initial frame.

For the specific implementation process of the foregoing S700-S740, reference may be made to the relevant description in the above-mentioned embodiment, and the description will not be repeated here. Moreover, the above S700-S740 may be implemented by providing point cloud data to a neural network, performing feature information extraction processing on points in the point cloud data via the neural network, and performing semantic segmentation processing based on the extracted feature information to obtain The semantic features of multiple points, according to the semantic features, predict the front points of the multiple points, and generate a 3D initial frame corresponding to at least some of the multiple points.

S750: According to the above 3D detection frame, generate an instruction or warning prompt information for performing obstacle avoidance navigation control on the robot where the lidar is located.

Optionally, the present disclosure may first determine at least one of the following information of the target object according to the 3D detection frame: the spatial position, size, distance to the robot, and relative orientation information of the target object in the scene. Then, according to the determined at least one piece of information, an instruction or early warning prompt information for performing obstacle avoidance navigation control on the robot is generated. The instructions generated by the present disclosure are, for example, an instruction to reduce the speed of an action, an instruction to suspend an action, or a turn instruction. The generated early warning prompt information such as the prompt information of paying attention to an obstacle (ie, target object) in a certain direction, etc. The present disclosure does not limit the specific implementation of generating instructions or warning prompt information according to the 3D detection frame.

FIG. 8 is a schematic structural diagram of an embodiment of a target object 3D detection device of the present disclosure. The device shown in FIG. 8 includes: a feature extraction module 800, a first semantic segmentation module 810, a pre-spot prediction module 820, a generation initial frame module 830, and a determination detection frame module 840.

The feature extraction module 800 is mainly used to extract feature information of point cloud data of the acquired scene. The first semantic segmentation module 810 is mainly used to perform semantic segmentation processing on the point cloud data according to the feature information of the point cloud data to obtain first semantic information of multiple points in the point cloud data. The predicted front sight module 820 is mainly used to predict at least one front sight corresponding to the target object among the multiple points according to the first semantic information. The generating initial frame module 830 is mainly used to generate a 3D initial frame corresponding to at least one front sight according to the first semantic information. The detection frame determination module 840 is mainly used to determine the 3D detection frame of the target object in the scene according to the 3D initial frame.

In an optional example, the determination detection block module 840 may include: a first submodule, a second submodule, and a third submodule. The first sub-module is mainly used to obtain characteristic information of points in a partial area in the point cloud data, where the partial area includes at least one of the 3D initial frames. The second sub-module is mainly used for semantically segmenting the points in the partial area according to the feature information of the points in the partial area to obtain second semantic information of the points in the partial area. The third sub-module is mainly used to determine the 3D detection frame of the target object in the scene according to the first semantic information and the second semantic information of the points in the partial area.

In an optional example, the third submodule in the present disclosure may include: a fourth submodule and a fifth submodule. The fourth sub-module is mainly used to correct the 3D initial frame according to the first semantic information and the second semantic information of the points in the partial area to obtain the corrected 3D initial frame. The fifth sub-module is mainly used to determine the 3D detection frame of the target object in the scene according to the corrected 3D initial frame.

In an optional example, the third submodule in the present disclosure may be further used to determine the confidence of the target object corresponding to the 3D initial frame according to the first semantic information and the second semantic information of the points in the partial area, and according to the 3D The initial frame and its confidence determine the 3D detection frame of the target object in the scene.

In an optional example, the third submodule in the present disclosure may include: a fourth submodule, a sixth submodule, and a seventh submodule. The fourth sub-module is mainly used to correct the 3D initial frame according to the first semantic information and the second semantic information of the points in the partial area to obtain the corrected 3D initial frame. The sixth sub-module is mainly used to determine the confidence of the target object corresponding to the corrected 3D initial frame according to the first semantic information and the second semantic information of the points in the partial area. The seventh sub-module is mainly used to determine the 3D detection frame of the target object in the scene according to the corrected 3D initial frame and its confidence.

In an optional example, some areas in the present disclosure include: the 3D expansion frame obtained by edge-expanding the 3D initial frame according to a predetermined strategy. For example, the 3D expansion frame may be: according to a preset X-axis direction increment, Y-axis direction increment and/or Z-axis direction increment, the 3D initial frame is expanded in 3D space to form a 3D initial frame 3D expansion box.

In an optional example, the second submodule in the present disclosure may include: an eighth submodule and a ninth submodule. The eighth sub-module is mainly used to perform coordinate transformation on the coordinate information of the points located in the 3D extension box in the point cloud data according to the preset target position of the 3D extension box to obtain the feature information of the point after the coordinate transformation. The ninth sub-module is mainly used to perform semantic segmentation based on the 3D extension box according to the feature information of the coordinate-transformed point, to obtain the second semantic feature of the point in the 3D extension box. Optionally, the ninth sub-module may perform semantic segmentation based on the 3D extension frame according to the mask of the front sight and the feature information of the point after coordinate transformation to obtain the second semantic feature of the point.

In an optional example, in the case where there are multiple front sights, the determination detection frame module 840 in the present disclosure may first determine the degree of overlap between the 3D initial frames corresponding to the multiple front sights, and then determine the detection frame module 840 screens the 3D initial frame whose overlapping degree is greater than the set threshold; then, the detection frame determination module 840 determines the 3D detection frame of the target object in the scene according to the filtered 3D initial frame.

In an optional example, the feature extraction module 800, the first semantic segmentation module 810, the pre-spot prediction module 820, and the initial frame generation module 830 in the present disclosure may be implemented by a first-stage neural network. At this time, the device of the present disclosure may further include a first training module. The first training module is used to train the first-stage neural network to be trained using point cloud data samples with 3D annotation frames.

In an optional example, the process of the first training module training the first stage neural network includes:

First, the first training module provides the point cloud data samples to the first stage neural network, extracts the feature information of the point cloud data samples based on the first stage neural network, and the first stage neural network performs the point cloud data samples according to the extracted feature information Semantic segmentation processing, the first stage neural network predicts at least one front sight corresponding to the target object among the multiple points according to the first semantic features of the multiple points obtained by the semantic segmentation processing, and generates at least one front sight according to the first semantic information Corresponding 3D initial frame.

Second, the first training module obtains the loss corresponding to the front sight and the loss formed by the 3D initial frame relative to the corresponding 3D annotation frame, and adjusts the network parameters in the first-stage neural network according to the above loss.

Optionally, the first training module may determine the first loss corresponding to the prediction result of the front sight according to the confidence of the front sight predicted by the neural network in the first stage. The first training module generates a second loss according to the number of the bucket in which the parameters in the 3D initial frame generated for the front sight are located and the number of the bucket in the 3D annotation frame information in the point cloud data sample. The first training module generates the first according to the offset of the parameters in the 3D initial frame generated for the front sight in the corresponding bucket and the offsets of the parameters in the 3D annotation frame information in the point cloud data sample in the corresponding bucket. Three losses. The first training module generates a fourth loss according to the offset of the parameter in the 3D initial frame generated for the front sight from the predetermined parameter. The first training module generates a fifth loss according to the offset of the coordinate parameter of the front sight relative to the coordinate parameter of the 3D initial frame generated for the front sight. The first training module adjusts the network parameters of the first-stage neural network according to the first loss, the second loss, the third loss, the fourth loss, and the fifth loss.

In an optional example, the first submodule, the second submodule, and the third submodule in the present disclosure are implemented by a second-stage neural network. At this time, the device of the present disclosure further includes a second training module, and the second training module is used to train the second-stage neural network to be trained using point cloud data samples with 3D annotation frames.

In an optional example, the process of the second training module training the second-stage neural network includes:

First, the second training module provides the 3D initial frame obtained by using the first-stage neural network to the second-stage neural network, and obtains the feature information of the points in the partial region of the point cloud data sample based on the second-stage neural network. The feature information of the points in the partial area, semantically segment the points in the partial area to obtain the second semantic characteristics of the points in the partial area; the second stage neural network is based on the first semantic characteristics and the second of the points in the partial area Semantic features, determine the confidence of the 3D initial frame as the target object, and generate a position-corrected 3D initial frame based on the first and second semantic features of the points in the partial area.

Second, the second training module obtains the loss corresponding to the confidence of the target object in the 3D initial frame, and the loss formed by the position-corrected 3D initial frame relative to the corresponding 3D annotation frame, and according to the obtained loss, the second-stage neural network Adjust the network parameters in the network.

Optionally, the second training module may determine the sixth loss corresponding to the prediction result according to the confidence level of the target object predicted by the 3D initial frame predicted by the second stage neural network. The second training module is based on the number of the bucket where the parameters in the 3D initial frame after the position correction and the point cloud data samples generated by the second stage neural network and the overlap with the corresponding 3D annotation frame exceed the set threshold The number of the bucket where the parameter in the 3D annotation frame information is generated, resulting in a seventh loss; the second training module corrects the 3D corrected position based on the position generated by the second stage neural network and the corresponding overlap of the 3D annotation frame exceeds the set threshold The offset of the parameter in the initial box in the corresponding bucket and the offset of the parameter in the 3D annotation box information in the point cloud data sample in the corresponding bucket generate an eighth loss; the second training module is based on the second stage The offset of the parameters of the 3D initial frame after the position correction of the position of the 3D initial frame generated by the neural network and the corresponding 3D labeling frame with respect to the set threshold generates a ninth loss; the second training module according to the second stage The offset of the coordinate parameters of the corrected 3D initial frame generated by the neural network and the corresponding position of the 3D labeled frame that exceeds the set threshold relative to the coordinate parameters of the center point of the 3D labeled frame generates a tenth loss; The second training module adjusts the network parameters of the second stage neural network according to the sixth loss, seventh loss, eighth loss, ninth loss, and tenth loss.

9 is a schematic structural diagram of an embodiment of a vehicle intelligent control device of the present disclosure. As shown in FIG. 9, the device of this embodiment includes: a target object 3D detection device 900 and a first control module 910. The target object 3D detection device 900 is used to obtain a 3D detection frame of the target object based on the point cloud data. The specific structure and specific operations of the target object 3D detection device 900 are as described in the above device and method embodiments, and will not be described in detail here. The first control module 910 is mainly used to generate an instruction or early warning information for controlling the vehicle according to the 3D detection frame. For details, reference may be made to the relevant description in the above method embodiment, and no more detailed description is provided here.

FIG. 10 is an obstacle avoidance navigation device of the present disclosure. As shown in FIG. 10, the device of this embodiment includes: a target object 3D detection device 1000 and a second control module 1010. The target object 3D detection device 1000 is used to obtain a 3D detection frame of the target object based on the point cloud data. The specific structure and specific operations of the target object 3D detection device 1000 are as described in the above-mentioned device and method embodiments, and will not be described in detail here. The second control module 1010 is mainly used to generate instructions or warning prompt information for performing obstacle avoidance navigation control on the robot according to the 3D detection frame. For details, reference may be made to the relevant description in the above method embodiment, and no more detailed description is provided here.

Exemplary equipment

FIG. 11 shows an exemplary device 1100 suitable for implementing the present disclosure. The device 1100 may be a control system/electronic system configured in a car, a mobile terminal (eg, smart mobile phone, etc.), a personal computer (PC, eg, desktop computer Or notebook computers, etc.), tablet computers and servers. In FIG. 11, the device 1100 includes one or more processors, a communication part, etc. The one or more processors may be: one or more central processing units (CPUs) 1101, and/or one or more utilizations An image processor (GPU) 1113 for visual tracking by a neural network, etc. The processor can load executable memory stored in a read-only memory (ROM) 1102 or load it from the storage section 1108 into a random access memory (RAM) 1103. Execute instructions to perform various appropriate actions and processes. The communication part 1112 may include but is not limited to a network card, and the network card may include but not limited to an IB (Infiniband) network card. The processor can communicate with the read-only memory 1102 and/or the random access memory 1103 to execute executable instructions, connect to the communication section 1112 through the bus 1104, and communicate with other target devices via the communication section 1112, thereby completing the corresponding steps in the present disclosure . For operations performed by the foregoing instructions, reference may be made to related descriptions in the foregoing method embodiments, and details are not described herein again. In RAM 1103, various programs and data necessary for device operation can also be stored. The CPU 1101, ROM 1102, and RAM 1103 are connected to each other through a bus 1104.

In the case of RAM1103, ROM1102 is an optional module. The RAM 1103 stores executable instructions, or writes executable instructions to the ROM 1102 at runtime. The executable instructions cause the central processing unit 1101 to perform the steps included in the target object 3D detection method. An input/output (I/O) interface 1105 is also connected to the bus 1104. The communication unit 1112 may be integratedly provided, or may be provided with multiple sub-modules (for example, multiple IB network cards), and are respectively connected to the bus. The following components are connected to the I/O interface 1105: an input section 1106 including a keyboard, a mouse, etc.; an output section 1107 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker; a storage section 1108 including a hard disk, etc. ; And a communication section 1109 including a network interface card such as a LAN card, a modem, etc. The communication section 1109 performs communication processing via a network such as the Internet. The driver 1110 is also connected to the I/O interface 1105 as necessary. A removable medium 1111, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed on the drive 1110 as necessary, so that the computer program read out therefrom is installed in the storage portion 1108 as needed.

It should be noted that the architecture shown in FIG. 11 is only an optional implementation method. In the specific practical process, the number and types of the components in FIG. 11 can be selected, deleted, added, or replaced according to actual needs. ; In the setting of different functional components, you can also use separate settings or integrated settings, for example, GPU1113 and CPU1101 can be set separately, for example, GPU1113 can be integrated on CPU1101, the communication section 1112 can be set separately, can also be integrated Set on CPU1101 or GPU1113, etc. These alternative embodiments all fall within the protection scope of the present disclosure. In particular, according to the embodiments of the present disclosure, the process described below with reference to the flowchart may be implemented as a computer software program. For example, the embodiments of the present disclosure include a computer program product that includes a software program tangibly contained on a machine-readable medium. A computer program, the computer program includes program code for performing the steps shown in the flowchart, and the program code may include instructions corresponding to the steps in the method provided by the present disclosure. In such an embodiment, the computer program may be downloaded and installed from the network through the communication section 1109, and/or installed from the removable medium 1111. When the computer program is executed by the central processing unit (CPU) 1101, instructions described in the present disclosure that implement the above-mentioned corresponding steps are executed.

In one or more optional implementation manners, the embodiments of the present disclosure also provide a computer program program product for storing computer-readable instructions, which when executed, causes the computer to perform the operations described in any of the above embodiments Target object 3D detection method.

The computer program product may be implemented in hardware, software, or a combination thereof. In an optional example, the computer program product is embodied as a computer storage medium. In another optional example, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), etc. Wait.

In one or more optional implementation manners, the embodiments of the present disclosure also provide another 3D detection method of a target object and its corresponding device and electronic device, computer storage medium, computer program, and computer program product, wherein the target object The 3D detection method includes: the first device sends a target object 3D detection instruction to the second device, the instruction causes the second device to perform the target object 3D detection method in any of the above possible embodiments; the first device receives the 3D detection results of the target object.

In some embodiments, the target object 3D detection instruction may specifically be a call instruction, and the first device may instruct the second device to perform the target object 3D detection operation by calling. Accordingly, in response to receiving the call instruction, the second device The steps and/or processes in any of the embodiments of the above-described target object 3D detection method may be performed.

It should be understood that the terms “first” and “second” in the embodiments of the present disclosure are only for distinction, and should not be construed as limiting the embodiments of the present disclosure. It should also be understood that in the present disclosure, “plurality” may refer to two or more, and “at least one” may refer to one, two, or more than two. It should also be understood that any component, data, or structure mentioned in the present disclosure can be generally understood as one or more, unless it is explicitly defined or given the opposite enlightenment in the context. It should also be understood that the description of the embodiments of the present disclosure emphasizes the differences between the embodiments, and the same or similarities can be referred to each other, and for the sake of brevity, they will not be described one by one.

The method and apparatus, electronic device, and computer-readable storage medium of the present disclosure may be implemented in many ways. For example, the method and apparatus, electronic device, and computer-readable storage medium of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above order of steps for the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless otherwise specifically stated. In addition, in some embodiments, the present disclosure may also be implemented as programs recorded in a recording medium, and these programs include machine-readable instructions for implementing the method according to the present disclosure. Thus, the present disclosure also covers the recording medium storing the program for executing the method according to the present disclosure.

The description of the present disclosure is given for the sake of example and description, and is not exhaustive or limits the present disclosure to the disclosed form. Many modifications and changes will be apparent to those of ordinary skill in the art. The embodiments are selected and described in order to better explain the principles and practical applications of the present disclosure, and enable those of ordinary skill in the art to understand that the embodiments of the present disclosure can thereby design various embodiments with various modifications suitable for specific uses .

Claims

A 3D detection method for a target object, characterized in that it includes:

Extract the feature information of the obtained point cloud data of the scene;

Performing semantic segmentation on the point cloud data according to the feature information of the point cloud data to obtain first semantic information of multiple points in the point cloud data;

Predicting at least one front attraction corresponding to the target object among the plurality of points according to the first semantic information;

Generating a 3D initial frame corresponding to each of the at least one front sight according to the first semantic information;

The 3D detection frame of the target object in the scene is determined according to the 3D initial frame.
The method of claim 1, the determining the 3D detection frame of the target object in the scene according to the 3D initial frame comprises:

Acquiring characteristic information of points in a partial area in the point cloud data, wherein the partial area includes at least one of the 3D initial frames;

Performing semantic segmentation on the points in the partial area according to the characteristic information of the points in the partial area to obtain second semantic information of the points in the partial area;

The 3D detection frame of the target object in the scene is determined according to the first semantic information and the second semantic information of the points in the partial area.
The method according to claim 2, the determining the 3D detection frame of the target object in the scene according to the first semantic information and the second semantic information of the points in the partial area includes:

Correct the 3D initial frame according to the first semantic information and the second semantic information of the points in the partial area, to obtain a corrected 3D initial frame;

The 3D detection frame of the target object in the scene is determined according to the corrected 3D initial frame.
The method according to claim 2, the determining the 3D detection frame of the target object in the scene according to the first semantic information and the second semantic information of the points in the partial area includes:

Determine the confidence of the target object corresponding to the 3D initial frame according to the first semantic information and the second semantic information of the points in the partial area;

The 3D detection frame of the target object in the scene is determined according to the 3D initial frame and its confidence.
The method according to claim 2, the determining the 3D detection frame of the target object in the scene according to the first semantic information and the second semantic information of the points in the partial area includes:

Correct the 3D initial frame according to the first semantic information and the second semantic information of the points in the partial area to obtain a corrected 3D initial frame;

Determine the confidence of the target object corresponding to the corrected 3D initial frame according to the first semantic information and the second semantic information of the points in the partial area;

The 3D detection frame of the target object in the scene is determined according to the corrected 3D initial frame and its confidence.
The method according to any one of claims 2 to 5, wherein the partial area comprises: performing edge expansion on the 3D initial frame according to a predetermined strategy to obtain a 3D expansion frame.
The method according to claim 6, wherein the 3D expansion frame includes:

According to a preset increment in the X-axis direction, increment in the Y-axis direction, and/or increment in the Z-axis direction, the 3D initial frame is expanded in 3D space to form a 3D expansion frame including the 3D initial frame.
The method according to claim 6 or 7, wherein the points in the partial area are semantically segmented according to the feature information of the points in the partial area to obtain the number of points in the partial area Two semantic information, including:

Performing coordinate transformation on the coordinate information of the point located in the 3D extension box in the point cloud data according to the preset target position of the 3D extension box to obtain the feature information of the point after the coordinate transformation;

According to the feature information of the point after coordinate transformation, the semantic segmentation based on the 3D extension frame is performed to obtain the second semantic feature of the point in the 3D extension frame.
The method according to claim 8, wherein the performing semantic segmentation based on the 3D extension frame according to the feature information of the coordinate-transformed point includes:

The semantic segmentation based on the 3D expansion frame is performed according to the mask of the front scenic spot and the feature information of the point after coordinate transformation.
The method according to claim 1, wherein there are a plurality of front spots, and the determining the 3D detection frame of the target object in the scene according to the 3D initial frame includes:

Determine the degree of overlap between the 3D initial frames corresponding to the plurality of front sights;

Screen the 3D initial frame whose overlap is greater than the set threshold;

The 3D detection frame of the target object in the scene is determined according to the filtered 3D initial frame.
The method according to any one of claims 1 to 10, wherein the feature information of the point cloud data of the acquired scene is extracted, and the point cloud data is performed according to the feature information of the point cloud data Semantic segmentation, obtaining first semantic information of multiple points in the point cloud data, predicting at least one previous scenic spot of the corresponding target object among the multiple points according to the first semantic information, and according to the first semantic information Generating a 3D initial frame corresponding to each of the at least one front sight, implemented by the first-stage neural network;

The first-stage neural network is obtained by training using point cloud data samples with 3D annotation frames.
The method according to claim 11, wherein the training process of the first-stage neural network includes:

Providing point cloud data samples to the first-stage neural network, extracting characteristic information of the point cloud data samples based on the first-stage neural network, and comparing the point cloud data according to the characteristic information of the point cloud data samples The sample performs semantic segmentation, predicts at least one front attraction corresponding to the target object among the multiple points according to the first semantic features of the multiple points obtained by the semantic segmentation, and generates the at least one front attraction according to the first semantic information Each corresponding 3D initial frame;

Obtain the loss corresponding to the front sight and the loss formed by the 3D initial frame relative to the corresponding 3D annotation frame, and adjust the network parameters in the first-stage neural network according to the loss.
The method according to claim 12, characterized in that, the loss corresponding to the front sight and the loss formed by the 3D initial frame relative to the corresponding 3D annotation frame are formed, and according to the loss, the The network parameters in the first stage neural network are adjusted, including:

Determine the first loss corresponding to the prediction result of the front sight according to the confidence of the front sight predicted by the first-stage neural network;

The second loss is generated according to the number of the bucket where the parameter in the 3D initial frame generated for the previous scenic spot is located and the number of the bucket where the parameter in the 3D annotation frame information in the point cloud data sample is located;

According to the offset of the parameter in the 3D initial frame generated for the front sight in the corresponding bucket and the offset of the parameter in the 3D annotation frame information in the point cloud data sample in the corresponding bucket, the first Three losses;

The fourth loss is generated according to the offset of the parameter in the 3D initial frame generated for the front sight from the predetermined parameter;

A fifth loss generated according to the offset of the coordinate parameter of the front sight relative to the coordinate parameter of the 3D initial frame generated for the front sight;

The network parameters of the first-stage neural network are adjusted according to the first loss, second loss, third loss, fourth loss, and fifth loss.
The method according to any one of claims 2 to 9, wherein the acquiring feature information of points in a partial area in the point cloud data, according to the feature information of points in the partial area Perform semantic segmentation of the points in the partial area to obtain second semantic information of the points in the partial area, and determine the points based on the first semantic information and the second semantic information of the points in the partial area The 3D detection frame of the target object in the scene is implemented by the second-stage neural network;

The second-stage neural network is obtained by training using point cloud data samples with 3D annotation frames.
The method according to claim 14, wherein the training process of the second-stage neural network includes:

The 3D initial frame is provided to the second-stage neural network, and based on the second-stage neural network, the feature information of the points in the partial area of the point cloud data sample is obtained, and according to the partial area in the point cloud data sample The characteristic information of the points, performing semantic segmentation on the points in the partial area in the point cloud data sample to obtain the second semantic characteristics of the points in the partial area in the point cloud data sample; according to the point cloud data The first semantic feature and the second semantic feature of the points in the partial area of the sample, determine the confidence of the 3D initial frame as the target object, and according to the first point of the point in the partial area of the point cloud data sample The semantic feature and the second semantic feature generate a 3D initial frame after position correction;

Acquiring the loss corresponding to the confidence of the target object in the 3D initial frame and the loss formed by the position-corrected 3D initial frame relative to the corresponding 3D annotation frame, and according to the loss, the second-stage nerve Adjust the network parameters in the network.
The method according to claim 15, wherein the acquiring the 3D initial frame is a loss corresponding to the confidence of the target object, and the position-corrected 3D initial frame is formed relative to the corresponding 3D annotation frame Loss, and adjust the network parameters in the second-stage neural network according to the loss, including:

Determine the sixth loss corresponding to the prediction result according to the confidence level of the target 3D initial frame predicted by the second stage neural network;

According to the number of buckets where the parameters in the 3D initial frame after correction in the position where the overlap with the corresponding 3D label frame generated by the second stage neural network exceeds the set threshold and the 3D label frame information in the point cloud data sample The number of the barrel where the parameter is located, resulting in the seventh loss;

The offset of the parameters in the corresponding 3D initial frame of the corrected 3D initial frame generated by the neural network generated in the second stage and the corresponding overlap of the corresponding 3D label frame in the corresponding threshold, and the 3D label in the point cloud data sample The offset of the parameter in the box information in the corresponding bucket generates an eighth loss;

The ninth loss is generated according to the offset of the parameters in the 3D initial frame after correction in the position of the position of the 3D initial frame generated by the neural network in the second stage that exceeds the set threshold, relative to the predetermined parameters;

According to the offset of the coordinate parameter of the 3D initial frame after the position generated by the second stage neural network and the corresponding overlap degree of the 3D annotation frame exceeds the set threshold, relative to the coordinate parameter of the center point of the 3D annotation frame, is generated Tenth loss;

Adjust the network parameters of the second-stage neural network according to the sixth loss, seventh loss, eighth loss, ninth loss, and tenth loss.
A vehicle intelligent control method, characterized in that the method includes:

The 3D detection method of the target object according to any one of claims 1 to 16 is used to obtain a 3D detection frame of the target object;

According to the 3D detection frame, an instruction or early warning information for controlling the vehicle is generated.
The method according to claim 17, the generating instructions or early warning information for controlling the vehicle according to the 3D detection frame includes:

According to the 3D detection frame, determine at least one of the following information of the target object: the spatial position, size, distance to the vehicle, and relative orientation information of the target object in the scene;

According to the determined at least one piece of information, an instruction or early warning information for controlling the vehicle is generated.
An obstacle avoidance navigation method, characterized in that the method includes:

The 3D detection method of the target object according to any one of claims 1 to 16 is used to obtain a 3D detection frame of the target object;

According to the 3D detection frame, an instruction or warning prompt information for performing obstacle avoidance navigation control on the robot is generated.
The method according to claim 19, the generating instructions or warning prompt information for performing obstacle avoidance navigation control on the robot according to the 3D detection frame includes:

According to the 3D detection frame, determine at least one of the following information of the target object: the spatial position, size, distance to the robot, and relative orientation information of the target object in the scene;

According to the determined at least one piece of information, an instruction or early warning prompt information for performing obstacle avoidance navigation control on the robot is generated.
A target object 3D detection device, characterized in that it includes:

Feature extraction module, used to extract the feature information of the acquired point cloud data of the scene;

A first semantic segmentation module, configured to perform semantic segmentation on the point cloud data according to the feature information of the point cloud data, to obtain first semantic information of multiple points in the point cloud data;

A pre-prediction point module for predicting at least one pre-point of sight corresponding to the target object in the plurality of points according to the first semantic information;

Generating an initial frame module, configured to generate a 3D initial frame corresponding to each of the at least one front sight according to the first semantic information;

The detection frame determination module is configured to determine a 3D detection frame of the target object in the scene according to the 3D initial frame.
The apparatus of claim 21, the determination detection frame module further comprising:

The first sub-module is used to obtain feature information of points in a partial area in the point cloud data, wherein the partial area includes at least one initial 3D frame;

A second sub-module for semantically segmenting the points in the partial area according to the feature information of the points in the partial area to obtain second semantic information of the points in the partial area;

The third submodule is used to determine the 3D detection frame of the target object in the scene according to the first semantic information and the second semantic information of the points in the partial area.
The apparatus according to claim 22, the third submodule comprises:

A fourth submodule, configured to correct the 3D initial frame according to the first semantic information and the second semantic information of the points in the partial area, to obtain a corrected 3D initial frame;

The fifth sub-module is used to determine the 3D detection frame of the target object in the scene according to the corrected 3D initial frame.
The apparatus according to claim 22, the third submodule is further used for:

Determine the confidence of the target object corresponding to the 3D initial frame according to the first semantic information and the second semantic information of the points in the partial area;

The 3D detection frame of the target object in the scene is determined according to the 3D initial frame and its confidence.
The apparatus according to claim 22, the third submodule comprises:

A fourth submodule, configured to correct the 3D initial frame according to the first semantic information and the second semantic information of the points in the partial area, to obtain a corrected 3D initial frame;

A sixth submodule, configured to determine the confidence of the target object corresponding to the corrected 3D initial frame according to the first semantic information and the second semantic information of the points in the partial area;

The seventh sub-module is used to determine the 3D detection frame of the target object in the scene according to the corrected 3D initial frame and its confidence.
The device according to any one of claims 22 to 25, wherein the partial area includes: a 3D expansion frame obtained by performing edge expansion on the 3D initial frame according to a predetermined strategy.
The apparatus according to claim 26, wherein the 3D expansion frame comprises:

According to a preset increment in the X-axis direction, increment in the Y-axis direction, and/or increment in the Z-axis direction, the 3D initial frame is expanded in 3D space to form a 3D expansion frame including the 3D initial frame.
The device according to claim 26 or 27, wherein the second submodule includes:

The eighth submodule is used to perform coordinate transformation on the coordinate information of the point in the 3D extension box in the point cloud data according to the preset target position of the 3D extension box, to obtain the feature information of the point after the coordinate transformation;

The ninth sub-module is used to perform semantic segmentation based on the 3D extension box according to the feature information of the coordinate-transformed point to obtain the second semantic feature of the point in the 3D extension box.
The apparatus according to claim 28, wherein the ninth sub-module is further used to:

The semantic segmentation based on the 3D expansion frame is performed according to the mask of the front scenic spot and the feature information of the point after coordinate transformation.
The apparatus according to claim 21, wherein there are a plurality of front spots, and the determination detection frame module is further used to:

Determine the degree of overlap between the 3D initial frames corresponding to the plurality of front sights;

Screen the 3D initial frame whose overlap is greater than the set threshold;

The 3D detection frame of the target object in the scene is determined according to the filtered 3D initial frame.
The device according to any one of claims 21 to 30, wherein the feature extraction module, the first semantic segmentation module, the prediction of the pre-spot scenic spot module, and the generation of the initial frame module are implemented by a first-stage neural network, and The first-stage neural network is obtained by training the first training module using point cloud data samples with 3D annotation frames.
The apparatus according to claim 31, wherein the first training module is used to:

Providing point cloud data samples to the first-stage neural network, extracting characteristic information of the point cloud data samples based on the first-stage neural network, and comparing the point cloud data according to the characteristic information of the point cloud data samples The sample performs semantic segmentation, predicts at least one front attraction corresponding to the target object among the multiple points according to the first semantic features of the multiple points obtained by the semantic segmentation, and generates the at least one front attraction according to the first semantic information Each corresponding 3D initial frame;

Obtain the loss corresponding to the front sight and the loss formed by the 3D initial frame relative to the corresponding 3D annotation frame, and adjust the network parameters in the first-stage neural network according to the loss.
The apparatus according to claim 32, wherein the first training module is further used to:

Determine the first loss corresponding to the prediction result of the front sight according to the confidence of the front sight predicted by the first-stage neural network;

The second loss is generated according to the number of the bucket where the parameter in the 3D initial frame generated for the previous scenic spot is located and the number of the bucket where the parameter in the 3D annotation frame information in the point cloud data sample is located;

According to the offset of the parameter in the 3D initial frame generated for the front sight in the corresponding bucket and the offset of the parameter in the 3D annotation frame information in the point cloud data sample in the corresponding bucket, the first Three losses;

The fourth loss is generated according to the offset of the parameter in the 3D initial frame generated for the front sight from the predetermined parameter;

A fifth loss generated according to the offset of the coordinate parameter of the front sight relative to the coordinate parameter of the 3D initial frame generated for the front sight;

The network parameters of the first-stage neural network are adjusted according to the first loss, second loss, third loss, fourth loss, and fifth loss.
The device according to any one of claims 22 to 29, wherein the first submodule, the second submodule, and the third submodule are implemented by a second-stage neural network, and the second stage The neural network is obtained by training the second training module using point cloud data samples with 3D annotation frames.
The apparatus according to claim 34, wherein the second training module is used to:

The 3D initial frame is provided to the second-stage neural network, and based on the second-stage neural network, the feature information of the points in the partial area of the point cloud data sample is obtained, and according to the partial area in the point cloud data sample Feature information of points within, perform semantic segmentation on points in a partial area in the point cloud data sample to obtain second semantic characteristics of points in a partial area in the point cloud data sample; according to the point cloud data sample The first semantic feature and the second semantic feature of the points in the partial area in the, determine the confidence that the 3D initial frame is the target object, and according to the first semantics of the point in the partial area in the point cloud data sample The feature and the second semantic feature to generate a 3D initial frame after position correction;

Acquiring the loss corresponding to the confidence of the target object in the 3D initial frame and the loss formed by the position-corrected 3D initial frame relative to the corresponding 3D annotation frame, and according to the loss, the second-stage nerve Adjust the network parameters in the network.
The apparatus according to claim 35, wherein the second training module is further used to:

Determine the sixth loss corresponding to the prediction result according to the confidence level of the target 3D initial frame predicted by the second stage neural network;

According to the number of buckets where the parameters in the 3D initial frame after correction in the position where the overlap with the corresponding 3D label frame generated by the second stage neural network exceeds the set threshold and the 3D label frame information in the point cloud data sample The number of the barrel where the parameter is located, resulting in the seventh loss;

The offset of the parameters in the corresponding 3D initial frame of the corrected 3D initial frame generated by the neural network generated in the second stage and the corresponding overlap of the corresponding 3D label frame in the corresponding threshold and the 3D label in the point cloud data sample The offset of the parameter in the box information in the corresponding bucket generates an eighth loss;

The ninth loss is generated according to the offset of the parameters in the 3D initial frame after correction in the position of the position of the 3D initial frame generated by the neural network in the second stage that exceeds the set threshold, relative to the predetermined parameters;

According to the offset of the coordinate parameter of the 3D initial frame after the position generated by the second stage neural network and the corresponding overlap degree of the 3D annotation frame exceeds the set threshold, relative to the coordinate parameter of the center point of the 3D annotation frame, is generated Tenth loss;

Adjust the network parameters of the second-stage neural network according to the sixth loss, seventh loss, eighth loss, ninth loss, and tenth loss.
A vehicle intelligent control device, characterized in that the device includes:

The target object 3D detection device according to any one of claims 21 to 36 is used to obtain a target object 3D detection frame;

The first control module is configured to generate a command or early warning information for controlling the vehicle according to the 3D detection frame.
The apparatus according to claim 37, the first control module is further used to:

According to the 3D detection frame, determine at least one of the following information of the target object: the spatial position, size, distance to the vehicle, and relative orientation information of the target object in the scene;

According to the determined at least one piece of information, an instruction or early warning information for controlling the vehicle is generated.
An obstacle avoidance navigation device, characterized in that the device comprises:

The target object 3D detection device according to any one of claims 21 to 36 is used to obtain a target object 3D detection frame;

The second control module is configured to generate an instruction or warning prompt information for performing obstacle avoidance navigation control on the robot according to the 3D detection frame.
The apparatus according to claim 39, the second control module is further used to:

According to the 3D detection frame, determine at least one of the following information of the target object: the spatial position, size, distance to the robot, and relative orientation information of the target object in the scene;

According to the determined at least one piece of information, an instruction or early warning prompt information for performing obstacle avoidance navigation control on the robot is generated.
An electronic device, including:

Memory, used to store computer programs;

A processor, configured to execute a computer program stored in the memory, and when the computer program is executed, implement the method of any one of claims 1-20.
A computer-readable storage medium on which a computer program is stored, which when executed by a processor, implements the method of any one of claims 1-20 above.
A computer program includes computer instructions, and when the computer instructions run in a processor of a device, the method according to any one of claims 1-20 above is implemented.