WO2023035822A1

WO2023035822A1 - Target detection method and apparatus, and device and storage medium

Info

Publication number: WO2023035822A1
Application number: PCT/CN2022/110147
Authority: WO
Inventors: 徐辉; 叶汇贤
Original assignee: 上海芯物科技有限公司
Priority date: 2021-09-13
Filing date: 2022-08-04
Publication date: 2023-03-16
Also published as: CN113807350A

Abstract

Disclosed in the present application are a target detection method and apparatus, and a device and a storage medium. The method comprises: inputting point cloud data corresponding to a target training sample into an initial target detection model, wherein the initial target detection model comprises a feature extraction network and a feature fusion network; performing feature extraction on the point cloud data by means of the feature extraction network, so as to obtain a plurality of candidate detection boxes; inputting the candidate detection boxes into the feature fusion network, so as to obtain a prediction detection box obtained after the feature fusion network performs feature fusion on the candidate detection boxes on the basis of distance features of the candidate detection boxes; and adjusting a parameter of the initial target detection model according to a loss function value determined by means of the prediction detection box and a mark detection box which corresponds to the target training sample. By means of the technical solution of the present application, feature fusion can be performed on candidate detection boxes on the basis of the association characteristic between the detection boxes that is reflected by distance features of the candidate detection boxes, thereby improving the prediction precision of a target detection model.

Description

A target detection method, device, equipment and storage medium

This application claims the priority of a Chinese patent application with application number 202111066892.1 filed with the China Patent Office on September 13, 2021, the entire content of which is incorporated herein by reference.

technical field

The embodiments of the present application relate to the technical field of computer vision, and in particular, to an object detection method, device, equipment, and storage medium.

Background technique

With the development of machine learning, point cloud-based 3D object detection is widely used in autonomous driving systems, object recognition, and 3D reconstruction.

In the process of 3D object detection, determining the trajectory of the detection frame is a very important step. The candidate detection frames output by the commonly used 3DSSD three-dimensional detection model are independent of each other, lack of global information, and the prediction accuracy needs to be improved.

Contents of the invention

Embodiments of the present application provide a target detection method, device, device, and storage medium, so as to realize feature fusion of candidate detection frames based on the correlation characteristics between detection frames reflected by the distance features of each candidate detection frame, and improve target detection. Check the predictive accuracy of the model.

In the first aspect, the embodiment of the present application provides a method for training a target detection model, including:

Input the point cloud data corresponding to the target training sample into the initial target detection model; wherein, the initial target detection model includes: a feature extraction network and a feature fusion network;

performing feature extraction on the point cloud data through the feature extraction network to obtain a plurality of candidate detection frames;

Input each of the candidate detection frames into the feature fusion network, and obtain a predicted detection frame obtained by performing feature fusion on each of the candidate detection frames by the feature fusion network based on the distance features of each of the candidate detection frames;

Adjusting the parameters of the initial target detection model according to the loss function value determined by the predicted detection frame and the marked detection frame corresponding to the target training sample.

In the second aspect, the embodiment of the present application provides a target detection method, including:

Obtain point cloud data to be detected;

Inputting the point cloud data to be detected into the target detection model obtained by training using the training method of the target detection model;

Obtain the target detection frame output by the target detection model, and determine the target detection result of the point cloud data to be detected based on the target detection frame.

In a third aspect, the embodiment of the present application also provides a training device for a target detection model, the device comprising:

The input module is used to input the point cloud data corresponding to the target training sample into the initial target detection model; wherein, the initial target detection model includes: a feature extraction network and a feature fusion network;

A feature extraction module, configured to perform feature extraction on the point cloud data through the feature extraction network to obtain a plurality of candidate detection frames;

A feature fusion module, configured to input each of the candidate detection frames into the feature fusion network, and obtain a prediction obtained by performing feature fusion on each of the candidate detection frames by the feature fusion network based on the distance characteristics of each of the candidate detection frames detection frame;

A parameter adjustment module, configured to adjust the parameters of the initial target detection model according to the loss function value determined by the predicted detection frame and the marked detection frame corresponding to the target training sample.

In the fourth aspect, the embodiment of the present application also provides a target detection device, which includes:

An acquisition module, configured to acquire point cloud data to be detected;

An input module, configured to input the point cloud data to be detected into the target detection model trained by the training method of the target detection model;

The determining module is configured to acquire a target detection frame output by the target detection model, and determine a target detection result of the point cloud data to be detected based on the target detection frame.

In the fifth aspect, the embodiment of the present application also provides a computer device, including a memory, a processor, and a computer program stored on the memory and operable on the processor. When the processor executes the program, it implements the The training method of the target detection model or the target detection method described in any one of the embodiments.

In the fourth aspect, the embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the training of the target detection model as described in any one of the embodiments of the present application is realized. method or object detection method.

In the embodiment of the present application, the point cloud data corresponding to the target training sample is input into the initial target detection model; wherein, the initial target detection model includes: a feature extraction network and a feature fusion network; feature extraction is performed on the point cloud data through the feature extraction network, and multiple candidate detection frames; each candidate detection frame is input into the feature fusion network, and the feature fusion network obtains the predicted detection frame obtained after feature fusion of each candidate detection frame based on the distance feature of each candidate detection frame; according to the predicted detection frame and the target The loss function value determined by the marked detection frame corresponding to the training sample adjusts the parameters of the initial target detection model to solve the problem that the candidate detection frames output by the 3DSSD three-dimensional detection model are independent of each other and lack global information, and realize the distance based on each candidate detection frame The correlation characteristics between the detection frames reflected by the features, and the feature fusion of the candidate detection frames are performed to improve the prediction accuracy of the target detection model.

Description of drawings

FIG. 1 is a flowchart of a method for training a target detection model in Embodiment 1 of the present application;

2A is a flowchart of a method for training a target detection model in Embodiment 2 of the present application;

FIG. 2B is a schematic structural diagram of a feature fusion network in Embodiment 2 of the present application;

FIG. 2C is a schematic diagram of the weights of a central point and adjacent points in a voxel in the related art;

FIG. 2D is a schematic diagram of the weights of a central point and adjacent points in a voxel in Embodiment 2 of the present application;

FIG. 3 is a flow chart of a target detection method in Embodiment 3 of the present application;

4 is a schematic structural diagram of a training device for a target detection model in Embodiment 4 of the present application;

Fig. 5 is a schematic structural diagram of a target detection device in Embodiment 5 of the present application;

FIG. 6 is a schematic structural diagram of a computer device in Embodiment 6 of the present application.

Detailed ways

The application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, but not to limit the present application. In addition, it should be noted that, for the convenience of description, only some structures related to the present application are shown in the drawings but not all structures.

It should be noted that like numerals and letters denote similar items in the following figures, therefore, once an item is defined in one figure, it does not require further definition and explanation in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second" and the like are only used to distinguish descriptions, and cannot be understood as indicating or implying relative importance.

Embodiment one

Figure 1 is a flow chart of a method for training a target detection model provided in Embodiment 1 of the present application. This embodiment is applicable to the situation of training a three-dimensional target detection model. This method can be implemented by training the target detection model in the embodiment of the present application. device, the device can be implemented in the form of software and/or hardware, as shown in Figure 1, the method specifically includes the following steps:

S110. Input the point cloud data corresponding to the target training samples into an initial target detection model; wherein, the initial target detection model includes: a feature extraction network and a feature fusion network.

Among them, the target training samples refer to the samples in the training sample set used to train the initial target detection model, the point cloud data corresponding to the target training samples can be obtained through the lidar, and the point cloud data refers to a set of vectors in a three-dimensional coordinate system collection. Point cloud data can include information such as geometric position, color information, or reflection intensity. The target training sample can be marked with a detection frame, and the detection frame is used to frame the position of the target object, which can be used to further identify the type of the target object.

Wherein, the initial target detection model refers to an untrained or incompletely trained target detection model, which is used to determine the detection frame of the target object and detect the type of the target object; the initial target detection model may include: a feature extraction network and a feature fusion network. The feature extraction network is used to extract the features of the point cloud data corresponding to the target training samples to determine the candidate detection frame; the feature fusion network is used to fuse the features of the candidate detection frame to obtain the predicted detection frame.

S120. Perform feature extraction on the point cloud data through the feature extraction network to obtain a plurality of candidate detection frames.

Among them, the candidate detection frame refers to the detection frame determined by the feature extraction network. The number of candidate detection frames can be one or more. The number of candidate detection frames is determined by the number of target objects in the target detection sample. One target object corresponds to one detection frame. frames, each detection frame is independent of each other.

Specifically, the step of performing feature extraction on the point cloud data through the feature extraction network may include: performing down-sampling, grouping and aggregation of sampling points within the group on the point cloud data.

Exemplarily, the manner of sampling the point cloud data may be to sample the point cloud data according to a preset sampling interval or to use the furthest point sampling method FPS to perform sampling. Grouping the sampled point cloud data can be defined as a unit cuboid, and the unit cuboid can become a voxel; the point cloud data is spaced according to voxels, and the point cloud data falling in each voxel belongs to a group of data . The way to aggregate the sampling points in the group can be to aggregate the point cloud data in each voxel based on the multi-layer perceptron to obtain the feature vector corresponding to the center point, and pass the feature vector of the center point through the regressor to obtain the candidate detection frame.

In a specific example, the way to aggregate the point cloud data in each voxel can be: input each point cloud data in each voxel into a multi-layer perceptron, and the characteristics of each point cloud data are determined by pooling aggregation After aggregation, the feature vectors corresponding to the central points are passed through the regressor to obtain candidate detection frames.

In another specific example, the way to aggregate the point cloud data in each voxel can be: calculate the center point of the point cloud data in each voxel; calculate the deviation between the point cloud data in each voxel and the center point Shift; input the center point and offset into the multi-layer perceptron to obtain the center point feature, so that the center point feature can be regressed to obtain the candidate detection frame.

S130. Input each of the candidate detection frames into the feature fusion network, and obtain a predicted detection frame obtained by performing feature fusion on each of the candidate detection frames by the feature fusion network based on the distance features of each of the candidate detection frames.

Specifically, input the candidate detection frames corresponding to each voxel into the feature fusion network, determine the distance features between the candidate detection frames, and perform feature fusion on each candidate detection frame according to the distance features to obtain a predicted detection frame.

Exemplarily, the manner of determining the distance feature between each candidate detection frame may be: determine the Euclidean distance between two different candidate detection frames to obtain a distance matrix. The distance matrix can capture the global characteristics of object detection samples.

S140. Adjust parameters of the initial target detection model according to a loss function value determined by the predicted detection frame and the marked detection frame corresponding to the target training sample.

Among them, the loss function is used to measure the difference between the target detection frame obtained by performing target detection on the point cloud data corresponding to the target training sample through the initial target detection model and the marked detection frame corresponding to the target training sample.

Specifically, the loss function value is calculated according to the predicted detection frame and the marked detection frame corresponding to the target training sample, and the parameters of the initial target detection model are adjusted based on the loss function value. The parameters in the initial target detection model may include untrained initial parameters or pre-trained parameters.

In the technical solution of this embodiment, the point cloud data corresponding to the target training sample is input into the initial target detection model; wherein, the initial target detection model includes: a feature extraction network and a feature fusion network; through the feature extraction network, the The point cloud data is subjected to feature extraction to obtain a plurality of candidate detection frames; each of the candidate detection frames is input into the feature fusion network, and the feature fusion network is obtained based on the distance characteristics of each of the candidate detection frames for each of the candidate detection frames. The predicted detection frame obtained after the frame is subjected to feature fusion; the parameters of the initial target detection model can be adjusted according to the loss function value determined by the predicted detection frame and the marked detection frame corresponding to the target training sample, which can be based on each candidate detection frame The correlation characteristics between the detection frames reflected by the distance feature, and the feature fusion of the candidate detection frames are performed to improve the prediction accuracy of the target detection model.

Embodiment two

FIG. 2A is a flowchart of a method for training a target detection model in Embodiment 2 of the present application. This embodiment optimizes the feature fusion network based on the above-mentioned embodiment.

As shown in Figure 2A, the method of this embodiment specifically includes the following steps:

S210, input the point cloud data corresponding to the target training samples into the initial target detection model; wherein, the initial target detection model includes: a feature extraction network and a feature fusion network; the feature fusion network includes: a channel fusion layer, a distance determination layer and a feature association layer.

Specifically, as shown in Figure 2B, the channel fusion layer is used to fuse the candidate detection frames output by the feature extraction network in the channel direction; the distance determination layer is used to determine the distance matrix between each candidate detection frame; the feature association layer uses The feature correlation between the distance matrix and the fused candidate detection frame is carried out to obtain the predicted detection frame.

S220, performing feature extraction on the point cloud data through a feature extraction network to obtain a plurality of candidate detection frames.

Optionally, step S220 includes: step S221, step S222 and step S223.

Step S221: Sampling the point cloud data based on a preset sampling interval to obtain a sampled point cloud.

Exemplarily, the point cloud data is down-sampled according to a preset sampling interval, for example, one point cloud data is sampled every N points, and the data amount of the point cloud data is reduced by N times.

Step S221 , dividing the space corresponding to the sampling point cloud to obtain a voxel sampling point cloud corresponding to each voxel after division.

Among them, voxel is the abbreviation of volume element, which is the smallest unit of digital data in three-dimensional space division.

Exemplarily, the voxel can be a cuboid whose length, width and height are respectively defined as L0, W0 and H0, and the cuboid is filled with the space corresponding to the sampling point cloud, and the space corresponding to the sampling point cloud is divided based on the cuboid to obtain multiple voxel, the sampling point cloud falling within each voxel is the voxel sampling point cloud.

Step S223 , for each voxel, perform feature aggregation on the corresponding voxel sampling point cloud based on the multi-layer perceptron to obtain a candidate detection frame corresponding to the voxel sampling point cloud.

Optionally, step S223 specifically includes:

For the voxel sampling point cloud corresponding to each voxel, determine the center point of the voxel sampling point cloud; determine the offset between each point in the voxel sampling point cloud and the center point; The voxel sampling point cloud and the offset are input into the multi-layer perceptron, the center point features corresponding to each of the voxel sampling point clouds are determined, and the center point features are regressed to obtain a candidate detection frame.

Specifically, for the voxel sampling point cloud in each voxel, determine the center point of the voxel sampling point cloud, where the center point may be the center of gravity of the voxel sampling point cloud. Determine the offset of each point in the voxel sampling point cloud and the center point respectively, input the voxel sampling point cloud and the offset into the multi-layer perceptron, and determine each of the voxel sampling The center point feature corresponding to the point cloud, and the feature of the center point is regressed to obtain a candidate detection frame.

The advantage of this is that in related technologies, each point in a voxel is input into the same multi-layer perceptron, as shown in Figure 2C, where X ₁ , X ₂ , X ₃ , X _n , and X _c are points in a voxel 5 points, point X _c and all adjacent points have the same weight of influence on the central point, both of which are 1. As shown in Figure 2D, in the embodiment of the present application, the voxel sampling point cloud and the offset are input into the multi-layer perceptron, so that the weight of the adjacent point to the central point is related to the offset, for example: point The weight of X _c and point X ₁ is X ₁ -X _c , the weight of point X _c and point X ₂ is X ₂ -X _c , the weight of point X _c and point X ₃ is X ₃ -X _c , point X _c The weight with point X _n is X _n -X _c .

Exemplarily, the voxel sampling point cloud and the offset are input into the multi-layer perceptron, and the center point features corresponding to each voxel sampling point cloud are specifically:

X _m ＝max(ReLU(MLP([x _j _-xi ; _xi ])))

Among them, X _m is the center point feature corresponding to each voxel sampling point cloud, ReLU is the activation function, MLP is the multi-layer perceptron, x _i is the i-th point in the voxel sampling point cloud, and x _j is the voxel sampling The jth point in the point cloud.

S230. Input each candidate detection frame into the channel fusion layer, and obtain a fusion detection frame obtained after the channel fusion layer fuses the candidate detection frames based on the channel direction.

Optionally, step S230 includes: step S231 and step S232.

S231. Obtain melted features after fusing the features of each candidate detection frame in the channel direction through the first convolutional layer.

S232. Perform a convolution operation on the melted features through a second convolution layer to obtain a fusion detection frame.

Exemplarily, if the number of pixel points of the candidate detection frame is num_points and the channel is C0, then the dimension of the candidate detection frame is (num_points, C0), where C0 is the coordinate of the pixel point. The channel fusion layer consists of a first convolutional layer and a second convolutional layer. Among them, the first convolution layer is a convolution kernel of 1×C0, the second convolution layer is a convolution kernel of 1×C1, and C1 is the number of channels of the second convolution layer, which determines the number of channels of the fusion detection frame .

Through the first convolutional layer, the features of each candidate detection frame in the channel direction are fused to obtain the fused feature, and its dimension is (num_points, C0), and the melted feature is obtained through the second convolutional layer Convolution operation is performed, and each candidate detection frame is fused to obtain a fusion detection frame.

S240. Input each candidate detection frame into the distance determination layer, and obtain distance features between each candidate detection frame.

Specifically, each candidate detection frame is input into the distance determination layer, the Euclidean distance between each candidate detection frame is determined, and the distance matrix is determined based on the Euclidean distance between each candidate detection frame; based on the distance matrix, each candidate detection The features of the frame in the distance direction are convoluted to obtain distance features.

Exemplary, calculate the Euclidean distance between two candidate detection frames, if the dimension of a candidate detection frame is (num_points, C0), then the dimension of the element in the distance matrix is (C0, num_points, num_points); The product kernel performs convolution on the distance matrix to obtain a distance feature with a dimension of (num_points, C1).

S250. Input the fused detection frame and the distance feature into the feature association layer, and obtain a predicted detection frame obtained after the feature association layer performs feature association on the fused detection frame based on the distance feature.

Specifically, the distance feature and the fusion detection frame are convolved to realize the feature association of the fusion detection frame based on the distance feature, so that the fusion detection frame can synthesize the local features of the candidate detection frame and the distance between the candidate detection frames. global features.

S260. Adjust the parameters of the initial target detection model according to the loss function value determined by the predicted detection frame and the marked detection frame corresponding to the target training sample.

Embodiment three

Fig. 3 is a flow chart of a target detection method provided in Embodiment 3 of the present application. This embodiment is applicable to the situation where target detection is performed on the point cloud data to be detected. This method can be executed by the target detection device in the embodiment of the present application , the device can be implemented in the form of software and/or hardware, as shown in Figure 3, the method specifically includes the following steps:

S310, acquiring point cloud data to be detected.

Specifically, the point cloud data to be detected may be point cloud data including the object to be detected obtained by a three-dimensional laser scanner.

S320. Input the point cloud data to be detected into the target detection model trained by the target detection model training method described in the first embodiment or the second embodiment of this application.

Wherein, the target detection model is a fully trained model trained by using the target detection model training method described in Embodiment 1 or Embodiment 2 of the present application.

Specifically, the point cloud data to be detected is input into a target detection model to obtain a candidate detection frame corresponding to the object to be detected. The target detection model includes: a feature extraction network and a feature fusion network; the feature extraction network is used to extract the features of the point cloud data corresponding to the target training sample, and determine the candidate detection frame; the feature fusion network is used to extract the features of the candidate detection frame Fusion is performed to obtain the predicted detection frame.

S330. Acquire a target detection frame output by the target detection model, and determine a target detection result of the point cloud data to be detected based on the target detection frame.

Wherein, the target detection frame is used to indicate the position of the object to be detected, and the category recognition is performed on the target object in the target detection frame to determine the target detection result. The method of classifying the target object in the target detection frame is not limited in this embodiment of the present application. For example, a semantic classification model may be used to identify the type of the target object in the target detection frame.

In the technical solution of this embodiment, by obtaining the point cloud data to be detected; inputting the point cloud data to be detected into the target detection model obtained by training the target detection model training method described in the first embodiment or embodiment two of this application; The target detection frame output by the target detection model, based on the target detection frame to determine the target detection result of the point cloud data to be detected, can be based on the correlation characteristics between the detection frames reflected by the distance characteristics of each candidate detection frame, for candidate detection The frame is used for feature fusion to improve the prediction accuracy of the target detection model.

Embodiment four

FIG. 4 is a schematic structural diagram of a training device for a target detection model provided in Embodiment 4 of the present application. This embodiment can be applied to the situation of training a three-dimensional target detection model, and the device can be implemented in the form of software and/or hardware, and the device can be integrated in any device that provides the training function of the target detection model, as shown in Figure 4, The training device of the target detection model specifically includes: an input module 410 , a feature extraction module 420 , a feature fusion module 430 and a parameter adjustment module 440 .

Wherein, the input module 410 is used to input the point cloud data corresponding to the target training sample into the initial target detection model; wherein, the initial target detection model includes: a feature extraction network and a feature fusion network; the feature extraction module 420 is used to pass the The feature extraction network performs feature extraction on the point cloud data to obtain a plurality of candidate detection frames; the feature fusion module 430 is used to input each of the candidate detection frames into the feature fusion network, and obtain the feature fusion network based on each The distance feature of the candidate detection frame is the predicted detection frame obtained after performing feature fusion on each of the candidate detection frames; the parameter adjustment module 440 is used to determine according to the predicted detection frame and the marked detection frame corresponding to the target training sample The value of the loss function adjusts the parameters of the initial object detection model.

Optionally, the feature extraction module 420 includes:

The sampling unit is used to sample the point cloud data based on a preset sampling interval to obtain a sampling point cloud; the division unit is used to divide the space corresponding to the sampling point cloud to obtain the voxel sampling corresponding to each voxel after division point cloud; an aggregation unit, configured to, for each voxel, perform feature aggregation on the corresponding voxel sampling point cloud based on the multi-layer perceptron, and determine a candidate detection frame corresponding to each voxel sampling point cloud.

Optionally, the polymerization unit is specifically used for:

For the voxel sampling point cloud corresponding to each voxel, determine the center point of the voxel sampling point cloud; determine the offset between each point in the voxel sampling point cloud and the center point; The voxel sampling point cloud and the offset are input into the multi-layer perceptron, the center point feature corresponding to the voxel sampling point cloud is determined, and the center point feature is regressed to obtain a candidate detection frame.

Optionally, the feature fusion module 430 includes:

a fusion unit, configured to input each of the candidate detection frames into the channel fusion layer, and obtain a fusion detection frame obtained after the channel fusion layer fuses the candidate detection frames based on the channel direction; a determination unit, configured to input each of the Candidate detection frames are input into the distance determination layer to obtain distance features between each of the candidate detection frames; an association unit is used to input the fusion detection frame and the distance feature into the feature association layer, and obtain the feature association layer based on the The predicted detection frame is obtained after performing feature association on the fusion detection frame with the distance feature.

Optionally, the fusion unit is specifically used for:

The features of each candidate detection frame in the channel direction are fused by the first convolution layer to obtain melted features; the melted features are convoluted by the second convolution layer to obtain the fused detection frame.

The above-mentioned products can execute the training method of the target detection model provided by any embodiment of the present application, and have the corresponding functional modules and effects of the execution method.

Embodiment five

FIG. 5 is a schematic structural diagram of a target detection device provided in Embodiment 5 of the present application. This embodiment can be applied to the situation where the point cloud data to be detected is used for target detection. The device can be implemented in the form of software and/or hardware. The device can be integrated in any device that provides target detection functions, as shown in FIG. 5 , The object detection device specifically includes: an acquisition module 510 , an input module 520 and a determination module 530 .

The acquisition module 510 is used to acquire the point cloud data to be detected; the input module 520 is used to input the point cloud data to be detected into the target detection obtained by the training method of the target detection model described in Embodiment 1 or Embodiment 2 Model; determination module 530, configured to acquire the target detection frame output by the target detection model, and determine the target detection result of the point cloud data to be detected based on the target detection frame.

The above-mentioned products can execute the target detection method provided by any embodiment of the present application, and have corresponding functional modules and effects for executing the method.

Embodiment six

Fig. 6 is a structural block diagram of a computer device provided in Embodiment 6 of the present application. As shown in Fig. 4, the computer device includes a processor 610, a memory 620, an input device 630, and an output device 640; The quantity can be one or more, and a processor 610 is taken as an example in FIG. Take the bus connection as an example.

The memory 620, as a computer-readable storage medium, can be used to store software programs, computer-executable programs and modules, such as the program instructions/modules corresponding to the training method of the target detection model in the embodiment of the present application (for example, the target detection model's input module 410, feature extraction module 420, feature fusion module 430 and parameter adjustment module 440) in the training device), or program instructions/modules corresponding to the target detection method in the embodiment of the application (for example, acquisition in the target detection model device module 510, input module 520 and determination module 530). The processor 610 executes various functional applications and data processing of the computer device by running the software programs, instructions and modules stored in the memory 620, that is, realizes the above-mentioned object detection model training method or object detection method.

The memory 620 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system and an application program required by at least one function; the data storage area may store data created according to the use of the terminal, and the like. In addition, the memory 620 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage devices. In some examples, the memory 620 may further include memory located remotely from the processor 610, and these remote memories may be connected to the computer device through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 630 can be used to receive input numbers or character information, and generate key signal input related to user settings and function control of the computer device. The output device 640 may include a display device such as a display screen.

Embodiment seven

Embodiment 7 of the present application provides a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the training method of the target detection model provided by all the embodiments of the present application is implemented: the target training sample The corresponding point cloud data is input into the initial target detection model; wherein, the initial target detection model includes: a feature extraction network and a feature fusion network; the feature extraction network is used to perform feature extraction on the point cloud data to obtain a plurality of candidate detections frame; input each of the candidate detection frames into the feature fusion network, and obtain the predicted detection frame obtained after the feature fusion network performs feature fusion on each of the candidate detection frames based on the distance features of each of the candidate detection frames; according to A loss function value determined by the predicted detection frame and the marked detection frame corresponding to the target training sample adjusts the parameters of the initial target detection model.

Alternatively, implement the target detection method provided in all application embodiments of the present application: obtain the point cloud data to be detected; input the point cloud data to be detected into the training of the target detection model described in Embodiment 1 or Embodiment 2 of the present application The method trains the target detection model; obtains the target detection frame output by the target detection model, and determines the target detection result of the point cloud data to be detected based on the target detection frame.

Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof. More specific examples (non-exhaustive list) of computer readable storage media include: electrical connections with one or more leads, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), Erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In this document, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a data signal carrying computer readable program code in baseband or as part of a carrier wave. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device. .

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program codes for performing the operations of the present application may be written in one or more programming languages or combinations thereof, including object-oriented programming languages such as Java, Smalltalk, C++, and conventional A procedural programming language, such as the "C" language or similar programming language. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. Where a remote computer is involved, the remote computer may be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (e.g. via the Internet using an Internet Service Provider). .

Claims

A training method for a target detection model, comprising:

Input the point cloud data corresponding to the target training sample into the initial target detection model; wherein, the initial target detection model includes: a feature extraction network and a feature fusion network;

performing feature extraction on the point cloud data through the feature extraction network to obtain a plurality of candidate detection frames;

Input each of the candidate detection frames into the feature fusion network, and obtain a predicted detection frame obtained by performing feature fusion on each of the candidate detection frames by the feature fusion network based on the distance features of each of the candidate detection frames;

Adjusting the parameters of the initial target detection model according to the loss function value determined by the predicted detection frame and the marked detection frame corresponding to the target training sample.
The method according to claim 1, wherein said feature extraction is performed on said point cloud data through said feature extraction network to obtain a plurality of candidate detection frames, comprising:

Sampling the point cloud data based on a preset sampling interval to obtain a sampled point cloud;

Dividing the space corresponding to the sampling point cloud to obtain a voxel sampling point cloud corresponding to each voxel after division;

For each of the voxels, the feature aggregation of the corresponding voxel sampling point cloud is performed based on the multi-layer perceptron, and the candidate detection frame corresponding to each of the voxel sampling point clouds is determined.
The method according to claim 2, wherein, for each of the voxels, the corresponding voxel sampling point cloud is subjected to feature aggregation based on a multi-layer perceptron, and the candidate detection frame corresponding to each of the voxel sampling point clouds is determined ,include:

For the voxel sampling point cloud corresponding to each voxel, determine the center point of the voxel sampling point cloud;

Determine the offset between each point in the voxel sampling point cloud and the center point;

Inputting the voxel sampling point cloud and the offset into the multi-layer perceptron, determining the center point feature corresponding to the voxel sampling point cloud, and regressing the center point feature to obtain a candidate detection frame.
The method according to claim 1, wherein the feature fusion network comprises: a channel fusion layer, a distance determination layer and a feature association layer, correspondingly, each of the candidate detection frames is input into the feature fusion network to obtain the feature fusion The predicted detection frame obtained after the network performs feature fusion on each of the candidate detection frames based on the distance features of each of the candidate detection frames includes:

Input each of the candidate detection frames into the channel fusion layer, and obtain the fusion detection frame obtained after the channel fusion layer fuses the candidate detection frames based on the channel direction;

Each of the candidate detection frames is input into the distance determination layer to obtain the distance feature between each of the candidate detection frames;

Inputting the fused detection frame and the distance feature into the feature association layer, and obtaining a predicted detection frame obtained by performing feature association on the fused detection frame by the feature association layer based on the distance feature.
The method according to claim 4, wherein each of the candidate detection frames is input into a channel fusion layer to obtain a fused detection frame obtained after the channel fusion layer fuses the candidate detection frames based on the channel direction, including:

After fusing the features of each candidate detection frame in the channel direction through the first convolutional layer, the melted features are obtained;

A fusion detection frame is obtained by performing a convolution operation on the melted features through a second convolution layer.
A target detection method, comprising:

Obtain point cloud data to be detected;

The point cloud data to be detected is input into the target detection model obtained by the training method of the target detection model described in any one of claims 1-5;

Obtain the target detection frame output by the target detection model, and determine the target detection result of the point cloud data to be detected based on the target detection frame.
A training device for a target detection model, comprising:

The input module is used to input the point cloud data corresponding to the target training sample into the initial target detection model; wherein, the initial target detection model includes: a feature extraction network and a feature fusion network;

A feature extraction module, configured to perform feature extraction on the point cloud data through the feature extraction network to obtain a plurality of candidate detection frames;

A feature fusion module, configured to input each of the candidate detection frames into the feature fusion network, and obtain a prediction obtained by performing feature fusion on each of the candidate detection frames by the feature fusion network based on the distance characteristics of each of the candidate detection frames detection frame;

A parameter adjustment module, configured to adjust the parameters of the initial target detection model according to the loss function value determined by the predicted detection frame and the marked detection frame corresponding to the target training sample.
A target detection device, comprising:

An acquisition module, configured to acquire point cloud data to be detected;

The input module is used to input the point cloud data to be detected into the target detection model obtained by training the target detection model training method according to any one of claims 1-5;

Determining module is used to obtain the target detection frame output by the target detection model, and determines the target detection result of the point cloud data to be detected based on the target detection frame.
A computer device, comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, wherein, when the processor executes the program, the method according to any one of claims 1-5 is realized The training method of target detection model, or realize the target detection method as claimed in claim 6.
A computer-readable storage medium storing a computer program, wherein, when the program is executed by a processor, the method for training a target detection model according to any one of claims 1-5 is realized, or the method according to claim 6 is realized. The target detection method described above.