CN111241964A

CN111241964A - Training method and device of target detection model, electronic equipment and storage medium

Info

Publication number: CN111241964A
Application number: CN202010010252.8A
Authority: CN
Inventors: 尹轩宇
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2020-01-06
Filing date: 2020-01-06
Publication date: 2020-06-05

Abstract

The application discloses a training method and device of a target detection model, electronic equipment and a storage medium. The target detection model comprises a one-stage network and a two-stage network, and the training method of the target detection model comprises the following steps: generating a preliminary detection frame in the three-dimensional point cloud data according to the first-stage network; classifying the three-dimensional point cloud data according to the preliminary detection frame, and generating training data according to a classification result; and utilizing the training data to iteratively train the two-stage network to extract local features and generate a refining prediction box according to the local features. The method has the advantages that the output result of the one-stage network can be secondarily classified by adopting the two-stage network design, and the local features in the frame are extracted to generate the refined prediction frame on the basis of the primary detection frame result, so that the problem of unbalanced samples can be improved, the information of the learned target can be better extracted, and the refinement of the prediction frame can be carried out.

Description

Training method and device of target detection model, electronic equipment and storage medium

Technical Field

The present application relates to the field of target detection, and in particular, to a method and an apparatus for training a target detection model, an electronic device, and a storage medium.

Background

The target detection has important significance for the unmanned technology, and the unmanned technology can be applied to the fields of logistics, takeaway distribution and the like, and has great business value. At present, for target detection, spatial information of a target can be acquired by fusing data of various sensors such as a laser radar, a camera and a millimeter wave radar, then the data are analyzed by a deep learning mode, for example, a convolutional neural network is utilized to extract features, and finally the target is marked in the data to realize target detection. However, taking three-dimensional point cloud data acquired by a laser radar as an example, the three-dimensional point cloud has the characteristic of irregularity, so that the target detection and identification effect in the prior art is poor, and the identification accuracy cannot be well ensured.

Disclosure of Invention

In view of the above, the present application is proposed to provide a method, an apparatus, an electronic device and a storage medium for training an object detection model that overcome or at least partially solve the above problems.

According to an aspect of the present application, there is provided a method for training a target detection model, wherein the target is an object to be focused in an automatic driving scene, the target detection model comprises a one-stage network and a two-stage network, the method comprising:

generating a preliminary detection frame in the three-dimensional point cloud data according to the first-stage network;

classifying the three-dimensional point cloud data according to the preliminary detection frame, and generating training data according to a classification result;

and utilizing the training data to iteratively train the two-stage network to extract local features and generate a refining prediction box according to the local features.

Optionally, each training process of iteratively training the two-stage network to extract local features and generate a refining prediction box according to the local features includes:

extracting the point cloud data corresponding to each preliminary detection frame through a voxelization feature extraction sub-network to extract voxelization features;

a refined prediction box is generated using the deep convolutional neural network and the voxelized features.

Optionally, the extracting, by the point cloud data corresponding to each preliminary detection box through the sub-network for extracting the voxelized features, the method includes:

dividing the point cloud data corresponding to each preliminary detection frame into a plurality of voxels;

for each voxel, extracting point link characteristics through a plurality of voxel characteristic extraction layers in a voxel characteristic extraction sub-network, and obtaining the voxel characteristics through the point link characteristics through a full-connection neural network layer and a regional maximum pooling layer.

Optionally, the generating a refined prediction box using the deep convolutional neural network and the voxelized feature comprises:

and sequentially performing convolution, maximum pooling, activation and normalization processing on the voxelized features by utilizing a deep convolutional neural network, and finally generating a refining prediction frame through two layers of full connection.

Optionally, the deep convolutional neural network comprises a classification network and a prediction box regression network, and the generating a refined prediction box using the deep convolutional neural network and the voxelized feature comprises:

using softmax as an activation function of the classification network, and outputting the target category corresponding to the refining bounding box by the classification network; and outputting the coordinates, length, width, height and deflection angle of the central point of the refined prediction frame by the prediction frame regression network.

Optionally, the classification network uses Focal loss as a loss function, and the prediction box regression network uses smooth L1 loss as a loss function.

Optionally, the deep convolutional neural network performs training control by using a preset training cycle number and a stop threshold value during training;

and when the training period is greater than the preset period number or the variation of the loss function corresponding to one training period is smaller than a preset threshold value, stopping training.

According to another aspect of the present application, there is provided a training apparatus for an object detection model, the object being an object to be focused in an automatic driving scene, the object detection model including a one-stage network and a two-stage network, the apparatus comprising:

the initial detection unit is used for generating an initial detection frame in the three-dimensional point cloud data according to the one-stage network;

the classification unit is used for classifying the three-dimensional point cloud data according to the preliminary detection frame and generating training data according to a classification result;

and the refining unit is used for iteratively training the two-stage network to extract local features and generating a refining prediction box according to the local features by utilizing the training data.

Optionally, the refining unit is configured to extract a voxelized feature from the point cloud data corresponding to each preliminary detection frame through a voxelized feature extraction sub-network;

Optionally, the refining unit is configured to segment the point cloud data corresponding to each preliminary detection frame into a plurality of voxels; for each voxel, extracting point link characteristics through a plurality of voxel characteristic extraction layers in a voxel characteristic extraction sub-network, and obtaining the voxel characteristics through the point link characteristics through a full-connection neural network layer and a regional maximum pooling layer.

Optionally, the refining unit is configured to sequentially perform convolution, maximum pooling, activation, and normalization processing on the voxelized feature by using a deep convolutional neural network, and finally generate a refining prediction frame by two layers of full connection.

Optionally, the deep convolutional neural network comprises a classification network and a prediction box regression network, and the refining unit is configured to output, by the classification network, a target class corresponding to a refining bounding box using softmax as an activation function of the classification network; and outputting the coordinates, length, width, height and deflection angle of the central point of the refined prediction frame by the prediction frame regression network.

Optionally, the refining unit is configured to perform training control by using a preset training cycle number and a stop threshold during training; and when the training period is greater than the preset period number or the variation of the loss function corresponding to one training period is smaller than a preset threshold value, stopping training.

In accordance with yet another aspect of the present application, there is provided an electronic device including: a processor; and a memory arranged to store computer executable instructions that, when executed, cause the processor to perform a method as any one of the above.

According to a further aspect of the application, there is provided a computer readable storage medium, wherein the computer readable storage medium stores one or more programs which, when executed by a processor, implement a method as in any above.

In view of the above, in the technical solution of the present application, the target detection model includes a first-stage network and a second-stage network, and the method includes: generating a preliminary detection frame in the three-dimensional point cloud data according to the first-stage network; classifying the three-dimensional point cloud data according to the preliminary detection frame, and generating training data according to a classification result; and utilizing the training data to iteratively train the two-stage network to extract local features and generate a refining prediction box according to the local features. The method has the advantages that the output result of the one-stage network can be secondarily classified by adopting the two-stage network design, and the local features in the frame are extracted to generate the refined prediction frame on the basis of the primary detection frame result, so that the problem of unbalanced samples can be improved, the information of the learned target can be better extracted, and the refinement of the prediction frame can be carried out.

The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 shows a schematic flow diagram of a method of training a target detection model according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a training apparatus for an object detection model according to an embodiment of the present application;

FIG. 3 shows a schematic structural diagram of an electronic device according to an embodiment of the present application;

FIG. 4 shows a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

For deep learning, convolutional neural networks are the most dominant solution. For point cloud data processing, different convolution kernels can automatically extract local correlation characteristics of points, but the sequence of the points is sensitive to convolution, and compared with a 2D image, data of a 3D point cloud is not regularly arranged in space, so that the shape information of the point cloud is lost by direct convolution. The main challenge of 3D object recognition based on deep learning is to solve the problem of the disorder of the point cloud, and also the problem of the irregularity of the point cloud.

In the embodiment of the present application, the target in the target detection model is a 3D object, and the related data is three-dimensional point cloud data, so the above problem also needs to be faced. Specifically, the target may include multiple categories that need to draw driving attention on roads such as pedestrians, bicycles, automobiles, and the like, and the target detection model may be applied to the unmanned technology and further applied to business fields such as logistics, takeaway distribution, and the like.

Reference will now be made to specific embodiments.

Fig. 1 shows a flowchart of a training method of a target detection model according to an embodiment of the present application, where a target is an object to be focused in an automatic driving scene. As shown in fig. 1, the method includes:

step S110, generating a preliminary detection frame in the three-dimensional point cloud data according to a stage of network.

The target is usually a 3D object, and the 3D object is identified mainly by adopting a deep learning method and utilizing a convolutional neural network and a target detection model technology. The target detection model includes a one-phase network and a two-phase network.

The reason why the two-stage network is selected is that when the features of the target are extracted, the positioning accuracy is not high because the sampling mechanism of the one-stage network in the prior art actually generates many frames to cause imbalance of positive and negative samples. In addition, the detection frame generated by the single-stage network also usually contains other objects besides the target, that is, the local features of the non-target in the detection frame affect the regression accuracy of the detection frame.

The basic network of the one-stage network can be realized by using the prior art, for example, local information irrelevant to the point sequence in the point cloud data can be extracted or ranked and weighted by adopting PointCNN, PointNet (no uniform Chinese name at present) and the like to obtain a primary detection frame. Preferably, a voxel feature extraction network VoxelNet can be used as the underlying network for implementing the one-stage network.

The method comprises the steps of generating preliminary detection frames in three-dimensional point cloud data according to a stage of network, enabling the preliminary detection frames to achieve preliminary detection of a target, enabling each preliminary detection frame to correspond to the approximate position of one target, and providing a basis for subsequent refining detection.

And step S120, classifying the three-dimensional point cloud data according to the preliminary detection frame, and generating training data according to a classification result.

After a preliminary detection frame generated in the three-dimensional point cloud data by the network at one stage is obtained, the three-dimensional point cloud data can be classified according to the preliminary detection frame, and then training data is generated according to a classification result. Therefore, a two-stage classification method is adopted, the results in the first stage are subjected to secondary classification, the problem of sample imbalance is solved, and the network prediction effect is better. The specific classification may be to classify the preliminary detection frame according to the type of the detected object, such as pedestrian, bicycle, car, background, etc.

And S130, extracting local features by using the training data and the iterative training two-stage network and generating a refining prediction box according to the local features.

The two-stage network can extract the local features of the three-dimensional point cloud data on the basis of the result of the primary prediction frame, so that object information can be better learned, iterative training is carried out according to the training data, a refined prediction frame can be generated according to the local features, and refining of the prediction frame of the three-dimensional point cloud data is realized.

Therefore, as shown in fig. 1, the method adopts a two-stage network design, can perform secondary classification on the output result of the one-stage network, and extract local features in the frame to generate a refined prediction frame on the basis of the primary detection frame result, so that the problem of sample imbalance can be improved, the information of a learned target can be better extracted, and the prediction frame can be refined.

In an embodiment of the present application, in the above method, each training process of iteratively training the two-stage network to extract local features and generate the refining prediction box according to the local features includes: extracting the point cloud data corresponding to each preliminary detection frame through a voxelization feature extraction sub-network to extract voxelization features; and generating a refined prediction box by using the deep convolutional neural network and the voxelized characteristic.

Voxelization can achieve a simplification of the model, resulting in a uniform grid. Therefore, the point cloud data corresponding to each preliminary detection frame can be subjected to extraction of the voxelized feature by using the voxelized feature extraction sub-network, and the voxelized feature can be extracted from the point cloud data corresponding to each preliminary detection frame through the voxelized feature extraction sub-network.

The convolutional neural network has the characteristic learning ability, can carry out translation invariant classification on input information according to the hierarchical structure of the convolutional neural network, can carry out lattice characteristic with smaller calculated amount, has stable effect and has no additional characteristic engineering requirement on data. Thus, a refined prediction box may be generated using the deep convolutional neural network and the voxelized features. Therefore, in each training process, the two-stage network is iteratively trained to extract local features and generate a refining prediction box according to the local features.

Of course, besides the manner shown in the above embodiment, there may be other manners to extract the local features, for example, gridding the preliminary prediction frame, and using whether there is a point cloud in each grid to form a matrix, if there is a point cloud in the grid, the corresponding matrix element is 1; if there is no point cloud in the grid, the corresponding matrix element is 0, and so on, in the above embodiment, using the voxelized feature as the local feature is an example with better verification effect.

In an embodiment of the application, in the method, extracting the voxelized features from the point cloud data corresponding to each preliminary detection frame through the voxelized feature extraction sub-network includes: dividing the point cloud data corresponding to each preliminary detection frame into a plurality of voxels; and for each voxel, extracting point link characteristics through a plurality of voxel characteristic extraction layers in a voxel characteristic extraction sub-network, and obtaining the voxel characteristics through the point link characteristics through a full-connection neural network layer and a regional maximum pooling layer.

When the voxelization features are extracted through the voxelization feature extraction sub-network, the three-dimensional point cloud can be divided into a certain number of voxels, the point cloud data corresponding to each preliminary detection frame is divided into a plurality of voxels, and local feature extraction is performed on each non-empty Voxel. Voxel is an abbreviation of Volume element (Volume Pixel), and a Volume containing a Voxel can be represented by a Volume rendering or by extracting a polygon iso-surface of a given threshold contour. The point link features are extracted from each voxel through a plurality of voxel feature extraction layers in a voxel feature extraction sub-network, and the voxel features can be obtained from the point link features through a full-connection neural network layer and a regional maximum pooling layer. In this way, a finer voxelization characteristic is achieved.

For example, the three-dimensional Point cloud may be divided into four voxels, and for each Voxel, a Point-wise feature (Point-wise feature) may be extracted by using several Voxel feature extraction layers in a voxelized feature extraction sub-network.

Specifically, each Voxel Feature extraction layer may perform a VFE (Voxel Feature Encoding) operation as follows: firstly, a Point-wise Feature is abstracted by using FCN (Fully Connected Neural Net), then a local Aggregated Feature (localized Aggregated Feature) is obtained by means of Element-wise maximum pooling, and then the local Aggregated Feature is linked to each Point-wise Feature to obtain a Point-wise linked Feature (Point-wise Aggregated Feature).

Then, for the whole sub-network for extracting the voxelized feature, the point link feature can be obtained through a plurality of VFE layers, and then the point link feature is passed through a fully-connected neural network layer and a regional maximum pooling layer to finally obtain the voxelized feature for subsequent convolution. Preferably the final extracted voxelized feature may be a 128-dimensional feature.

In one embodiment of the present application, the above method, wherein generating a refined prediction box using a deep convolutional neural network and a voxelized feature comprises: and sequentially performing convolution, maximum pooling, activation and normalization processing on the voxelized features by utilizing a deep convolutional neural network, and finally generating a refining prediction frame through two layers of full connection.

For example, the deep convolutional neural network may adopt six layers, the first four layers respectively perform operations such as convolution, maximum pooling, activation (specifically, activation may be performed by applying an activation function relu function), normalization processing (Batchnormalization), and the like on the extracted 128-dimensional features, and the last two layers are all fully-connected layers and may output a prediction result.

The point with the maximum value in the local acceptance domain can be taken through convolution operation and maximum pooling operation, the activation function relu is used for increasing the nonlinear factor of the neural network model, and after the nonlinear factor is introduced, the neural network can approach any nonlinear function at will, so that the applicability is improved. The normalization process can greatly speed up the training time of the model.

In one embodiment of the application, the method, wherein the deep convolutional neural network comprises a classification network and a prediction box regression network, and the generating the refined prediction box by using the deep convolutional neural network and the voxelized feature comprises: using softmax as an activation function of the classification network, and outputting the target category corresponding to the refining bounding box by the classification network; and outputting the coordinates, length, width, height and deflection angle of the central point of the refined prediction frame by the prediction frame regression network.

That is, the deep convolutional neural network includes two tasks, namely a classification task and a prediction box regression task, which may also be referred to as two networks. A classification network is a network of nodes that can classify data according to its characteristics. The prediction box regression network is a network in which the prediction box is subjected to mapping processing to obtain a regression window closer to the real box.

In the process of generating the refined prediction box by utilizing the deep convolutional neural network and the voxelized characteristic, an activation function softmax can be used as an activation function of the classification network, the activation function softmax can be used in a multi-classification process, the output of a plurality of neurons is mapped into a (0,1) interval, and the probability can be considered to be understood, so that the multi-classification is carried out. In this way, the classification network may output a refined bounding box to implement the classification of the corresponding target class. And the prediction frame regression network can output data such as the coordinates, the length, the width, the height, the deflection angle and the like of the central point of the refining prediction frame, so that the specific position of the refining prediction frame can be determined.

In one embodiment of the present application, the classification network uses Focal loss as a loss function, and the prediction box regression network uses smooth L1 loss as a loss function.

The classification network can use a loss function, which is mainly used for solving a problem scene that the ratio of positive samples and negative samples of a foreground class and a background class is seriously unbalanced in a training phase in a one-stage target detection, as a loss function, and can reduce the weight occupied by a large number of simple negative samples in training. The prediction frame regression network uses smooth L1 loss as a loss function, and the smooth L1 loss can limit the gradient from two aspects, namely when the difference between the prediction frame and the real effective value ground truth is too large, the gradient value is not too large; when the prediction box is very small from the true value ground treth, the gradient value is small enough. In this way, a determination of the loss function is achieved.

The prediction box regression network may determine the center point coordinates, length, width, height, and declination of the refined prediction box according to a residual between the predicted preliminary prediction box and the true value, and the center point coordinates, length, width, height, and declination of the preliminary prediction box. When the method is used, the input three-dimensional point cloud is subjected to a first-stage network to obtain a preliminary prediction frame, then the deviation between the preliminary prediction frame and a real scene is predicted through a second-stage network, and a refined prediction frame is obtained after correction.

In an embodiment of the present application, in the method, the deep convolutional neural network performs training control by using a preset training cycle number and a stop threshold during training; and when the training period is greater than the preset period number or the variation of the loss function corresponding to one training period is smaller than a preset threshold value, stopping training.

When the training period is greater than the preset period number, or the variation of the loss function corresponding to one training period is smaller than the preset threshold value, the condition that the pre-designed model training condition is met is indicated, and the training can be stopped. Thus, the control of the training stopping condition of the intensity model is realized, and the over-training is avoided.

Fig. 2 is a schematic structural diagram of a training apparatus for an object detection model according to an embodiment of the present application, where an object to be focused in an automatic driving scene is an object. As shown in fig. 2, the training apparatus 200 for the object detection model includes:

and the initial detection unit 210 is configured to generate an initial detection frame in the three-dimensional point cloud data according to a stage network.

And the classification unit 220 is configured to classify the three-dimensional point cloud data according to the preliminary detection frame, and generate training data according to a classification result.

And the refining unit 230 is configured to iteratively train the two-stage network to extract local features and generate a refining prediction box according to the local features by using the training data.

Therefore, the device shown in fig. 2 adopts a two-stage network design, can perform secondary classification on the output result of the one-stage network, and extracts the local features in the frame to generate a refined prediction frame on the basis of the primary detection frame result, so that the problem of sample imbalance can be improved, the information of the learned target can be better extracted, and the refinement of the prediction frame can be performed.

In an embodiment of the present application, in the above apparatus, the refining unit 230 is configured to extract a voxelized feature from the point cloud data corresponding to each preliminary detection frame through a voxelized feature extraction sub-network; and generating a refined prediction box by using the deep convolutional neural network and the voxelized characteristic.

In an embodiment of the present application, in the above apparatus, the refining unit 230 is configured to divide the point cloud data corresponding to each preliminary detection frame into a plurality of voxels; and for each voxel, extracting point link characteristics through a plurality of voxel characteristic extraction layers in a voxel characteristic extraction sub-network, and obtaining the voxel characteristics through the point link characteristics through a full-connection neural network layer and a regional maximum pooling layer.

In an embodiment of the present application, in the above apparatus, the refining unit 230 is configured to perform convolution, maximum pooling, activation, and normalization processing on the voxelized feature sequentially by using a deep convolutional neural network, and finally generate a refined prediction frame by two layers of full connections.

In an embodiment of the present application, in the above apparatus, the deep convolutional neural network includes a classification network and a prediction box regression network, and the refining unit 230 is configured to output, by using softmax as an activation function of the classification network, a target class corresponding to the refining bounding box; and outputting the coordinates, length, width, height and deflection angle of the central point of the refined prediction frame by the prediction frame regression network.

In one embodiment of the present application, in the above apparatus, the classification network uses Focal loss as a loss function, and the prediction box regression network uses smooth L1 loss as a loss function.

In an embodiment of the present application, in the above apparatus, the refining unit 230 is configured to perform training control by using a preset training cycle number and a stop threshold during training; and when the training period is greater than the preset period number or the variation of the loss function corresponding to one training period is smaller than a preset threshold value, stopping training.

It should be noted that, for the specific implementation of each apparatus embodiment, reference may be made to the specific implementation of the corresponding method embodiment, which is not described herein again.

To sum up, in the technical solution of the present application, the target detection model includes a first-stage network and a second-stage network, and the method includes: generating a preliminary detection frame in the three-dimensional point cloud data according to the first-stage network; classifying the three-dimensional point cloud data according to the preliminary detection frame, and generating training data according to a classification result; and utilizing the training data to iteratively train the two-stage network to extract local features and generate a refining prediction box according to the local features. The method has the advantages that the output result of the one-stage network can be secondarily classified by adopting the two-stage network design, and the local features in the frame are extracted to generate the refined prediction frame on the basis of the primary detection frame result, so that the problem of unbalanced samples can be improved, the information of the learned target can be better extracted, and the refinement of the prediction frame can be carried out.

It should be noted that:

the algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose devices may be used with the teachings herein. The required structure for constructing such a device will be apparent from the description above. In addition, this application is not directed to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the present application as described herein, and any descriptions of specific languages are provided above to disclose the best modes of the present application.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the application, various features of the application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the application and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: this application is intended to cover such departures from the present disclosure as come within known or customary practice in the art to which this invention pertains. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the present application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components in the training apparatus of the object detection model according to embodiments of the present application. The present application may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present application may be stored on a computer readable medium or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

For example, fig. 3 shows a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device 300 comprises a processor 310 and a memory 320 arranged to store computer executable instructions (computer readable program code). The memory 320 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. The memory 320 has a storage space 330 storing computer readable program code 331 for performing any of the method steps described above. For example, the storage space 330 for storing the computer readable program code may comprise respective computer readable program codes 331 for respectively implementing various steps in the above method. The computer readable program code 331 may be read from or written to one or more computer program products. These computer program products comprise a program code carrier such as a hard disk, a Compact Disc (CD), a memory card or a floppy disk. Such a computer program product is typically a computer readable storage medium such as described in fig. 4. FIG. 4 shows a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present application. The computer readable storage medium 400 has stored thereon a computer readable program code 331 for performing the steps of the method according to the application, readable by a processor 310 of an electronic device 300, which computer readable program code 331, when executed by the electronic device 300, causes the electronic device 300 to perform the steps of the method described above, in particular the computer readable program code 331 stored on the computer readable storage medium may perform the method shown in any of the embodiments described above. The computer readable program code 331 may be compressed in a suitable form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the application, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims

1. A training method of a target detection model is characterized in that the target detection model comprises a one-stage network and a two-stage network, the target is an object to be focused in an automatic driving scene, and the method comprises the following steps:

2. The method of claim 1, wherein each training process in iteratively training the two-stage network to extract local features and generate a refined prediction box from the local features comprises:

3. The method of claim 2, wherein extracting the voxelized features from the point cloud data corresponding to each preliminary detection box through the voxelized feature extraction sub-network comprises:

4. The method of claim 2, wherein generating a refined prediction box using the deep convolutional neural network and the voxelized feature comprises:

5. The method of claim 4, wherein the deep convolutional neural network comprises a classification network and a prediction box regression network, and wherein generating a refined prediction box using the deep convolutional neural network and the voxelized feature comprises:

6. The method of claim 5, wherein the classification network uses Focal loss as a loss function and the prediction box regression network uses smooth L1 loss as a loss function.

7. The method of claim 6, wherein the deep convolutional neural network is training controlled using a preset number of training cycles and a stop threshold when training;

8. A training device for an object detection model, wherein the object is an object to be focused in an automatic driving scene, the object detection model comprises a one-stage network and a two-stage network, the device comprises:

9. An electronic device, wherein the electronic device comprises: a processor; and a memory arranged to store computer-executable instructions that, when executed, cause the processor to perform the method of any one of claims 1-7.

10. A computer readable storage medium, wherein the computer readable storage medium stores one or more programs which, when executed by a processor, implement the method of any of claims 1-7.