CN114332201A

CN114332201A - Model training and target detection method and device

Info

Publication number: CN114332201A
Application number: CN202111452181.8A
Authority: CN
Inventors: 王子扩; 李亚蓓; 冯阳
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2021-12-01
Filing date: 2021-12-01
Publication date: 2022-04-12

Abstract

The specification discloses a method and a device for model training and target detection. The method can be applied to an automatic driving system. In the training process, the prediction heat map is supervised by using a labeling heat map mapped based on an elliptic Gaussian kernel function, so that the heat values of all points on a target object in the prediction heat map output by the trained target detection model are distributed in an elliptic shape, the closer the point heat value to the center of the elliptic shape is, the lower the point heat value to the center of the elliptic shape is, the farther the point heat value to the center of the elliptic shape is, and the shape of the vehicle in the real world is more similar to a rectangle instead of a square or a circle.

Description

Model training and target detection method and device

Technical Field

The present disclosure relates to the field of machine learning technologies, and in particular, to a method and an apparatus for model training and target detection.

Background

One task of the target detection algorithm is to locate the position of a target object in data to be detected by using a detection frame with a simpler geometric shape.

In some idea of the target detection algorithm, the prediction of the detection frame of the target object may be replaced by predicting a key point of the target object, for example, predicting a position of a center point of the target object, and predicting a size of the detection frame of the target object, so as to obtain the predicted detection frame of the target object.

In this case, the target detection model may output a prediction heat map in which the heat value of each point located on the target object represents a probability that the point is the center point of the target object, and in general, for each point located on the target object in the prediction heat map, the probability that the point is the center point of the target object is higher as the heat value of the point is higher, and then, the point having the highest heat value in the prediction heat map may be used as the key point of the target object in which the point is located.

In the prior art, an annotation heat map (as shown in fig. 1) is obtained based on an annotation center point of a target object and a circular gaussian kernel function projection, and the annotation heat map is used for supervising a prediction heat map output by a target detection model. At this time, in the area of the labeled heat map projected by each point on the target object, the heat value of each point is decreased progressively from the center of the circle along the radius direction with the labeled center point as the center of the circle, that is, the probability distribution that each point on the target object is the center point is smaller as the distance from the labeled center point is larger, the probability that the point is the center point is smaller.

However, when the target object to be detected is a vehicle, especially for a truck with a relatively narrow width and a long length, the probability that the central point of the target object is distributed at each position cannot be represented by the target detection model trained by using the labeling heatmap projected by the circular gaussian kernel distribution as a monitor.

Disclosure of Invention

The present specification provides a method and an apparatus for model training and target detection, which partially solve the above problems in the prior art.

The technical scheme adopted by the specification is as follows:

the present specification provides a model training method, comprising:

determining a sample image to be detected;

extracting a subnet through the characteristics in the target detection model to be trained, and determining the image characteristics corresponding to the sample image to be tested;

taking the image features output by the feature extraction sub-network as the input of a first prediction sub-network of a target detection model, and outputting a prediction heat map of the sample image through the first prediction sub-network;

acquiring an annotation heat map determined for the sample image in advance, and adjusting parameters in the target detection model by taking the minimum difference between the prediction heat map and the annotation heat map of the sample image as a target;

wherein, confirm the annotation heat map for the said sample picture in advance, include specifically:

and acquiring an annotation central point determined for the target object contained in the sample image, and mapping each point contained in the sample image to an annotation heat map to be annotated based on the annotation central point and a specified elliptic Gaussian kernel function to obtain the annotated annotation heat map of the sample image.

Optionally, the method further comprises:

determining a plurality of prediction key points of the target object in the prediction heat map according to the prediction heat map of the sample image;

inputting the image features output by the feature extraction sub-network into a second prediction sub-network of the target detection model, and outputting the prediction distance between each prediction key point and a prediction detection frame of the target object in each specified direction as each prediction distance corresponding to the prediction key point by aiming at each prediction key point through the second prediction sub-network;

for each prediction key point, determining a prediction detection frame which is predicted for the target object and is based on the prediction key point according to the prediction key point and each prediction interval corresponding to the prediction key point;

acquiring predetermined label detection frames of the target object in the label heat map, and adjusting parameters in the target detection model by taking the minimum difference between each prediction detection frame and the label detection frame as a target;

the method for predetermining the annotation detection frame of the target object in the annotation heat map specifically includes:

and mapping the annotation detection frame contained in the sample image to an annotation heat map to be annotated based on the annotation central point and a specified elliptical Gaussian kernel function to obtain the annotation detection frame of the target object in the annotation heat map.

Optionally, determining a plurality of prediction key points of the target object in the prediction heat map according to the prediction heat map of the sample image, specifically including:

determining a heat value of each point on the target object in the predicted heat map, wherein for each point on the target object in the predicted heat map, the greater the heat value of the point, the greater the probability that the point is the center point of the target object;

and selecting points with the heat value larger than a preset heat threshold value from all points on the target object as prediction key points on the target object.

Optionally, with a minimum difference between each prediction detection frame and the labeling detection frame as a target, adjusting parameters in the target detection model specifically includes:

determining a detection frame set formed by each prediction detection frame and the label detection frame;

and aiming at each detection frame in the detection frame set, determining each appointed corner point of the detection frame, and adjusting parameters in the target detection model by taking the minimum difference between the appointed corner points corresponding to each prediction detection frame and each marking detection frame as a target, wherein the number of the appointed corner points is at least two.

Optionally, the method further comprises:

using the image features output by the feature extraction sub-network as the input of a third prediction sub-network of a target detection model, and predicting the deflection angle of a detection frame of the target object relative to the specified direction through the third prediction sub-network to be used as the predicted deflection angle of the target object;

and acquiring the determined labeled deflection angle of the target object, and adjusting parameters in the target detection model by taking the minimum difference between the predicted deflection angle and the labeled deflection angle of the target object as a target.

Optionally, before determining the sample image to be detected, the method further comprises:

acquiring point cloud data to be detected;

determining a sample image to be detected specifically comprises:

projecting the point cloud data to a designated plane in a space where the point cloud data is located to obtain a projection diagram on the designated plane as a sample image to be detected;

extracting a sub-network through the features in the target detection model to be trained, and determining the image features corresponding to the sample image to be tested, wherein the method specifically comprises the following steps:

dividing the space where the point cloud data is located into a plurality of voxels, and performing feature extraction on the point cloud data in each voxel to obtain the extracted point cloud features in the voxel;

and according to the extracted point cloud characteristics of each voxel, extracting a sub-network through characteristics in a target detection model to be trained, and determining a point cloud characteristic diagram on the designated plane as the determined image characteristics corresponding to the sample image to be detected.

Optionally, the method further comprises:

using the image features output by the feature extraction sub-network as the input of a fourth prediction sub-network of a target detection model, and predicting the height of a detection frame of the target object through the fourth prediction sub-network to be used as the predicted height of the target object;

and acquiring the determined labeling height of the target object, and adjusting parameters in the target detection model by taking the minimum difference between the predicted height and the labeling height of the target object as a target.

The present specification provides a target detection method, including:

determining a target image to be detected;

extracting a subnet from the features in the target detection model trained by any one of the methods, and determining the image features corresponding to the target image;

using the image features output by the feature extraction sub-network as the input of a first prediction sub-network of a target detection model, and outputting a prediction heat map of the target image through the first prediction sub-network, wherein for each point on a target object in the prediction heat map, the greater the heat value of the point, the greater the probability that the point is the central point of the target object;

and outputting a prediction detection frame of the target object according to the prediction heat map of the target image.

Optionally, outputting a prediction detection frame of the target object according to the prediction heat map of the target image, specifically including:

determining the heat value of each point in the prediction heat map, and taking the point with the extreme heat value on the target object as a prediction key point of the target object;

the image features output by the feature extraction sub-network are used as the input of a second prediction sub-network of a target detection model, and the prediction distance between the prediction key point of the target object and a prediction detection frame of the target object in each specified direction is output through the second prediction sub-network;

and determining the prediction detection frame of the target object according to the prediction key point of the target object and the prediction distance between the prediction key point of the target object and the prediction detection frame of the target object in each specified direction.

determining the shape of a prediction detection frame of the target object according to the prediction heat map of the target image;

the image features output by the feature extraction sub-network are used as the input of a third prediction sub-network of a target detection model, and the deflection angle of a detection frame of the target object relative to the specified direction is output through the third prediction sub-network and is used as the predicted deflection angle of the target object;

and outputting the predicted detection frame of the target object according to the shape of the predicted detection frame of the target object and the predicted deflection angle of the target object.

Optionally, the extracting of the image feature corresponding to the target image specifically includes:

acquiring point cloud data to be detected;

according to the extracted point cloud characteristics of each voxel, determining a point cloud characteristic graph on the designated plane as image characteristics corresponding to the extracted target image through a characteristic extraction subnet in the target detection model;

outputting a prediction detection frame of the target object according to the prediction heat map of the target image, specifically comprising:

and outputting a three-dimensional detection frame of the target object contained in the target image in a three-dimensional space according to the prediction heat map of the target image.

Optionally, outputting a three-dimensional detection frame of the target object included in the target image in the three-dimensional space according to the predicted heat map of the target image, specifically including:

determining a two-dimensional detection frame of the target object in the target image according to the prediction heat map of the target image, and using the two-dimensional detection frame as the prediction detection frame of the target object;

the image features output by the feature extraction sub-network are used as the input of a fourth prediction sub-network of a target detection model, and the height of the target object from the specified plane is output through the fourth prediction sub-network and is used as the prediction height of the target object;

and determining a three-dimensional detection frame of the target object in a three-dimensional space according to the two-dimensional detection frame of the target object in the target image and the predicted height of the target object, and updating the determined three-dimensional detection frame into the predicted detection frame of the target object.

This specification provides a model training device, comprising:

the image determining module is used for determining a sample image to be detected;

the characteristic extraction module is used for extracting a sub-network through characteristics in a target detection model to be trained and determining image characteristics corresponding to the sample image to be detected;

the heat map prediction module is used for taking the image features output by the feature extraction sub-network as the input of a first prediction sub-network of a target detection model and outputting a prediction heat map of the sample image through the first prediction sub-network;

the parameter adjusting module is used for acquiring an annotation heat map determined for the sample image in advance, and adjusting parameters in the target detection model by taking the minimum difference between the prediction heat map and the annotation heat map of the sample image as a target;

the annotation module is used for determining an annotation heat map for the sample image in advance, and specifically comprises: and acquiring an annotation central point determined for the target object contained in the sample image, and mapping each point contained in the sample image to an annotation heat map to be annotated based on the annotation central point and a specified elliptic Gaussian kernel function to obtain the annotated annotation heat map of the sample image.

The present specification provides an object detection apparatus comprising:

the image determining module is used for determining a target image to be detected;

the feature extraction module is used for extracting a sub-network through the features in the target detection model trained by the method and determining the image features corresponding to the target image;

the heat map prediction module is used for taking the image features output by the feature extraction sub-network as the input of a first prediction sub-network of a target detection model, and outputting a prediction heat map of the target image through the first prediction sub-network, wherein for each point on a target object in the prediction heat map, the greater the heat value of the point, the greater the probability that the point is the central point of the target object;

and the target detection module is used for outputting a prediction detection frame of the target object according to the prediction heat map of the target image.

The present specification provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the above-described model training and target detection method.

The present specification provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above model training method when executing the program.

The present specification provides an autopilot device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above object detection method when executing the program.

The technical scheme adopted by the specification can achieve the following beneficial effects:

in the method for model training and target detection provided by the present specification, in the training process, a labeled heat map mapped based on an elliptic gaussian kernel function is used to monitor a predicted heat map, so that the heat values of points on a target object in the predicted heat map output by the trained target detection model are distributed in an elliptic shape, the closer the point heat value to the center of the elliptic shape is, the lower the point heat value to the center of the elliptic shape is, the more similar the shape of the vehicle in the real world is to a rectangle rather than a square or a circle, and therefore, the heat map output by the target detection model trained based on the method can closely represent the probability that the central point of the target object is distributed at each position.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification and are incorporated in and constitute a part of this specification, illustrate embodiments of the specification and together with the description serve to explain the specification and not to limit the specification in a non-limiting sense. In the drawings:

FIG. 1 is a schematic diagram of a heat map mapped based on a circular Gaussian kernel function in the present specification;

FIG. 2 is a schematic structural diagram of an object detection model provided in the present specification;

FIG. 3 is a schematic flow chart of a model training method provided herein;

FIG. 4 is a schematic diagram of a model training apparatus provided herein;

FIG. 5 is a schematic view of an object detection device provided herein;

fig. 6 is a schematic structural diagram of an autopilot apparatus provided in this specification.

Detailed Description

The embodiment of the description provides a target detection model, a model training method for training the target detection model, and a target detection method adopting the target detection model.

The target detection in the embodiments of the present specification aims to predict a detection frame for a target object, that is, a detection frame with a simpler geometric shape is used to locate the position of the target object in data to be detected.

In order to make the objects, technical solutions and advantages of the present disclosure more clear, the technical solutions of the present disclosure will be clearly and completely described below with reference to the specific embodiments of the present disclosure and the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present specification without any creative effort belong to the protection scope of the present specification.

The technical solutions provided by the embodiments of the present description are described in detail below with reference to the accompanying drawings.

First, a model structure diagram of the target detection model is shown in fig. 2, where the target detection model includes a feature extraction layer, a first prediction subnet, a second prediction subnet, a third prediction subnet, and a fourth prediction subnet.

In actual application, image features corresponding to a target image are extracted from the input data to be extracted through a feature extraction layer, then the image features are respectively input into a first prediction sub-network, a second prediction sub-network, a third prediction sub-network and a fourth prediction sub-network, so as to obtain output results of the prediction sub-networks, as shown in fig. 2, in an embodiment of the present specification, the outputs of the prediction sub-networks are a prediction heat map, a prediction interval, a prediction deflection angle and a prediction height, and then, according to the outputs of the prediction sub-networks, a detection frame predicted for the target object, that is, a prediction detection frame, can be determined.

Specifically, according to an application scenario of the target detection model, when the target detection model is used for face recognition, the target object may be a face of a person in data to be extracted, which is acquired by a sensor, and when the target detection model is used in the field of automatic driving, obstacles around the automatic driving equipment are detected, and when the obstacle detection is performed, the target object may be each obstacle in the data to be extracted, which is acquired by the sensor, and the obstacles may be pedestrians, motor vehicles, non-motor vehicles, and the like. The following description will describe a target detection model, a target detection method, and a model training method provided in the embodiments of the present description, taking the target object as an obstacle as an example.

The execution subject of the target detection method provided by the present specification and the execution subject of the model training method provided by the present specification may be the same or different, and taking the example that the execution subject of the target detection method provided by the present specification and the execution subject of the model training method provided by the present specification are different, both the execution subject of the target detection method and the execution subject of the model training method may be any existing server or electronic device, and further, when the execution subject of the target detection method is not the automatic driving device itself, there is a communication connection between the execution subject of the target detection method and the automatic driving device. On this basis, for each of the execution subject of the target detection method and the execution subject of the model training method, when the execution subject is an electronic device, the execution subject may be any existing electronic device, such as a mobile phone, a notebook computer, a tablet computer, and the like, and when the execution subject is a server, the execution subject may be a cluster server, a distributed server, and the like.

In an embodiment of the present specification, an execution subject of the model training method is taken as a server, and an execution subject of the target detection method is taken as an example of an automatic driving device.

Further, the automatic driving apparatus described in this specification may include an automatic driving vehicle and a vehicle having a driving assistance function, among others. The autopilot device may be a delivery vehicle for use in the delivery field.

The data to be extracted includes a target object, the data to be extracted may be data acquired by a sensor, and the data type of the data to be extracted may also be different according to the type of the sensor, for example, when the sensor is a detection device such as a radar, the data to be extracted may be point cloud data, and when the sensor is an image acquisition device, the data to be extracted may be image data. In an embodiment of the present specification, the data to be extracted may be a target image acquired by an image acquisition device.

In an embodiment of this specification, a two-dimensional detection frame of a target object on a certain designated plane may be predicted according to the predicted heatmap output by the first predicted subnet, the predicted interval output by the second predicted subnet, and the predicted deflection angle output by the third predicted subnet, and a three-dimensional detection frame of the target object predicted by the target detection model may be output according to the two-dimensional detection frame and the predicted height output by the fourth predicted subnet. In an embodiment of the present specification, a plane in which an image coordinate system of the target image is located may be the designated plane.

Generally speaking, when the target detection model is used in the field of automatic driving, the specified plane may be a road surface on which the automatic driving device is located, further, the target image may be a top view of the road surface on which the automatic driving device is located, and of course, any plane in a space in which the automatic driving device is located may also be used, at this time, the fourth prediction subnet needs to output a first height from the proximal end of the target object to the specified plane and a second height from the target object to the distal end of the specified plane, and then determines a three-dimensional detection frame of the target object according to a difference between the second height and the first height and a prediction detection frame of the target object. For the sake of brevity, the description will be given by taking the specified plane as the road surface where the automatic driving apparatus is located.

In this embodiment, the two-dimensional detection box may be referred to as a prediction detection box of the target object.

Specifically, the first prediction subnet outputs a prediction heat map in which the heat value of each point on the target object indicates a probability that the point is the center point of the target object, and for each point on the target object in the prediction heat map, the higher the heat value of the point is, the higher the probability that the point is the center point of the target object is, and therefore, the point with the maximum heat value may be used as the key point of the target object, and of course, when the smaller the heat value of the point is, the higher the probability that the point is the center point of the target object is, the smaller the heat value of the point is, the point with the minimum heat value may be used as the key point of the target object.

In an embodiment of this specification, the key point of the target object may be a center point of the target object, and in this case, the predicted detection frame of the target object may be obtained by using the center point output by the first prediction subnet as a center point of a standard detection frame according to a preset standard detection frame having a standard width and a standard height.

In another embodiment of this specification, the standard detection frame of the target object may be further determined according to outputs of the first prediction subnet and the second prediction subnet, and specifically, the second prediction subnet outputs a prediction distance from the key point to the prediction detection frame of the target object in each specified direction.

In practical applications, a rectangular frame is generally used as a two-dimensional detection frame of a target object, in this case, the specified direction may be a direction perpendicular to four sides of the rectangular frame, and the predicted distances may be distances from key points of the target object to the four sides of the rectangular frame, respectively.

In an embodiment of the present specification, when the key point of the target object is the central point of the target object, the predicted distance from the opposite side of the predicted detection frame output by the second prediction subnet is the same, and therefore, in this case, the second prediction subnet may output only the width and the height of the predicted detection frame. Of course, in this embodiment of the present description, it is not limited whether the key point of the target object is the central point of the target object.

In addition, the target detection model further includes a third prediction sub-network, where the third prediction sub-network outputs a deflection angle of the target object detection frame with respect to a designated direction as a predicted deflection angle of the target object, and then may determine the predicted detection frame of the target object according to a shape of the predicted detection frame of the target object determined in any one of the above manners.

Furthermore, the prediction detection frame of the target object can also be a three-dimensional detection frame.

Specifically, the image features output by the feature extraction subnet may be used as input of a fourth prediction subnet of the target detection model, and the height of the target object from the designated plane is output through the fourth prediction subnet as the predicted height of the target object, and then, according to the two-dimensional detection frame of the target object determined by any of the above manners and the predicted height of the target object, the three-dimensional detection frame of the target object in the three-dimensional space is determined, and the determined three-dimensional detection frame is updated to the predicted detection frame of the target object.

Generally, when data to be detected only includes two-dimensional features, it is difficult to predict a three-dimensional detection frame of a target object, and therefore, in an embodiment of this specification, the data to be detected may be three-dimensional data, for example, point cloud data, or image data including depth data.

Specifically, point cloud data to be detected can be obtained, then, the space where the point cloud data is located is divided into a plurality of voxels, for each voxel, feature extraction is performed on the point cloud data in the voxel to obtain extracted point cloud features in the voxel, according to the extracted point cloud features of each voxel, a sub-network is extracted through features in the target detection model to determine a point cloud feature map on the designated plane, and the point cloud feature map is used as the image features corresponding to the extracted target image. The voxel is a volume element, and represents a unit divided in a three-dimensional space.

The feature extraction subnet may be any existing machine learning model, for example, may be Deep residual network (ResNet), Multilayer Perceptron (MLP), and the like, which is not limited in this specification.

In this case, the image features are extracted from the point cloud type data to be detected, and therefore, in an embodiment of the present specification, the two-dimensional detection frame may be determined directly in a specified plane of a space where the point cloud data is located by using any of the above manners, and then the three-dimensional detection frame of the target object in the space where the point cloud data is located may be determined according to the prediction of the target object output by the fourth prediction subnet, that is, the point cloud data does not need to be projected to obtain the target image.

The data to be detected is projected to an image obtained on a certain plane, and the data to be detected is directly marked in the data to be detected to obtain a marked central point, a marked detection frame and the like of a target object, and then each mark is projected to a specified plane by adopting any mode to obtain a mark on the specified plane, at the moment, the target object contained in the target image can refer to an area in the marked detection frame in the target image.

Before the target detection model is actually applied, in order to ensure the accuracy of the detection result output by the target detection model, the target detection model needs to be trained, and of course, after the target detection model is applied for a period of time, the accuracy of the detection result output by the target detection model may be poor due to the difference between the newly added data and the sample adopted during training, and at this time, the target detection model may also be trained.

It can be seen that any target detection model with training requirements can be the target detection model in the embodiments of the present specification. The following embodiments of the present specification provide a model training method for training the target detection model to be trained.

First, the sample graph may be labeled first.

Similar to the application phase, the sample image may also be acquired by the sensor in advance of the image containing the target object. In the training stage, the sample image may be labeled with the labeled central point of the target object in advance. Then, an annotation center point determined for the target object contained in the sample image may be obtained, and based on the annotation center point and a specified elliptical gaussian kernel function, each point contained in the sample image is mapped to an annotation heat map to be annotated, so as to obtain an annotation heat map of the annotated sample image.

In the embodiments of the present specification, the specific parameters included in the elliptic gaussian kernel function are not limited, it should be noted that, in the annotation heat map projected by the elliptical gaussian kernel function, the heat value of the point projected from the annotation center point of the target object in the sample image to the annotation heat map is greater than the heat value of the point projected from other points on the target object in the sample image to the annotation heat map, and furthermore, the heat value of other points around the point projected by the annotation central point of the target object in the sample image in the annotation heat map decreases along with the distance between the point and the point projected by the annotation central point, and along the circumferential direction from the point projected from the marking central point to the ellipse, the smaller the included angle between the point and the minor axis of the ellipse, the larger the variation amplitude of the heat value of each point,

then, an embodiment of the present specification provides a flowchart of a method for training a model as shown in fig. 3, which specifically includes the following steps:

s300: and determining a sample image to be detected.

S302: and extracting a subnet through the characteristics in the target detection model to be trained, and determining the image characteristics corresponding to the sample image to be detected.

Generally, in order to improve the accuracy of the output of the model, it is ensured that the sample set of the model training stage is similar to the data included in the test set of the actual application stage, and therefore, in this embodiment of the present specification, the sample image is similar to the target image of the application stage, specifically, the image coordinate system of the target image may also be a certain specified plane, further, the specified plane may also be a road surface, and the target image is a top view of the road surface, and so on.

The sample image with the corresponding label can be used as a sample image to be detected, and the image characteristics corresponding to the sample image to be detected are determined by extracting a sub-network through the characteristics in the target detection model to be trained. Similar to the application stage, the image features may also be extracted from the data to be detected, and the data to be detected is taken as the sample image for illustration.

Further, in the training phase, when the data to be detected is point cloud data, the specification exemplarily provides a labeling mode:

marking a marking central point of a marked object and a three-dimensional marking frame of each target object in a three-dimensional space where the point cloud data is located, then projecting the marked marking central point and the marked three-dimensional marking frame to a specified plane of the three-dimensional space to obtain a marking central point and a two-dimensional marking frame of the target object on the specified plane, namely, a target image on the specified plane only comprises the marking central point and the two-dimensional marking frame, but not the target object. At this time, the point in the two-dimensional labeling frame may be considered as a point located on the target object.

Similar to the application stage, the feature extraction sub-network may be any existing machine learning model, for example, Deep residual network (ResNet), multi-layer Perceptron (MLP), and the like, which is not limited in this embodiment of the specification.

S304: and taking the image features output by the feature extraction sub-network as the input of a first prediction sub-network of a target detection model, and outputting the prediction heat map of the sample image through the first prediction sub-network.

Using the image features output by the feature extraction sub-network as the input of a first prediction sub-network of a target detection model, outputting a prediction heat map of the sample image through the first prediction sub-network, in a prediction heat map, the heat value for each point on a target object represents the probability that the point is the center point of the target object, for each point on the target object in the prediction heat map, the greater the heat value of the point, the greater the probability that the point is the center point of the target object, and therefore, the points with the greatest heat value may be considered critical points of the target object, and, of course, when for each point on the target object in the predicted heat map, when the probability that the point is the central point of the target object is higher as the heat value of the point is smaller, the point having the minimum heat value may be used as the key point of the target object.

S306: and acquiring an annotation heat map determined for the sample image in advance, and adjusting parameters in the target detection model by taking the minimum difference between the prediction heat map and the annotation heat map of the sample image as a target.

Parameters in the object detection model are then adjusted with the minimum difference between the predicted and annotated heat maps of the sample images as an object.

In an embodiment of the present specification, the image sizes and the number of pixels of the sample image, the annotation heat map, and the prediction heat map may be set to be the same.

Specifically, the difference between each point pair at the corresponding position in the prediction heat map and the annotation heat map can be determined, and then the parameter in the target detection model is adjusted by taking the minimum difference between each point pair as a target. Furthermore, a parameter may be set for the difference between each pair of points, and the closer a point pair is to a point projected from the labeled center point in the labeled heat map, the higher the weight of the point pair.

It can be understood that, in the training process, since the predicted heat map is supervised by using the labeled heat map mapped based on the elliptic gaussian kernel function, the heat value of each point on the target object in the output predicted heat map is distributed in an elliptic shape, and the heat value of the point closer to the center of the elliptic shape is higher, and the heat value of the point farther from the center of the elliptic shape is lower.

Based on the method described in fig. 3, after the target detection model is trained, the heat value of each point on the target object in the predicted heat map output by the target detection model is distributed in an elliptical shape, and the shape of the vehicle in the real world is more similar to a rectangle rather than a square or a circle, because the heat value of the point closer to the center of the ellipse is higher and the heat value of the point farther from the center of the ellipse is lower, the heat map output by the target detection model trained based on the method can closely represent the distribution probability of the center point of the target object at each position.

Then, based on the predicted heat map of the sample image, a number of predicted keypoints of the target object in the predicted heat map may be determined. Inputting the image features output by the feature extraction sub-network into a second prediction sub-network of the target detection model, outputting, by the second prediction sub-network, for each prediction key point, a prediction distance from the prediction key point to a prediction detection frame of the target object in each specified direction as each prediction distance corresponding to the prediction key point, and then, for each prediction key point, determining a prediction detection frame based on the prediction key point predicted for the target object according to the prediction key point and each prediction distance corresponding to the prediction key point.

In an embodiment of the present specification, when labeling is performed in advance, a label detection frame is labeled in the sample image, and then, after the above projection is performed based on the elliptic gaussian kernel function, the obtained label heat map includes the label detection frame.

It will be appreciated that parameters in the object detection model are adjusted with the goal of minimizing the difference between each of the predictive detection boxes and the annotation detection box.

For example, a detection frame set composed of prediction detection frames and the labeling detection frame may be determined, then, for each detection frame in the detection frame set, each designated corner of the detection frame is determined, and a parameter in the target detection model is adjusted by taking a minimum difference between the designated corners of the prediction detection frames and the corresponding labeling detection frames as a target, where the number of the designated corners is at least two. In the embodiments of the present specification, the corner point is a connection point between line segments that form the labeling detection frame.

In an embodiment of the present specification, the heat value of each point on the target object in the prediction heat map may be determined, and a point having a heat value greater than a preset heat threshold value may be selected from each point on the target object as a prediction key point on the target object.

Similarly, the sample image may be further marked with an annotated deflection angle of the target object, and the annotated deflection angle may indicate a deflection angle of a detection frame of the target object relative to a specified direction. In an embodiment of the present specification, the specified direction may be a road center line direction of a road on which the automatic driving apparatus is located.

And similarly to the application process, the image features output by the feature extraction sub-network may be used as input of a third prediction sub-network of the target detection model, a deflection angle of the detection frame of the target object relative to a specified direction may be predicted through the third prediction sub-network, and the deflection angle may be used as a predicted deflection angle of the target object, and then, a parameter in the target detection model may be adjusted with a minimum difference between the predicted deflection angle and the labeled deflection angle of the target object as a target.

It should be noted that, the above description is only performed on the data to be detected, that is, the sample image, and in another embodiment of the present specification, the data to be detected may be other data acquired by a sensor, for example, may be point cloud data. In this case, the sample image may be an image obtained by projecting the data to be detected onto a certain plane, but the sample image is directly labeled in the data to be detected to obtain a labeled central point, a labeled detection frame, and the like of the target object, and then each label is projected onto the designated plane by using any of the above methods to obtain a label on the designated plane, where at this time, the target object included in the target image may refer to an area in the labeled detection frame in the target image.

In this case, similar to the application process, any one of the above manners may be adopted to extract the image feature corresponding to the sample object from the point cloud data to be detected, and the description of the embodiment of this specification is omitted.

Certainly, when the data to be detected is point cloud data, the height of a target object, that is, the distance between the target object and the designated plane may also be marked, in this embodiment of the present specification, the marked height of the target object may be used as the marked height of the target object, then, any one of the above methods is used for implementation, the image feature output by the feature extraction subnet is used as the input of a fourth prediction subnet of the target detection model, the height of the detection frame of the target object is predicted by the fourth prediction subnet and is used as the predicted height of the target object, and the parameter in the target detection model is adjusted by taking the minimum difference between the predicted height and the marked height of the target object as a target.

Based on the same idea, the present specification further provides a corresponding model training device and a corresponding target detection device.

Fig. 4 is a schematic diagram of a model training apparatus provided in the present specification, the apparatus including:

an image determination module 400 for determining a sample image to be detected;

a feature extraction module 402, configured to extract a sub-network according to features in a target detection model to be trained, and determine image features corresponding to the sample image to be detected;

a heat map prediction module 404, configured to use the image features output by the feature extraction sub-network as input of a first prediction sub-network of the target detection model, and output a prediction heat map of the sample image through the first prediction sub-network;

a parameter adjusting module 406, configured to acquire an annotation heat map determined for the sample image in advance, and adjust a parameter in the target detection model by using a minimum difference between a predicted heat map of the sample image and the annotation heat map as a target;

the annotation module 408 is configured to determine an annotation heat map for the sample image in advance, and specifically includes: and acquiring an annotation central point determined for the target object contained in the sample image, and mapping each point contained in the sample image to an annotation heat map to be annotated based on the annotation central point and a specified elliptic Gaussian kernel function to obtain the annotated annotation heat map of the sample image.

Optionally, the parameter adjusting module 406 is specifically configured to determine, according to a prediction heat map of the sample image, a plurality of prediction key points of the target object in the prediction heat map; inputting the image features output by the feature extraction sub-network into a second prediction sub-network of the target detection model, and outputting the prediction distance between each prediction key point and a prediction detection frame of the target object in each specified direction as each prediction distance corresponding to the prediction key point by aiming at each prediction key point through the second prediction sub-network; for each prediction key point, determining a prediction detection frame which is predicted for the target object and is based on the prediction key point according to the prediction key point and each prediction interval corresponding to the prediction key point; acquiring predetermined label detection frames of the target object in the label heat map, and adjusting parameters in the target detection model by taking the minimum difference between each prediction detection frame and the label detection frame as a target; the annotation module 408 is specifically configured to obtain an annotation detection frame determined for a target object included in the sample image, obtain an annotation center point determined for the target object included in the sample image, and map the annotation detection frame included in the sample image to an annotation heat map to be annotated based on the annotation center point and a specified elliptical gaussian kernel function, so as to obtain the annotation detection frame of the target object in the annotation heat map.

Optionally, the parameter adjusting module 406 is specifically configured to determine a heat value of each point on the target object in the prediction heat map, where for each point on the target object in the prediction heat map, the greater the heat value of the point, the greater the probability that the point is the center point of the target object; and selecting points with the heat value larger than a preset heat threshold value from all points on the target object as prediction key points on the target object.

Optionally, the parameter adjusting module 406 is specifically configured to determine a detection frame set formed by each prediction detection frame and the label detection frame; and aiming at each detection frame in the detection frame set, determining each appointed corner point of the detection frame, and adjusting parameters in the target detection model by taking the minimum difference between the appointed corner points corresponding to each prediction detection frame and each marking detection frame as a target, wherein the number of the appointed corner points is at least two.

Optionally, the parameter adjusting module 406 is specifically configured to use the image feature output by the feature extraction subnet as an input of a third prediction subnet of the target detection model, and predict, through the third prediction subnet, a deflection angle of the detection frame of the target object relative to a specified direction as a predicted deflection angle of the target object; and acquiring the determined labeled deflection angle of the target object, and adjusting parameters in the target detection model by taking the minimum difference between the predicted deflection angle and the labeled deflection angle of the target object as a target.

Optionally, before determining the sample image to be detected, the image determining module 400 is specifically configured to obtain point cloud data to be detected; projecting the point cloud data to a designated plane in a space where the point cloud data is located to obtain a projection diagram on the designated plane as a sample image to be detected; the feature extraction module 402 is specifically configured to divide a space where the point cloud data is located into a plurality of voxels, perform feature extraction on the point cloud data in each voxel, and obtain extracted point cloud features in the voxel; and according to the extracted point cloud characteristics of each voxel, extracting a sub-network through characteristics in a target detection model to be trained, and determining a point cloud characteristic diagram on the designated plane as the determined image characteristics corresponding to the sample image to be detected.

Optionally, the parameter adjusting module 406 is specifically configured to use the image feature output by the feature extraction subnet as an input of a fourth prediction subnet of the target detection model, and predict, through the fourth prediction subnet, a detection frame height of the target object as a predicted height of the target object; and acquiring the determined labeling height of the target object, and adjusting parameters in the target detection model by taking the minimum difference between the predicted height and the labeling height of the target object as a target.

Fig. 5 is a schematic diagram of an object detection apparatus provided in the present specification, the apparatus including:

an image determining module 500, configured to determine a target image to be detected;

a feature extraction module 502, configured to extract a sub-network from features in a target detection model trained by any one of the foregoing methods, and determine an image feature corresponding to the target image;

a heat map prediction module 504, configured to output a predicted heat map of the target image through a first prediction subnet of a target detection model by using the image features output by the feature extraction subnet as input of the first prediction subnet, where, for each point on the target object in the predicted heat map, the greater the heat value of the point, the greater the probability that the point is the center point of the target object;

and the target detection module 506 is configured to output a prediction detection frame of the target object according to the prediction heat map of the target image.

Optionally, the target detection module 506 is specifically configured to determine heat values of points in the predicted heat map, and use a point, which is located on the target object and has an extreme heat value, as a predicted key point of the target object; the image features output by the feature extraction sub-network are used as the input of a second prediction sub-network of a target detection model, and the prediction distance between the prediction key point of the target object and a prediction detection frame of the target object in each specified direction is output through the second prediction sub-network; and determining the prediction detection frame of the target object according to the prediction key point of the target object and the prediction distance between the prediction key point of the target object and the prediction detection frame of the target object in each specified direction.

Optionally, the target detection module 506 is specifically configured to determine a shape of a predicted detection frame of the target object according to a predicted heat map of the target image; the image features output by the feature extraction sub-network are used as the input of a third prediction sub-network of a target detection model, and the deflection angle of a detection frame of the target object relative to the specified direction is output through the third prediction sub-network and is used as the predicted deflection angle of the target object; and outputting the predicted detection frame of the target object according to the shape of the predicted detection frame of the target object and the predicted deflection angle of the target object.

Optionally, the feature extraction module 502 is specifically configured to obtain point cloud data to be detected; dividing the space where the point cloud data is located into a plurality of voxels, and performing feature extraction on the point cloud data in each voxel to obtain the extracted point cloud features in the voxel; according to the extracted point cloud characteristics of each voxel, determining a point cloud characteristic graph on the designated plane as image characteristics corresponding to the extracted target image through a characteristic extraction subnet in the target detection model; the target detection module 506 is specifically configured to output a three-dimensional detection frame of a target object included in a target image in a three-dimensional space according to the predicted heat map of the target image.

Optionally, the target detection module 506 is specifically configured to determine, according to a predicted heat map of a target image, a two-dimensional detection frame of the target object in the target image, and use the two-dimensional detection frame as the predicted detection frame of the target object; the image features output by the feature extraction sub-network are used as the input of a fourth prediction sub-network of a target detection model, and the height of the target object from the specified plane is output through the fourth prediction sub-network and is used as the prediction height of the target object; and determining a three-dimensional detection frame of the target object in a three-dimensional space according to the two-dimensional detection frame of the target object in the target image and the predicted height of the target object, and updating the determined three-dimensional detection frame into the predicted detection frame of the target object.

The present specification also provides a computer-readable storage medium storing a computer program, which can be used to execute the above-mentioned model training and target detection method.

The present specification also provides a schematic structural diagram of the electronic device shown in fig. 6. As shown in fig. 6, at the hardware level, the autopilot device includes a processor, an internal bus, a memory, and a non-volatile memory, although it may include hardware required for other services. The processor reads a corresponding computer program from the nonvolatile memory to the memory and then runs the computer program to realize the model training and target detection method.

Of course, besides the software implementation, the present specification does not exclude other implementations, such as logic devices or a combination of software and hardware, and the like, that is, the execution subject of the following processing flow is not limited to each logic unit, and may be hardware or logic devices.

In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually making an Integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Hardware Description Language), traffic, pl (core universal Programming Language), HDCal (jhdware Description Language), lang, Lola, HDL, laspam, hardward Description Language (vhr Description Language), vhal (Hardware Description Language), and vhigh-Language, which are currently used in most common. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functions of the various elements may be implemented in the same one or more software and/or hardware implementations of the present description.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

This description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only an example of the present specification, and is not intended to limit the present specification. Various modifications and alterations to this description will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present specification should be included in the scope of the claims of the present specification.

Claims

1. A method of model training, comprising:

determining a sample image to be detected;

and acquiring an annotation central point determined for the target object contained in the sample image, and mapping each point contained in the sample image to an annotation heat map to be annotated based on the annotation central point and a specified elliptic Gaussian kernel function to obtain the annotation heat map of the sample image.

2. The training method of claim 1, wherein the method further comprises:

3. The method of claim 2, wherein determining a number of predicted keypoints of the target object in the predicted heat map based on the predicted heat map of the sample image comprises:

4. The method of claim 2, wherein aiming at minimizing a difference between each of the predictive detection boxes and the label detection box, adjusting parameters in the target detection model comprises:

5. The method of claim 1, wherein the method further comprises:

6. The method of claim 1, wherein prior to determining the sample image to be detected, the method further comprises:

acquiring point cloud data to be detected;

determining a sample image to be detected specifically comprises:

7. The method of claim 6, wherein the method further comprises:

8. A method of object detection, comprising:

determining a target image to be detected;

determining image characteristics corresponding to the target image through a characteristic extraction sub-network in a target detection model trained by the method according to any one of claims 1 to 7;

9. The method of claim 8, wherein outputting the predicted detection frame of the target object based on the predicted heat map of the target image comprises:

10. The method of claim 8, wherein outputting the predicted detection frame of the target object based on the predicted heat map of the target image comprises:

11. The method of claim 8, wherein extracting the image feature corresponding to the target image specifically comprises:

acquiring point cloud data to be detected;

12. The method according to claim 11, wherein outputting a three-dimensional detection frame of the target object contained in the target image in a three-dimensional space according to the predicted heat map of the target image comprises:

13. A model training device, characterized in that the device specifically includes:

14. An object detection apparatus, characterized in that the apparatus specifically comprises:

the feature extraction module is used for determining image features corresponding to the target image through a feature extraction sub-network in a target detection model trained by the method according to any one of claims 1 to 7;

15. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1 to 12.

16. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1 to 7 when executing the program.

17. An autopilot device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program carries out the method of any one of claims 8 to 12.