WO2024037552A1

WO2024037552A1 - Target detection model training method and apparatus, map generation method and apparatus, and device

Info

Publication number: WO2024037552A1
Application number: PCT/CN2023/113197
Authority: WO
Inventors: 廖本成; 陈少宇; 程天恒; 张骞
Original assignee: 北京地平线信息技术有限公司
Priority date: 2022-08-16
Filing date: 2023-08-15
Publication date: 2024-02-22
Also published as: CN115331188A

Abstract

Disclosed in embodiments of the present disclosure are a target detection model training method and apparatus, a map generation method and apparatus, and a device. The target detection model training method comprises: acquiring training input data and corresponding first label data, wherein the training input data comprises training image data and/or training point cloud data, the first label data comprises ordered point sets respectively corresponding to a first number of instances in the training input data, and each ordered point set comprises a target number of coordinate points under a first coordinate system; and on the basis of the training input data, the first label data, a point-to-point loss function and a direction loss function, training a pre-established target detection network to obtain a target detection model. The target detection model obtained in the embodiments of the present disclosure can accurately and effectively predict ordered point sets respectively corresponding to instances, realizing the coordinate point-level prediction of the instances, and compared with instance box-level prediction, the embodiments of the present disclosure are beneficial to improving the precision of a prediction result.

Description

Target detection model training method, map generation method, device and equipment

This disclosure requires the priority of a Chinese patent application submitted to the State Intellectual Property Office on August 16, 2022, with the application number CN202210977934.5 and the invention title "Target detection model training method, map generation method, device and equipment" , the entire contents of which are incorporated into this disclosure by reference.

Technical field

The present disclosure relates to autonomous driving technology, and in particular, to a training method of a target detection model, a map generation method, device and equipment.

Background technique

In autonomous driving scenarios, it is usually necessary to use vehicle-mounted surround-view cameras and/or radars to perceive road elements (such as lane lines, zebra crossings, curbs, drivable areas, etc.) for online map generation.

Contents of the invention

Embodiments of the present disclosure provide a training method for a target detection model, a map generation method, an apparatus and a device.

According to an aspect of an embodiment of the present disclosure, a training method for a target detection model is provided, including: obtaining training input data and corresponding first label data, where the training input data includes training image data and/or training point cloud data. , the first label data includes an ordered point set corresponding to a first number of instances in the training input data, and the ordered point set includes a target number of coordinate points in the first coordinate system; based on the training Input data, the first label data, a point-to-point loss function and a direction loss function to train the pre-established target detection network to obtain a target detection model. The point-to-point loss function is used to determine the training instance output by the target detection network. The point-to-point loss of the point set relative to the ordered point set of the instance in the first label data, the direction loss function is used to determine the direction between points in the training instance point set relative to the first label data The loss of the direction between points in the ordered point set of the instance.

According to another aspect of an embodiment of the present disclosure, a method for generating a map is provided, including: acquiring first image data and/or first point cloud data of at least one perspective; based on the first image data and/or the The first point cloud data uses a target detection model obtained by pre-training to obtain an ordered point set of target instances. The target detection model is obtained by the training method of the target detection model as described in any of the above embodiments. The target instance The ordered point set includes ordered point sets corresponding to the first number of instances, and the ordered point set includes a target number of coordinate points in the first coordinate system; based on the target instance ordered point set, a map is generated.

According to another aspect of the embodiment of the present disclosure, a training device for a target detection model is provided, including: a first acquisition module for acquiring training input data and corresponding first label data, where the training input data includes training images data and/or training point cloud data. The first label data includes an ordered point set corresponding to a first number of instances in the training input data. The ordered point set includes a target number of points in the first coordinate system. coordinate points; a first processing module configured to train a pre-established target detection network based on the training input data, the first label data, a point-to-point loss function and a direction loss function to obtain a target detection model, the The point-to-point loss function is used to determine the point-to-point loss of the training instance point set output by the target detection network relative to the ordered point set of instances in the first label data, and the direction loss function is used to determine the training instance point set. The loss of the direction between points relative to the direction between points of the ordered point set of instances in the first label data.

According to yet another aspect of an embodiment of the present disclosure, a map generation device is provided, including: a second acquisition module for acquiring first image data and/or first point cloud data of at least one perspective; a second processing module, Used to obtain an ordered point set of target instances based on the first image data and/or the first point cloud data using a target detection model obtained through pre-training. The target detection model is as described in any of the above embodiments. The training method of the target detection model is obtained, so The ordered point set of the target instance includes an ordered point set corresponding to a first number of instances, and the ordered point set includes a target number of coordinate points in the first coordinate system.

According to another aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided, the storage medium stores a computer program, the computer program is used to perform training of the target detection model described in any of the above embodiments of the present disclosure. Method; or, the computer program is used to execute the map generation method described in any of the above embodiments of the present disclosure.

According to yet another aspect of an embodiment of the present disclosure, an electronic device is provided. The electronic device includes: a processor; a memory for storing instructions executable by the processor; and the processor is configured to retrieve instructions from the memory. The executable instructions are read and executed to implement the training method of the target detection model described in any of the above embodiments of the present disclosure.

According to yet another aspect of an embodiment of the present disclosure, an electronic device is provided. The electronic device includes: a processor; a memory for storing instructions executable by the processor; and the processor is configured to retrieve instructions from the memory. The executable instructions are read and executed to implement the map generation method described in any of the above embodiments of the present disclosure.

According to another aspect of the embodiments of the present disclosure, a computer program product is provided. When the instruction processor in the computer program product is executed, the map generation method or the map generation method described in any of the above embodiments of the present disclosure is executed. How to generate a map.

Based on the target detection model training method, map generation method, device and equipment provided by the above embodiments of the present disclosure, the pre-established target detection network is trained by using the ordered point set corresponding to the instance as a label and combining point-to-point loss and direction loss. , the obtained target detection model can predict the ordered point set of instances for image data and/or point cloud data, that is, to achieve prediction at the map element coordinate point level, relative to the prediction at the map element instance frame level, embodiments of the present disclosure It can help improve the prediction accuracy of the model.

Description of drawings

Figure 1 is an exemplary application scenario of the training method of the target detection model provided by the present disclosure;

Figure 2 is a schematic flowchart of a training method for a target detection model provided by an exemplary embodiment of the present disclosure;

Figure 3 is a schematic flowchart of step 202 provided by an exemplary embodiment of the present disclosure;

Figure 4 is a schematic structural diagram of a target detection network provided by an exemplary embodiment of the present disclosure;

Figure 5 is a schematic flowchart of step 202 provided by another exemplary embodiment of the present disclosure;

Figure 6 is a schematic flowchart of step 2021 provided by an exemplary embodiment of the present disclosure;

Figure 7 is a schematic structural diagram of a decoder network provided by an exemplary embodiment of the present disclosure;

Figure 8 is a schematic diagram of the principle of Deformable DETR provided by an exemplary embodiment of the present disclosure;

Figure 9 is a schematic diagram of the determination principle of a training instance point set provided by another exemplary embodiment of the present disclosure;

Figure 10 is a schematic flowchart of step 202 provided by yet another exemplary embodiment of the present disclosure;

Figure 11 is a schematic diagram of a prediction network of a prediction type provided by an exemplary embodiment of the present disclosure;

Figure 12 is a schematic flowchart of step 2024 provided by an exemplary embodiment of the present disclosure;

Figure 13 is a schematic flowchart of a map generation method provided by an exemplary embodiment of the present disclosure;

Figure 14 is a schematic structural diagram of a training device for a target detection model provided by an exemplary embodiment of the present disclosure;

Figure 15 is a schematic structural diagram of the first processing module 502 provided by an exemplary embodiment of the present disclosure;

Figure 16 is a schematic structural diagram of the second processing unit 5022 provided by an exemplary embodiment of the present disclosure;

Figure 17 is a schematic structural diagram of the first processing unit 5021 provided by an exemplary embodiment of the present disclosure;

Figure 18 is a schematic structural diagram of the first processing module 502 provided by another exemplary embodiment of the present disclosure;

Figure 19 is a schematic structural diagram of a map generation device provided by an exemplary embodiment of the present disclosure;

Figure 20 is a schematic structural diagram of an application embodiment of the electronic device of the present disclosure.

Detailed ways

In order to explain the present disclosure, example embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present disclosure, rather than all embodiments. It should be understood that the present disclosure is not intended to be exemplified. Limitations of sexual embodiment.

It should be noted that the relative arrangement of components and steps, numerical expressions, and numerical values set forth in these examples do not limit the scope of the disclosure unless otherwise specifically stated.

Overview of the Disclosure

In the process of realizing the present disclosure, the inventor found that in autonomous driving scenarios, it is usually necessary to use vehicle-mounted surround-view cameras and/or radars to perceive road elements (such as lane lines, zebra crossings, curbs, drivable areas, etc.), using for the generation of online maps. If the detection model in related technologies is used to predict the map instances corresponding to various road elements to obtain the map instance frame position, and the map is generated based on the map instance frame position, the accuracy of the generated map will be lower.

Illustrative overview

Figure 1 is an exemplary application scenario of the training method of the target detection model provided by the present disclosure.

In an autonomous driving scenario, using the target detection model training method of the present disclosure, the pre-collected image data can be used as training image data, and the pre-collected point cloud data can be used as training point cloud data to form training input data, and the training input data can be formed The labels corresponding to the image data and the labels corresponding to the training point cloud data are used as the first label data for training the target detection model. The network output of the target detection model may include ordered point sets corresponding to the first number of instances, that is, each Each instance can correspond to an ordered point set. Each ordered point set can include a target number of coordinate points in a first coordinate system. The first coordinate system can be a coordinate system corresponding to a bird's-eye view. The instances can be various types of points on the road. The representation of elements in an image or point cloud, such as lane lines, zebra crossings, arrows, curbs, drivable areas and other elements in the image can be instances, that is, each element can have one or more corresponding elements in the image or point cloud. Multiple instances. Each instance can predict its corresponding ordered point set. The ordered point set can fit the elements corresponding to the instance, such as the lane line instance. The ordered point set includes 3 coordinate points. Through this 3 coordinate points can fit a lane line. During the training process, the network parameters of the target detection model are adjusted based on the point-to-point loss function and the direction loss function, so that the target detection model obtained through training can effectively detect the ordered point sets corresponding to each instance. Since the target detection model obtained through training predicts the coordinate points of instances, compared to the above-mentioned method of predicting map instances to obtain map instance frame positions, the prediction at the instance coordinate point level of the present disclosure helps to improve the accuracy of prediction results. The target detection model obtained by training can then be deployed to the map generation device on the on-board computing platform of the autonomous vehicle for online mapping of the autonomous vehicle, which helps to improve the accuracy of the generated map.

In practical applications, the training method of the target detection model of the present disclosure is not limited to autonomous driving scenarios, but can also be applied to any other implementable scenarios according to actual needs, and can be set according to actual needs.

Example methods

Figure 2 is a schematic flowchart of a training method for a target detection model provided by an exemplary embodiment of the present disclosure. This embodiment can be applied to electronic devices, such as servers, terminals and other electronic devices. As shown in Figure 2, it includes the following steps:

Step 201: Obtain training input data and corresponding first label data. The training input data includes training image data and/or training point cloud data. The first label data includes ordered points corresponding to the first number of instances in the training input data. Set, the ordered point set includes the target number of coordinate points in the first coordinate system.

Among them, the training image data and training point cloud data can be obtained based on vehicle-mounted surround-view cameras and radar collection. For example, by driving a collection vehicle equipped with a surround-view camera and radar on the road, the road environment images and road point cloud data around the vehicle are collected as training image data and training point cloud data respectively. The first label data may be obtained by annotating ordered point sets for instances in the training image data and/or training point cloud data. The first label data includes coordinate points in a first coordinate system. The first coordinate system can be a coordinate system corresponding to a bird's-eye view. The training image data is data in an image coordinate system. The training point cloud data can be data in a radar coordinate system. , the labeling result can be converted to the first coordinate system based on the camera parameters and radar parameters to obtain the corresponding first label data. The specific conversion principle will not be described again. Instances can be the representation of various elements on the road in images or point clouds. For example, lane lines, zebra crossings, arrows, curbs, drivable areas and other elements in the image can be instances, that is, each element is represented in the image or point cloud. There can be one or more instances in the cloud. Each instance can correspond to an ordered point set. The ordered point set can be fitted to the elements corresponding to the instance, such as lane line instances. The ordered point set includes 3 coordinates. point, a segment of lane line can be fitted through these three coordinate points. first quantity and target quantity The amount can be set according to actual needs.

In an optional example, step 201 may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by the first acquisition module run by the processor.

Step 202, based on the training input data, the first label data, the point-to-point loss function and the direction loss function, train the pre-established target detection network to obtain the target detection model. The point-to-point loss function is used to determine the training instance point output by the target detection network. The point-to-point loss of the set relative to the ordered point set of the instance in the first label data, the direction loss function is used to determine the direction between the points in the training instance point set relative to the point of the ordered point set of the instance in the first label data and direction loss between points.

Among them, the target detection network can be set according to actual needs. For example, the target detection network can be a detection network based on a deformable detection transformer (Deformable DEtection TRansformer, referred to as: Deformable DETR) or other implementable detection networks. The point-to-point loss function and the direction loss function can use any implementable loss function. For example, the point-to-point loss function can use the L1 loss function. The L1 loss function refers to the L1 norm loss function, also known as the least absolute deviation (LAD) or minimum Absolute error (LAE), which is to minimize the sum of the absolute differences between the target value (in this embodiment, it refers to the label value of the annotation) and the estimated value (in this embodiment, it refers to the output value of the target detection network) ization, the direction loss function can adopt the cosine similarity loss function of the direction vectors of two adjacent points. During the training process, the network parameters of the target detection model are adjusted based on the point-to-point loss function and the direction loss function, so that the target detection model obtained through training can effectively detect the ordered point sets corresponding to each instance. Among them, the point-to-point loss determined by the point-to-point loss function is used to supervise the point-level prediction results of the target detection network, so that the target detection network can accurately predict the points of the instance, and the direction loss determined by the direction loss function is used to supervise the order of points. This enables the target detection network to predict a more accurate ordered point set. Since the target detection model obtained by training in the embodiments of the present disclosure predicts the ordered coordinate points of the instance, compared to the prediction of the instance frame, the embodiments of the present disclosure are helpful. Improve the accuracy of prediction results.

In an optional example, step 202 may be performed by the processor calling corresponding instructions stored in the memory, or may be performed by the first processing module run by the processor.

The training method of the target detection model provided by this embodiment is based on the point-to-point loss determined by the point-to-point loss function and the direction loss determined based on the direction loss function, and supervises the points and the order of the points in the instance point set output by the target detection network, so that the training The obtained target detection model can accurately and effectively predict the ordered point sets corresponding to each instance, realizing the prediction at the coordinate point level of the instance. Compared with the prediction at the map instance box level, the target detection model obtained by training in the embodiment of the present disclosure It helps to improve the accuracy of prediction results, which in turn helps improve map accuracy when used for map generation.

In an optional example, Figure 3 is a schematic flowchart of step 202 provided by an exemplary embodiment of the present disclosure. In this example, step 202 may specifically include the following steps:

Step 2021: Obtain the training instance point set based on the training input data and the target detection network.

Among them, the target detection network can be set according to actual needs to any one of three situations: only training image data can be input, only training point cloud data can be input, or both training image data and training point cloud data can be input.

Exemplarily, FIG. 4 is a schematic structural diagram of a target detection network provided by an exemplary embodiment of the present disclosure. As shown in Figure 4, the target detection network may include a feature extraction network, an encoder network, a decoder network and a prediction head network. Wherein, the feature extraction network may include a first feature extraction network and/or a second feature extraction network according to actual needs. The first feature extraction network is used for feature extraction of training image data to obtain training image features, and the second feature extraction network is used for Extract training point cloud data to obtain training point cloud features; the encoder network is used to encode training image features and/or training point cloud features to obtain the training feature map in the first coordinate system; the decoder network is used to train The feature map is decoded to obtain the training decoding results; the prediction head network is used to predict the training instance point set based on the training decoding results. For example, the prediction head network can be a linear neural network such as MLP (Multilayer Perceptron, or feedforward neural network), which can be set according to actual needs. The training input data can be used as the input of the target detection network, and through the reasoning of the target detection network, the output training instance point set can be obtained.

In an optional example, different ordered point sets can be used for different instances. For example, for lane line instances, an open-loop ordered point set can be used, that is, the starting point and the end point of the ordered point set are not the same point. , what is obtained after fitting is Line segments, for example, for drivable area examples, the ordered point set can be a polygon point set, forming a closed-loop ordered point set. After fitting, it becomes a closed-loop polygon. The details can be set according to actual needs. The target number of coordinate points of the ordered point sets corresponding to different instances can be the same or different. For example, the lane line instance sets an ordered point set of 3 coordinate points, and the zebra crossing sets an ordered point set of 5 coordinate points. There is no specific limit. .

In an optional example, step 2021 may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by the first processing unit run by the processor.

Step 2022: Determine the first loss based on the training instance point set, the first label data and the point-to-point loss function.

Among them, after obtaining the training instance point set, the point set of each instance in the training instance point set can be compared point-to-point with the ordered point set of the instance in the first label data to determine the first loss. For example, the point-to-point point set can be compared The absolute value of the difference is used as the loss of the point, thereby obtaining the loss of each point in each instance, and then based on the loss of each point in each instance, the point-to-point loss of the entire network can be determined as the first loss. For another example, the losses at each point of each instance can be summed to obtain the first loss. The details can be set according to actual needs.

In an optional example, step 2022 may be performed by the processor calling corresponding instructions stored in the memory, or may be performed by a second processing unit run by the processor.

Step 2023: Determine the second loss based on the training instance point set, the first label data and the direction loss function.

Among them, the direction loss function is used to determine the loss of the direction between the points in the training instance point set relative to the direction between the points of the ordered point set of the instance in the first label data. The loss can be based on the difference between the two points. Determined by the cosine similarity of the direction vectors.

For example, for two adjacent points in any instance in the training instance point set, the first direction vector of the two adjacent points can be determined based on the coordinate values of the two adjacent points, based on the first label data and the The coordinate value labels of the two points corresponding to the two adjacent points are used to determine the second direction vector of the two adjacent points. Based on the first direction vector and the second direction vector, the cosine similarity of the two direction vectors is determined. Based on the cosine similarity of two adjacent points in each instance, the direction loss of the entire network is determined as the second loss, which can be set according to actual needs.

Step 2022 and step 2023 are in no particular order.

In an optional example, step 2023 may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by a third processing unit run by the processor.

Step 2024: Based on the first loss and the second loss, adjust the network parameters of the target detection network until the first loss and the second loss meet the preset conditions, and obtain the target detection model.

Among them, the first loss and the second loss can be weighted and summed by preset weights as a comprehensive loss for adjusting network parameters. Preset conditions can be set according to actual needs. The adjustment of network parameters can be implemented using any implementable optimizer, such as the Adam training optimizer, which can be set according to actual needs. The Adam training optimizer absorbs the advantages of the adaptive learning rate gradient descent algorithm (Adagrad) and the momentum gradient descent algorithm. It can not only adapt to sparse gradients (that is, natural language and computer vision problems), but also alleviate the problem of gradient oscillation.

In an optional example, step 2024 may be performed by the processor calling corresponding instructions stored in the memory, or may be performed by a fourth processing unit run by the processor.

Figure 5 is a schematic flowchart of step 202 provided by another exemplary embodiment of the present disclosure.

In an optional example, step 2022 determines the first loss based on the training instance point set, the first label data and the point-to-point loss function, including:

Step 20221: For each instance, based on the ordered point set corresponding to the instance in the first label data, determine the points in the ordered point set and the training instance point set of the instance in different orders of the ordered point set. The corresponding relationship between points is to obtain the point-to-point relationship corresponding to each sequence.

Among them, the different orders of ordered point sets refer to the order in which different endpoints of the ordered point set are used as the starting points. For example, for the ordered point set of line segments such as lane lines, it includes three ordered coordinate points A1, A2, and A3. , which has two endpoints A1 and A3. The different orders of this ordered point set include two orders, one is A1-A2-A3, and the other is A3-A2-A1. The coordinates of any two coordinate points in different orders The adjacent relationship remains unchanged. For another example, for an ordered point set of polygons such as drivable areas or zebra crossings, the ordered coordinate points of B1-B5 are included, where B5 can be equal to B1 to represent the polygon, or it can be represented by other symbols. The point set corresponds to a polygon. When fitting, it needs to be connected end to end to form a closed loop. Specifically, It can be set according to actual needs, as long as it can be distinguished from the ordered point set of line segments. It also provides a basis for determining point-to-point relationships in different sequences. For the ordered point set B1-B5, take B5 not equal to B1 as an example. Since each coordinate point can be a vertex of a polygon and can be used as a starting point, the ordered point set has 5 starting points. Combined with the direction, the ordered point set can correspond to 10 sequences, including B1-B5 and the reverse sequence B5-B1, B2-B3-B4-B5-B1 and its reverse sequence, B3-B4-B5-B1-B2 and Its reverse order, B4-B5-B1-B2-B3 and its reverse order, B5-B1-B2-B3-B4 and its reverse order. By adjusting the order of coordinate points in the training instance point set or the ordered point set in the first label data, point-to-point relationships corresponding to different orders in the corresponding instances in the two are determined. For example, the ordered point set of an instance in the training instance point set is C1-C5, and the valid point set labels of the instance in the first label data are D1-D5. D1-D5 are arranged in the 10 different orders of B1-B5 mentioned above. , corresponding to C1-C5 respectively, that is, forming a point-to-point relationship corresponding to different orders. For example, C1-C5 correspond to D5-D1 in order, and the specific principles will not be repeated one by one.

In an optional example, step 20221 may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by the first determination subunit executed by the processor.

Step 20222: Based on the point-to-point relationships corresponding to each sequence, determine the point-to-point losses corresponding to each sequence.

Among them, after determining the point-to-point relationship corresponding to each sequence, the point-to-point loss can be obtained based on the point-to-point loss function.

In an optional example, step 20222 may be performed by the processor calling corresponding instructions stored in the memory, or may be performed by the second determination subunit run by the processor.

Step 20223, use the order with the smallest point-to-point loss as the target order of this instance.

Among them, after determining the point-to-point losses corresponding to each order, the minimum point-to-point loss and the order of the minimum point-to-point loss can be determined, and this order is used as the target order of the ordered point set of the instance for subsequent point-to-point points in the entire network. Determination of loss.

In an optional example, step 20223 may be executed by the processor calling a corresponding instruction stored in the memory, or may be executed by a third determination subunit run by the processor.

Step 20224: Use the point-to-point loss corresponding to the target sequence as the target point-to-point loss of this instance.

Since the training instance point set includes an ordered point set corresponding to the first number of instances, when the first number is multiple, the corresponding target sequence and the corresponding target point-to-point loss can be determined for each instance.

In an optional example, step 20224 may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by the fourth determination subunit executed by the processor.

Step 20225: Determine the first loss based on the target point-to-point loss of each instance.

Specifically, the target point-to-point losses of each instance can be combined to determine the first loss of the entire network.

In an optional example, step 20225 may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by the fifth determination subunit executed by the processor.

The embodiments of the present disclosure can determine the minimum point-to-point loss for determining the overall point-to-point loss of the network through various possible orders of the ordered point set during the training process, so that the target detection network can simulate the optimal starting point of the instance and The order of corresponding instances helps to further improve model performance and accuracy of prediction results.

In an optional example, step 2023 determines the second loss based on the training instance point set, the first label data and the direction loss function, including:

Step 20231: Determine the second loss based on the training instance point set, the first label data, the target order and direction loss function corresponding to each instance.

Among them, since the second loss is a direction loss, it involves the directionality of two adjacent points. Therefore, when the point-to-point loss adopts the point-to-point relationship of the target sequence, the direction loss can also be based on the two adjacent coordinate points determined by the point-to-point relationship of the target sequence. The direction vector is determined to ensure the consistency between the predicted point and the direction of the label.

In an optional example, step 20231 may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by a third processing unit run by the processor.

Figure 6 is a schematic flowchart of step 2021 provided by an exemplary embodiment of the present disclosure.

In an optional example, the training input data may also include initial query features and initial reference points. The initial query features include initial features of the target number corresponding to the first number of instances. The initial reference points may include initial features. The reference coordinate points corresponding to each feature; the target detection network is a detection network based on the deformable detection transformer; accordingly, in step 2021, based on the training input data and the target detection network, a training instance point set is obtained, including:

Step 20211: Extract features from the training image data based on the first feature extraction network in the target detection network to obtain the first training image features.

Among them, Deformable Detection Transformer (Deformable DETR) is a detection network obtained by improving DETR. It uses a multi-scale variable attention module instead of the attention module in DETR to process features, which helps to solve DETR Problems such as high computational complexity and too slow convergence. DETR is an end-to-end target detector that fully integrates a convolutional neural network (CNN) and a transformer (Transformer). It can achieve target detection based on the powerful modeling capabilities of the Transformer. The initial query features (queries) may be randomly initialized query features, and the initial query features include initial features corresponding to the target number of the first number of instances, that is, each instance may correspond to the initial features of the target number. The target number of initial features corresponding to different instances can be the same or different. The initial query features are used in the decoder attention operation in the object detection network. The initial reference point can be a set of initial reference coordinate points corresponding to each randomly initialized instance. The first feature extraction network can use any implementable feature extraction network, such as using a convolutional neural network as the feature extraction network. The specifics can be based on actual needs. set up.

In an optional example, step 20211 may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by the first feature extraction subunit run by the processor.

Step 20212: Perform feature extraction on the training point cloud data based on the second feature extraction network in the target detection network to obtain the first training point cloud features.

Among them, the second feature extraction network can use any implementable feature extraction network, such as using a convolutional neural network as the feature extraction network, which can be set according to actual needs.

Among them, step 20211 and step 20212 are not in any order.

In an optional example, step 20212 may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by the second feature extraction subunit run by the processor.

Step 20213: Encode the first training image feature and/or the first training point cloud feature based on the encoder network in the target detection network to obtain the target training feature map in the first coordinate system.

The encoder network may include at least one encoder, and the encoder network may convert the first training image features and the first training point cloud features into the first coordinate system through coding to obtain the corresponding target training feature map.

In an optional example, step 20213 may be performed by the processor calling corresponding instructions stored in the memory, or may be performed by a coding subunit executed by the processor.

Step 20214: Obtain the training decoding result based on the target training feature map, the initial query feature, the initial reference point, and the decoder network in the target detection network. The decoder network includes at least one decoder.

The training decoding result may be a decoding result obtained by decoding by at least one decoder. The decoder network can continuously update the initial query features based on the initial reference points and target training feature maps to obtain training decoding results.

In an optional example, step 20214 may be performed by the processor calling corresponding instructions stored in the memory, or may be performed by a decoding subunit run by the processor.

Step 20215: Based on the training decoding results, determine the training instance point set.

Among them, the training instance point set can be obtained by continuously updating the initial reference point with the decoding result of each decoder. That is, after each decoder obtains the decoding result, the offset corresponding to each reference point can be predicted based on the decoding result, so as to Take the first decoder as an example. The decoding result predicts the offset corresponding to each initial reference point. Each offset is added to the corresponding initial reference point to obtain the updated reference point as the corresponding first decoder. output reference point. After decoding by the second decoder, the offset corresponding to each output reference point of the first decoder is predicted based on its decoding result, and added to each output reference point corresponding to the first decoder to obtain the second decoding The output reference point corresponding to the decoder, and so on, the output reference point corresponding to the last decoder can be used as the training instance point set.

Exemplarily, FIG. 7 is a schematic structural diagram of a decoder network provided by an exemplary embodiment of the present disclosure. As shown in Figure 7, the decoder network can include N decoders, and the initial query features can include 3 initial features corresponding to the two instances of instance 1 and instance 2, that is, each instance corresponds to 3 initial features, where, The three gray blocks corresponding to instance 1 respectively represent the three initial features of instance 1, and the three black blocks corresponding to instance 2 represent the three initial features of instance 2 respectively. Features, the initial reference points may include reference coordinate points corresponding to the two instances. Each initial feature corresponds to a reference coordinate point. Taking Example 1 as an example, 3 initial features correspond to 3 reference coordinate points (see Figure 7 The 3 gray dots corresponding to Example 1). The training decoding results are obtained through decoding by N decoders, and then the training instance point set is obtained based on the training decoding results.

Exemplarily, FIG. 8 is a schematic diagram of the principle of Deformable DETR provided by an exemplary embodiment of the present disclosure. As shown in Figure 8, taking the first decoder as an example, Query Feature represents the initial query feature, Reference Point represents the initial reference point, and Input Feature Map represents the target training feature map. The variable attention module of the decoder can only Focus on a part of the range near the reference point, regardless of the resolution of the entire feature map. Head represents the attention head, m (m=1, 2, 3) represents the mth attention head, Attention Weights (A _mqk ) represents the attention weight corresponding to the mth attention head, W′ _m x represents the mth The encoding of the key vector (KEY) in the attention operation corresponding to each attention head, Linear represents the linear layer, Aggregate represents the aggregation operation of the key points in the attention weight and value vector Values, Softmax represents the activation function, and Output represents the decoding result. Compared with DETR, Deformable DETR can only collect the main feature points near the reference point. Therefore, for each query vector Query, there can be only a very small number of key vectors (Key). The initial query features are predicted by the linear layer Offsets (Sampling Offsets) △p _mqk , in this example feature offsets (Sampling Offsets) △p _mqk can predict 3 features in each attention head (such as Head1, that is, m=1 in △p _mqk ) The offset is represented by three arrows respectively. The feature offset refers to the position offset of the key points collected in the value vector relative to the initial reference point. The Input Feature Map obtains the value vector Values through the linear layer. Each note The force head obtains the corresponding value vector, and the feature offset can be used to extract sparse values (that is, the key points in the above Values) from near the initial reference point in the value vector Values. Query Feature obtains attention weights through linear layer and Softmax. Attention Weights (A _mqk ) are aggregated with sparse values, such as _{through the three weights of Head1 in Attention Weights (A mqk} ₎ , 0.5 and 0.3. , 0.2 perform a weighted sum of the values of the three key points (three stacked gray blocks) extracted from the value vector of Head1 to obtain the attention result corresponding to the attention head Head1. In the same way, each attention head can be obtained Corresponding attention results (Aggregated Sampled Values) respectively, Aggregated Sampled Values obtain the decoding result (Output) through the linear layer, or the Output can also be added to the Query Feature through the residual connection, and the addition result is used as the decoding result. Specifically, Set according to actual needs. The specific principles of Deformable DETR will not be repeated one by one.

In an optional example, step 20215 may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by the first processing subunit run by the processor.

The embodiments of the present disclosure realize coordinate point level prediction of instances through a target detection model based on Deformable DETR. Compared with prediction using segmentation combined with post-processing or autoregressive prediction, it helps to improve prediction accuracy, and due to the target detection based on Deformable DETR The model, using deformable convolution, can collect only the main feature points near the reference point when performing attention operations, which helps to reduce the amount of calculation and thus helps to improve the prediction speed.

In an optional example, step 20214 obtains training decoding results based on the target training feature map, initial query features, initial reference points, and the decoder network in the target detection network, including: for each decoding in the decoder network The decoding result of the decoder is obtained based on the target training feature map and the input query features and input reference points corresponding to the decoder. Among them, the input query features and input reference points corresponding to the first decoder are the initial query features respectively. and the initial reference point. The input query feature corresponding to any other decoder except the first decoder is the decoding result of the previous decoder of the other decoder. The input reference point of the other decoder is based on the previous decoder. The output reference point determined by the decoding result of the decoder; the decoding result of the last decoder is used as the training decoding result.

In an optional example, after obtaining the decoding result of each decoder in the decoder network based on the target training feature map and the input query feature and input reference point corresponding to the decoder, it also includes: Based on the decoding result of the decoder and the offset prediction network corresponding to the decoder, the first offset corresponding to the decoder is determined; based on the first offset and the input reference point corresponding to the decoder, the first offset is determined The output reference point corresponding to the decoder; accordingly, the training instance point set is determined based on the training decoding result, including: using the output reference point corresponding to the last decoder determined based on the training decoding result as the training instance point set.

Exemplarily, FIG. 9 is a schematic diagram of the determination principle of the training instance point set provided by another exemplary embodiment of the present disclosure. As shown in Figure 9, each decoder can correspond to an offset prediction network, which is used to predict the first offset of the reference point based on the decoding result of the decoder, and add it to the input reference point corresponding to the decoder. The input reference point of decoder 1 is the initial reference point, and the input reference point of decoder i (i=2, 3,...,N) is the output reference point of decoder i-1. Continuously fine-tune the reference points through training to obtain an accurate set of instance points.

Figure 10 is a schematic flowchart of step 202 provided by yet another exemplary embodiment of the present disclosure.

In an optional example, the first label data also includes the type label corresponding to each instance in the training input data; in step 20214, the target training feature map, the initial query feature, the initial reference point, and the decoding in the target detection network are After obtaining the training decoding results of the machine network, it also includes:

Step 20216: Based on the training decoding results, determine the training type results. The training type results include the prediction types corresponding to each instance.

Among them, the type label of the instance can be the real type of each instance obtained by pre-annotation, such as lane lines, curbs, zebra crossings, arrows, drivable areas, etc. The prediction type corresponding to an instance refers to the element type to which the instance is predicted by the target detection network. The element type can include lane lines, curbs, zebra crossings, arrows, drivable areas, etc. For example, predict that an instance belongs to a lane line.

Exemplarily, FIG. 11 is a schematic diagram of a prediction network provided by an exemplary embodiment of the present disclosure. Among them, decoder N decodes to obtain the training decoding result, and predicts the training type result through the type prediction network. The type prediction network can be a prediction network based on a feedforward neural network, which can be set according to actual needs. The type prediction network, each offset prediction network, and each reference point update network can be collectively referred to as a prediction head network.

In an optional example, step 20216 may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by the fifth processing unit run by the processor.

Step 20217: Determine the type loss based on the training type result and the type label in the first label data.

Among them, the type loss can be determined based on the preset type loss function, and the type loss function can use any implementable loss function. For example, the type loss function can use the focal loss function, which can be set according to actual needs.

In an optional example, step 20217 may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by the sixth processing unit run by the processor.

Correspondingly, step 2024 adjusts the network parameters of the target detection network based on the first loss and the second loss until the first loss and the second loss meet the preset conditions, and obtains the target detection model, including:

Step 20241: Determine the comprehensive loss based on the first loss, the second loss, the type loss and the preset weight.

Among them, the preset weights can be set according to actual needs. For example, the weights of the first loss l1, the second loss l2, and the type loss l3 can be set to λ1, λ2, and λ3 respectively. Then the comprehensive loss can be expressed as:
L＝λ1*l1+λ2*l2+λ3*l3

For example, λ1, λ2, and λ3 can be set to 5, 0.1, and 2 respectively, and there are no specific limitations.

In an optional example, step 20241 may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by a third processing subunit run by the processor.

Step 20242: Based on the comprehensive loss, adjust the network parameters of the target detection network until the comprehensive loss meets the preset conditions and obtain the target detection model.

In an optional example, step 20242 may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by the fourth processing subunit run by the processor.

The embodiments of the present disclosure help to further improve the performance of the target detection model and the accuracy of the prediction results by further combining type loss, point-to-point loss and direction loss to comprehensively adjust network parameters.

In an optional example, Figure 12 is a schematic flowchart of step 2024 provided by an exemplary embodiment of the present disclosure. In this example, step 2024 adjusts the network parameters of the target detection network based on the first loss and the second loss. , until the first loss and the second loss meet the preset conditions, and the target detection model is obtained, including:

Step 20241a: Determine the comprehensive loss based on the first loss and the second loss.

Among them, the first loss and the second loss can be weighted and summed according to a certain proportional weight to obtain the comprehensive loss. For the specific principle, please refer to the above content and will not be repeated here.

In an optional example, step 20241a may be executed by the processor calling corresponding instructions stored in the memory, or Can be executed by a fourth processing unit 5024 executed by the processor.

Step 20242a: Based on the comprehensive loss, adjust the network parameters of the target detection network until the comprehensive loss meets the preset conditions, and obtain the target detection model.

For the specific network parameter adjustment principle, please refer to the above content and will not be repeated here.

In an optional example, step 20242a may be performed by the processor calling corresponding instructions stored in the memory, or may be performed by the fourth processing unit 5024 run by the processor.

The training method of the target detection model in the embodiment of the present disclosure uses hierarchical prediction methods of examples and corresponding ordered point sets, and combines point-to-point loss, direction loss, and type loss to conduct model training, so that the obtained target detection model can be more accurate. Predicting the ordered point set of the instance helps to further improve the prediction accuracy, and combined with the deformable DETR network, the attention operation of the target detection model during the inference process can only focus on the feature interaction of neighboring points around the reference point, which helps Reduce the computational complexity, thereby helping to reduce the amount of calculation and improve prediction efficiency. Moreover, the target detection model query vector in the embodiment of the present disclosure is at the point level, which is more flexible than the instance box level.

The above-mentioned embodiments or optional examples of the present disclosure can be implemented individually or combined in any combination without conflict. The details can be set according to actual needs, and are not limited by the embodiments of the present disclosure.

Figure 13 is a schematic flowchart of a map generation method provided by an exemplary embodiment of the present disclosure. This embodiment can be applied to electronic devices, specifically such as vehicle-mounted computing platforms. As shown in Figure 13, it includes the following steps:

Step 301: Obtain first image data and/or first point cloud data of at least one viewing angle.

The first image data may be the image data of the current frame collected in real time by at least one camera installed on the vehicle while the vehicle is driving, and the first point cloud data may be collected in real time by a radar installed on the vehicle while the vehicle is driving. The point cloud data of the current frame.

In an optional example, step 301 may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by the second acquisition module run by the processor.

Step 302: Based on the first image data and/or the first point cloud data, use the target detection model obtained by pre-training to obtain an ordered point set of target instances.

Wherein, the target detection model is obtained through the training method of the target detection model provided in any of the above embodiments or optional examples. The target instance ordered point set includes ordered point sets corresponding to the first number of instances, and the ordered point set includes The target number of coordinate points in the first coordinate system.

Among them, the specific input data required by the target detection model can be set and trained according to actual needs, and can support image data or point cloud data, or can support both image data and point cloud data. For details, please refer to the aforementioned embodiments. The specific reasoning principle of the target detection model can be found in the foregoing embodiments and will not be described again here.

In an optional example, step 302 may be performed by the processor calling corresponding instructions stored in the memory, or may be performed by a second processing module run by the processor.

Step 303: Generate a map based on the ordered point set of the target instance.

Among them, the target instance ordered point set is the coordinate point set under the first coordinate system (such as the coordinate system corresponding to the bird's-eye view). By fitting the ordered point set of each instance in the target instance point set, the corresponding Map elements, such as lane lines, zebra crossings, curbs, etc. The fitting results of each instance can be used as a local road map around the current location of the vehicle.

In practical applications, the ordered point set of the target instance can also be converted into the global coordinate system through coordinate transformation, so that a global road map can be generated according to the regional growth method, which can be set according to actual needs. The global coordinate system may be, for example, a world coordinate system or a relatively stable coordinate system rigidly connected to the world coordinate system. For example, the global coordinate system may be a preset coordinate system with the starting position of the vehicle as the origin.

In an optional example, step 303 may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by a third processing module run by the processor.

The map generation method in the embodiment of the present disclosure realizes prediction at the coordinate point level of the map instance based on the target detection model. Compared with the prediction at the frame level of the map instance, the method in the embodiment of the present disclosure can help to improve the accuracy of the map.

Any method provided by the embodiments of the present disclosure (including the training method of the target detection model and the map generation method) can be executed by any appropriate device with data processing capabilities, including but not limited to: terminal devices and servers. Alternatively, any method provided by the embodiments of the present disclosure can be executed by a processor. For example, the processor executes any method mentioned in the embodiments of the present disclosure by calling corresponding instructions stored in the memory. No further details will be given below.

Those of ordinary skill in the art can understand that all or part of the steps to implement the above method embodiments can be completed by hardware related to program instructions. The aforementioned program can be stored in a computer-readable storage medium. When the program is executed, It includes the steps of the above method embodiment; and the aforementioned storage medium includes: ROM, RAM, magnetic disk or optical disk and other various media that can store program codes.

Exemplary device

Figure 14 is a schematic structural diagram of a training device for a target detection model provided by an exemplary embodiment of the present disclosure. The device of this embodiment can be used to implement the training method embodiment of the corresponding target detection model of the present disclosure. The device shown in Figure 14 includes: a first acquisition module 501 and a first processing module 502.

The first acquisition module 501 is used to acquire training input data and corresponding first label data. The training input data includes training image data and/or training point cloud data. The first label data includes a first number of instances in the training input data. Corresponding ordered point set, the ordered point set includes a target number of coordinate points in the first coordinate system; the first processing module 502 is used to based on the training input data, first label data, point-to-point obtained by the first acquisition module 501 The loss function and the direction loss function train the pre-established target detection network to obtain the target detection model. The point-to-point loss function is used to determine the training instance point set output by the target detection network relative to the ordered point set of the instance in the first label data. Point-to-point loss, the direction loss function is used to determine the loss of the direction between points in the point set of the training instance relative to the direction between points in the ordered point set of the instance in the first label data.

In an optional example, FIG. 15 is a schematic structural diagram of the first processing module 502 provided by an exemplary embodiment of the present disclosure. In this example, the first processing module 502 includes: a first processing unit 5021, a second processing unit 5022, a third processing unit 5023 and a fourth processing unit 5024.

The first processing unit 5021 is used to obtain the training instance point set based on the training input data and the target detection network; the second processing unit 5022 is used to obtain the training instance point set, the first label data and the point-to-point based on the first processing unit 5021 The loss function determines the first loss; the third processing unit 5023 is used to determine the second loss based on the training instance point set, the first label data and the direction loss function obtained by the first processing unit 5021; the fourth processing unit 5024 uses Based on the first loss and the second loss, the network parameters of the target detection network are adjusted until the first loss and the second loss meet the preset conditions, and the target detection model is obtained.

In an optional example, FIG. 16 is a schematic structural diagram of the second processing unit 5022 provided by an exemplary embodiment of the present disclosure. In this example, the second processing unit 5022 includes: a first determination sub-unit 50221, a second determination sub-unit 50222, a third determination sub-unit 50223, a fourth determination sub-unit 50224 and a fifth determination sub-unit 50225.

The first determination subunit 50221 is used for each instance, based on the ordered point set corresponding to the instance in the first label data, and in different orders of the ordered point set, determine the relationship between each point in the ordered point set and the training The instance point set corresponds to the point of the instance to obtain the point-to-point relationship corresponding to each sequence; the second determination subunit 50222 is used to determine the point-to-point loss corresponding to each sequence based on the point-to-point relationship corresponding to each sequence; the third determination Subunit 50223 is used to use the order with the smallest point-to-point loss as the target order of this instance; the fourth determination subunit 50224 is used to use the point-to-point loss corresponding to the target order as the target point-to-point loss of this instance; the fifth determination subunit 50225, Based on the target point-to-point loss for each instance, a first loss is determined.

In an optional example, the third processing unit 5023 is specifically configured to determine the second loss based on the training instance point set, the first label data, and the target order and direction loss function corresponding to each instance.

Figure 17 is a schematic structural diagram of the first processing unit 5021 provided by an exemplary embodiment of the present disclosure.

In an optional example, the training input data also includes initial query features and initial reference points. The initial query features include the target number of initial features corresponding to the first number of instances, and the initial reference points include reference coordinates corresponding to each initial feature. point; the target detection network is a detection network based on a deformable detection transformer; the first processing unit 5021 includes: a first feature extraction subunit 50211, a second feature extraction subunit 50212, an encoding subunit 50213, a decoding subunit 50214 and a third A processing subunit 50215.

The first feature extraction subunit 50211 is used to pair the training graph based on the first feature extraction network in the target detection network. Perform feature extraction on the image data to obtain the first training image features; the second feature extraction subunit 50212 is used to perform feature extraction on the training point cloud data based on the second feature extraction network in the target detection network to obtain the first training point cloud features. ; Encoding subunit 50213, used to encode the first training image feature and/or the first training point cloud feature based on the encoder network in the target detection network to obtain the target training feature map in the first coordinate system; decoding subunit 50214, used to obtain training decoding results based on the target training feature map, initial query features, initial reference points, and the decoder network in the target detection network. The decoder network includes at least one decoder; the first processing subunit 50215, with Based on the training decoding results, the training instance point set is determined.

In an optional example, the decoding subunit 50214 is specifically configured to: for each decoder in the decoder network, obtain the decoder's decoder based on the target training feature map and the input query features and input reference points corresponding to the decoder. Decoding result, where the input query feature and input reference point corresponding to the first decoder are the initial query feature and the initial reference point respectively, and the input query feature corresponding to any other decoder except the first decoder is The decoding result of the previous decoder of other decoders, the input reference point of the other decoder is the output reference point determined based on the decoding result of the previous decoder; the decoding result of the last decoder is used as the training decoding result.

In an optional example, the first processing unit 5021 also includes: an offset prediction sub-unit 50216 and a second processing sub-unit 50217.

The offset prediction subunit 50216 is used to determine the first offset corresponding to the decoder based on the decoding result of the decoder and the offset prediction network corresponding to the decoder; the second processing subunit 50217 is used to Based on the first offset and the input reference point corresponding to the decoder, the output reference point corresponding to the decoder is determined; accordingly, the first processing subunit 50215 is specifically used to: convert the last decoding determined based on the training decoding result The output reference points corresponding to the device are used as the training instance point set.

Figure 18 is a schematic structural diagram of the first processing module 502 provided by another exemplary embodiment of the present disclosure.

In an optional example, the first label data also includes type labels corresponding to each instance in the training input data; the first processing module 502 also includes:

The fifth processing unit 5025 is used to determine the training type result based on the training decoding result. The training type result includes the prediction type corresponding to each instance; the sixth processing unit 5026 is used to determine the training type result based on the training type result and the type in the first label data. tag to determine the type loss; correspondingly, the fourth processing unit 5024 includes: a third processing subunit 50241, used to determine the comprehensive loss based on the first loss, the second loss, the type loss and the preset weight; the fourth processing subunit 50242, used to adjust the network parameters of the target detection network based on the comprehensive loss until the comprehensive loss meets the preset conditions and obtain the target detection model.

In an optional example, the fourth processing unit 5024 is specifically configured to: determine the comprehensive loss based on the first loss and the second loss; and adjust the network parameters of the target detection network based on the comprehensive loss until the comprehensive loss meets the preset conditions. , obtain the target detection model.

The beneficial technical effects corresponding to the exemplary embodiments of this device can be found in the corresponding beneficial technical effects in the above exemplary method section, and will not be described again here.

Figure 19 is a schematic structural diagram of a map generation device provided by an exemplary embodiment of the present disclosure. The device of this embodiment can be used to implement the corresponding map generation method embodiment of the present disclosure. The device shown in Figure 19 includes: a second acquisition module 601, a second processing module 602, and a third processing module 603.

The second acquisition module 601 is used to acquire the first image data and/or the first point cloud data of at least one perspective; the second processing module 602 is used to acquire the first image data and/or the first point cloud data based on the second acquisition module 601. For point cloud data, a target detection model obtained through pre-training is used to obtain an ordered point set of target instances. The target detection model is obtained through the training method of the target detection model in any of the above embodiments or optional examples. An ordered point set of target instances is obtained. It includes an ordered point set corresponding to the first number of instances, and the ordered point set includes a target number of coordinate points in the first coordinate system; the third processing module 603 is configured to based on the target instance obtained by the second processing module 602: Sequence point set, generate map.

Example electronic device

Figure 20 is a schematic structural diagram of an application embodiment of the electronic device of the present disclosure. In this embodiment, the electronic device 10 includes one or more processors 11 and memories 12 .

The processor 11 may be a central processing unit (CPU) or other form of processing unit with data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 10 to perform desired functions.

Memory 12 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory may include, for example, random access memory (RAM) and/or cache memory (cache), etc. Non-volatile memory may include, for example, read-only memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium, and the processor 11 may execute the program instructions to implement the methods of various embodiments of the present disclosure described above and/or other desired functions.

In one example, the electronic device 10 may further include an input device 13 and an output device 14, and these components are interconnected through a bus system and/or other forms of connection mechanisms (not shown).

In addition, the input device 13 may also include, for example, a keyboard, a mouse, and the like.

The output device 14 can output various information to the outside, including determined distance information, direction information, etc. The output device 14 may include, for example, a display, a speaker, a printer, a communication network and remote output devices connected thereto, and the like.

Of course, for simplicity, only some of the components in the electronic device 10 related to the present disclosure are shown in FIG. 20 , and components such as buses, input/output interfaces, etc. are omitted. In addition, the electronic device 10 may also include any other appropriate components depending on the specific application.

Example computer program products and computer-readable storage media

In addition to the above-mentioned methods and devices, embodiments of the present disclosure may also be a computer program product, which includes computer program instructions. When executed by a processor, the computer program instructions cause the processor to perform the above-mentioned “Example Methods” section of this specification. Described steps in methods according to various embodiments of the present disclosure.

The computer program product may have program code for performing operations of embodiments of the present disclosure written in any combination of one or more programming languages, including object-oriented programming languages such as Java, C++, etc., and Includes conventional procedural programming languages, such as the "C" language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server execute on.

In addition, embodiments of the present disclosure may also be a computer-readable storage medium having computer program instructions stored thereon. The computer program instructions, when executed by a processor, cause the processor to execute the above-mentioned “example method” part of this specification. The steps in methods according to various embodiments of the present disclosure are described in .

Computer-readable storage media can take the form of any combination of one or more computer-readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, electrical, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any combination thereof. More specific examples (non-exhaustive list) of readable storage media include: electrical connection with one or more conductors, portable disk, hard disk, random access memory (RAM), read only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.

The basic principles of the present disclosure have been described above in conjunction with specific embodiments. However, the advantages, advantages, effects, etc. mentioned in the present disclosure are only examples and not limitations, and cannot be considered to be necessary for each embodiment of the present disclosure. In addition, the specific details disclosed above are only for the purpose of illustration and to facilitate understanding, and are not limiting. The above details do not limit the disclosure to be implemented by using the above specific details.

Various changes and modifications can be made to the present disclosure by those skilled in the art without departing from the spirit and scope of the application. In this way, if these modifications and variations of the present application fall within the scope of the claims of the present disclosure and its equivalent technology, the present disclosure is also intended to include these modifications and variations.

Claims

A training method for a target detection model, including:

Obtain training input data and corresponding first label data. The training input data includes training image data and/or training point cloud data. The first label data includes corresponding labels of a first number of instances in the training input data. An ordered point set, the ordered point set includes a target number of coordinate points in the first coordinate system;

Based on the training input data, the first label data, a point-to-point loss function and a direction loss function, the pre-established target detection network is trained to obtain a target detection model, and the point-to-point loss function is used to determine the target detection network The point-to-point loss of the output training instance point set relative to the ordered point set of the instance in the first label data, the direction loss function is used to determine the direction between the points in the training instance point set relative to the The loss of the point-to-point direction of the ordered point set of the instance in the first label data.
The method according to claim 1, wherein the pre-established target detection network is trained based on the training input data, the first label data, a point-to-point loss function and a direction loss function to obtain a target detection model, include:

Based on the training input data and the target detection network, obtain the training instance point set;

Determine a first loss based on the training instance point set, the first label data and the point-to-point loss function;

Determine a second loss based on the training instance point set, the first label data and the direction loss function;

Based on the first loss and the second loss, the network parameters of the target detection network are adjusted until the first loss and the second loss meet preset conditions, and the target detection model is obtained.
The method of claim 2, wherein determining the first loss based on the training instance point set, the first label data and the point-to-point loss function includes:

For each instance, based on the ordered point set corresponding to the instance in the first label data, determine each point in the ordered point set and the training instance point set in different orders of the ordered point set. The corresponding relationship between the points of this instance is to obtain the point-to-point relationship corresponding to each sequence;

Based on the point-to-point relationship corresponding to each sequence, determine the point-to-point loss corresponding to each sequence;

The order with the smallest point-to-point loss is used as the target order of this instance;

Use the point-to-point loss corresponding to the target sequence as the target point-to-point loss of this instance;

The first loss is determined based on the target point-to-point loss for each of the instances.
The method of claim 3, wherein determining the second loss based on the training instance point set, the first label data and the direction loss function includes:

The second loss is determined based on the training instance point set, the first label data, the target order corresponding to each instance, and the direction loss function.
The method according to claim 2, wherein the training input data further includes initial query features and initial reference points, the initial query features include initial features of a target number corresponding to the first number of instances, and the The initial reference points include reference coordinate points corresponding to each of the initial features; the target detection network is a detection network based on a deformable detection transformer;

Obtaining the training instance point set based on the training input data and the target detection network includes:

Perform feature extraction on the training image data based on the first feature extraction network in the target detection network to obtain first training image features;

Perform feature extraction on the training point cloud data based on the second feature extraction network in the target detection network to obtain first training point cloud features;

Encode the first training image features and/or the first training point cloud features based on the encoder network in the target detection network to obtain a target training feature map in the first coordinate system;

Obtain a training decoding result based on the target training feature map, the initial query feature, the initial reference point, and a decoder network in the target detection network, the decoder network including at least one decoder;

Based on the training decoding results, the training instance point set is determined.
The method of claim 5, wherein the training decoding result is obtained based on the target training feature map, the initial query feature, the initial reference point, and the decoder network in the target detection network, include:

For each decoder in the decoder network, the decoding result of the decoder is obtained based on the target training feature map and the input query feature and input reference point corresponding to the decoder, where the first decoder The input query features and input reference points corresponding to the decoder are the initial query features and the initial reference point respectively, and the input query features corresponding to any other decoder except the first decoder are the other decoders. The decoding result of the previous decoder, the input reference point of the other decoder is the output reference point determined based on the decoding result of the previous decoder;

The decoding result of the last decoder is used as the training decoding result.
The method of claim 6, wherein for each decoder in the decoder network, based on the target training feature map and the input query feature and input reference point corresponding to the decoder, After obtaining the decoding result of the decoder, it also includes:

Based on the decoding result of the decoder and the offset prediction network corresponding to the decoder, determine the first offset corresponding to the decoder;

Based on the first offset and the input reference point corresponding to the decoder, determine the output reference point corresponding to the decoder;

Determining the training instance point set based on the training decoding result includes:

The output reference point corresponding to the last decoder determined based on the training decoding result is used as the training instance point set.
The method according to claim 5, wherein the first label data further includes type labels corresponding to each of the instances in the training input data;

After obtaining the training decoding result based on the target training feature map, the initial query feature, the initial reference point, and the decoder network in the target detection network, it also includes:

Based on the training decoding results, determine a training type result, where the training type result includes a prediction type corresponding to each of the instances;

Determine a type loss based on the training type result and the type label in the first label data;

Based on the first loss and the second loss, the network parameters of the target detection network are adjusted until the first loss and the second loss meet preset conditions, and the target detection model is obtained. ,include:

Determine comprehensive loss based on the first loss, the second loss, the type of loss and the preset weight;

Based on the comprehensive loss, the network parameters of the target detection network are adjusted until the comprehensive loss meets the preset condition, and the target detection model is obtained.
The method of claim 2, wherein based on the first loss and the second loss, network parameters of the target detection network are adjusted until the first loss and the second loss Satisfy the preset conditions and obtain the target detection model, including:

determining a comprehensive loss based on the first loss and the second loss;

Based on the comprehensive loss, the network parameters of the target detection network are adjusted until the comprehensive loss meets the preset condition, and the target detection model is obtained.
A map generation method, including:

Obtaining first image data and/or first point cloud data of at least one perspective;

Based on the first image data and/or the first point cloud data, a target detection model obtained by pre-training is used to obtain an ordered point set of target instances, and the target detection model is passed according to any one of claims 1-9. Obtained by the training method of the target detection model described above, the ordered point set of the target instance includes an ordered point set corresponding to a first number of instances, and the ordered point set includes a target number of coordinate points in the first coordinate system ;

Based on the ordered point set of the target instance, a map is generated.
A training device for a target detection model, including:

The first acquisition module is used to acquire training input data and corresponding first label data. The training input data includes training image data and/or training point cloud data. The first label data includes the training input data. An ordered point set corresponding to a number of instances, the ordered point set including a target number of coordinate points in the first coordinate system;

The first processing module is used to train a pre-established target detection network based on the training input data, the first label data, a point-to-point loss function and a direction loss function to obtain a target detection model. The point-to-point loss function is In order to determine the point-to-point loss of the training instance point set output by the target detection network relative to the ordered point set of the instance in the first label data, the direction loss function is used to determine the point between the training instance point concentration point and the point The loss of the direction between points relative to the direction between points of the ordered point set of instances in the first label data.
A map generating device including:

a second acquisition module, configured to acquire first image data and/or first point cloud data of at least one perspective;

The second processing module is used to obtain an ordered point set of target instances based on the first image data and/or the first point cloud data using a target detection model obtained through pre-training. The target detection model is configured as follows: The training method of the target detection model according to any one of claims 1 to 9 is obtained, the ordered point set of the target instance includes an ordered point set corresponding to a first number of instances, and the ordered point set includes a first number of target instances. A coordinate point in a coordinate system;

The third processing module is used to generate a map based on the ordered point set of the target instance.
A computer-readable storage medium, the storage medium stores a computer program, the computer program is used to execute the training method of the target detection model according to any one of the above claims 1-9; or, the computer program is used to Implement the map generation method described in claim 10 above.
An electronic device, the electronic device includes:

processor;

memory for storing instructions executable by the processor;

The processor is configured to read the executable instructions from the memory and execute the instructions to implement the training method of the target detection model described in any one of claims 1-9.
An electronic device, the electronic device includes:

processor;

memory for storing instructions executable by the processor;

The processor is configured to read the executable instructions from the memory and execute the instructions to implement the map generation method described in claim 10.