CN115331188A

CN115331188A - Training method of target detection model, map generation method, map generation device and map generation equipment

Info

Publication number: CN115331188A
Application number: CN202210977934.5A
Authority: CN
Inventors: 廖本成; 陈少宇; 程天恒; 张骞
Original assignee: Beijing Horizon Information Technology Co Ltd
Current assignee: Beijing Horizon Information Technology Co Ltd
Priority date: 2022-08-16
Filing date: 2022-08-16
Publication date: 2022-11-11
Also published as: WO2024037552A1

Abstract

The embodiment of the disclosure discloses a training method of a target detection model, and a map generation method, a map generation device and a target detection system, wherein the method comprises the following steps: acquiring training input data and corresponding first label data, wherein the training input data comprise training image data and/or training point cloud data, the first label data comprise ordered point sets respectively corresponding to a first number of instances in the training input data, and the ordered point sets comprise coordinate points of a target number in a first coordinate system; and training the pre-established target detection network based on the training input data, the first label data, the point-to-point loss function and the direction loss function to obtain a target detection model. The target detection model obtained by the embodiment of the disclosure can accurately and effectively predict the ordered point sets corresponding to the instances, so that the prediction of the coordinate point level of the instance is realized, and compared with the prediction of the frame level of the existing instance, the accuracy of the prediction result is greatly improved.

Description

Training method of target detection model, map generation method, map generation device and map generation equipment

Technical Field

The disclosure relates to an automatic driving technology, and in particular to a training method of a target detection model, and a map generation method, device and equipment.

Background

In an autonomous driving scenario, it is often necessary to use a vehicle-mounted look-around camera and/or radar for sensing road elements (such as lane lines, zebra crossings, road edges, drivable areas, etc.) for the generation of an online map. In the related art, a target detection model is usually adopted to predict map example frames corresponding to various elements to obtain positions of the map example frames, and then a map is generated based on the positions of the map example frames, but the generated map has low precision.

Disclosure of Invention

The map generation method and the map generation device aim to solve the technical problems that the map generated based on the position of the map example frame is low in precision and the like. The embodiment of the disclosure provides a training method of a target detection model, and a map generation method, device and equipment.

According to an aspect of the embodiments of the present disclosure, there is provided a training method of a target detection model, including: acquiring training input data and corresponding first label data, wherein the training input data comprise training image data and/or training point cloud data, the first label data comprise ordered point sets respectively corresponding to a first number of instances in the training input data, and the ordered point sets comprise coordinate points of a target number in a first coordinate system; training a pre-established target detection network based on the training input data, the first label data, a point-to-point loss function and a direction loss function to obtain a target detection model, wherein the point-to-point loss function is used for determining the point-to-point loss of a training instance point set output by the target detection network relative to an ordered point set of an instance in the first label data, and the direction loss function is used for determining the loss of the direction between the training instance point set and a point relative to the direction between the point of the ordered point set of the instance in the first label data and the point.

According to another aspect of the embodiments of the present disclosure, there is provided a map generation method, including: acquiring first image data and/or first point cloud data of at least one visual angle; based on the first image data and/or the first point cloud data, obtaining a target instance ordered point set by adopting a target detection model obtained by pre-training, wherein the target detection model is obtained by the training method of the target detection model according to any one of the embodiments, the target instance ordered point set comprises ordered point sets respectively corresponding to a first number of instances, and the ordered point sets comprise coordinate points of a target number in a first coordinate system; and generating a map based on the target instance ordered point set.

According to still another aspect of the embodiments of the present disclosure, there is provided a training apparatus for an object detection model, including: the first acquisition module is used for acquiring training input data and corresponding first label data, wherein the training input data comprises training image data and/or training point cloud data, the first label data comprises ordered point sets respectively corresponding to a first number of instances in the training input data, and the ordered point sets comprise coordinate points in a first coordinate system of a target number; a first processing module, configured to train a pre-established target detection network based on the training input data, the first label data, a point-to-point loss function, and a direction loss function, to obtain a target detection model, where the point-to-point loss function is used to determine a point-to-point loss of a training instance point set output by the target detection network relative to an ordered point set of instances in the first label data, and the direction loss function is used to determine a loss of a direction between the training instance point set and a point relative to a direction between points of the ordered point set of instances in the first label data.

According to still another aspect of the embodiments of the present disclosure, there is provided a map generating apparatus including: the second acquisition module is used for acquiring first image data and/or first point cloud data of at least one visual angle; a second processing module, configured to obtain, based on the first image data and/or the first point cloud data, a target example ordered point set by using a target detection model obtained through pre-training, where the target detection model is obtained by using the training method of the target detection model according to any one of the embodiments, the target example ordered point set includes ordered point sets corresponding to a first number of examples, and the ordered point set includes coordinate points in a first coordinate system of the target number.

According to a further aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein the storage medium stores a computer program, and the computer program is configured to execute the training method of the target detection model according to any one of the above embodiments of the present disclosure; alternatively, the computer program is configured to execute a map generation method according to any one of the above embodiments of the present disclosure.

According to still another aspect of an embodiment of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing the processor-executable instructions; the processor is configured to read the executable instructions from the memory and execute the instructions to implement the method for training the target detection model according to any of the above embodiments of the present disclosure.

According to still another aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing the processor-executable instructions; the processor is configured to read the executable instructions from the memory and execute the instructions to implement the map generating method according to any of the above embodiments of the present disclosure.

Based on the training method of the target detection model, the generation method of the map, the device and the equipment provided by the embodiment of the disclosure, the ordered point set corresponding to the instance is used as the label, the pre-established target detection network is trained by combining point-to-point loss and direction loss, and the obtained target detection model can predict the ordered point set of the instance for the image data and/or the point cloud data, namely, the prediction of the coordinate point level of the map element is realized.

The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in more detail embodiments of the present disclosure with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the principles of the disclosure and not to limit the disclosure. In the drawings, like reference numbers generally represent like parts or steps.

FIG. 1 is an exemplary application scenario of a training method of an object detection model provided by the present disclosure;

FIG. 2 is a flowchart illustrating a method for training an object detection model according to an exemplary embodiment of the disclosure;

FIG. 3 is a flowchart of step 202 provided by an exemplary embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a target detection network according to an exemplary embodiment of the present disclosure;

FIG. 5 is a flowchart of step 202 provided by another exemplary embodiment of the present disclosure;

FIG. 6 is a schematic flow chart of step 2021 provided by an exemplary embodiment of the present disclosure;

fig. 7 is a schematic diagram of a decoder network according to an exemplary embodiment of the present disclosure;

fig. 8 is a schematic diagram of a Deformable DETR according to an exemplary embodiment of the present disclosure;

FIG. 9 is a schematic diagram illustrating a determination principle of a training example point set provided by another exemplary embodiment of the present disclosure;

FIG. 10 is a flowchart of step 202 provided by yet another exemplary embodiment of the present disclosure;

FIG. 11 is a schematic illustration of a prediction network for prediction types provided by an exemplary embodiment of the present disclosure;

FIG. 12 is a flowchart of step 2024 provided by an exemplary embodiment of the present disclosure;

FIG. 13 is a flowchart illustrating a method for generating a map according to an exemplary embodiment of the present disclosure;

FIG. 14 is a schematic structural diagram of a training apparatus for an object detection model according to an exemplary embodiment of the present disclosure;

fig. 15 is a schematic structural diagram of a first processing module 502 according to an exemplary embodiment of the disclosure;

fig. 16 is a schematic structural diagram of the second processing unit 5022 according to an exemplary embodiment of the disclosure;

fig. 17 is a schematic structural diagram of the first processing unit 5021 provided in an exemplary embodiment of the present disclosure;

fig. 18 is a schematic structural diagram of a first processing module 502 according to another exemplary embodiment of the present disclosure;

FIG. 19 is a block diagram of a map generation apparatus provided in an exemplary embodiment of the present disclosure;

fig. 20 is a schematic structural diagram of an application embodiment of the electronic device of the present disclosure.

Detailed Description

Hereinafter, example embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of the embodiments of the present disclosure and not all embodiments of the present disclosure, with the understanding that the present disclosure is not limited to the example embodiments described herein.

It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

It will be understood by those of skill in the art that the terms "first," "second," and the like in the embodiments of the present disclosure are used merely to distinguish one element from another, and are not intended to imply any particular technical meaning, nor is the necessary logical order between them.

It is also understood that in embodiments of the present disclosure, "a plurality" may refer to two or more and "at least one" may refer to one, two or more.

It is also to be understood that any reference to any component, data, or structure in the embodiments of the present disclosure may be generally understood as one or more, unless explicitly defined otherwise or indicated to the contrary hereinafter.

In addition, the term "and/or" in the present disclosure is only one kind of association relationship describing an associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the former and latter associated objects are in an "or" relationship.

It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and the same or similar parts may be referred to each other, so that the descriptions thereof are omitted for brevity.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

Embodiments of the disclosure may be implemented in electronic devices such as terminal devices, computer systems, servers, etc., which are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with electronic devices, such as terminal devices, computer systems, servers, and the like, include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set top boxes, programmable consumer electronics, network pcs, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above systems, and the like.

Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

Summary of the disclosure

In implementing the present disclosure, the inventors discovered that in an automatic driving scenario, it is generally necessary to use a vehicle-mounted panoramic camera and/or radar to perform perception of road elements (such as lane lines, zebra crossings, road edges, drivable areas, etc.) for generation of an online map. In the related art, a target detection model is usually adopted to predict map example frames corresponding to various elements to obtain positions of the map example frames, and then a map is generated based on the positions of the map example frames, but the generated map is low in precision.

Brief description of the drawings

Fig. 1 is an exemplary application scenario of the training method of the object detection model provided in the present disclosure.

In an automatic driving scene, by using the training method of the target detection model disclosed by the disclosure, based on image data acquired in advance as training image data, the point cloud data acquired in advance is taken as training point cloud data to form training input data, tags corresponding to the training image data and tags corresponding to the training point cloud data are taken as first tag data to be used for training of the target detection model, the network output of the target detection model is ordered point sets corresponding to a first number of instances respectively, that is, each instance corresponds to one ordered point set, each ordered point set can include coordinate points under a first coordinate system of a target number, the first coordinate system can be a coordinate system corresponding to a bird's-eye view angle, an example can be a representation of various elements on a road in an image or in an ordered point set, for example, each lane line, a zebra line, an arrow, a road edge, a feasible driving area and the like in the image or in the ordered point set, each element can correspond to one or more instances in the image or the ordered point set, each instance can predict the ordered point set corresponding to the instance, the ordered point set can be fitted with 3 coordinate points, and the coordinate set can be fitted through a segment of the coordinate points. In the training process, network parameters of the target detection model are adjusted based on the point-to-point loss function and the direction loss function, so that the target detection model obtained through training can effectively detect the ordered point sets corresponding to the instances. As the coordinate point of the example is predicted by the target detection model obtained by training, the accuracy of the prediction result can be greatly improved compared with the prediction of the existing example frame. And the trained target detection model is deployed to a map generation device on a vehicle-mounted computing platform of the automatic driving vehicle, so that the map generation device is used for on-line map building of the automatic driving vehicle, and the accuracy of the generated map can be effectively improved.

In practical application, the training method of the target detection model disclosed by the invention is not limited to an automatic driving scene, and can be applied to any other practical scene according to actual requirements, and specifically can be set according to the actual requirements.

Exemplary method

Fig. 2 is a flowchart illustrating a training method of an object detection model according to an exemplary embodiment of the disclosure. The embodiment can be applied to electronic devices, such as a server, a terminal, and other electronic devices, and as shown in fig. 2, includes the following steps:

step 201, training input data and corresponding first label data are obtained, the training input data include training image data and/or training point cloud data, the first label data include ordered point sets corresponding to a first number of instances in the training input data, and the ordered point sets include coordinate points in a first coordinate system of a target number.

The training image data and the training point cloud data can be acquired based on a vehicle-mounted panoramic camera and radar acquisition. For example, a special collection vehicle provided with a look-around camera and a radar runs on a road, and a road environment image and road point cloud data around the vehicle are collected as training image data and training point cloud data, respectively. The first label data is obtained by carrying out ordered point set labeling on instances in training image data and/or training point cloud data. Because the first label data comprises the coordinate points in the first coordinate system, the first coordinate system can be the coordinate system corresponding to the bird's-eye view angle, the training image data is the data in the image coordinate system, the training point cloud data is the data in the radar coordinate system, the labeling result can be converted into the first coordinate system based on the camera parameters and the radar parameters, and the corresponding first label data is obtained, and the specific conversion principle is not repeated. The examples may be representations of various elements on a road in an image or a point cloud, for example, each lane line, zebra crossing, arrow, road edge, travelable area, and the like in the image are examples, that is, each element may correspond to one or more examples in the image or the point cloud, each example corresponds to an ordered point set, the ordered point set may fit out the elements corresponding to the example, such as the lane line example, the ordered point set includes 3 coordinate points, and a section of lane line may be fitted out through the 3 coordinate points. The first number and the target number can be set according to actual requirements.

Step 202, training a pre-established target detection network based on training input data, first label data, a point-to-point loss function and a direction loss function to obtain a target detection model, wherein the point-to-point loss function is used for determining point-to-point loss of a training example point set output by the target detection network relative to an ordered point set of examples in the first label data, and the direction loss function is used for determining loss of directions between the training example point set and points relative to directions between the points of the ordered point set of examples in the first label data.

The target DEtection network may be set according to actual requirements, such as a DEtection network based on a Deformable DEtection TRansformer (Deformable DEtection TRansformer, abbreviated as "Deformable DETR") and other implementable DEtection networks. The point-to-point loss function and the directional loss function may adopt any applicable loss function, for example, the point-to-point loss function may adopt an L1 loss function, the L1 loss function refers to an L1 norm loss function, and is also called a minimum absolute value deviation (LAD) or a minimum absolute value error (LAE), which is a function of minimizing a sum of absolute differences between a target value (referred to as a labeled tag value in the present disclosure) and an estimated value (referred to as an output value of a target detection network in the present disclosure), and the directional loss function may adopt a cosine similarity loss function of a direction vector of two adjacent points. In the training process, network parameters of the target detection model are adjusted based on the point-to-point loss function and the direction loss function, so that the target detection model obtained through training can effectively detect the ordered point sets corresponding to the instances. The point-to-point loss determined by the point-to-point loss function is used for monitoring a point level prediction result of the target detection network, so that the target detection network can accurately predict points of the examples, the direction loss determined by the direction loss function is used for monitoring the sequence of the points, so that the target detection network can predict more accurate ordered point sets, and the ordered coordinate points of the examples are predicted by the target detection model obtained through training, so that the accuracy of the prediction result can be greatly improved compared with the prediction of the existing example frame.

According to the training method of the target detection model provided by the embodiment, based on the point-to-point loss determined by the point-to-point loss function and the direction loss determined by the direction loss function, the sequence of points of the example point set output by the target detection network is supervised, so that the target detection model obtained through training can accurately and effectively predict the ordered point sets corresponding to the examples respectively, the prediction of the coordinate point level of the examples is realized, and compared with the prediction of the existing example frame level, the target detection model greatly improves the precision of the prediction result, and further can effectively improve the map precision when being used for map generation.

In an optional example, fig. 3 is a schematic flowchart of step 202 provided in an exemplary embodiment of the present disclosure, and in this example, step 202 may specifically include the following steps:

step 2021, obtain training example point set based on training input data and target detection network.

The target detection network can be set to only input training image data, only input training point cloud data, input training image data and input training point cloud data according to actual requirements.

Illustratively, fig. 4 is a schematic structural diagram of an object detection network according to an exemplary embodiment of the present disclosure. In this example, the target detection network includes a feature extraction network, an encoder network, a decoder network, and a prediction header network. The feature extraction network can comprise a first feature extraction network and/or a second feature extraction network according to actual requirements, the first feature extraction network is used for extracting features of training image data to obtain training image features, and the second feature extraction network is used for extracting training point cloud data to obtain training point cloud features; the encoder network is used for encoding the training image features and/or the training point cloud features to obtain a training feature map under a first coordinate system; the decoder network is used for decoding the training characteristic graph to obtain a training decoding result; and the prediction head network is used for predicting the training example point set based on the training decoding result. For example, the predictive header network may be a linear neural network such as an MLP (multi layer Perceptron, or feedforward neural network), and may be specifically set according to actual requirements. Training input data is used as input of the target detection network, and an output training example point set can be obtained through inference of the target detection network.

For different examples, different ordered point sets can be adopted, for example, for a lane line example, an open-loop ordered point set can be adopted, that is, the starting point and the ending point of the ordered point set are not the same point, a line segment is obtained after fitting, and for example, for a travelable area example, the ordered point set is a polygon point set to form a closed-loop ordered point set, and a closed-loop polygon is obtained after fitting, and the closed-loop polygon can be specifically set according to actual requirements. The target number of the coordinate points of the ordered point sets corresponding to different examples may be the same or different, for example, the lane line example sets an ordered point set of 3 coordinate points, and the zebra crossing sets an ordered point set of 5 coordinate points, which is not limited specifically.

Step 2022, determine a first loss based on the training instance point set, the first label data, and the point-to-point loss function.

After the training instance point set is obtained, point-to-point comparison can be performed between the point set of each instance in the training instance point set and the ordered point set of the instance in the first label data, the absolute value of the point-to-point difference is used as the loss of the point, so that the loss of each point in each instance is obtained, and the point-to-point loss of the whole network is determined as the first loss based on the loss of each point in each instance. Such as summing the losses at various points of the instances to obtain a first loss. The method can be specifically set according to actual requirements.

Step 2023, determine a second loss based on the training instance point set, the first label data, and the directional loss function.

Wherein the direction loss function is used to determine a loss of the direction between the points in the set of training instance points relative to the direction between the points of the ordered set of instances in the first label data, which loss may be determined based on a cosine similarity of a direction vector between the two points.

Illustratively, for two adjacent points in one example in the training example point set, a first direction vector of the two adjacent points is determined based on coordinate values of the two adjacent points, a second direction vector of the two adjacent points is determined based on coordinate value labels of two points corresponding to the two adjacent points in the first label data, and a cosine similarity of the two direction vectors is determined based on the first direction vector and the second direction vector. And determining the overall direction loss of the network as a second loss based on the cosine similarity of every two adjacent points in each example, wherein the second loss can be specifically set according to actual requirements.

Step 2022 and step 2023 are not in order.

Step 2024, adjusting the network parameters of the target detection network based on the first loss and the second loss until the first loss and the second loss satisfy a preset condition, and obtaining a target detection model.

The first loss and the second loss can be weighted and summed through preset weights to serve as comprehensive losses, and the comprehensive losses are used for adjusting network parameters. The preset conditions can be set according to actual requirements. The adjustment of the network parameters can be realized by adopting any implementable optimizer, such as an Adam training optimizer, and can be specifically set according to actual requirements. The Adam training optimizer absorbs the advantages of a gradient descent algorithm (Adagarad) of a self-adaptive learning rate and a momentum gradient descent algorithm, can adapt to sparse gradients (namely natural language and computer vision problems), and can relieve the problem of gradient oscillation. The detailed description of the specific principles is omitted.

Fig. 5 is a flowchart of step 202 provided by another exemplary embodiment of the present disclosure.

In an alternative example, the determining the first loss based on the training example point set, the first label data, and the point-to-point loss function of step 2022 includes:

step 20221, for each instance, based on the ordered point set corresponding to the instance in the first label data, determining the correspondence between each point in the ordered point set and the point of the instance in the training instance point set in different orders of the ordered point set, and obtaining the point-to-point relationships corresponding to each order.

For example, for an ordered point set of a line segment such as a lane line, the ordered point set comprises three ordered coordinate points A1, A2, and A3, wherein two endpoints A1 and A3 are provided, the ordered point set comprises two orders, one is A1-A2-A3, the other is A3-A2-A1, and the adjacent relationship between any two coordinate points in the different orders is unchanged. For another example, for an ordered point set of polygons such as a travelable region or a zebra crossing, the ordered point set includes ordered coordinate points of B1-B5, where B5 may be equal to B1 to represent the polygon, or may represent the polygon corresponding to the ordered point set by other symbols, and when fitting, the ordered point set needs to be connected end to form a closed loop, and specifically, the ordered point set may be set according to actual requirements, as long as the ordered point set can be distinguished from the ordered point set of line segments. And also provides basis for determining point-to-point relations in different orders. For the ordered point set of B1-B5, taking B5 not equal to B1 as an example, since each coordinate point is a vertex of a polygon, which can be used as a starting point, the ordered point set has 5 starting points, and the ordered point set has 10 orders including B1-B5 and reverse order B5-B1, B2-B3-B4-B5-B1 and reverse order thereof, B3-B4-B5-B1-B2 and reverse order thereof, B4-B5-B1-B2-B3 and reverse order thereof, and B5-B1-B2-B3-B4 and reverse order thereof, in combination with the direction. And determining the point-to-point relationship corresponding to different sequences in corresponding examples in the training example point set or the first label data by adjusting the sequence of the coordinate points in the ordered point set. For example, an ordered point set of one example in the training example point set is C1 to C5, the effective point set label of the example in the first label data is D1 to D5, and after the D1 to D5 are arranged according to 10 different sequences of the above B1 to B5, the point-to-point relationships respectively corresponding to the different sequences are formed respectively with the C1 to C5. For example, C1-C5 correspond to D5-D1 in sequence, and the detailed description is omitted.

Step 20222, determine point-to-point losses corresponding to the respective sequences based on the point-to-point relationships corresponding to the respective sequences.

After point-to-point relations respectively corresponding to the sequences are determined, point-to-point losses are obtained based on a point-to-point loss function.

Step 20223, the order of least point-to-point loss is taken as the target order of the example.

After point-to-point losses corresponding to the sequences are determined, the minimum point-to-point loss and the sequence of the minimum point-to-point loss are determined, and the sequence is used as a target sequence of the ordered point set of the example and is used for determining the whole point-to-point loss of the subsequent network.

In step 20224, the point-to-point loss corresponding to the target sequence is used as the target point-to-point loss of the example.

Because the training instance point set includes ordered point sets corresponding to the first number of instances, each instance determines a corresponding target sequence and a corresponding target point to point loss when the first number is multiple.

At step 20225, a first penalty is determined based on the target point-to-point penalty for each instance.

Specifically, the target point-to-point loss of each instance is integrated, and the first loss of the whole network is determined.

According to the method, the minimum point-to-point loss is determined for determining the overall point-to-point loss of the network aiming at various possible sequences of the ordered point set in the training process, so that the target detection network can simulate the optimal starting point of the examples and the sequence of the corresponding examples, and the model performance and the accuracy of the prediction result are further improved.

In an alternative example, the determining the second loss based on the training example point set, the first label data, and the directional loss function of step 2023 includes:

step 20231, determine a second loss based on the training instance point set, the first label data, and the target sequence and direction loss function respectively corresponding to each instance.

Since the second loss is a directional loss and relates to the directionality of two adjacent points, when the point-to-point loss adopts a point-to-point relationship of a target sequence, the directional loss also needs to be determined based on the directional vectors of two adjacent coordinate points determined by the point-to-point relationship of the target sequence, so as to ensure the consistency between the direction between the predicted point and the direction of the label.

Fig. 6 is a flowchart of step 2021 provided by an exemplary embodiment of the present disclosure.

In an optional example, the training input data further includes initial query features and initial reference points, the initial query features include initial features of a first number of instances corresponding to the target number, respectively, and the initial reference points include reference coordinate points corresponding to the initial features, respectively; the target detection network is a detection network based on a deformable detection converter; correspondingly, the step 2021 of obtaining a training instance point set based on the training input data and the target detection network includes:

step 20211, extracting features of the training image data based on the first feature extraction network in the target detection network to obtain a first training image feature.

The Deformable detection Transformer (Deformable DETR) is a detection Transformer obtained by improving the DETR, a multi-scale variable attention module is used for replacing an attention module in the DETR to process characteristics, the problems of high DETR calculation complexity, too slow convergence and the like are solved, the DETR is a first pure end-to-end target detector, a Convolutional Neural Network (CNN) and a Transformer (Transformer) are fully fused, and target detection is realized by means of strong modeling capability of the Transformer. The initial query features (queries) may be randomly initialized query features that are used for decoder attention operations in the target detection network. The initial reference point may be an initial reference coordinate point set corresponding to each instance of random initialization, and the first feature extraction network may adopt any implementable feature extraction network, for example, a convolutional neural network as the feature extraction network, and may be specifically set according to actual requirements.

Step 20212, extracting features of the training point cloud data based on the second feature extraction network in the target detection network to obtain a first training point cloud feature.

The second feature extraction network may adopt any implementable feature extraction network, for example, a convolutional neural network is adopted as the feature extraction network, and may be specifically set according to actual requirements.

Wherein, the step 20211 and the step 20212 are not in sequence.

Step 20213, encoding the first training image feature and/or the first training point cloud feature based on an encoder network in the target detection network to obtain a target training feature map in the first coordinate system.

The encoder network comprises at least one encoder, and the encoder network converts the first training image characteristic and the first training point cloud characteristic into a first coordinate system through encoding to obtain a corresponding target training characteristic diagram.

Step 20214, obtain a training decoding result based on the target training feature map, the initial query feature, the initial reference point, and a decoder network in the target detection network, where the decoder network includes at least one decoder.

Wherein the training decoding result is a decoding result obtained by decoding by at least one decoder. The decoder network continuously updates the initial query features based on the initial reference points and the target training feature graph to obtain a training decoding result.

Step 20215, based on the training decoding result, determines a training instance point set.

The training example point set is obtained by continuously updating the initial reference points through the decoding result of each decoder, namely after each decoder obtains the decoding result, the offset corresponding to each reference point is predicted based on the decoding result, taking the first decoder as an example, the decoding result of the first decoder predicts the offset corresponding to each initial reference point, the offsets and the corresponding initial reference points are added, and the updated reference points are obtained and serve as the output reference points corresponding to the first decoder. After the second decoder decodes, the offset corresponding to each output reference point of the first decoder is predicted based on the decoding result, the offset is added to each output reference point corresponding to the first decoder to obtain the output reference point corresponding to the second decoder, and so on, and the output reference point corresponding to the last decoder is used as a training example point set.

Illustratively, fig. 7 is a schematic structural diagram of a decoder network provided in an exemplary embodiment of the present disclosure. In this example, the decoder network includes N decoders, the initial query feature includes 3 initial features respectively corresponding to two instances (instance 1 and instance 2), the initial reference point includes reference coordinate points respectively corresponding to the two instances, a training decoding result is obtained by decoding through the N decoders, and a training instance point set is obtained based on the training decoding result.

Exemplarily, fig. 8 is a schematic diagram of a principle of a Deformable DETR according to an exemplary embodiment of the present disclosure. In this example, taking the first decoder as an example, query Feature represents an initial Query Feature, reference Point represents an initial Reference Point, input Feature Map represents a target training Feature Map, and the variable attention module of the decoder represents the target training Feature Map each timeOnly a part of the range near the reference point is focused, and the resolution of the whole characteristic diagram is not considered. Head denotes Attention Head, m denotes mth Attention Head, attention Weights (Amqk) denote Attention weight, W' _m x denotes encoding of a KEY vector (KEY) in the attention operation, linear denotes a Linear layer, aggregate denotes an aggregation operation of an attention weight with a KEY point in a value vector Values, softmax denotes an activation function, and Output denotes a decoding result. Deformable DETR, as opposed to DETR, collects only the dominant feature points near the reference point, and therefore, for each Query vector Query, there are only a very small number of Key vectors (Key), the initial Query features are linearly layer predicted feature Offsets (Sampling Offsets) Δ p _mqk Feature offset (Sampling Offsets) Δ p in this example _mqk The method includes that 3 characteristic offsets are predicted in each attention Head (such as Head 1) and are respectively represented by 3 arrows, the characteristic offsets refer to position offsets of key points acquired in a value vector relative to an initial reference point, an Input Feature Map obtains value vectors through a linear layer, each attention Head obtains corresponding value vectors, and the characteristic offsets are used for extracting sparse Values (namely, the key points in the Values) from the vicinity of the initial reference point in the value vectors. The Attention weight attributes (Amqk) is obtained by Query Feature through the linear layer and Softmax, and the Attention Weights (Amqk) and sparse Values are Aggregated, for example, the Attention result corresponding to the Attention Head1 is obtained by performing weighted summation on the Values (stacked 3 green blocks) of 3 key points extracted from the value vector of Head1 through 3 Weights 0.5, 0.3 and 0.2 of Head1 in the Attention Weights (Amqk), and similarly, the Attention result (Aggregated Sampled Values) corresponding to each Attention Head can be obtained, and the decoded result (Output) is obtained by Aggregated Sampled Values through the linear layer, or the Output can be further added to Query Feature through residual connection, and the added result is used as the decoding result, and can be specifically set according to actual requirements. The specific principle of the Deformable DETR is not described in detail.

According to the method, prediction of the coordinate point level of the example is achieved through the target detection model based on Deformable DETR, compared with the existing prediction or autoregressive prediction which adopts segmentation combined with postprocessing, prediction accuracy can be greatly improved, and due to the fact that the target detection model based on Deformable DETR adopts Deformable convolution, when attention operation is conducted, only main feature points near a reference point can be collected, the calculated amount is greatly reduced, and therefore prediction speed is effectively improved.

In an alternative example, the obtaining of the training decoding result based on the target training feature map, the initial query feature, the initial reference point, and the decoder network in the target detection network in step 20214 includes: for each decoder in the decoder network, obtaining a decoding result of the decoder based on a target training feature map and an input query feature and an input reference point corresponding to the decoder, wherein the input query feature and the input reference point corresponding to a first decoder are an initial query feature and an initial reference point respectively, the input query feature corresponding to any other decoder except the first decoder is a decoding result of a previous decoder of the other decoders, and the input reference point of the other decoders is an output reference point determined based on the decoding result of the previous decoder; and taking the decoding result of the last decoder as a training decoding result.

In an alternative example, after obtaining, for each decoder in the decoder network, a decoding result of the decoder based on the target training feature map and the input query feature and the input reference point corresponding to the decoder, the method further includes: determining a first offset corresponding to the decoder based on a decoding result of the decoder and an offset prediction network corresponding to the decoder; determining an output reference point corresponding to the decoder based on the first offset and the input reference point corresponding to the decoder; accordingly, determining a training instance point set based on the training decoding result includes: and taking the output reference point corresponding to the last decoder determined based on the training decoding result as a training instance point set.

Illustratively, fig. 9 is a schematic diagram of a determination principle of a training example point set provided by another exemplary embodiment of the present disclosure. Each decoder corresponds to an offset prediction network for predicting a first offset of the reference point based on a decoding result of the decoder, and adding the first offset to an input reference point corresponding to the decoder, wherein the input reference point of the decoder 1 is an initial reference point, and the input reference point of the decoder i (i =2,3, \8230;, N) is an output reference point of the decoder i-1. And (4) obtaining an accurate example point set by training the reference points continuously and finely.

Fig. 10 is a flowchart of step 202 provided by yet another exemplary embodiment of the present disclosure.

In an optional example, the first label data further includes type labels respectively corresponding to the instances in the training input data; after obtaining the training decoding result based on the target training feature map, the initial query feature, the initial reference point, and the decoder network in the target detection network in step 20214, the method further includes:

step 20216, determining a training type result based on the training decoding result, where the training type result includes prediction types corresponding to the respective instances.

Wherein the type label of the instance is the real type of each instance obtained by pre-labeling, such as lane line, road edge, zebra crossing, arrow, drivable area, and the like. The prediction type corresponding to the instance refers to an element type to which the instance belongs, and the element type may include types of a lane line, a road edge, a zebra crossing, an arrow, a travelable area, and the like. Such as predicting that an instance belongs to a lane line.

Illustratively, fig. 11 is a schematic diagram of a prediction network of a prediction type provided by an exemplary embodiment of the present disclosure. The decoder N decodes to obtain a training decoding result, and a training type result is obtained through type prediction network prediction. The type prediction network can be a prediction network based on a feedforward neural network, and can be specifically set according to actual requirements. The type prediction network, each offset prediction network, each reference point update network may be collectively referred to as a prediction header network.

At step 20217, a type penalty is determined based on the training type result and the type label in the first label data.

The type loss may be determined based on a preset type loss function, and the type loss function may adopt any implementable loss function, such as a focal loss function, and may be specifically set according to an actual requirement.

Correspondingly, the step 2024 of adjusting the network parameter of the target detection network based on the first loss and the second loss until the first loss and the second loss satisfy the preset condition, to obtain the target detection model, includes:

step 20241, determining a composite loss based on the first loss, the second loss, the type loss, and the predetermined weight.

Wherein, the preset weight may be set according to the actual requirement, for example, the weights of the first loss l1, the second loss l2 and the type loss l3 may be set to λ 1, λ 2 and λ 3 respectively, and then the total loss may be represented as:

L＝λ1*l1+λ2*l2+λ3*l3

for example, λ 1, λ 2, λ 3 may be set to 5, 0.1, and 2, respectively, without limitation.

Step 20242, based on the comprehensive loss, adjusting the network parameters of the target detection network until the comprehensive loss meets a preset condition, and obtaining a target detection model.

According to the method and the device, the network parameters are adjusted by further combining type loss, point-to-point loss and direction loss, and the performance of the target detection model and the accuracy of the prediction result are further improved.

In an alternative example, fig. 12 is a flowchart of step 2024 provided by an exemplary embodiment of the present disclosure, in this example, the step 2024 adjusts a network parameter of the target detection network based on the first loss and the second loss until the first loss and the second loss satisfy a preset condition, and obtains the target detection model, including:

at step 20241a, a combined loss is determined based on the first loss and the second loss.

The first loss and the second loss may be weighted and summed according to a certain proportional weight to obtain a comprehensive loss, and the specific principle refers to the foregoing content, which is not described herein again.

Step 20242a, adjusting network parameters of the target detection network based on the comprehensive loss until the comprehensive loss meets a preset condition, and obtaining a target detection model.

For a specific network parameter adjustment principle, reference is made to the foregoing contents, and details are not described herein again.

According to the training method of the target detection model, model training is carried out in combination with point-to-point loss, direction loss and type loss through a hierarchical prediction mode of the examples and the corresponding ordered point sets, so that the obtained target detection model can predict the ordered point sets of the examples more accurately, prediction accuracy is greatly improved, attention operation of the target detection model in a reasoning process is only concerned about feature interaction of adjacent points around the reference points in combination with a deformable DETR network, calculation complexity is greatly reduced, calculated amount is effectively reduced, and prediction efficiency is improved. And the target detection model query vectors of the present disclosure are point-level, more flexible relative to the existing instance box level.

The above embodiments or optional examples of the present disclosure may be implemented alone or in combination in any combination without conflict, and may be specifically set according to actual requirements, and the present disclosure is not limited.

Fig. 13 is a flowchart illustrating a method for generating a map according to an exemplary embodiment of the disclosure. The embodiment can be applied to electronic equipment, such as a vehicle-mounted computing platform. As shown in fig. 13, the method comprises the following steps:

step 301, acquiring first image data and/or first point cloud data of at least one view angle.

The first image data may be image data of a current frame acquired by at least one camera arranged on the vehicle in real time during the driving of the vehicle, and the first point cloud data may be point cloud data of the current frame acquired by a radar arranged on the vehicle in real time during the driving of the vehicle.

Step 302, based on the first image data and/or the first point cloud data, a target detection model obtained through pre-training is adopted, and a target instance ordered point set is obtained.

The target detection model is obtained by the training method of the target detection model provided in any one of the above embodiments or optional examples, the target instance ordered point set includes ordered point sets corresponding to a first number of instances, and the ordered point set includes coordinate points in a first coordinate system of the target number.

The input data specifically required by the target detection model may be set and trained according to actual requirements, and may support image data or point cloud data, or simultaneously support image data and point cloud data, which is specifically referred to in the foregoing embodiments. For the specific reasoning principle of the target detection model, reference is made to the foregoing embodiments, and details are not repeated here.

Step 303, generating a map based on the target instance ordered point set.

The target example ordered point set is a coordinate point set under a first coordinate system (such as a coordinate system corresponding to the bird's-eye view), and corresponding map elements such as lane lines, zebra crossings, road edges and the like can be obtained by fitting the ordered point sets of the examples in the target example point set. The fitting result of each example can be used as a local road map around the current position of the vehicle.

In practical application, the target instance ordered point set can be converted into a global coordinate system (such as a world coordinate system) through coordinate conversion, so that a global road map can be generated according to a region growing mode, and the setting can be specifically carried out according to actual requirements.

According to the map generation method, the prediction of the coordinate point level of the map example is realized based on the target detection model, and compared with the prediction of the frame level of the existing map example, the map generation method can greatly improve the map accuracy.

Any of the methods provided by the embodiments of the present disclosure (including the training method of the target detection model and the map generation method) may be performed by any suitable device with data processing capabilities, including but not limited to: terminal equipment, a server and the like. Alternatively, any of the methods provided by the embodiments of the present disclosure may be performed by a processor, such as a processor that executes any of the methods mentioned by the embodiments of the present disclosure by calling corresponding instructions stored in a memory. Which will not be described in detail below.

Exemplary devices

Fig. 14 is a schematic structural diagram of a training apparatus for an object detection model according to an exemplary embodiment of the present disclosure. The apparatus of this embodiment may be used to implement the embodiment of the training method for the target detection model corresponding to the present disclosure, and the apparatus shown in fig. 14 includes: a first obtaining module 501 and a first processing module 502.

A first obtaining module 501, configured to obtain training input data and corresponding first label data, where the training input data includes training image data and/or training point cloud data, the first label data includes ordered point sets corresponding to a first number of instances in the training input data, and the ordered point sets include coordinate points in a first coordinate system of a target number; a first processing module 502, configured to train a pre-established target detection network based on the training input data, the first label data, the point-to-point loss function, and the direction loss function obtained by the first obtaining module 501, to obtain a target detection model, where the point-to-point loss function is used to determine the point-to-point loss of a training example point set output by the target detection network relative to an ordered point set of examples in the first label data, and the direction loss function is used to determine the loss of a direction between the training example point set and a point relative to a direction between points of the ordered point set of examples in the first label data.

In an alternative example, fig. 15 is a schematic structural diagram of the first processing module 502 according to an exemplary embodiment of the disclosure. In this example, the first processing module 502 includes: a first processing unit 5021, a second processing unit 5022, a third processing unit 5023, and a fourth processing unit 5024.

The first processing unit 5021 is used for obtaining a training example point set based on training input data and a target detection network; the second processing unit 5022 is configured to determine a first loss based on the training example point set, the first tag data, and the point-to-point loss function obtained by the first processing unit 5021; a third processing unit 5023, configured to determine a second loss based on the training instance point set, the first label data, and the directional loss function obtained by the first processing unit 5021; the fourth processing unit 5024 is configured to adjust the network parameters of the target detection network based on the first loss and the second loss until the first loss and the second loss meet a preset condition, so as to obtain a target detection model.

In an alternative example, fig. 16 is a schematic structural diagram of the second processing unit 5022 provided in an exemplary embodiment of the present disclosure. In this example, the second processing unit 5022 includes: a first determination subunit 50221, a second determination subunit 50222, a third determination subunit 50223, a fourth determination subunit 50224, and a fifth determination subunit 50225.

A first determining subunit 50221, configured to determine, for each instance, a corresponding relationship between each point in the ordered point set and a point of the instance in the training instance point set, based on the ordered point set corresponding to the instance in the first tag data, in different orders of the ordered point set, respectively, and obtain point-to-point relationships corresponding to the orders, respectively; a second determining subunit 50222, configured to determine point-to-point losses corresponding to the respective sequences based on point-to-point relationships corresponding to the respective sequences; a third determining subunit 50223, configured to take the order in which the point-to-point loss is the smallest as the target order of the instance; a fourth determining subunit 50224, configured to use the point-to-point loss corresponding to the target sequence as the target point-to-point loss of the example; the fifth determining subunit 50225 determines the first loss based on the target point-to-point loss of each instance.

In an optional example, the third processing unit 5023 is specifically configured to: and determining a second loss based on the training example point set, the first label data, and the target sequence and the direction loss function respectively corresponding to each example.

Fig. 17 is a schematic structural diagram of the first processing unit 5021 according to an exemplary embodiment of the disclosure.

In an optional example, the training input data further includes initial query features and initial reference points, the initial query features include initial features of a first number of instances corresponding to the target number, respectively, and the initial reference points include reference coordinate points corresponding to the initial features, respectively; the target detection network is a detection network based on a deformable detection converter; the first processing unit 5021 includes: a first feature extraction subunit 50211, a second feature extraction subunit 50212, an encoding subunit 50213, a decoding subunit 50214, and a first processing subunit 50215.

A first feature extraction subunit 50211, configured to perform feature extraction on training image data based on a first feature extraction network in the target detection network to obtain a first training image feature; a second feature extraction subunit 50212, configured to perform feature extraction on the training point cloud data based on a second feature extraction network in the target detection network to obtain a first training point cloud feature; the encoding subunit 50213 is configured to encode the first training image feature and/or the first training point cloud feature based on an encoder network in the target detection network to obtain a target training feature map in a first coordinate system; a decoding subunit 50214, configured to obtain a training decoding result based on the target training feature map, the initial query feature, the initial reference point, and a decoder network in the target detection network, where the decoder network includes at least one decoder; a first processing subunit 50215 is configured to determine a training instance point set based on a training decoding result.

In an alternative example, the decoding subunit 50214 is specifically configured to: for each decoder in the decoder network, obtaining a decoding result of the decoder based on a target training feature map and an input query feature and an input reference point corresponding to the decoder, wherein the input query feature and the input reference point corresponding to a first decoder are an initial query feature and an initial reference point respectively, the input query feature corresponding to any other decoder except the first decoder is a decoding result of a previous decoder of the other decoder, and the input reference point of the other decoder is an output reference point determined based on the decoding result of the previous decoder; and taking the decoding result of the last decoder as a training decoding result.

In an optional example, the first processing unit 5021 further comprises: an offset predictor subunit 50216 and a second processing subunit 50217.

An offset prediction subunit 50216, configured to determine a first offset corresponding to the decoder based on a decoding result of the decoder and an offset prediction network corresponding to the decoder; a second processing subunit 50217, configured to determine an output reference point corresponding to the decoder based on the first offset and the input reference point corresponding to the decoder; correspondingly, the first processing subunit 50215 is specifically configured to: and taking the output reference point corresponding to the last decoder determined based on the training decoding result as a training instance point set.

Fig. 18 is a schematic structural diagram of a first processing module 502 according to another exemplary embodiment of the present disclosure.

In an optional example, the first label data further includes type labels respectively corresponding to the instances in the training input data; the first processing module 502 further includes:

a fifth processing unit 5025, configured to determine a training type result based on the training decoding result, where the training type result includes prediction types corresponding to the respective instances; a sixth processing unit 5026, configured to determine a type loss based on the training type result and the type tag in the first tag data; accordingly, the fourth processing unit 5024 includes: a third processing subunit 50241, configured to determine a comprehensive loss based on the first loss, the second loss, the type loss, and a preset weight; a fourth processing subunit 50242, configured to adjust the network parameters of the target detection network based on the synthetic loss until the synthetic loss meets a preset condition, so as to obtain a target detection model.

In an optional example, the fourth processing unit 5024 is specifically configured to: determining a composite loss based on the first loss and the second loss; and adjusting network parameters of the target detection network based on the comprehensive loss until the comprehensive loss meets the preset condition, and obtaining a target detection model.

Fig. 19 is a schematic structural diagram of a map generation apparatus according to an exemplary embodiment of the present disclosure. The apparatus of this embodiment may be used to implement the embodiment of the method for generating a map corresponding to the present disclosure, and the apparatus shown in fig. 19 includes: a second acquisition module 601, a second processing module 602, and a third processing module 603.

A second obtaining module 601, configured to obtain first image data and/or first point cloud data of at least one view angle; a second processing module 602, configured to obtain a target instance ordered point set based on the first image data and/or the first point cloud data obtained by the second obtaining module 601 by using a target detection model obtained through pre-training, where the target detection model is obtained by using the training method of the target detection model according to any one of the above embodiments or optional examples, the target instance ordered point set includes ordered point sets corresponding to a first number of instances respectively, and the ordered point set includes coordinate points in a first coordinate system of the target number; and a third processing module 603, configured to generate a map based on the target instance ordered point set obtained by the second processing module 602.

Exemplary electronic device

An embodiment of the present disclosure further provides an electronic device, including: a memory for storing a computer program;

a processor, configured to execute the computer program stored in the memory, and when the computer program is executed, implement the method for training the object detection model according to any of the above embodiments of the present disclosure.

a processor, configured to execute the computer program stored in the memory, and when the computer program is executed, the map generation method according to any of the above embodiments of the present disclosure is implemented.

Fig. 20 is a schematic structural diagram of an application embodiment of the electronic device of the present disclosure. In this embodiment, the electronic device 10 includes one or more processors 11 and a memory 12.

The processor 11 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 10 to perform desired functions.

Memory 12 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by processor 11 to implement the methods of the various embodiments of the disclosure described above and/or other desired functionality. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.

In one example, the electronic device 10 may further include: an input device 13 and an output device 14, which are interconnected by a bus system and/or other form of connection mechanism (not shown).

The input device 13 may be, for example, a microphone or a microphone array as described above for capturing an input signal of a sound source.

The input device 13 may also include, for example, a keyboard, a mouse, and the like.

The output device 14 may output various information including the determined distance information, direction information, and the like to the outside. The output devices 14 may include, for example, a display, speakers, a printer, and a communication network and its connected remote output devices, among others.

Of course, for simplicity, only some of the components of the electronic device 10 relevant to the present disclosure are shown in fig. 20, omitting components such as buses, input/output interfaces, and the like. In addition, the electronic device 10 may include any other suitable components depending on the particular application.

Exemplary computer program product and computer-readable storage Medium

In addition to the methods and apparatus described above, embodiments of the present disclosure may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform steps in methods according to various embodiments of the present disclosure as described in the "exemplary methods" section of this specification above.

The computer program product may write program code for carrying out operations for embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform steps in methods according to various embodiments of the present disclosure as described in the "exemplary methods" section above of this specification.

The computer readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present disclosure in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present disclosure are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure is not intended to be limited to the specific details so described.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and reference may be made to the partial description of the method embodiment for relevant points.

The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by one skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

It is also noted that in the apparatus, devices, and methods of the present disclosure, various components or steps may be broken down and/or re-combined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A method of training an object detection model, comprising:

acquiring training input data and corresponding first label data, wherein the training input data comprises training image data and/or training point cloud data, the first label data comprises ordered point sets respectively corresponding to a first number of instances in the training input data, and the ordered point sets comprise coordinate points in a first coordinate system of a target number;

training a pre-established target detection network based on the training input data, the first label data, a point-to-point loss function and a direction loss function to obtain a target detection model, wherein the point-to-point loss function is used for determining the point-to-point loss of a training example point set output by the target detection network relative to an ordered point set of examples in the first label data, and the direction loss function is used for determining the loss of the direction between the training example point set and a point relative to the direction between the points of the ordered point set of examples in the first label data.

2. The method of claim 1, wherein the training a pre-established target detection network based on the training input data, the first label data, a point-to-point loss function, and a directional loss function to obtain a target detection model comprises:

obtaining the training instance point set based on the training input data and the target detection network;

determining a first loss based on the training instance point set, the first label data, and the point-to-point loss function;

determining a second loss based on the set of training instance points, the first label data, and the directional loss function;

and adjusting network parameters of the target detection network based on the first loss and the second loss until the first loss and the second loss meet preset conditions, and obtaining the target detection model.

3. The method of claim 2, wherein the determining a first penalty based on the set of training instance points, the first label data, and the point-to-point penalty function comprises:

for each instance, based on the ordered point set corresponding to the instance in the first label data, determining the corresponding relation between each point in the ordered point set and the point of the instance in the training instance point set according to different orders of the ordered point set, and obtaining the point-to-point relation corresponding to each order;

determining point-to-point losses respectively corresponding to the sequences based on the point-to-point relations respectively corresponding to the sequences;

taking the sequence with the minimum point-to-point loss as the target sequence of the example;

taking the point-to-point loss corresponding to the target sequence as the target point-to-point loss of the instance;

determining the first loss based on the target point-to-point losses for each of the instances.

4. The method of claim 3, wherein the determining a second loss based on the set of training instance points, the first label data, and the directional loss function comprises:

and determining the second loss based on the training example point set, the first label data, the target sequence respectively corresponding to each example and the direction loss function.

5. The method of claim 2, wherein the training input data further comprises initial query features and initial reference points, the initial query features comprising a target number of initial features to which the first number of instances respectively correspond, the initial reference points comprising reference coordinate points to which each of the initial features respectively corresponds; the target detection network is a detection network based on a deformable detection transducer;

the obtaining the training instance point set based on the training input data and the target detection network includes:

performing feature extraction on the training image data based on a first feature extraction network in the target detection network to obtain a first training image feature;

extracting the features of the training point cloud data based on a second feature extraction network in the target detection network to obtain first training point cloud features;

coding the first training image feature and/or the first training point cloud feature based on a coder network in the target detection network to obtain a target training feature map in a first coordinate system;

obtaining a training decoding result based on the target training feature map, the initial query feature, the initial reference point, and a decoder network in the target detection network, wherein the decoder network comprises at least one decoder;

determining the set of training instance points based on the training decoding result.

6. The method of claim 5, wherein the obtaining training decoding results based on the target training feature map, the initial query features, the initial reference points, and a decoder network in the target detection network comprises:

for each decoder in the decoder network, obtaining a decoding result of the decoder based on the target training feature map and an input query feature and an input reference point corresponding to the decoder, wherein the input query feature and the input reference point corresponding to a first decoder are the initial query feature and the initial reference point respectively, the input query feature corresponding to any other decoder except the first decoder is a decoding result of a previous decoder of the other decoder, and the input reference point of the other decoder is an output reference point determined based on the decoding result of the previous decoder;

and taking the decoding result of the last decoder as the training decoding result.

7. The method of claim 6, wherein after obtaining, for each of the decoders in the decoder network, a decoding result of the decoder based on the target training feature map and the input query feature and the input reference point corresponding to the decoder, the method further comprises:

determining a first offset corresponding to the decoder based on a decoding result of the decoder and an offset prediction network corresponding to the decoder;

determining an output reference point corresponding to the decoder based on the first offset and the input reference point corresponding to the decoder;

the determining the set of training instance points based on the training decoding result comprises:

and taking the output reference point corresponding to the last decoder determined based on the training decoding result as the training instance point set.

8. The method of claim 5, wherein the first label data further comprises type labels respectively corresponding to the instances in the training input data;

after obtaining a training decoding result based on the target training feature map, the initial query feature, the initial reference point, and a decoder network in the target detection network, the method further includes:

determining a training type result based on the training decoding result, wherein the training type result comprises a prediction type corresponding to each instance;

determining a type loss based on the training type result and a type label in the first label data;

the adjusting the network parameters of the target detection network based on the first loss and the second loss until the first loss and the second loss meet a preset condition to obtain the target detection model includes:

determining a synthetic loss based on the first loss, the second loss, the type loss and a preset weight;

and adjusting network parameters of the target detection network based on the comprehensive loss until the comprehensive loss meets the preset condition, and obtaining the target detection model.

9. The method of claim 2, wherein the adjusting network parameters of the target detection network based on the first loss and the second loss until the first loss and the second loss satisfy a preset condition to obtain the target detection model comprises:

determining a composite loss based on the first loss and the second loss;

10. A map generation method comprises the following steps:

acquiring first image data and/or first point cloud data of at least one visual angle;

obtaining a target instance ordered point set by adopting a target detection model obtained by pre-training based on the first image data and/or the first point cloud data, wherein the target detection model is obtained by the training method of the target detection model according to any one of claims 1 to 9, the target instance ordered point set comprises ordered point sets respectively corresponding to a first number of instances, and the ordered point sets comprise coordinate points in a first coordinate system of the target number;

and generating a map based on the target instance ordered point set.

11. A training apparatus for an object detection model, comprising:

the first acquisition module is used for acquiring training input data and corresponding first label data, wherein the training input data comprises training image data and/or training point cloud data, the first label data comprises ordered point sets respectively corresponding to a first number of instances in the training input data, and the ordered point sets comprise coordinate points in a first coordinate system of a target number;

a first processing module, configured to train a pre-established target detection network based on the training input data, the first label data, a point-to-point loss function, and a direction loss function, to obtain a target detection model, where the point-to-point loss function is used to determine a point-to-point loss of a training instance point set output by the target detection network relative to an ordered point set of instances in the first label data, and the direction loss function is used to determine a loss of a direction between the training instance point set and a point relative to a direction between points of the ordered point set of instances in the first label data.

12. A map generation apparatus comprising:

the second acquisition module is used for acquiring first image data and/or first point cloud data of at least one visual angle;

a second processing module, configured to obtain, based on the first image data and/or the first point cloud data, a target example ordered point set by using a target detection model obtained through pre-training, where the target detection model is obtained by using the training method of the target detection model according to any one of claims 1 to 9, the target example ordered point set includes ordered point sets corresponding to the first number of examples respectively, and the ordered point set includes coordinate points in a first coordinate system of the target number;

and the third processing module is used for generating a map based on the target instance ordered point set.

13. A computer-readable storage medium storing a computer program for executing the method for training an object detection model according to any one of claims 1 to 9; alternatively, the computer program is for executing the map generation method of claim 10.

14. An electronic device, the electronic device comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the method for training the object detection model according to any one of claims 1 to 9.

15. An electronic device, the electronic device comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the map generating method of claim 10.