CN115294539A

CN115294539A - Multitask detection method and device, storage medium and terminal

Info

Publication number: CN115294539A
Application number: CN202210582263.2A
Authority: CN
Inventors: 黄超; 姚为龙; 郑伟伟
Original assignee: Shanghai Xiantu Intelligent Technology Co Ltd
Current assignee: Shanghai Xiantu Intelligent Technology Co Ltd
Priority date: 2022-05-26
Filing date: 2022-05-26
Publication date: 2022-11-04

Abstract

A multitask detection method and device, a storage medium and a terminal are provided, and the method comprises the following steps: acquiring input data, and inputting the input data into a multi-task detection model obtained by training, wherein the multi-task detection model comprises: the system comprises a single feature extraction network, a plurality of adjustment networks and a plurality of prediction networks, wherein the prediction networks correspond to tasks one to one, and the adjustment networks correspond to the prediction networks one to one; extracting the characteristics of input data by adopting a characteristic extraction network to obtain a universal characteristic vector; performing feature extraction and/or dimension transformation on the general feature vector by adopting a kth adjusting network to obtain a feature vector corresponding to a kth task, and marking the feature vector as the kth feature vector, wherein k is more than or equal to 1 and is less than or equal to M, k and M are positive integers, and M is the number of the tasks; and calculating the kth characteristic vector by adopting the kth prediction network to obtain a detection result of the kth task. Through the scheme, the calculated amount of the multi-task detection process can be reduced, and the detection efficiency is improved.

Description

Multitask detection method and device, storage medium and terminal

Technical Field

The invention relates to the technical field of deep learning, in particular to a multitask detection method and device, a storage medium and a terminal.

Background

In recent years, the automatic driving technology has been rapidly developed. At present, an automatic driving system mainly comprises an environment sensing system, a positioning navigation system, a path planning system, a speed control system, a motion control system and the like. The environment perception system is an important component of an automatic driving system, and as an upstream link of the whole automatic driving, the detection performance of the environment perception system directly influences subsequent links of automatic driving such as planning and decision making.

The radar is used as a perception device and has good stability and adaptability, and the radar is generally deployed on an automatic driving vehicle to perform 3D target detection in the prior art so as to realize environment perception. The environment perception of automatic driving relates to a plurality of detection tasks such as obstacle detection, lane line detection and drivable area division, and in the existing scheme, the detection tasks are usually processed independently, and considering the limited computing resources, the processing of the tasks independently greatly increases the consumption of vehicle-mounted computing resources, and the detection efficiency is low.

Therefore, a multitask detection method is needed to reduce the amount of computation in the multitask detection process and improve the detection efficiency.

Disclosure of Invention

The invention solves the technical problem of how to reduce the calculated amount in the multi-task detection process and improve the detection efficiency.

In order to solve the foregoing technical problem, an embodiment of the present invention provides a multitask detection method, where the method includes: acquiring input data, and inputting the input data into a multi-task detection model obtained by training, wherein the multi-task detection model comprises: the task scheduling system comprises a single feature extraction network, a plurality of adjustment networks and a plurality of prediction networks, wherein the prediction networks correspond to the tasks one to one, and the adjustment networks correspond to the prediction networks one to one; extracting the features of the input data by adopting the feature extraction network to obtain a universal feature vector; performing feature extraction and/or dimension transformation on the general feature vector by adopting a kth adjusting network to obtain a feature vector corresponding to a kth task, and recording the feature vector as the kth feature vector, wherein k is more than or equal to 1 and less than or equal to M, k and M are positive integers, and M is the number of the tasks; and calculating the kth characteristic vector by adopting a kth prediction network to obtain a detection result of the kth task.

Optionally, the input data is point cloud data acquired by a radar, the feature extraction network includes convolution layers, and the convolution layers perform sparse convolution calculation.

Optionally, the multi-task detection model is obtained by training a preset model in advance by using training data, where the preset model includes: a single initial feature extraction network, a plurality of initial adjustment networks, and a plurality of initial prediction networks, the training data comprising: before the sample data corresponding to the kth task is obtained, the method further includes: the method comprises the following steps: carrying out supervised training on the preset model by adopting the training data, and obtaining an intermediate detection model when a first preset condition is met, wherein the intermediate detection model comprises the feature extraction network, a plurality of intermediate adjustment networks and a plurality of intermediate prediction networks; step two: carrying out supervised training on a kth intermediate adjusting network and a kth intermediate predicting network by adopting sample data corresponding to the kth task, and obtaining the multi-task detecting network when a second preset condition is met; wherein the first preset condition comprises: the total loss is less than or equal to a first preset loss, and the second preset condition comprises: and the loss of each task is less than or equal to a second preset loss, and the total loss is calculated according to the losses of the M tasks.

Optionally, the total loss is calculated using the following formula:

l is the total loss, σ _k Adjusting the weight of the network for the kth initial _k As a learnable parameter, L _k Is the loss of the kth task.

Optionally, the learning rate used in the step one is a first learning rate, the learning rate used in the step two is a second learning rate, and the first learning rate is greater than the second learning rate.

Optionally, the sample data set further includes: no labeled sample data, before the first step, the method further comprises: and performing unsupervised training on the preset model by adopting the unlabeled sample data, and obtaining the preset model for performing the supervised training when a third preset condition is met.

Optionally, the unlabelled sample data includes multiple frames of first point cloud data, and the constraint condition of the unsupervised training includes: minimizing the feature distance between the matching points and/or maximizing the feature distance between the non-matching points, wherein before performing unsupervised training on the preset model by using the unlabeled sample data, the method further comprises: performing data enhancement processing on each frame of first point cloud data to obtain second point cloud data corresponding to the frame of first point cloud data, wherein points in the first point cloud data correspond to points in the second point cloud data one to one; determining a matching point and a non-matching point in each frame of first point cloud data and second point cloud data corresponding to the first point cloud data; the matching points refer to points with corresponding relations, and the non-matching points refer to points without corresponding relations.

In order to solve the foregoing technical problem, an embodiment of the present invention further provides a method and an apparatus for multitask detection, where the apparatus includes: an obtaining module, configured to obtain input data and input the input data into a trained multi-task detection model, where the multi-task detection model includes: the system comprises a single feature extraction network, a plurality of adjustment networks and a plurality of prediction networks, wherein the prediction networks correspond to the tasks one to one, and the adjustment networks correspond to the prediction networks one to one; the characteristic extraction module is used for extracting the characteristics of the input data by adopting a characteristic extraction network so as to obtain a universal characteristic vector; the adjusting module is used for performing feature extraction and/or dimension transformation on the general feature vector by adopting a kth adjusting network to obtain a feature vector corresponding to a kth task, and the feature vector is marked as the kth feature vector, wherein k is more than or equal to 1 and less than or equal to M, k and M are positive integers, and M is the number of the tasks; and the detection module is used for calculating the kth characteristic vector by adopting the kth prediction network so as to obtain a detection result of the kth task.

Embodiments of the present invention further provide a storage medium, on which a computer program is stored, where the computer program, when executed by a processor, performs the steps of the above-mentioned multitask detection method.

The embodiment of the present invention further provides a terminal, which includes a memory and a processor, where the memory stores a computer program that can be run on the processor, and the processor executes the steps of the above-mentioned multitask detection method when running the computer program.

Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:

in the scheme of the embodiment of the invention, input data is obtained and input into the trained multi-task detection model. Wherein, the multitask detection model comprises: the system comprises a single feature extraction network, a plurality of adjustment networks and a plurality of prediction networks, wherein the prediction networks correspond to the tasks one to one, and the adjustment networks correspond to the prediction networks one to one. Further, the feature extraction network is adopted to extract features of the input data to obtain a universal feature vector; performing feature extraction and/or dimension transformation on the general feature vector by adopting a kth adjusting network to obtain a feature vector corresponding to the kth task, and recording the feature vector as the kth feature vector; and calculating the kth characteristic vector by adopting a kth prediction network to obtain a detection result of the kth task. Therefore, in the scheme of the embodiment of the invention, only single feature extraction can be carried out on input data of each frame, and then the detection results of a plurality of detection tasks are obtained based on the extracted universal feature vector, so that the plurality of detection tasks are completed. That is, multiple detection tasks share a feature extraction network. On one hand, compared with the existing multi-task detection model, the feature extraction network is shared by a plurality of detection tasks, and the model capacity and the calculation complexity are reduced. On the other hand, a corresponding adjusting network is adopted for each detection task to perform domain adaptive conversion on the universal feature vector, so that the detection effect of each prediction network is ensured. Therefore, the scheme of the embodiment of the invention can reduce the calculation complexity and improve the detection efficiency under the condition of ensuring the detection effect.

Further, in the scheme of the embodiment of the invention, the training data is adopted to carry out supervised training on the preset model, and when a first preset condition is met, an intermediate detection model is obtained; then, carrying out supervised training on the kth intermediate adjusting network and the kth intermediate predicting network by adopting sample data corresponding to the kth task, and obtaining a multi-task detection network when a second preset condition is met; wherein, the first preset condition comprises: the total loss is less than or equal to a first preset loss, and the second preset condition comprises: the loss of each task is less than or equal to a second preset loss, and the total loss is calculated according to the losses of the M tasks. Compared with the prior art that the task detection models are trained independently, the method for extracting the features of the multi-task detection network can be used for obtaining the feature extraction network by training through a joint training method, the association information among the detection tasks can be mined, the feature extraction network feature extraction capability can be improved, and the accuracy and the generalization capability of the multi-task detection are improved.

Further, in the scheme of the embodiment of the invention, before the supervised training is adopted, the unsupervised training is firstly carried out to obtain the preset model for the supervised training. By adopting the scheme, unsupervised training and supervised training are combined, so that the requirement on labeled sample data can be reduced, and the accuracy and generalization capability of the network can be improved.

Further, in the scheme of the embodiment of the present invention, the following formula is adopted to calculate the total loss:

wherein σ _k The weight of the net is adjusted for the kth initial, and σ _k Are parameters that can be learned. By adopting the scheme, the learnable self-adaptive weight is adopted for each adjusting network, so that the artificial over-parameter setting can be reduced, the automatic optimization of the weight of the network is facilitated, and the training effect of the feature extraction network is improved.

Drawings

FIG. 1 is a schematic diagram of a multi-tasking detection model of the prior art;

FIG. 2 is a flowchart illustrating a multitask detection method according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a multi-tasking detection model according to an embodiment of the invention;

FIG. 4 is a flowchart illustrating a method for training a multi-tasking detection model according to an embodiment of the invention;

fig. 5 is a schematic structural diagram of a multitask detecting device according to an embodiment of the present invention.

Detailed Description

As described in the background art, there is a need for a multitask detection method, which can reduce the amount of computation in the multitask detection process and improve the detection efficiency.

In the application scenario of the automatic driving perception system, due to the requirement of the automatic driving perception system for real-time performance, multiple detection tasks are generally required to be executed simultaneously on the same frame of point cloud data. Existing multi-tasking detection models typically include multiple independent detection models. As shown in fig. 1, the first detection model 11, the second detection model 12 and the third detection model 13 may be respectively used to perform different detection tasks. For example, the same frame of point cloud data may be input into the first detection model 11, the second detection model 12, and the third detection model 13, respectively, to obtain detection results of different detection tasks, thereby completing multi-task detection.

In addition, when multi-task detection is performed, a plurality of detection models need to be read into the memory to perform task detection at the same time. Therefore, the existing scheme needs to occupy a large amount of memory, the calculation amount is huge, the blockage is easily caused, and the detection efficiency is low.

In order to solve the above technical problem, an embodiment of the present invention provides a multi-task detection method, and in a scheme of the embodiment of the present invention, only a single feature extraction may be performed on input data of each frame, and then detection results of a plurality of detection tasks are obtained based on an extracted general feature vector, so that the plurality of detection tasks are completed. That is, a plurality of detection tasks share the feature extraction network. On one hand, compared with a multi-task detection model in the prior art, in the scheme of the embodiment of the invention, a plurality of detection tasks can share a feature extraction network, so that the model capacity and the calculation complexity are reduced. On the other hand, a corresponding adjusting network is adopted for each detection task to perform domain adaptive conversion on the universal feature vector, so that the detection effect of each prediction network is ensured. Therefore, the scheme of the embodiment of the invention can reduce the calculation complexity and improve the detection efficiency under the condition of ensuring the detection effect.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

Referring to fig. 2, fig. 2 is a flowchart illustrating a multitask detection method according to an embodiment of the present invention. The method illustrated in fig. 2 may be performed by a terminal, which may be any of various existing devices having data receiving and data processing functions.

In an application scenario, the terminal may be an in-vehicle terminal, for example, an Electronic Control Unit (ECU) of a vehicle, where the vehicle may be configured with a radar, and the embodiment of the present invention is not limited to the type of the vehicle, and for example, the terminal may be an Automatic Guided Vehicle (AGV), an automatic driven vehicle (automous Vehicles), and the like, but is not limited thereto.

In another application scenario, the terminal may also be a server, for example, the server is in communication connection with a vehicle, the vehicle may be configured with a radar, and the server may receive point cloud data acquired by the radar from the vehicle and perform the multi-task detection method provided by the embodiment of the present invention to obtain the detection result.

It should be noted that, in the solution of the embodiment of the present invention, the radar may be a laser radar or a millimeter wave radar.

It should be further noted that in the embodiment of the present invention, the multitask detection refers to processing multiple detection tasks at the same time. Specifically, the same input data (for example, the same frame of point cloud data) is processed to obtain the detection results of a plurality of detection tasks, instead of performing the next detection task after obtaining the detection result of one detection task, different detection tasks are not performed on different point cloud data, respectively.

It should be further noted that the method provided by the embodiment of the present invention can be applied to multiple technical fields, for example, the technical fields of automatic driving, augmented reality, virtual reality, intelligent robots, and the like, and the embodiment of the present invention is described only by taking the automatic driving field as an example, and does not limit the application scenario of the method provided by the embodiment of the present invention.

The multitask detection method shown in fig. 2 may include the steps of:

step S11: acquiring input data, and inputting the input data into a multi-task detection model obtained through training;

step S12: extracting the features of the input data by adopting the feature extraction network to obtain a universal feature vector;

step S13: performing feature extraction and/or dimension transformation on the general feature vector by adopting a kth adjusting network to obtain a feature vector corresponding to a kth task, and recording the feature vector as a kth feature vector;

step S14: and calculating the kth characteristic vector by adopting a kth prediction network to obtain a detection result of the kth task.

It is understood that, in a specific implementation, the method may be implemented by a software program running in a processor integrated inside a chip or a chip module; alternatively, the method can be implemented in hardware or a combination of hardware and software.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a multitask detection model according to an embodiment of the present invention. As shown in fig. 3, the multitasking detection model may include: a single feature extraction network 31, multiple adjustment networks 32 and multiple prediction networks 33.

The following non-limiting description will be made on the multitask detection method provided by the embodiment of the invention with reference to fig. 2 and 3.

In the specific implementation of step S21, input data to be processed at the current time may be acquired, where the input data may be point cloud data acquired by a radar. In the driving process of the vehicle, the radar can sense the driving environment to obtain point cloud data. That is, the input data may be point cloud data.

In other embodiments, the input data may also be an image acquired by a camera, which is not limited in the embodiments of the present invention.

Under the condition that the input data is point cloud data, the point cloud data needs to be preprocessed before being input to the multitask detection model to obtain processed point cloud data, and the processed point cloud data is input to the multitask detection model.

Specifically, the point cloud data is N × C dimensional data, where N is the number of points in the point cloud data and C is the dimension of the point cloud data. Preprocessing the point cloud data may include: carrying out voxel blocking on the three-dimensional space to obtain a plurality of voxel grids, and then projecting points in the point cloud data into the voxel grids to obtain a voxel characteristic map; projecting the point cloud data along a Bird's-Eye View direction to obtain a Bird's-Eye View (BEV); and projecting the point cloud data along the depth direction to obtain a depth map (Range Image). Thus, the processed point cloud data may include: a voxel characteristic map, a bird's eye view map and a depth map.

Correspondingly, the projection relationship between the point cloud data and the processed point cloud data can be stored together, so that the subsequent processing can perform inverse transformation according to the projection relationship. The projection relationship may include a projection relationship between the point cloud data and the voxel characteristic map, a projection relationship between the point cloud data and the aerial view, and a projection relationship between the point cloud data and the depth map. More specifically, in the adjustment network corresponding to the semantic segmentation task, inverse transformation may be performed according to the projection relationship so as to calculate the prediction result of each point.

Further, the input data may be input into a multi-tasking detection model trained in advance. That is, the input data may be input into the multitasking detection model shown in fig. 3.

The multi-tasking detection model may include: a single feature extraction network 31, the feature extraction network 31 being for performing feature extraction on the input data.

The multi-tasking detection model may further include: a plurality of adjustment networks 32 and a plurality of prediction networks 33. The number of the adjusting networks 32 and the number of the predicting networks 33 are the same, and the adjusting networks 32 correspond to the predicting networks 33 one to one. More specifically, each adjustment network 32 has an input connected to the output of the feature extraction network 31 and an output connected to the input of its corresponding prediction network 33.

The number M of the adjusting networks 32 and the predicting networks 33 may be determined according to the actual application requirements. More specifically, the number of the adjusting networks 32 and the number of the prediction networks 33 are determined according to the number of the detection tasks, and the prediction networks 33 correspond to the detection tasks one to one, that is, the adjusting networks 32 also correspond to the detection tasks one to one.

The configuration of the adjustment network 32 and the configuration of the prediction network 33 are determined based on the corresponding detection tasks.

In a specific implementation of step S22, a feature extraction network in the multitask detection model may be used to perform feature extraction on the input data to obtain a generic feature vector.

In a specific implementation, the structure of the feature extraction network 31 may be determined by the type of input data. In a specific example, in the case where the input data is point cloud data, the feature extraction network includes a Convolution layer, and the Convolution layer performs Sparse Convolution (Sparse Convolution) calculation. Further, the universal feature vector output by the feature extraction network may be four-dimensional feature data, wherein a channel direction of the universal feature vector may be used to indicate attribute information of a point in the point cloud data.

In another specific example, in the case where the input data is an image, the Convolution layer of the feature extraction network 31 may perform a full Convolution (full Convolution) calculation.

More specifically, the feature extraction network shown in fig. 3 may be a structurally identical part of the plurality of feature extraction networks in fig. 1, compared to the plurality of feature extraction networks shown in fig. 1.

In a specific implementation of step S23, the general feature vector output by the feature extraction network may be transmitted to multiple adjustment networks, and each adjustment network is used to perform corresponding calculation, so as to obtain a feature vector required by the detection task corresponding to the adjustment network.

More specifically, feature extraction and/or dimension change can be performed on the general feature vector by using the kth adjusting network to obtain a feature vector corresponding to the kth task, and the feature vector is recorded as the kth feature vector. Wherein k is more than or equal to 1 and less than or equal to M, k and M are positive integers, and M is the number of detection tasks.

In one aspect, feature extraction may be performed on the generic feature vector using a kth adjusting network to obtain a kth feature vector. It should be noted that, compared with the feature extraction performed by the feature extraction network, the kth adjustment network performs further feature extraction on the generic feature vector to obtain a deeper feature vector.

In a specific implementation, the plurality of detection tasks may include: and (4) target detection task. Correspondingly, the adjusting network corresponding to the target detection task may include: feature Pyramid Networks (FPN) and Cross Stage Partial Networks (CSPNet). The generic feature vector may be input to an adjustment network corresponding to the target detection task, and the FPN and CSPNet may further perform feature extraction on the generic feature vector to obtain a feature vector output by the adjustment network, where the feature vector may be a two-dimensional feature map. Further, the prediction network corresponding to the target detection task may include a classification network and a regression network, the classification network may obtain the position of the target center point according to the feature vector output by the adjustment network, and the regression network may obtain the length, width, height, and orientation angle of the target according to the feature vector output by the adjustment network, so as to obtain the detection result (x, y, z, dx, dy, dz, heading) of each target. Wherein x, y, z are three-dimensional positions of the center point of the target, dx, dy and dz are the length, width and height of the three-dimensional rectangular frame circumscribed by the target, and heading is the rotation angle (i.e. the above orientation angle) of the target.

On the other hand, the dimension change of the general feature vector can be carried out by adopting the kth adjusting network so as to obtain the kth feature vector.

In a specific example, the plurality of detection tasks may also include: and (5) performing semantic segmentation task. Correspondingly, the adjusting network corresponding to the semantic segmentation task may include a feature projection network, and the feature projection network may be configured to project the general feature vector output by the feature extraction network to each point in the point cloud data. Performing the dimension transformation on the generic feature vector using the adjustment network may refer to projecting the generic feature vector to each point in the point cloud data using a feature projection network.

In specific implementation, the feature projection network may perform inverse transformation on the general feature vector according to a projection relation adopted when the point cloud data is preprocessed, so as to project the general feature vector to each point in the point cloud data, thereby obtaining a kth feature vector. Wherein the k-th feature vector comprises feature vectors of N points. Further, a prediction network corresponding to the semantic segmentation task can be adopted to calculate the prediction category of each point according to the feature vector of each point. Furthermore, the prediction network corresponding to the semantic segmentation task may be a 2-layer fully-connected network, and the prediction network may obtain the category of each point according to a feature vector output by the adjustment network. The category may be any of: water mist, dust, tree branches, and others.

In a specific implementation of step S24, the kth prediction network may be used to calculate the kth feature vector to obtain a detection result of the kth task. The specific structure of the kth prediction network may be determined according to the corresponding detection task, which is not limited in this embodiment.

In the solution provided by the embodiment of the present invention, only a single feature extraction may be performed on input data (for example, point cloud data) of each frame, and then detection results of multiple detection tasks are obtained based on the extracted general feature vector, thereby completing multiple detection tasks. On one hand, compared with the existing multi-task detection model shown in fig. 1, in the scheme of the embodiment of the invention, a plurality of detection tasks can share a feature extraction network, so that the model capacity and the computational complexity are reduced. On the other hand, in the scheme of the embodiment of the invention, the corresponding adjusting network is adopted for each detection task to perform domain adaptive conversion on the universal feature vector, so that the training effect and the detection effect of each prediction network are improved. Therefore, the scheme of the embodiment of the invention can reduce the calculation complexity and improve the detection efficiency under the condition of ensuring the detection effect.

In practical application, in the unmanned driving process, the vehicle-mounted laser radar can acquire point cloud data in real time, and the point cloud data is input into the multi-task detection model shown in fig. 3 to obtain the detection result of each task. Further, the detection result may be subjected to format conversion and then input to downstream modules such as a prediction module and a planning module.

The following non-limiting description of the training process of the multi-tasking detection model shown in FIG. 3 is provided.

Specifically, the prediction model may be trained by using training data, where the preset model is a model before training, and the prediction model may include: a single initial feature extraction network, a plurality of initial adjustment networks, and a plurality of initial prediction networks. The training data may include sample data corresponding to a plurality of tasks. The initial feature extraction network is a feature extraction network before training, the initial adjustment network corresponds to the adjustment network in a one-to-one mode, the initial adjustment network is an adjustment network before training, the initial prediction network corresponds to the prediction network in a one-to-one mode, and the initial prediction network is a prediction network before training.

Referring to fig. 4, fig. 4 is a flowchart illustrating a method for training a multi-task detection model according to an embodiment of the present invention. The training process of the multi-tasking detection model shown in FIG. 3 is described below in conjunction with FIG. 4 without limitation. The training method shown in fig. 4 may include the steps of:

step S41: performing unsupervised training on the preset model by using the unlabeled sample data, and obtaining the preset model for performing supervised training when a third preset condition is met;

step S42: carrying out supervised training on a preset model by adopting training data, and obtaining an intermediate detection model when a first preset condition is met;

step S43: and carrying out supervised training on the kth intermediate adjusting network and the kth intermediate predicting network by adopting sample data corresponding to the kth task, and obtaining the multi-task detecting network when a second preset condition is met.

In the training method shown in fig. 4, the training data for training may include: the sample data corresponding to a plurality of tasks, more specifically, the sample data corresponding to each task may include: and the sample data is not marked and marked. More specifically, the labeled sample data corresponding to each task may be obtained by labeling the unlabeled sample data corresponding to the task, or may be obtained by labeling other sample data except the unlabeled sample data corresponding to the task, which is not limited in this embodiment.

In a specific implementation of step S41, the matching point and the non-matching point may be determined first.

Specifically, the unlabeled sample data may include multiple frames of first point cloud data, and data enhancement processing may be performed on each frame of first point cloud data to obtain second point cloud data corresponding to the frame of point cloud. Wherein the data enhancement processing may include one or more of: random noise, random flipping, random noise, and random scaling (e.g., random scaling) are added.

Since the second point cloud data is generated from the first point cloud data, and the points in the first point cloud data and the points in the second point cloud data correspond one to one, the matching points and the non-matching points can be determined in each frame of the first point cloud data and the corresponding second point cloud data. The matching points may refer to points having a corresponding relationship, and the non-matching points refer to points having no corresponding relationship.

Further, the preset model may be trained unsupervised. The loss function of the unsupervised training may be a triplet loss function, and the constraint condition of the unsupervised training may include: feature distances between matching points are minimized and/or feature distances between non-matching points are maximized. The characteristic distance may be a cosine distance, a chebyshev distance, a manhattan distance, or the like, which is not limited in this embodiment.

Further, when a third preset condition is met, the unsupervised training may be stopped, and a preset model for the supervised training is obtained. The third preset condition may include that a training loss of the unsupervised training is less than or equal to a preset value, which is not limited by this embodiment.

In an implementation of step S42, the training data may be used to perform supervised training on the preset model. The training data may include labeled sample data corresponding to a plurality of tasks. It should be noted that the preset model in step S42 refers to the preset model obtained after step S41 is executed. In other words, supervised training may use a preset model obtained by unsupervised training as an initial model for training.

Further, in the training process, each batch of training data may include labeled sample data corresponding to a plurality of tasks, the labeled sample data of all the tasks are input to the initial feature extraction network, and then the sample general feature vector corresponding to each task output by the initial feature extraction network is respectively input to the corresponding initial adjustment network and the initial prediction network, so as to obtain a sample prediction result of each task.

Further, for each detection task, the loss of the task can be calculated according to the label of the sample data corresponding to the task and the sample prediction result.

Further, the total loss may be calculated according to the losses of the plurality of tasks, and the parameters of the preset network model may be updated according to the total loss.

In one non-limiting example, the total loss can be calculated using the following formula:

wherein L is the total loss, σ _k Adjusting the weight of the network for the kth initial _k As a learnable parameter, L _k And M is the total number of tasks, wherein M is the loss of the kth task.

Note that σ represents _k Are learnable parameters. Specifically, in the process of performing the first supervised training (that is, step S42), each time the parameters of each network are updated, the weight σ of each adjusted network is also updated together _k . Wherein σ _k The initial value of (b) may be preset. Note that σ represents _k For calculating total loss by optimizing σ _k The total loss is optimized, thereby optimizing the training effect of the feature extraction network.

Further, in the training process of step S42, the parameters may be updated by using a random gradient descent method (i.e., SGD optimizer).

Further, when a first preset condition is met, an intermediate detection model is obtained, wherein the intermediate detection model comprises the feature extraction network, a plurality of intermediate adjustment networks and a plurality of intermediate prediction networks. Wherein, the first preset condition comprises: the total loss is less than or equal to a first predetermined loss.

In step S42, learnable adaptive weights are applied to each adjustment network, so as to reduce artificial over-parameter settings, and facilitate automatic optimization of parameter gradient weights by the network, thereby improving the training effect of the feature extraction network.

In the specific implementation of step S43, sample data corresponding to the kth task is used to perform supervised training on the kth intermediate adjusting network and the kth intermediate predicting network, and when a second preset condition is met, the multi-task detecting network is obtained.

It should be noted that, when step S42 is completed (that is, when the first preset condition is satisfied), the feature extraction network may be obtained, that is, the training of the feature extraction network is completed. In step S43, only the adjustment network and the prediction network are trained, and the feature extraction network is not trained. That is, the parameters of the feature extraction network are already fixed when the first preset condition is satisfied, and only the parameters of the adjustment network and the prediction network are updated in step S43.

Further, since the adjustment network and the prediction network corresponding to each task are trained by using the sample data corresponding to the task in step S43, the loss of each task is calculated in step S43 without calculating the total loss, and the parameters of the intermediate adjustment network and the intermediate prediction network corresponding to the task are updated according to the loss of each task.

In a specific implementation, the learning rate used in step S42 is a first learning rate, the learning rate used in step S43 is a second learning rate, and the first learning rate is greater than the second learning rate.

Further, when a second preset condition is met, the multi-task detection model can be obtained. Wherein the second preset condition comprises: the penalty for each task is less than or equal to a second predetermined penalty.

In a specific example, after each training period, the verification set data can be used for verification, the verification effect and the loss change are observed, and if the verification effect is not improved any more and the loss is increased, the training is stopped immediately.

In the scheme of the embodiment of the invention, the method combining unsupervised training and supervised training is adopted, so that the requirement on sample data can be reduced, and the accuracy and generalization capability of the network can be improved.

Under the condition that the training data is point cloud data, the problem of difficult labeling is easy to occur. Specifically, the existing labels are usually artificial labels, in an automatic driving scene, points in the point cloud data usually include points corresponding to water mist and dust, and it is usually difficult to judge whether the points are points corresponding to water mist or dust during artificial labeling. Therefore, the method combining unsupervised training and supervised training can effectively solve the problem of labeled sample shortage.

It should be noted that in other embodiments, only a supervised training approach may be used. That is, the multitask detection model may be obtained by executing step S42 and step S43.

As mentioned above, the existing task detection models are independent of each other, and the training processes are also independent of each other. However, in the perception scenario of autonomous driving, there is a correlation between different detection tasks. For example, a lane is often the boundary of a travelable region, which often closely surrounds a traffic object. In the embodiment of the invention, a multi-task detection model shown in FIG. 3 is constructed and combined training is carried out. In the training process, better representation can be learned through shared information among a plurality of tasks, and the trained feature extraction network can extract more information, so that the performance of each detection task is improved.

In practical application, the pyrrch can be adopted for training, after the training is finished, the trained multi-task detection model can be quantitatively accelerated through the tensorrt library, and the model implementation and the on-line deployment are carried out by using the C + + language.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a multitask detecting device according to an embodiment of the present invention. The apparatus shown in fig. 5 may include:

an obtaining module 51, configured to obtain input data and input the input data into a trained multi-task detection model, where the multi-task detection model includes: the task scheduling system comprises a single feature extraction network, a plurality of adjustment networks and a plurality of prediction networks, wherein the prediction networks correspond to the tasks one to one, and the adjustment networks correspond to the prediction networks one to one;

a feature extraction module 52, configured to perform feature extraction on the input data by using the feature extraction network to obtain a universal feature vector;

an adjusting module 53, configured to perform feature extraction and/or dimension transformation on the generic feature vector by using a kth adjusting network to obtain a feature vector corresponding to a kth task, and record the feature vector as the kth feature vector, where k is greater than or equal to 1 and less than or equal to M, k and M are positive integers, and M is the number of the tasks;

and the detecting module 54 is configured to calculate the kth feature vector by using the kth prediction network to obtain a detection result of the kth task.

In a specific implementation, the above-mentioned multitask detection device may correspond to a chip having a data processing function in a terminal; or to a chip module having a data processing function in the terminal, or to the terminal.

For more details of the working principle, the working mode, the beneficial effects, and the like of the multitask detecting device shown in fig. 5, reference may be made to the above description related to fig. 1 to 4, and details are not repeated here.

Embodiments of the present invention further provide a storage medium, on which a computer program is stored, where the computer program, when executed by a processor, performs the steps of the above-mentioned multitask detection method. The storage medium may include ROM, RAM, magnetic or optical disks, etc. The storage medium may further include a non-volatile memory (non-volatile) or a non-transitory memory (non-transient), and the like.

The embodiment of the present invention further provides a terminal, which includes a memory and a processor, where the memory stores a computer program that can be run on the processor, and the processor executes the steps of the above-mentioned multitask detection method when running the computer program. The terminal may be a vehicle-mounted terminal.

The embodiment of the invention also provides a vehicle which can comprise the terminal, and the terminal can execute the multitask detection method.

It should be understood that, in the embodiment of the present application, the processor may be a Central Processing Unit (CPU), and the processor may also be other general-purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It will also be appreciated that the memory in the embodiments of the subject application can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example and not limitation, many forms of Random Access Memory (RAM) are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (enhanced SDRAM), synchronous DRAM (SLDRAM), synchronous Link DRAM (SLDRAM), and direct bus RAM (DR RAM).

The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions or computer programs. The procedures or functions according to the embodiments of the present application are wholly or partially generated when the computer instructions or the computer program are loaded or executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer program may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer program may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire or wirelessly.

In the several embodiments provided in the present application, it should be understood that the disclosed method, apparatus and system may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative; for example, the division of the cell is only a logic function division, and there may be another division manner in actual implementation; for example, various elements or components may be combined or may be integrated in another system or some features may be omitted, or not implemented. The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be physically included alone, or two or more units may be integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit. For example, for each device or product applied to or integrated into a chip, each module/unit included in the device or product may be implemented by hardware such as a circuit, or at least a part of the module/unit may be implemented by a software program running on a processor integrated within the chip, and the rest (if any) part of the module/unit may be implemented by hardware such as a circuit; for each device and product applied to or integrated with the chip module, each module/unit included in the device and product may be implemented by hardware such as a circuit, and different modules/units may be located in the same component (e.g., a chip, a circuit module, etc.) or different components of the chip module, or at least part of the modules/units may be implemented by a software program running on a processor integrated inside the chip module, and the rest (if any) part of the modules/units may be implemented by hardware such as a circuit; for each device and product applied to or integrated in the terminal, each module/unit included in the device and product may be implemented by using hardware such as a circuit, and different modules/units may be located in the same component (e.g., a chip, a circuit module, etc.) or different components in the terminal, or at least part of the modules/units may be implemented by using a software program running on a processor integrated in the terminal, and the rest (if any) part of the modules/units may be implemented by using hardware such as a circuit.

It should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein indicates that the former and latter associated objects are in an "or" relationship.

The "plurality" appearing in the embodiments of the present application means two or more.

The descriptions of the first, second, etc. appearing in the embodiments of the present application are only for the purpose of illustrating and differentiating the description objects, and do not represent any particular limitation to the number of devices in the embodiments of the present application, and cannot constitute any limitation to the embodiments of the present application.

Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected by one skilled in the art without departing from the spirit and scope of the invention, as defined in the appended claims.

Claims

1. A method of multi-tasking detection, the method comprising:

acquiring input data, and inputting the input data into a multi-task detection model obtained by training, wherein the multi-task detection model comprises: the task scheduling system comprises a single feature extraction network, a plurality of adjustment networks and a plurality of prediction networks, wherein the prediction networks correspond to the tasks one to one, and the adjustment networks correspond to the prediction networks one to one;

extracting the features of the input data by adopting the feature extraction network to obtain a universal feature vector;

performing feature extraction and/or dimension transformation on the general feature vector by adopting a kth adjusting network to obtain a feature vector corresponding to a kth task, and marking the feature vector as the kth feature vector, wherein k is more than or equal to 1 and is less than or equal to M, k and M are positive integers, and M is the number of the tasks;

and calculating the kth characteristic vector by adopting a kth prediction network to obtain a detection result of the kth task.

2. The multi-tasking detection method of claim 1, wherein the input data is radar-acquired point cloud data, the feature extraction network comprises convolutional layers, and the convolutional layers perform sparse convolution calculations.

3. The multitask detection method according to claim 2, wherein the multitask detection model is obtained by training a preset model in advance by using training data, and the preset model comprises: a single initial feature extraction network, a plurality of initial adjustment networks, and a plurality of initial prediction networks, the training data comprising: before the sample data corresponding to the kth task is obtained, the method further includes:

the method comprises the following steps: carrying out supervised training on the preset model by adopting the training data, and obtaining an intermediate detection model when a first preset condition is met, wherein the intermediate detection model comprises the feature extraction network, a plurality of intermediate adjustment networks and a plurality of intermediate prediction networks;

step two: carrying out supervised training on a kth intermediate adjusting network and a kth intermediate predicting network by adopting sample data corresponding to the kth task, and obtaining the multi-task detecting network when a second preset condition is met;

wherein the first preset condition comprises: the total loss is less than or equal to a first preset loss, and the second preset condition comprises that: and the loss of each task is less than or equal to a second preset loss, and the total loss is calculated according to the losses of the M tasks.

4. A method of multitasking detection according to claim 3, characterized in that the total loss is calculated using the following formula:

wherein L is the total loss, σ _k For the kth initial adjustment network weight, σ _k As a learnable parameter, L _k Is the loss of the k-th task.

5. The multitask detection method according to claim 3, wherein the learning rate used in the first step is a first learning rate, the learning rate used in the second step is a second learning rate, and the first learning rate is larger than the second learning rate.

6. The multitask detection method according to claim 3, characterized in that said sample data set further comprises: no labeled sample data, before the first step, the method further comprises:

and performing unsupervised training on the preset model by adopting the unlabeled sample data, and obtaining the preset model for performing the supervised training when a third preset condition is met.

7. The multitask detection method according to claim 6, wherein the unlabeled sample data comprises a plurality of frames of first point cloud data, and the unsupervised training constraints comprise: minimizing the feature distance between the matching points and/or maximizing the feature distance between the non-matching points, wherein before performing unsupervised training on the preset model by using the unlabeled sample data, the method further comprises:

performing data enhancement processing on each frame of first point cloud data to obtain second point cloud data corresponding to the frame of first point cloud data, wherein points in the first point cloud data correspond to points in the second point cloud data one to one;

determining matching points and non-matching points in each frame of first point cloud data and corresponding second point cloud data; the matching points refer to points with corresponding relations, and the non-matching points refer to points without corresponding relations.

8. A method and apparatus for multi-task detection, the apparatus comprising:

an obtaining module, configured to obtain input data and input the input data into a trained multi-task detection model, where the multi-task detection model includes: the system comprises a single feature extraction network, a plurality of adjustment networks and a plurality of prediction networks, wherein the prediction networks correspond to the tasks one to one, and the adjustment networks correspond to the prediction networks one to one;

the characteristic extraction module is used for extracting the characteristics of the input data by adopting the characteristic extraction network so as to obtain a universal characteristic vector;

the adjusting module is used for performing feature extraction and/or dimension transformation on the general feature vector by adopting a kth adjusting network to obtain a feature vector corresponding to a kth task, and the feature vector is marked as the kth feature vector, wherein k is more than or equal to 1 and less than or equal to M, k and M are positive integers, and M is the number of the tasks;

and the detection module is used for calculating the kth characteristic vector by adopting the kth prediction network so as to obtain a detection result of the kth task.

9. A storage medium having a computer program stored thereon, the computer program, when executed by a processor, performing the steps of the multitask detection method according to any one of claims 1 to 7.

10. A terminal comprising a memory and a processor, the memory having stored thereon a computer program operable on the processor, wherein the processor, when executing the computer program, performs the steps of the multitask detection method according to any one of claims 1-7.