CN114519381A

CN114519381A - Sensing method and device based on multitask learning network, storage medium and terminal

Info

Publication number: CN114519381A
Application number: CN202111677833.8A
Authority: CN
Inventors: 黄超; 姚为龙
Original assignee: Shanghai Xiantu Intelligent Technology Co Ltd
Current assignee: Shanghai Xiantu Intelligent Technology Co Ltd
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-05-20

Abstract

A perception method and device, a storage medium and a terminal based on a multitask learning network are disclosed, wherein the method comprises the following steps: constructing a training sample set; inputting the samples with the determined recognition results into the whole multi-task learning network, and determining the optimal characteristic network layer parameters and the multi-task learning network after initial training after training; fixing parameters of a characteristic network layer, inputting samples with determined recognition results into the initially trained multi-task learning network, and obtaining optimized adjustment network layer parameters and the retrained multi-task learning network through training; fixing parameters of a characteristic network layer and adjusting parameters of the network layer, inputting samples with determined recognition results and screened samples into the retrained multi-task learning network, and obtaining the adjusted multi-task learning network through training; and inputting the image to be perceived into the adjusted multi-task learning network to obtain a perception result. The invention can effectively improve the accuracy of the sensing result of the multi-task learning network.

Description

Sensing method and device based on multitask learning network, storage medium and terminal

Technical Field

The invention relates to the field of machine learning, in particular to a sensing method and device based on a multitask learning network, a storage medium and a terminal.

Background

Perception algorithms in autonomous driving play an important role in the vehicle's cognitive surroundings. The perceived target generally comprises obstacles such as pedestrians, vehicles, bicycles and the like, and the perception algorithm generally comprises target detection, tracking, semantic segmentation, instance segmentation, clustering, depth estimation and the like. Since all perception algorithms are deployed in the vehicle computer, high requirements are put on the performance of the computer, and equipment cost is increased. In order to effectively reduce the calculation cost of the algorithm and consider the correlation among different perception tasks in automatic driving, a multi-task learning method is researched, a plurality of related tasks are put together for learning, and the plurality of tasks can share the related information of the learned fields in the learning process.

In the prior art, the current mainstream multitask learning method adopts a structure of sharing a feature network and a task head network. In addition, in the prior art, when a multi-task learning network is trained, a fixed weight value is generally set for each task head network manually, the manually set weight is not always optimal, and the fixed weight value is not beneficial to the training of the whole multi-task learning network, so that the accuracy of a sensing result output by the trained multi-task learning network in practical application is insufficient; moreover, at present, the mainstream method adopts a supervised learning method to train the whole network, but the data labeling requires a large amount of labor cost, and the generalization capability of the multitask learning network trained by only adopting sample data with labels is insufficient.

Therefore, a sensing method based on the multi-task learning network is needed, which can reduce the labor cost required by training, effectively improve the generalization ability of the multi-task learning network obtained by training, and improve the accuracy of the sensing result output by the multi-task learning network.

Disclosure of Invention

The invention solves the technical problems that the generalization capability of a multi-task learning network in the prior art is insufficient and the accuracy of a perception result is not high.

In order to solve the above technical problem, an embodiment of the present invention provides a sensing method based on a multitask learning network, where the multitask learning network includes a feature network layer, an adjustment network layer, and a task head network layer, where the adjustment network layer includes a plurality of adjustment networks, the task head network layer includes a plurality of task head networks, and the plurality of adjustment networks are in one-to-one correspondence with the plurality of task head networks; the method comprises the following steps: constructing a training sample set, wherein the training sample set comprises samples with determined recognition results and samples with undetermined recognition results; inputting the samples with the determined recognition results into the whole multi-task learning network, and determining the optimal characteristic network layer parameters and the multi-task learning network after initial training after training; fixing the parameters of the characteristic network layer as the optimized parameters of the characteristic network layer, inputting the samples with the determined recognition results into the initially trained multi-task learning network, and obtaining optimized parameters of the adjusting network layer and the retrained multi-task learning network after training; fixing the parameters of the characteristic network layer as the optimized characteristic network layer parameters, fixing the parameters of the adjusting network layer as the optimized adjusting network layer parameters, inputting the samples with the determined identification results and the screened samples obtained by screening the samples without the determined identification results into the retrained multitask learning network, and obtaining the adjusted multitask learning network after training; and inputting the image to be perceived into the adjusted multi-task learning network to obtain a perception result.

Optionally, the step of inputting the sample of the determined recognition result into the whole multi-task learning network, and the training step of determining the preferred feature network layer parameters and the initially trained multi-task learning network includes: setting a loss function of the multitask learning network; and inputting the samples with the determined recognition results into the whole multi-task learning network to perform initial training on the multi-task learning network until the loss function of the multi-task learning network is converged, stopping training and obtaining the optimal feature network layer parameters and the multi-task learning network after initial training.

Optionally, the step of inputting the sample of the determined recognition result into the whole multi-task learning network, and the training step of determining the preferred feature network layer parameters and the initially trained multi-task learning network includes: setting a loss function of the multitask learning network; and inputting the samples with the determined recognition results into the whole multi-task learning network to perform initial training on the multi-task learning network, wherein in the initial training process, the parameters of the feature network layer are updated once when the weight value of each task head network is updated once until the loss function of the multi-task learning network is converged, and the training is stopped to obtain the optimal feature network layer parameters and the multi-task learning network after the initial training.

Optionally, the loss function of the multitask learning network is as follows:

wherein Loss represents a Loss function value of the whole multi-task learning network,

representing the weight value of the kth task head network, k representing the number of task head networks, L_kRepresenting the loss function value of the kth task head network.

Optionally, fixing the parameters of the feature network layer as the preferred parameters of the feature network layer, inputting the sample with the determined recognition result into the initially trained multi-task learning network, and obtaining the preferred parameters of the adjustment network layer and the retrained multi-task learning network after training includes: setting a loss function of each adjusting network, and fixing the parameters of the characteristic network layer as the optimal parameters of the characteristic network layer; and inputting the samples with the determined recognition results into the initially trained multi-task learning network to retrain the initially trained multi-task learning network until loss functions of all the adjusting networks are converged, stopping training and obtaining the optimal adjusting network layer parameters and the retrained multi-task learning network.

Optionally, the screening the sample with the undetermined recognition result to obtain the screened sample by using the following method includes: inputting the samples of the undetermined recognition results into the retrained multitask learning network, and outputting the perception results of the samples of the undetermined recognition results and the confidence degrees of the perception results; and selecting a sample with the confidence coefficient larger than a preset confidence coefficient threshold value as the screening sample.

Optionally, the fixing the parameter of the feature network layer as the preferred feature network layer parameter, and the fixing the parameter of the adjustment network layer as the preferred adjustment network layer parameter, inputting the sample with the determined recognition result and the filtered sample obtained by filtering the sample with the undetermined recognition result into the retrained multitask learning network, and obtaining the adjusted multitask learning network by training includes: setting a loss function of each task head network, and respectively fixing the parameters of the characteristic network layer as the preferred parameters of the characteristic network layer, and fixing the parameters of the adjusting network layer as the preferred parameters of the adjusting network layer; and inputting the samples with the determined recognition results and the screened samples into the retrained multi-task learning network to train the retrained multi-task learning network until loss functions of all the task head networks are converged, and stopping training to obtain the adjusted multi-task learning network.

Optionally, the multitask learning network satisfies one or more of the following conditions: the feature network structure in the feature network layer adopts a dark network dark net structure in a single-step strategy algorithm yolo; the structure of each adjusting network in the adjusting network layer adopts a bottleneck layer bottleeck network structure.

Optionally, the task head network layer includes one or more of the following: a semantic dividing head, a short barrier dividing head, a target detection head, a garbage detection head and a traffic light detection head.

Optionally, the task head network satisfies one or more of the following conditions: the semantic segmentation head adopts a segmentation mask SegMaskPSP structure; the low obstacle segmentation head adopts a segmentation mask SegMaskPSP structure; the target detection head adopts a full convolution network FCN structure; the garbage detection head adopts a full convolution network FCN structure; the traffic light detection head adopts a full convolution network FCN structure.

The embodiment of the invention also provides a sensing device based on the multitask learning network, which comprises: the training sample set constructing module is used for constructing a training sample set, and the training sample set comprises samples with determined identification results and samples without determined identification results; the initial training module is used for inputting the samples with the determined recognition results into the whole multi-task learning network, and determining the optimal characteristic network layer parameters and the multi-task learning network after initial training after training; the retraining module is used for fixing the parameters of the characteristic network layer as the optimized parameters of the characteristic network layer, inputting the samples with the determined recognition results into the initially trained multi-task learning network, and obtaining optimized parameters of the adjusting network layer and the retrained multi-task learning network after training; a network adjusting module, configured to fix the parameter of the feature network layer as the preferred feature network layer parameter, fix the parameter of the adjusting network layer as the preferred adjusting network layer parameter, input the sample with the determined recognition result and the filtered sample obtained by filtering the sample without the determined recognition result into the retrained multitask learning network, and obtain the adjusted multitask learning network after training; and the perception result determining module is used for inputting the image to be perceived into the adjusted multi-task learning network to obtain a perception result.

An embodiment of the present invention further provides a storage medium, where the storage medium is a computer-readable storage medium, and a computer program is stored on the storage medium, and when the computer program is executed by a processor, the computer program performs the steps of the sensing method based on the multi-task learning network.

The embodiment of the invention also provides a terminal, which comprises a memory and a processor, wherein the memory is stored with a computer program capable of running on the processor, and the processor executes the steps of the sensing method based on the multitask learning network when running the computer program.

Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:

in the embodiment of the invention, firstly, a training sample set is constructed, wherein the training sample set comprises samples with determined recognition results and samples without determined recognition results; inputting the samples with the determined recognition results into the whole multi-task learning network, and determining the optimal characteristic network layer parameters and the multi-task learning network after initial training after training; then, parameters of a characteristic network layer are fixed, the samples with the determined recognition results are input into the initially trained multi-task learning network, and the optimized adjustment network layer parameters and the retrained multi-task learning network are obtained through training; fixing the parameters of the characteristic network layer and the parameters of the adjusting network layer, inputting the samples with the determined identification results and the samples screened from the samples without the determined identification results into the retrained multi-task learning network, and obtaining the adjusted multi-task learning network after training; and inputting the image to be perceived into the adjusted multi-task learning network to obtain a perception result. Compared with the prior art, the method has the advantages that the adjusting networks corresponding to the task head networks one by one are not adopted, and the weight values of the task head networks are generally set manually when the characteristic networks are trained, so that the accuracy of the output sensing result of the trained multi-task learning network in practical application is insufficient; in addition, a large amount of labor cost is needed to label data during training by adopting a supervision method (only samples with labels are included in training samples), and the generalization capability of the trained multi-task learning network is insufficient.

Further, in the process of initial training, when the weight value of each task head network is updated once, the parameters of the feature network layer are updated once until the loss function of the multi-task learning network is converged, the training is stopped, and the optimal feature network layer parameters and the multi-task learning network after the initial training are obtained. By adopting the above mode, the weighted value of each task head network is dynamically adjusted according to the training state of each task head network in the training process, the problem that the weighted value of each task head network is not accurate enough in manual setting in the prior art is solved, the training effect of each task head network is effectively improved, and the accuracy of the output sensing result of the trained whole multi-task learning network in practical application is further improved.

Further, the screening the sample with the undetermined identification result to obtain the screened sample by adopting the following method comprises: inputting the samples of the undetermined recognition results into the retrained multitask learning network, and outputting the perception results of the samples of the undetermined recognition results and the confidence degrees of the perception results; and selecting a sample with the confidence coefficient larger than a preset confidence coefficient threshold value as the screening sample. Because the samples of which the identification results are not determined do not need to be manually marked and the selection of the samples has certain randomness, the screened samples are used as part of training data to further adjust the whole network, the labor cost can be effectively reduced, and the generalization capability of the multi-task learning network is improved.

Drawings

FIG. 1 is a flowchart of a sensing method based on a multitask learning network according to an embodiment of the present invention;

FIG. 2 is an overall architecture diagram of a multitasking learning network in an embodiment of the present invention;

FIG. 3 is a flow chart of another sensing method based on a multitask learning network according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a sensing device based on a multitask learning network according to an embodiment of the present invention.

Detailed Description

In the field of automatic driving, a sensing method based on a multitask learning network is urgently needed to realize accurate sensing of the surrounding environment of a vehicle.

In the prior art, the current mainstream multitask learning method adopts a structure of a common feature network and a plurality of task head networks, and an adjusting network aiming at each task head is not provided; in addition, in the prior art, a fixed weight value is generally manually set for each task head network during training, and a supervision method is often adopted to train the whole multi-task learning network, that is, in the training process, the multi-task learning network is trained, adjusted or optimized totally depending on sample data with labels.

The inventor of the invention finds that the features extracted by the common feature network are not necessarily suitable for each task, so that the common features need to be subjected to domain adaptive adjustment to be suitable for each individual task; in addition, the artificially set weight is not always optimal, and the fixed and unchangeable parameters are not beneficial to the network training, so that the accuracy of the output sensing result of the trained multi-task learning network in the practical application is insufficient; in addition, in the prior art, only the labeled sample is used as training data in a supervised learning method, a large amount of labor cost is consumed for labeling the sample, and the generalization capability of the multi-task learning network trained by the labeled sample is completely relied on.

In the embodiment of the invention, firstly, a training sample set is constructed, wherein the training sample set comprises samples with determined recognition results and samples without determined recognition results; inputting the samples with the determined recognition results into the whole multi-task learning network, and determining the optimal characteristic network layer parameters and the multi-task learning network after initial training after training; then fixing parameters of a characteristic network layer, inputting the samples with the determined recognition results into the initially trained multi-task learning network, and obtaining the optimized adjustment network layer parameters and the retrained multi-task learning network after training; fixing the parameters of the characteristic network layer and the parameters of the adjusting network layer, inputting the samples with the determined identification results and the samples screened from the samples without the determined identification results into the retrained multi-task learning network, and obtaining the adjusted multi-task learning network after training; and inputting the image to be perceived into the adjusted multi-task learning network to obtain a perception result. Compared with the prior art, the method has the advantages that the adjusting networks corresponding to the task head networks one by one are not adopted, and the weight values of the task head networks are generally set manually when the characteristic networks are trained, so that the accuracy of the output sensing result of the trained multi-task learning network in practical application is insufficient; in addition, a large amount of labor cost is needed to label data during training by adopting a supervision method (only samples with labels are included in training samples), and the generalization capability of the trained multi-task learning network is insufficient.

In order to make the aforementioned objects, features and advantages of the present invention more comprehensible, embodiments accompanying figures are described in detail below.

Referring to fig. 1, fig. 1 is a flowchart of a sensing method based on a multitask learning network according to an embodiment of the present invention. The method may include steps S11 to S15:

step S11: constructing a training sample set, wherein the training sample set comprises samples with determined recognition results and samples with undetermined recognition results;

step S12: inputting the samples with the determined recognition results into the whole multi-task learning network, and determining the optimal characteristic network layer parameters and the multi-task learning network after initial training after training;

step S13: fixing the parameters of the characteristic network layer as the optimized parameters of the characteristic network layer, inputting the samples with the determined recognition results into the initially trained multi-task learning network, and obtaining optimized parameters of the adjusting network layer and the retrained multi-task learning network after training;

step S14: fixing the parameters of the characteristic network layer as the preferred parameters of the characteristic network layer, fixing the parameters of the adjusting network layer as the preferred parameters of the adjusting network layer, inputting the samples with the determined recognition results and the screening samples obtained by screening the samples without the determined recognition results into the retrained multi-task learning network, and obtaining the adjusted multi-task learning network after training;

step S15: and inputting the image to be perceived into the adjusted multi-task learning network to obtain a perception result.

The multitask learning network comprises a feature network layer, an adjusting network layer and a task head network layer, wherein the adjusting network layer comprises a plurality of adjusting networks, the task head network layer comprises a plurality of task head networks, and the adjusting networks correspond to the task head networks one to one.

In a specific implementation of step S11, the training sample set may be multi-frame image data acquired by a vehicle-mounted camera, specifically, the training sample set includes a plurality of task data sets to be trained, and each task data set to be trained has a one-to-one correspondence relationship with each task head network to be trained. The sample of the determined recognition result is a sample data set with a label, and the label can be used for indicating the recognition result artificially labeled, for example, an object in an image is artificially labeled as garbage of a certain class; the sample of undetermined recognition results is a set of sample data without a tag.

The determined recognition result may refer to a clearly labeled recognition result, and the undetermined recognition result may refer to a not clearly labeled recognition result, for example, the recognition result may be known but not labeled, or the recognition result may be unknown.

It should be noted that a data enhancement method may be used to perform data enhancement on the training sample set, and the data enhancement method may be a method of adding random noise, image stitching, random scaling, or the like. By performing data enhancement on the training sample set, the diversity of training data can be improved, and then a more optimized multi-task learning network is obtained through training, so that a perception result with high accuracy is output in specific application.

In the specific implementation of step S12, the samples of the determined recognition result are input into the whole multi-task learning network, and the trained determination of the preferred feature network layer parameters and the multi-task learning network after the initial training includes: setting a loss function of the multitask learning network; and inputting the samples with the determined recognition results into the whole multi-task learning network to perform initial training on the multi-task learning network until the loss function of the multi-task learning network is converged, stopping training and obtaining the optimal feature network layer parameters and the multi-task learning network after initial training.

Further, inputting the samples of the determined recognition result into the whole multi-task learning network, and determining the optimal characteristic network layer parameters and the multi-task learning network after initial training by training comprises the following steps: setting a loss function of the multitask learning network; and inputting the samples with the determined recognition results into the whole multi-task learning network to perform initial training on the multi-task learning network, wherein in the initial training process, the parameters of the feature network layer are updated once when the weight value of each task head network is updated once until the loss function of the multi-task learning network is converged, and the training is stopped to obtain the optimal feature network layer parameters and the multi-task learning network after the initial training.

Further, the loss function of the multitask learning network is as follows:

Where σ is referred to as mean square error in probability theory and statistics, it is used to represent the degree of variation of a variable, and more specifically it can be used to describe the degree of deviation of data from a mean, a measure of the degree of dispersion in a random variable or a set of data. In the practice of embodiments of the present invention,

can express the stability degree of the loss function corresponding to the kth task head network,

in the training process, self-adaptive adjustment is carried out according to the training state of each task head network so as to reasonably distribute the weight of each task.

It should be noted that, in the step of initially training the entire multitask learning network, in addition to setting the loss function of the entire multitask learning network, at least a loss function of each task head network needs to be set, during the training process, the weight value of each task head network is dynamically adjusted according to the loss function value of each task head network, the parameters of the feature network layer are continuously optimized until the loss function of the multitask learning network converges, and the training is stopped, at this time, the preferred feature network layer parameters and the initially trained multitask learning network are obtained. In the embodiment of the invention, the problem that the weight value of each task head network is not accurate enough by manual setting in the prior art is solved by adopting a dynamic weight value adjusting mode, the training effect of each task head network can be effectively improved, and the accuracy of the output sensing result of the trained whole multi-task learning network in practical application can be further improved.

In a specific implementation, before the samples of the determined recognition result are input into a multi-task learning network to train a feature network, the multi-task learning network needs to be constructed.

Referring to fig. 2, fig. 2 is an overall architecture diagram of a multitask learning network according to an embodiment of the present invention. The multitasking learning network may include: a feature network layer 21, a regulation network layer 22 and a task header network layer 23.

Wherein, the feature network layer 21 can be used to construct a first layer of the multitask learning network, including a common feature extraction network; the adjustment network layer 22 includes a plurality of adjustment networks, which are adjustment network 1, adjustment network 2, … …, and adjustment network n, respectively, and the adjustment network layer 22 may be used to construct a second layer of the multitask learning network; the task head network layer 23 includes a plurality of task head networks, which are respectively a task head network 1, task head networks 2 and … …, and a task head network n, where n is a positive integer, and the task head network layer 23 may be used to construct a third layer of the multitask learning network.

The plurality of adjusting networks and the plurality of task head networks have one-to-one correspondence; the feature network layer 21, the adjustment network layer 22, and the task head network layer 23 are connected in sequence.

The feature network layer 21 may correspond to a common feature extraction network used in a conventional multitask learning method, and may be configured to extract feature points in an image and output a feature data set. The embodiment of the present invention does not limit the specific implementation details of the feature network layer 21.

The adjustment network layer 22 may perform domain adaptation on the feature data set (or extracted common features) output by the feature network layer 21, so as to adapt to each individual task.

The task head network layer 23 may correspond to a plurality of tasks (each task may also be referred to as a task head) in the multi-task learning network, and may include various sensing tasks that need to be completed in a driving environment, for example, semantic segmentation, low obstacle recognition, garbage detection, traffic light detection, and the like, taking the field of automatic driving technology as an example.

As a non-limiting example, the multitasking learning network satisfies one or more of: the feature network structure in the feature network layer adopts a dark net structure in a single-step policy algorithm (yolo); the structure of each adjusting network in the adjusting network layer adopts a bottleneck layer network bottleeck structure. The single-step strategy algorithm (yolo) can be a 5 th version (yolov5) of the single-step strategy algorithm, and the algorithm adopts a single neural network to directly predict the boundary and the class probability of an object, so that the end-to-end object detection is rapidly realized in one step; the darknet structure (dark net) is a light learning framework, can analyze and process picture data, and has the main characteristics of easy installation, flexible use and good portability. The bottleneck layer network (bottle) structure is also a common network layer structure in machine learning, a 1 × 1 convolutional neural network is used, so the bottleneck layer is called as a bottleneck layer because the bottle is long and is used in a network with higher depth, the number of parameters can be reduced, the calculated amount is reduced, and data training and feature extraction are effectively and intuitively carried out.

As a non-limiting example, the task head network layer may include one or more of: a semantic dividing head, a short barrier dividing head, a target detection head, a garbage detection head and a traffic light detection head. The task head network layer may also include other types of task heads in the perception field, and the embodiment of the present invention does not limit the types of the task heads of the task head network layer.

In some non-limiting embodiments, the semantic segmentation head may adopt a segmentation mask (SegMaskPSP) structure, the number of categories may be 65, and the semantic segmentation head includes foreground objects common on a road, such as a road sign, a building, and the like; the low obstacle dividing head can adopt a dividing mask (SegMaskPSP) structure, the number of categories can be 3, and the categories are respectively a low obstacle, a passable area and a non-passable area; the target detection head can adopt a Full Convolution Network (FCN) structure and comprises classification and regression networks, the number of the classes can be 4, and the classes are respectively an automobile, a pedestrian, a bicycle and a motorcycle; the garbage detection head can adopt a Full Convolution Network (FCN) structure and comprises classification and regression networks, the number of classes can be 15, and the classes comprise common garbage types; the traffic light detection head can adopt a full convolution network FCN structure, comprises classification and regression networks, and the number of the classes can be 1. The various task head networks may also adopt other network structures, and the present invention is not limited to this.

In specific implementation, the basic principle of the training process of the multitask learning network is as follows:

first, initial training is performed: inputting a sample (with a label) with a determined recognition result into the multi-task learning network for initial training, wherein the sample with the determined recognition result comprises a plurality of task data sets to be trained, and in the training process, parameters of a characteristic network layer (a common characteristic network) are mainly and continuously optimized to obtain preferred characteristic network layer parameters;

then, retraining is performed: fixing the parameters of the characteristic network layer as the preferred parameters of the characteristic network layer, inputting the samples (with labels) with the determined recognition results into the whole multi-task learning network again, and in the training process, mainly optimizing and adjusting the parameters of each adjusting network in the network layer continuously to obtain the preferred parameters of the adjusting network layer;

and moreover, network fine adjustment is carried out, namely the multi-task learning network after retraining is subjected to fine adjustment: and fixing the parameters of the characteristic network layer as the preferred parameters of the characteristic network layer, fixing the parameters of the adjusting network layer as the preferred parameters of the adjusting network layer, inputting the samples with the determined recognition results (labeled samples) and the screened samples (screened samples without labels) obtained by screening the samples without the determined recognition results into the retrained multi-task learning network, and in the training process, mainly continuously optimizing the parameters of each task head network in the task head network layer to obtain the preferred parameters of the task head network layer. And finishing the training process of the multi-task learning network.

With reference to fig. 1, in a specific implementation of step S13, fixing the parameters of the feature network layer as the preferred feature network layer parameters, inputting the sample of the determined recognition result into the initially trained multi-task learning network, and obtaining the preferred adjusted network layer parameters and the retrained multi-task learning network after training includes: setting a loss function of each adjusting network, and fixing the parameters of the characteristic network layer as the optimal parameters of the characteristic network layer; and inputting the samples with the determined recognition results into the initially trained multi-task learning network to retrain the initially trained multi-task learning network until loss functions of all the adjusting networks are converged, stopping training and obtaining the optimal adjusting network layer parameters and the retrained multi-task learning network.

Specifically, in the process of retraining the initially trained multitask learning network, because the parameters of the feature network layer are fixed, the retraining process mainly trains the parameters of each adjusting network in the adjusting network layer, wherein the loss functions of each adjusting network have independence with each other, in the training process, whenever the loss function of a certain adjusting network converges, the training of the adjusting network is stopped, the preferred parameters of the adjusting network are obtained, until the loss functions of each adjusting network in the adjusting network layer converge, at this time, the whole retraining process is stopped, and the preferred adjusting network layer parameters and the retrained multitask learning network are obtained.

Specifically, an appropriate loss function may be selected as the loss function of the adjustment network, and the loss functions of the adjustment networks may be the same or different. It should be noted that, the training of each adjustment network in the retraining process may be independent from each other, and the training sequence may not be described in sequence.

In the embodiment of the invention, an adjusting network layer comprising a plurality of adjusting networks is added between the characteristic network layer and the task head network layer, wherein the plurality of adjusting networks correspond to the plurality of task head networks one by one, and the common characteristics extracted by the characteristic network layer can be subjected to domain adaptive adjustment, so that the common characteristics are adapted to each individual task, and the accuracy of a perception result is improved.

In a specific implementation of step S14, fixing the parameters of the feature network layer as the preferred feature network layer parameters, and fixing the parameters of the adjustment network layer as the preferred adjustment network layer parameters, inputting the samples with the determined recognition results and the filtered samples obtained by filtering the samples with the undetermined recognition results into the retrained multitask learning network, and obtaining the adjusted multitask learning network by training includes: setting a loss function of each task head network, and respectively fixing the parameters of the characteristic network layer as the preferred parameters of the characteristic network layer, and fixing the parameters of the adjusting network layer as the preferred parameters of the adjusting network layer; and inputting the samples with the determined recognition results and the screened samples into the retrained multi-task learning network to train the retrained multi-task learning network until loss functions of all the task head networks are converged, and stopping training to obtain the adjusted multi-task learning network. Specifically, in the process of adjusting the retrained multitask learning network, since the parameters of the feature network layer and the parameters of the adjusting network layer are fixed and unchanged, the parameters of each task head network in the task head network layer are mainly trained in the training process, wherein the loss functions of each task head network have independence, in the training process, when the loss function of a certain task head network converges, the training of the task head network is stopped, the preferred parameters of the task head network are obtained until the loss functions of each task head network in the task head network layer converge, at this moment, the whole network adjusting (training) process is stopped, and the preferred task head network layer parameters and the adjusted multitask learning network are obtained.

In a specific implementation, an appropriate loss function may be selected as the loss function of the task head network, and the loss functions of the task head networks may be the same or different. It should be noted that, the training of each task head network in the training process may be independent from each other, and the training sequence may not be described in sequence.

In the embodiment of the invention, the whole multi-task learning network is finely adjusted by adopting a semi-supervised learning method (by adopting various types of samples including samples with determined recognition results and samples without determined recognition results as training samples), so that the generalization capability of the model can be effectively improved, and the accuracy of the perception result output by the model in specific application can be improved.

Further, the screening the sample with the undetermined identification result to obtain the screened sample comprises: inputting the samples of the undetermined recognition results into the retrained multitask learning network, and outputting the perception results of the samples of the undetermined recognition results and the confidence degrees of the perception results; and selecting a sample with the confidence coefficient larger than a preset confidence coefficient threshold value as the screening sample.

The confidence degree refers to the reliability degree of the sensing result output by the multitask learning network, and it can be understood that the larger the numerical value of the confidence degree is, the larger the reliability of the sensing result output when the sample of the undetermined recognition result is input into the multitask learning network is; the smaller the value of the confidence coefficient is, the smaller the reliability of the output perception result when the sample of the undetermined recognition result is input into the multitask learning network is.

In the embodiment of the invention, the retrained multitask learning network is adopted to screen the sample of which the identification result is not determined, and then the screened sample is used as part of training data to further adjust the whole multitask learning network, so that the generalization capability of the multitask learning network can be effectively improved. In addition, because the samples of which the recognition results are not determined do not need to be labeled by manpower, the manpower cost for constructing the training sample set can be greatly reduced.

It should be noted that, in the implementation of the above steps S12, S13, and S14, the parameters of the feature network layer, the parameters of the adjustment network layer, and the parameters of the task head network layer may be conventional parameters used in each specific network structure. For example: when the feature network adopts a darknet (dark net) structure in a single-step policy algorithm (yolo), parameters of the feature network layer can be parameters of the darknet structure; when the structure of each of the adjustment networks in the adjustment network layers adopts a bottleneck layer network (bottleeck) structure, the parameter of the adjustment network layer may be a parameter of the bottleneck layer network structure itself.

Specifically, parameters of a feature network layer, parameters of an adjustment network layer and parameters of a task head network layer can be continuously optimized in training; before training, parameters of each network layer can be set with different initial values according to different application scenarios, and the setting of the initial values of the parameters in the adopted specific network structure is not limited in the embodiment of the invention.

In the specific implementation of step S15, the image to be perceived is input into the adjusted multitask learning network, and a perception result is obtained.

In a specific embodiment of the present invention, the image to be perceived may be an image captured by a camera in an automatic driving process, and the image to be perceived is input to the adjusted multitask learning network, and the obtained perception result may be a recognition result of an object in a surrounding environment. The camera may be located in front of the cab, or may be located at another suitable position on the vehicle.

Referring to fig. 3, fig. 3 is a flowchart of another sensing method based on a multitask learning network according to an embodiment of the present invention. The other sensing method based on the multitask learning network may include steps S31 to S35, which are described below.

In step S31, a training sample set is constructed, which contains samples with determined recognition results and samples with undetermined recognition results.

In step S32, a loss function of the multi-task learning network is set, then the samples with the determined recognition results are input into the whole multi-task learning network for initial training until the loss function of the multi-task learning network converges, and training is stopped to obtain the preferred feature network layer parameters and the initially trained multi-task learning network.

In step S33, a loss function of each adjustment network is set, and the parameters of the feature network layer are fixed as the preferred parameters of the feature network layer, then the samples with the determined recognition results are input to the initially trained multi-task learning network for retraining until the loss functions of each adjustment network are all converged, and training is stopped to obtain the preferred parameters of the adjustment network layer and the retrained multi-task learning network.

In step S34, a loss function of each task head network is set, and the parameters of the feature network layer are respectively fixed as the parameters of the preferred feature network layer, and the parameters of the adjustment network layer are fixed as the parameters of the preferred adjustment network layer, and then the samples with the determined recognition results and the filtered samples are input into the retrained multi-task learning network for training until the loss functions of each task head network are all converged, and the training is stopped to obtain the adjusted multi-task learning network.

In step S35, the image to be perceived is input into the adjusted multitask learning network, and a perception result is obtained.

In the specific implementation, please refer to the foregoing description of steps 1 and 2 for more details regarding steps S31 to S35, which are not repeated herein.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a sensing device based on a multitask learning network according to an embodiment of the present invention. The sensing device based on the multitask learning network can comprise:

a training sample set constructing module 41, configured to construct the training sample set by using samples with determined recognition results and samples with undetermined recognition results;

the initial training module 42 is used for inputting the samples with the determined recognition results into the whole multi-task learning network, and determining the optimal characteristic network layer parameters and the multi-task learning network after initial training after training;

a retraining module 43, configured to fix the parameters of the feature network layer as the preferred feature network layer parameters, input the sample with the determined recognition result into the initially trained multi-task learning network, and obtain preferred adjusted network layer parameters and a retrained multi-task learning network through training;

a network adjusting module 44, configured to fix the parameter of the feature network layer as the preferred feature network layer parameter, fix the parameter of the adjusting network layer as the preferred adjusting network layer parameter, input the sample with the determined recognition result and the filtered sample obtained by filtering the sample without the determined recognition result into the retrained multitask learning network, and obtain the adjusted multitask learning network after training;

and a perception result determining module 45, configured to input the image to be perceived into the adjusted multitask learning network, so as to obtain a perception result.

For the principle, specific implementation and beneficial effects of the sensing apparatus based on the multitask learning network, please refer to the foregoing and the related description about the sensing method based on the multitask learning network shown in fig. 1 to 3, which will not be described herein again.

The embodiment of the invention also provides a computer readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program executes the steps of the sensing method based on the multitask learning network. The computer-readable storage medium may include a non-volatile memory (non-volatile) or a non-transitory memory, and may further include an optical disc, a mechanical hard disk, a solid state hard disk, and the like.

Specifically, in the embodiment of the present invention, the processor may be a Central Processing Unit (CPU), and the processor may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It will also be appreciated that the memory in the embodiments of the subject application can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example and not limitation, many forms of Random Access Memory (RAM) are available, such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (enhanced SDRAM), SDRAM (SLDRAM), synchlink DRAM (SLDRAM), and direct bus RAM (DR RAM).

The embodiment of the invention also provides a terminal, which comprises a memory and a processor, wherein the memory is stored with a computer program capable of running on the processor, and the processor executes the steps of the sensing method based on the multitask learning network when running the computer program. The terminal can include but is not limited to a mobile phone, a computer, a tablet computer and other terminal devices, and can also be a server, a cloud platform and the like.

The "plurality" appearing in the embodiments of the present application means two or more.

The descriptions of the first, second, etc. appearing in the embodiments of the present application are only for illustrating and differentiating the objects, and do not represent the order or the particular limitation of the number of the devices in the embodiments of the present application, and do not constitute any limitation to the embodiments of the present application.

It should be noted that the sequence numbers of the steps in this embodiment do not represent a limitation on the execution sequence of the steps.

Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected by one skilled in the art without departing from the spirit and scope of the invention, as defined in the appended claims.

Claims

1. A perception method based on a multitask learning network is characterized in that the multitask learning network comprises a feature network layer, an adjusting network layer and a task head network layer, wherein the adjusting network layer comprises a plurality of adjusting networks, the task head network layer comprises a plurality of task head networks, and the adjusting networks correspond to the task head networks one to one;

the method comprises the following steps:

constructing a training sample set, wherein the training sample set comprises samples with determined recognition results and samples with undetermined recognition results;

inputting the samples with the determined recognition results into the whole multi-task learning network, and determining the optimal characteristic network layer parameters and the multi-task learning network after initial training after training;

fixing the parameters of the characteristic network layer as the optimized parameters of the characteristic network layer, inputting the samples with the determined recognition results into the initially trained multi-task learning network, and obtaining optimized parameters of the adjusting network layer and the retrained multi-task learning network after training;

fixing the parameters of the characteristic network layer as the preferred parameters of the characteristic network layer, fixing the parameters of the adjusting network layer as the preferred parameters of the adjusting network layer, inputting the samples with the determined recognition results and the screening samples obtained by screening the samples without the determined recognition results into the retrained multi-task learning network, and obtaining the adjusted multi-task learning network after training; and inputting the image to be perceived into the adjusted multi-task learning network to obtain a perception result.

2. The method of claim 1, wherein the step of inputting the samples of the determined recognition results into the whole multi-task learning network, and the step of determining the preferred feature network layer parameters and the multi-task learning network after the initial training comprises the following steps:

setting a loss function of the multitask learning network;

and inputting the samples with the determined recognition results into the whole multi-task learning network to perform initial training on the multi-task learning network until the loss function of the multi-task learning network is converged, stopping training and obtaining the optimal feature network layer parameters and the multi-task learning network after initial training.

3. The method of claim 1, wherein the step of inputting the samples of the determined recognition results into the whole multi-task learning network, and the step of determining the preferred feature network layer parameters and the multi-task learning network after the initial training comprises the following steps:

setting a loss function of the multitask learning network;

and inputting the samples with the determined recognition results into the whole multi-task learning network to perform initial training on the multi-task learning network, wherein in the initial training process, the parameters of the feature network layer are updated once when the weight value of each task head network is updated once until the loss function of the multi-task learning network is converged, and the training is stopped to obtain the optimal feature network layer parameters and the multi-task learning network after the initial training.

4. The method of claim 3, wherein the loss function of the multitasking learning network is:

5. The method of claim 1, wherein fixing the parameters of the feature network layer as the preferred feature network layer parameters, inputting the samples with the determined recognition results into the initially trained multi-task learning network, and obtaining the preferred adjusted network layer parameters and the retrained multi-task learning network after training comprises:

setting a loss function of each adjusting network, and fixing the parameters of the characteristic network layer as the optimal parameters of the characteristic network layer;

and inputting the samples with the determined recognition results into the initially trained multi-task learning network to retrain the initially trained multi-task learning network until loss functions of all the adjusting networks are converged, stopping training and obtaining the optimal adjusting network layer parameters and the retrained multi-task learning network.

6. The method of claim 1, wherein screening the sample with undetermined recognition results to obtain the screened sample comprises:

inputting the samples of the undetermined recognition results into the retrained multitask learning network, and outputting the perception results of the samples of the undetermined recognition results and the confidence degrees of the perception results;

and selecting a sample with the confidence coefficient larger than a preset confidence coefficient threshold value as the screening sample.

7. The method according to claim 1, wherein the fixing the parameters of the feature network layer as the preferred feature network layer parameters and the fixing the parameters of the adjustment network layer as the preferred adjustment network layer parameters, and the inputting the samples with the determined recognition results and the filtered samples obtained by filtering the samples without the determined recognition results into the retrained multi-task learning network, and the training to obtain the adjusted multi-task learning network comprises:

setting a loss function of each task head network, and respectively fixing the parameters of the characteristic network layer as the preferred parameters of the characteristic network layer, and fixing the parameters of the adjusting network layer as the preferred parameters of the adjusting network layer;

and inputting the samples with the determined recognition results and the screening samples into the retrained multi-task learning network to train the retrained multi-task learning network until the loss functions of the task head networks are all converged, and stopping training to obtain the adjusted multi-task learning network.

8. The method of claim 1, wherein the multitask learning network satisfies one or more of:

the feature network structure in the feature network layer adopts a dark network dark net structure in a single-step strategy algorithm yolo;

the structure of each adjusting network in the adjusting network layer adopts a bottleneck layer bottleeck network structure.

9. The method of claim 1, wherein the task head network layer comprises one or more of:

a semantic dividing head, a short barrier dividing head, a target detection head, a garbage detection head and a traffic light detection head.

10. The method of claim 9, wherein the task-head network satisfies one or more of:

the semantic segmentation head adopts a segmentation mask SegMaskPSP structure;

the low obstacle segmentation head adopts a segmentation mask SegMaskPSP structure;

the target detection head adopts a full convolution network FCN structure;

the garbage detection head adopts a full convolution network FCN structure;

the traffic light detection head adopts a full convolution network FCN structure.

11. A perception device based on a multitask learning network is characterized by comprising:

the training sample set constructing module is used for constructing a training sample set, and the training sample set comprises samples with determined identification results and samples without determined identification results;

the initial training module is used for inputting the samples with the determined recognition results into the whole multi-task learning network, and determining the optimal characteristic network layer parameters and the multi-task learning network after initial training after training;

the retraining module is used for fixing the parameters of the characteristic network layer as the optimized parameters of the characteristic network layer, inputting the samples with the determined recognition results into the initially trained multi-task learning network, and obtaining optimized parameters of the adjusting network layer and the retrained multi-task learning network after training;

a network adjusting module, configured to fix the parameter of the feature network layer as the preferred feature network layer parameter, fix the parameter of the adjusting network layer as the preferred adjusting network layer parameter, input the sample with the determined recognition result and the filtered sample obtained by filtering the sample without the determined recognition result into the retrained multitask learning network, and obtain the adjusted multitask learning network after training;

and the perception result determining module is used for inputting the image to be perceived into the adjusted multi-task learning network to obtain a perception result.

12. A storage medium having stored thereon a computer program for performing the steps of the method for awareness based on a multitasking learning network according to any one of claims 1 to 10 when the computer program is executed by a processor.

13. A terminal comprising a memory and a processor, the memory having stored thereon a computer program operable on the processor, wherein the processor executes the computer program to perform the steps of the multitask learning network based awareness method according to any one of claims 1 to 10.