CN113591573A

CN113591573A - Training and target detection method and device for multi-task learning deep network model

Info

Publication number: CN113591573A
Application number: CN202110723220.7A
Authority: CN
Inventors: 杨喜鹏; 谭啸; 孙昊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-06-28
Filing date: 2021-06-28
Publication date: 2021-11-02

Abstract

The disclosure provides a method and a device for training a multi-task learning deep network model and detecting a target, and relates to the technical field of computer vision and deep learning. The specific implementation scheme is as follows: acquiring training data of multi-task learning in a target detection scene; inputting the image sample into a backbone network in an Anchor-Free-based multitask learning deep network model to obtain a characteristic diagram output by the backbone network; inputting the feature map into a feature pyramid network in the multitask learning deep network model to obtain a multi-scale feature map output by the feature pyramid network; inputting the multi-scale feature map into a Head network in the multi-task learning deep network model to learn each task, and obtaining a prediction result output by the Head network and corresponding to each task; and training the multi-task learning depth network model according to the prediction result output by the Head network and the label of each task on the image sample. The method and the device can improve the characterization capability of the network.

Description

Training and target detection method and device for multi-task learning deep network model

Technical Field

The disclosure relates to the field of artificial intelligence, in particular to the technical field of computer vision and deep learning, can be used in intelligent traffic scenes, and particularly relates to a method and a device for training a multi-task learning deep network model and detecting a target.

Background

Multitask learning definition: a machine learning method that learns across multiple related tasks based on a shared representation. Meanwhile, multi-task learning is a derivation transfer learning method, and a main task uses domain related information possessed by training signals of related tasks as a derivation bias to improve the generalization effect of the main task.

In the current multi-task learning method, most detection tasks adopt an anchor scheme, the network learns the characteristics of a plurality of layers through supervision information and loss, and the representation capability of the characteristics is enhanced, so that the final prediction can have higher accuracy. However, this method requires a preset frame, detects whether or not an object is present in the preset frame, and adjusts the prediction frame according to the preset frame if an object is present, to obtain a final object detection frame.

Disclosure of Invention

The disclosure provides a method and a device for training a multi-task learning deep network model and detecting a target.

According to a first aspect of the present disclosure, there is provided a training method for a multitask learning deep network model, including:

acquiring training data of multi-task learning in a target detection scene; wherein the training data comprises an image sample and a label corresponding to each task on the image sample;

inputting the image sample into a backbone network in the multitask learning deep network model to obtain a feature map output by the backbone network; the multitask learning deep network model is based on Anchor-Free;

inputting the feature map into a feature pyramid network in the multitask learning deep network model to obtain a multi-scale feature map output by the feature pyramid network;

inputting the multi-scale feature map into a Head network in the multi-task learning deep network model to learn each task, and obtaining a prediction result output by the Head network and corresponding to each task;

and training the multi-task learning deep network model according to the prediction result output by the Head network and the label of each task on the image sample.

Wherein, in some embodiments of the present disclosure, the backbone network comprises a plurality of first feature extraction units; the inputting the image sample to a backbone network in the multitask learning deep network model to obtain a feature map output by the backbone network comprises:

inputting the image samples to the backbone network;

and obtaining a first feature map output by each first feature extraction unit in the backbone network.

In some embodiments of the present disclosure, the feature pyramid network comprises a plurality of second feature extraction units; the inputting the feature map into a feature pyramid network in the multitask learning deep network model to obtain a multi-scale feature map output by the feature pyramid network comprises the following steps:

inputting the first feature map output by each first feature extraction unit in the backbone network into the feature pyramid network;

and performing feature fusion in a top-down dense connection mode on the first feature graph output by the corresponding first feature extraction unit through the plurality of second feature extraction units in the feature pyramid network to obtain the feature graph corresponding to each scale.

In some embodiments of the present disclosure, the Head network comprises a plurality of Head subnetworks; the step of inputting the multi-scale feature map into a Head network in the multi-task learning deep network model to learn each task and obtain a prediction result output by the Head network and corresponding to each task includes:

inputting the multi-scale feature map into a corresponding Head sub-network to learn each task;

obtaining a prediction result corresponding to each task output by each Head sub-network;

and fusing the prediction results corresponding to each task output by each Head sub-network to obtain the prediction results corresponding to each task output by the Head network.

Wherein, in some embodiments of the present disclosure, the multitask learning comprises:

a target center point positioning task, a target corner point positioning task, a target boundary box prediction task and a target characteristic point positioning task.

According to a second aspect of the present disclosure, there is provided an object detection method, comprising:

acquiring an image of the surroundings of the vehicle;

inputting the image to a trained multi-task learning deep network model; the multitask learning deep network model is based on Anchor-Free and comprises a backbone network, a characteristic pyramid network and a Head network; the Head network comprises a plurality of Head sub-networks, and each Head sub-network predicts each task respectively;

obtaining a prediction result corresponding to each task output by the multi-task learning deep network model;

and determining a detection frame, a boundary frame and feature points of the target in the image according to the prediction result corresponding to each task.

According to a third aspect of the present disclosure, there is provided a training apparatus for a multitask learning deep network model, including:

the first acquisition module is used for acquiring training data of multi-task learning in a target detection scene; wherein the training data comprises an image sample and a label corresponding to each task on the image sample;

the second acquisition module is used for inputting the image sample to a backbone network in the multitask learning deep network model to acquire a feature map output by the backbone network; the multitask learning deep network model is based on Anchor-Free;

the third obtaining module is used for inputting the feature map into a feature pyramid network in the multitask learning deep network model and obtaining a multi-scale feature map output by the feature pyramid network;

the fourth obtaining module is used for inputting the multi-scale feature map into a Head network in the multi-task learning deep network model to learn each task, and obtaining a prediction result output by the Head network and corresponding to each task;

and the training module is used for training the multi-task learning depth network model according to the prediction result output by the Head network and corresponding to each task and the label of each task on the image sample.

According to a fourth aspect of the present disclosure, there is provided an object detection apparatus comprising:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring an image of the surrounding environment of the vehicle;

the input module is used for inputting the image to the trained multi-task learning deep network model; the multitask learning deep network model is based on Anchor-Free and comprises a backbone network, a characteristic pyramid network and a Head network; the Head network comprises a plurality of Head sub-networks, and each Head sub-network predicts each task respectively;

the second acquisition module is used for acquiring a prediction result corresponding to each task output by the multi-task learning deep network model;

and the determining module is used for determining a detection frame, a boundary frame and feature points of the target in the image according to the prediction result corresponding to each task.

According to a fifth aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first and/or second aspects.

According to a sixth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of the first and/or second aspect.

According to the technical scheme, the Anchor-Free-based multi-task learning deep network model is constructed, the prediction result of each task is obtained through the backbone network, the pyramid network and the Head network, the advantages of Anchor-Free can be fully utilized, and therefore the feature expression can be increased, the effect of the model is improved, the detection efficiency under a target detection scene can be improved, and the consumption of computational power is reduced.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of single task learning and multitask learning;

FIG. 2 is a method for training a multi-task learning deep network model according to the disclosed embodiments;

FIG. 3 is a schematic structural diagram of a multitask learning deep network model proposed according to an embodiment of the present disclosure;

fig. 4 is a flow chart of a Head network according to an embodiment of the disclosure;

fig. 5 is a flow chart of a method for object detection according to an embodiment of the present disclosure;

FIG. 6 is a block diagram of a training apparatus for a multi-task learning deep network model according to an embodiment of the present disclosure;

fig. 7 is a block diagram of a target detection apparatus according to an embodiment of the present disclosure;

fig. 8 is a block diagram of an electronic device for implementing an embodiment of the disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that the definition of multi-task learning is a machine learning method that learns a plurality of related tasks together based on a shared representation. Meanwhile, multi-task learning is also a derivation transfer learning method, and the main task uses domain related information possessed by training signals of related tasks as a derivation bias to improve the generalization effect of the main task. Fig. 1 is a schematic diagram of single-task learning and multi-task learning, and as shown in fig. 1, the multi-task learning involves simultaneous and parallel learning of a plurality of related tasks, gradient simultaneous and reverse propagation, and the plurality of tasks mutually help learning through shared representation of a bottom layer, thereby improving generalization effect. Briefly, the method comprises the following steps: the multi-task learning puts a plurality of related tasks together for learning, and the learning process shows that the related information of the learned fields are mutually shared and mutually supplemented through shallow sharing, so that the learning is mutually promoted, and the generalization effect is improved.

Among them, the advantages of multitask learning include: (1) the multiple related tasks are learned together, with related but also unrelated parts. When a task is learned, the part irrelevant to the task is equivalent to noise in the learning process, and therefore, the generalization effect of learning can be improved by introducing the noise. (2) In single-task learning, the back-propagation of the gradient tends to fall into local minima. Local minimum values of different tasks in multi-task learning are in different positions, and through interaction, a hidden layer can be helped to escape from the local minimum values. (3) The added task can change the dynamic characteristics of weight value updating, and the network is probably more suitable for multi-task learning. For example, the multi-task parallel learning improves the learning rate of a shallow sharing layer, and possibly, the learning effect is improved by a larger learning rate. (4) Multiple tasks are shared and expressed in a shallow layer, the network capacity is possibly weakened, network overfitting is reduced, and the generalization effect is improved.

In the current multi-task learning method, most detection tasks adopt an Anchor-Base scheme, so that the network learns the characteristics of multiple layers through supervision information and loss values, and the representation capability of the characteristics is enhanced, thereby enabling the final prediction to have higher accuracy. However, the hyper-parameter setting of the model in the Anchor-Base scheme is difficult, the complexity of the model is high, and the required calculation power is high.

Based on the problems, the invention provides a method and a device for training a multitask learning deep network model and detecting a target based on an intelligent traffic scene.

Fig. 2 is a training method of a multitask learning deep network model according to an embodiment of the present disclosure. It should be noted that the training method for the deep network model for multitask learning proposed in the embodiments of the present disclosure may be applied to a training apparatus for the deep network model for multitask learning in the embodiments of the present disclosure, and the apparatus may be configured in an electronic device. As shown in fig. 2, the training method of the multitask learning deep network model includes the following steps:

step 201, training data of multi-task learning in a target detection scene is obtained. The training data includes image samples and labels corresponding to the image samples for each task.

It can be understood that when the multi-task learning deep network model is trained, training data of multi-task learning under a corresponding target detection scene needs to be acquired first. Because of the multi-task learning, the training data needs to include image samples and labels corresponding to the image samples for each task, so that when the model is trained, the model parameters can be adjusted by calculating loss values according to the prediction results and the corresponding actual labels.

For example, in a scenario of detecting whether an obstacle exists in a vehicle driving path, assuming that the multi-task learning includes positioning of an obstacle center point, positioning of an obstacle corner point, prediction of an obstacle bounding box, and the like, the training data includes image samples within a vehicle driving view range, and an obstacle center point positioning tag, an obstacle corner point positioning tag, an obstacle bounding box prediction tag, and the like in the image samples.

Step 202, inputting the image sample into a backbone network in the multitask learning deep network model, and obtaining a feature map output by the backbone network. The multitask learning deep network model is based on Anchor-Free. Fig. 3 is a schematic structural diagram of a deep network model for multitask learning in an embodiment of the present disclosure.

It should be noted that, at present, mainstream target detection algorithms include various deep convolutional neural networks of multiple stages and Single-stage SSD (Single Shot multi box Detector), and RetinaNet are all made based on Anchor-Base. The nature of Anchor is candidate boxes, and after the candidate boxes with different scales and proportions are designed, the neural network learns how to classify the candidate boxes: whether it contains objects and what kind of objects it contains, how to return them to the correct position. The role of the method is similar to that of mechanisms such as a sliding window in a traditional detection algorithm. However, the hypercarameter setting of the Anchor-Base method is difficult, the model complexity is high, and the required calculation power is large.

In the embodiment of the disclosure, the Anchor-Free-based multitask learning deep network model is adopted, wherein the Anchor-Free method does not need to preset the Anchor, and only needs to carry out regression on the target central points and the widths and the heights of feature maps with different scales, so that the time consumption and the required computational power are greatly reduced, and the detection speed and the detection precision are high.

As shown in fig. 3, since images of different sizes contain different feature information, in order to improve the expression of the feature information, it is possible to obtain feature maps of different sizes by subjecting an image sample to a multi-layer convolution operation, and each feature map of different sizes corresponds to the extraction of different feature information. In the disclosed embodiment, the backbone network may be the Resnet family (Resnet34, Resnet50, resent101, etc.), the DarkNet family (DarkNet19, DarkNet 53). In addition, the backbone network may also select a network with a suitable size according to an application scenario, such as a lightweight structure, e.g., resnet18, resnet34, and darknet19, a medium structure, e.g., resnet50, resneXt50, and darknet53, a heavy structure, e.g., rest 101, resneXt152, and so on.

As an example, the backbone network may include a plurality of first feature extraction units, where the image sample is input to the backbone network in the multitask learning deep network model, and a specific implementation manner of obtaining the feature map output by the backbone network may be: inputting the image samples to a backbone network; and obtaining a first feature map output by each first feature extraction unit in the backbone network.

And 203, inputting the feature map into a feature pyramid network in the multitask learning deep network model to obtain a multi-scale feature map output by the feature pyramid network.

It should be noted that the feature pyramid network is a method for efficiently extracting features of each dimension in an image by using a conventional convolutional neural network model. In the computer vision discipline, multi-dimensional target detection always generates a feature combination reflecting different dimensional information by taking a reduced or enlarged different dimensional picture as an input, and the method can effectively express each dimensional feature in the picture, but has higher requirements on hardware computer capacity and memory. The feature pyramid network is a feature extractor designed according to a feature pyramid concept, and can combine shallow semantic information and deep semantic information at the same time, so as to improve the precision and speed. The network utilizes the feature expression structure of the conventional convolutional neural network model from bottom to top of each layer for different dimensions of the same-size picture, which is equivalent to the enhancement of the expression and output of the traditional convolutional neural network on the image information, thereby improving the feature extraction mode of the convolutional neural network and enabling the finally output features to better represent the information of each dimension of the input image.

In the embodiment of the present disclosure, the feature pyramid network includes a plurality of second feature extraction units, where an implementation manner of obtaining the multi-scale feature map output by the feature pyramid network may be: inputting the first feature graph output by each first feature extraction unit in the backbone network into a feature pyramid network; and performing feature fusion in a top-down dense connection mode on the first feature graph output by the corresponding first feature extraction unit through a plurality of second feature extraction units in the feature pyramid network to obtain the feature graph corresponding to each scale.

Specifically, as shown in fig. 3, where C3 to C5 are the first feature maps output by the first feature extraction unit, and P3 to P7 are the multi-scale feature maps output by the second feature extraction unit. As an example, a specific way of performing top-down dense connection mode feature fusion on the corresponding first feature map by the second feature extraction unit may be: the features of the C4 feature map are subjected to a dimensionality reduction operation (i.e., a layer of 1x1 convolutional layers is added), the features of the P5 feature map are subjected to an upsampling operation so that they have corresponding dimensions, then an addition operation (corresponding element addition) is performed on the processed C4 feature map and the processed P5 feature map, and the obtained result is input into the P3 feature map. Therefore, the processed low-layer features and the processed high-layer features are accumulated, so that the purpose is to provide more accurate position information for the low-layer features, and the positioning information of the deep network has errors due to multiple down-sampling and up-sampling operations, so that the positioning information is combined for use, a deeper feature pyramid is constructed, and multi-layer feature information is fused.

And 204, inputting the multi-scale feature map into a Head network in the multi-task learning deep network model to learn each task, and obtaining a prediction result output by the Head network and corresponding to each task.

As an example, in an embodiment of the present disclosure, multitask learning may include: the Head network can comprise a target center point positioning sub-network, a target corner positioning sub-network, a target boundary box prediction sub-network and a target feature point positioning sub-network. The target corner positioning sub-network is based on a target center point positioning sub-network, the target boundary box prediction sub-network is based on a target corner positioning sub-network, and the target feature point positioning sub-network is based on a target boundary box prediction sub-network. Furthermore, in embodiments of the present disclosure, the target centroid location sub-network may utilize thermodynamic techniques for learning of the centroid location sub-network.

In the embodiment of the disclosure, each multi-scale feature map is input to a Head network in the multi-task learning deep network model, so that each Head network outputs a prediction result corresponding to each task. As an example, the prediction result output by each Head network may be subjected to fusion processing to obtain a prediction result corresponding to each task.

Step 205, training a multi-task learning deep network model according to the prediction result output by the Head network and the label of each task on the image sample.

It can be understood that the loss value corresponding to each task can be calculated according to the prediction result corresponding to each task output by the Head network and the label on the image sample corresponding to each task, so that the multi-task learning deep network model is trained by continuously adjusting the parameters of the model according to the loss value until the calculated loss value meets the expectation, and the training of the multi-task learning deep network model is completed.

According to the training method of the multitask learning deep network model provided by the embodiment of the disclosure, the multichannel learning deep network model based on Anchor-Free is constructed, and the prediction result of each task is obtained through the backbone network, the pyramid network and the Head network, so that the advantages of Anchor-Free can be fully utilized, the feature expression can be increased, the effect of the model can be improved, the efficiency of model training can be improved, and the consumption of computing power can be reduced.

In order to further describe the output process of the prediction result in the above embodiments, the present disclosure proposes another embodiment.

Fig. 4 is a flowchart of a Head network according to an embodiment of the disclosure. As shown in fig. 4, the step of inputting the multi-scale feature map into the Head network includes:

step 401, inputting the multi-scale feature map into a corresponding Head sub-network to perform learning of each task.

In embodiments of the present disclosure, multitask learning may include: the method comprises a target center point positioning task, a target corner point positioning task, a target boundary box prediction task and a target characteristic point positioning task, so that the Head network can comprise a plurality of Head sub-networks. Wherein each Head sub-network corresponds to the learning of each task, as an example, the plurality of Head sub-networks may include a target center point positioning sub-network, a target corner positioning sub-network, a target bounding box prediction sub-network, and a target feature point positioning sub-network.

And 402, obtaining a prediction result corresponding to each task output by each Head subnetwork.

And step 403, performing fusion processing on the prediction result corresponding to each task output by each Head subnetwork to obtain the prediction result corresponding to each task output by the Head network.

It can be understood that the fusion processing is performed on the prediction result corresponding to each task output by each Head subnetwork, which is equivalent to the fusion processing performed on the Head network output results corresponding to feature maps of different scales, that is, the fusion processing is performed on information between different levels, so that the accuracy of the prediction result can be improved, and the efficiency of model training can also be improved. As an example, the fusion processing is performed on the prediction results corresponding to each task output by each Head subnetwork, and the averaging processing may be performed on the prediction results corresponding to each task output by each Head subnetwork.

According to the training method for the multi-task learning deep network model provided by the embodiment of the disclosure, the Head network is divided into the plurality of Head sub-networks, and the prediction result corresponding to each task output by each Head sub-network is subjected to fusion processing to obtain the prediction result corresponding to each task output by the Head network, so that information among multiple levels is fused, the accuracy of the prediction result can be improved, and the efficiency of model training can also be improved.

On the basis of the above embodiment, the present disclosure also provides a target detection method.

Fig. 5 is a flowchart of a target detection method according to an embodiment of the disclosure. It should be noted that the embodiments of the present disclosure provide that the object detection method can be applied to the object detection apparatus of the embodiments of the present disclosure, and the apparatus can be configured in an electronic device. As shown in fig. 5, the target detection method includes the steps of:

in step 501, an image of the environment surrounding the vehicle is acquired.

As an example, a scene of obstacle detection in which a vehicle travels will be described below. Assuming that the method is configured in an on-board system of a vehicle, an image of the vehicle surroundings can be acquired by a vehicle-mounted camera.

Step 502, inputting the image into the trained multi-task learning deep network model.

It should be noted that the multitask learning deep network model is an Anchor-Free-based multitask learning deep network model, and the multitask learning deep network model includes a backbone network, a feature pyramid network and a Head network. The Head network comprises a plurality of Head sub-networks, and each Head sub-network predicts each task respectively.

The method comprises the steps that a backbone network in a multi-task learning deep network model is used for carrying out feature extraction on images to obtain a first feature map, feature fusion in a top-down dense connection mode is carried out on the first feature map through a feature pyramid network to obtain a feature map corresponding to each scale, and then a prediction result of each task is obtained through a Head network.

For the above example, in the scenario of obstacle detection during vehicle driving, the multitask in the multitask learning deep network model may be an obstacle center point positioning task, an obstacle corner point positioning task, an obstacle boundary frame, and an obstacle feature point positioning task, and each task may correspond to a Head subnetwork for prediction.

Step 503, obtaining a prediction result corresponding to each task output by the multi-task learning deep network model.

In the embodiment of the disclosure, each Head subnetwork outputs the prediction result of the corresponding task, and the feature map corresponding to each scale is input into the Head subnetwork, so that each task corresponds to a plurality of prediction results. In order to improve the accuracy of the prediction results, a plurality of prediction results corresponding to each task may be subjected to fusion processing, for example, averaging, and the result after the fusion processing may be used as the prediction result of the corresponding task.

For the above example, in a scene of detecting an obstacle during vehicle driving, the obtained prediction result corresponding to each task may be a positioning prediction result of a center point of the obstacle, a positioning prediction result of a corner point of the obstacle, a prediction result of a boundary frame of the obstacle, and a positioning prediction result of a feature point of the obstacle.

And step 504, determining a detection frame, a boundary frame and feature points of the target in the image according to the prediction result corresponding to each task.

For the above example, the coordinates of the center point of the obstacle, the coordinates of the corner points of the obstacle, the boundary frame of the obstacle, the detection frame of the obstacle (corresponding to the circumscribed rectangle of the boundary frame), and the feature points of the obstacle (the type of the obstacle, the size of the obstacle, etc.) may be determined according to the positioning prediction result of the center point of the obstacle, the positioning prediction result of the corner points of the obstacle, the prediction result of the boundary frame of the obstacle, and the feature point positioning prediction result of the obstacle, so as to control the vehicle to react in time, so as to avoid the occurrence of an accident.

According to the target detection method provided by the embodiment of the disclosure, the image of the surrounding environment of the vehicle is input into the multitask learning deep network model to determine the detection frame, the boundary frame and the feature point of the target in the image, and the model is based on Anchor-Free and obtains the prediction result of each task through the backbone network, the pyramid network and the Head network, so that the advantages of Anchor-Free can be fully utilized, the detection efficiency in the target detection scene is improved, and the consumption of calculation power is reduced.

In order to implement the above embodiments, the present disclosure provides a training apparatus for a multitask learning deep network model.

Fig. 6 is a block diagram of a training apparatus for a multi-task learning deep network model according to an embodiment of the present disclosure. As shown in fig. 6, the apparatus includes:

a first obtaining module 601, configured to obtain training data for multitask learning in a target detection scenario; the training data comprises image samples and labels, corresponding to each task, on the image samples;

a second obtaining module 602, configured to input the image sample into a backbone network in the multitask learning deep network model, and obtain a feature map output by the backbone network; the multitask learning deep network model is based on Anchor-Free;

a third obtaining module 603, configured to input the feature map into a feature pyramid network in the multitask learning deep network model, and obtain a multi-scale feature map output by the feature pyramid network;

a fourth obtaining module 604, configured to input the multi-scale feature map into a Head network in the multi-task learning deep network model to perform learning of each task, and obtain a prediction result corresponding to each task output by the Head network;

and the training module 605 is configured to train the multi-task learning deep network model according to the prediction result output by the Head network and the label of each task on the image sample.

In some embodiments of the present disclosure, the backbone network comprises a plurality of first feature extraction units; the second obtaining module 602 is specifically configured to:

inputting the image samples to a backbone network;

In some embodiments of the present disclosure, the feature pyramid network comprises a plurality of second feature extraction units; the third obtaining module 603 is specifically configured to:

inputting the first feature graph output by each first feature extraction unit in the backbone network into a feature pyramid network;

and performing feature fusion in a top-down dense connection mode on the first feature graph output by the corresponding first feature extraction unit through a plurality of second feature extraction units in the feature pyramid network to obtain the feature graph corresponding to each scale.

Further, the Head network includes a plurality of Head subnetworks; the fourth obtaining module 604 is specifically configured to:

and performing fusion processing on the prediction result corresponding to each task output by each Head sub-network to obtain the prediction result corresponding to each task output by the Head network.

Wherein the plurality of task learning includes:

According to the training device for the multitask learning deep network model, the multichannel learning deep network model based on Anchor-Free is constructed, the prediction result of each task is obtained through the backbone network, the pyramid network and the Head network, the advantages of the Anchor-Free can be fully utilized, so that the feature expression can be increased, the effect of the model is improved, the efficiency of model training can be improved, and the consumption of computing power is reduced.

In order to realize the above embodiment, the present disclosure further provides a target detection apparatus.

Fig. 7 is a block diagram of a target detection apparatus according to an embodiment of the disclosure. As shown in fig. 7, the apparatus includes:

a first acquisition module 701 for acquiring an image of the surroundings of the vehicle;

an input module 702, configured to input an image to the trained multi-task learning deep network model; the multitask learning deep network model is based on Anchor-Free and comprises a backbone network, a characteristic pyramid network and a Head network; the Head network comprises a plurality of Head sub-networks, and each Head sub-network predicts each task respectively;

a second obtaining module 703, configured to obtain a prediction result corresponding to each task output by the multi-task learning deep network model;

and the determining module 704 is configured to determine a detection frame, a boundary frame, and feature points of the target in the image according to the prediction result corresponding to each task.

According to the target detection device provided by the embodiment of the disclosure, the image of the surrounding environment of the vehicle is input into the multitask learning deep network model to determine the detection frame, the boundary frame and the feature point of the target in the image, and the model is based on Anchor-Free and obtains the prediction result of each task through the backbone network, the pyramid network and the Head network, so that the advantages of Anchor-Free can be fully utilized, the detection efficiency in the target detection scene is improved, and the consumption of calculation power is reduced.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

The present disclosure also provides an electronic device and a readable storage medium according to an embodiment of the present disclosure.

As shown in fig. 8, the embodiment of the disclosure is a block diagram of an electronic device for a training method and/or a target detection method of a multitask learning deep network model. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the electronic apparatus includes: one or more processors 801, memory 802, and interfaces for connecting the various components, including a high speed interface and a low speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). Fig. 8 illustrates an example of a processor 801.

The memory 802 is a non-transitory computer readable storage medium provided by the present disclosure. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform a training method and/or an object detection method of a multi-task learning deep network model provided by the present disclosure. The non-transitory computer-readable storage medium of the present disclosure stores computer instructions for causing a computer to perform the training method of the multitask learning deep network model and/or the target detection method provided by the present disclosure.

The memory 802, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the training method and/or the target detection method of the multitask learning deep network model in the embodiments of the disclosure (e.g., the first obtaining module 601, the second obtaining module 602, the third obtaining module 603, the fourth obtaining module 604, and the training module 605 shown in fig. 6). The processor 801 executes various functional applications of the server and data processing, i.e., implementing the training method and/or the target detection method of the multitask learning deep network model in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 802.

The memory 802 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of the electronic device according to a training method of the multitask learning deep network model and/or a target detection method, and the like. Further, the memory 802 may include high speed random access memory and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 802 optionally includes memory located remotely from the processor 801, which may be connected over a network to an electronic device for a training method and/or an object detection method for a multitasking learning deep network model. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the training method and/or the target detection method of the multitask learning deep network model may further include: an input device 803 and an output device 804. The processor 801, the memory 802, the input device 803, and the output device 804 may be connected by a bus or other means, and are exemplified by a bus in fig. 8.

The input device 803 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device of the training method and/or the object detection method of the multitasking learning deep network model, such as an input device of a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, etc. The output devices 804 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A training method of a multi-task learning deep network model comprises the following steps:

2. The method of claim 1, wherein the backbone network comprises a plurality of first feature extraction units; the inputting the image sample to a backbone network in the multitask learning deep network model to obtain a feature map output by the backbone network comprises:

inputting the image samples to the backbone network;

3. The method of claim 2, wherein the feature pyramid network comprises a plurality of second feature extraction units; the inputting the feature map into a feature pyramid network in the multitask learning deep network model to obtain a multi-scale feature map output by the feature pyramid network comprises the following steps:

4. The method of claim 1, wherein the Head network comprises a plurality of Head subnetworks; the step of inputting the multi-scale feature map into a Head network in the multi-task learning deep network model to learn each task and obtain a prediction result output by the Head network and corresponding to each task includes:

5. The method of any of claims 1-4, wherein the multitask learning comprises:

6. A method of target detection, comprising:

acquiring an image of the surroundings of the vehicle;

7. A training device for a multitask learning deep network model comprises:

8. The apparatus of claim 7, wherein the backbone network comprises a plurality of first feature extraction units; the second obtaining module is specifically configured to:

inputting the image samples to the backbone network;

9. The apparatus of claim 8, wherein the feature pyramid network comprises a plurality of second feature extraction units; the third obtaining module is specifically configured to:

10. The apparatus of claim 7, wherein the Head network comprises a plurality of Head subnetworks; the fourth obtaining module is specifically configured to:

11. The apparatus of any of claims 7 to 10, wherein the multitask learning comprises:

12. An object detection device comprising:

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 6.

14. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 6.

15. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 6.