WO2022217434A1

WO2022217434A1 - Cognitive network, method for training cognitive network, and object recognition method and apparatus

Info

Publication number: WO2022217434A1
Application number: PCT/CN2021/086643
Authority: WO
Inventors: 周凯强; 江立辉; 黄梓钊; 秘谧; 王鑫
Original assignee: 华为技术有限公司
Priority date: 2021-04-12
Filing date: 2021-04-12
Publication date: 2022-10-20
Also published as: CN117157679A

Abstract

A cognitive network (800), a training method (1200) for a cognitive network, and an object recognition method (1500) and apparatus. The cognitive network (800) comprises a backbone network (810), a candidate region proposal network (RPN) (820), a region of interest extraction module (830), and a classification regression network (840). A plurality of cognitive tasks share the RPN (820), and one RPN (820) predicts the region in which an object to be detected in the plurality of tasks is located, and the classification regression network obtains a target 2D frame and a classification result. The cognitive network (800) can reduce the number of parameters and calculations in a multi-task cognitive network, reduce the power consumption of hardware, and improve the running speed of a model.

Description

Perception network, training method of perceptual network, object recognition method and device

technical field

The present application relates to the field of computer vision, and more particularly, to a perceptual network, a training method for a perceptual network, and an object recognition method and apparatus.

Background technique

Computer vision is an integral part of various intelligent/autonomous systems in various application fields such as manufacturing, inspection, document analysis, and medical diagnostics. To put it figuratively, computer vision is to install eyes (cameras/cameras) and brains (algorithms) on computers, so that computers can perceive the environment. Computer vision uses various imaging systems to replace the visual organ to obtain input information, and then the computer replaces the brain to process and interpret the input information.

With the development of visual perception technology and the increasing demand for artificial intelligence (AI) perception of actual scenes, more and more perception networks are widely deployed in various fields. For example, perception networks deployed in advanced driving assistance systems (ADAS) and autonomous driving systems (ADS) can be used to identify obstacles on the road. Most of the current perception networks can only complete one detection task. To achieve multiple detection tasks, it is usually necessary to deploy different networks to achieve different detection tasks. However, the simultaneous operation of multiple perception networks will increase the power consumption of the hardware and reduce the running speed of the model. Moreover, the computing power of chips used in many fields is low, making it difficult to deploy large-scale sensor networks, and even more difficult to deploy multiple sensor networks.

Therefore, how to reduce the hardware power consumption of the multi-task-aware network operation has become an urgent problem to be solved.

SUMMARY OF THE INVENTION

The present application provides a perceptual network, a training method for a perceptual network, an object recognition method and a device, which can reduce the amount of parameters and calculations in a multi-task perceptual network, reduce the power consumption of hardware, and improve the running speed of the model.

In a first aspect, a perception network is provided, including: a backbone network, a region proposal network (RPN), a region of interest extraction module, and a classification and regression network; the RPN is used to output a target object based on the second feature map The position information of the candidate two-dimensional (2 dementional, 2D) frame, the target object includes objects to be detected in multiple tasks, each task in the multiple tasks includes at least one category, and the second feature map is based on the first feature. The region of interest extraction module is used to extract the first feature information on the third feature map based on the position information of the candidate 2D frame, the first feature information is the feature of the region where the candidate 2D frame is located, and the third feature map is based on The first feature map is determined; the classification and regression network is used to process the first feature information, output the target 2D frame of the target object and the first indication information, the number of target 2D frames is less than or equal to the number of candidate 2D frames, the first The indication information is used to indicate the category to which the target object belongs.

According to the solution of the embodiment of the present application, a sensing network is used to complete a variety of sensing tasks, and multiple tasks share one RPN, that is, one RPN predicts the area where the object to be detected in multiple tasks is located, while ensuring the performance of the sensing network. , which reduces the amount of parameters and calculation of the perception network, improves the processing efficiency, is conducive to deployment in scenarios with high real-time requirements, reduces the pressure on hardware, and saves costs.

The "first feature map" refers to the feature map output by the backbone network. The feature maps output by the backbone network can all be referred to as first feature maps.

There may be one or more first feature maps of the input image.

Multiple tasks can also be understood as multiple categories. A broad category includes at least one category. In other words, a broad category is a collection of at least one category. Task division criteria can be set as needed. For example, the objects to be detected are divided into multiple tasks according to the similarity of the objects to be detected.

Multiple tasks in this embodiment of the present application share the same RPN, and the RPN may also be referred to as a single-head multi-task RPN.

The second feature map may be one or multiple.

Illustratively, the second feature map may include one or more of the first feature map.

Exemplarily, the third feature map may be one of the first feature maps.

In combination with the first aspect, in some implementations of the first aspect, the perception network further includes feature pyramid networks (FPN), and the FPN is connected to the backbone network for feature fusion on the first feature map and output fusion feature map after.

In this case, the second feature map may include one or more of the fused feature maps.

Exemplarily, the third feature map may be one of the first feature maps or one of the fused feature maps output by the FPN.

According to the solution of the embodiment of the present application, by using FPN to perform feature fusion on the first feature map, a feature map with more expressive ability can be generated and provided to subsequent modules, thereby improving the performance of the model.

With reference to the first aspect, in some implementations of the first aspect, the classification and regression network is specifically used to: process the first feature information to obtain the confidence that the candidate 2D frame belongs to each category in multiple tasks; The position information of the frame is adjusted to obtain an adjusted candidate 2D frame; the target 2D frame is determined according to the adjusted candidate 2D frame; the first indication information is determined according to the confidence that the target 2D frame belongs to each category.

Exemplarily, the position information of the candidate 2D frame is adjusted so that the adjusted candidate 2D frame matches the shape of the actual object more closely than the candidate 2D frame, that is, the adjusted candidate 2D frame is a more compact candidate 2D frame.

Further, a frame merging operation is performed on the adjusted candidate 2D frame to obtain the target 2D frame. For example, the adjusted 2D boxes are merged with non-maximum suppression (NMS) to obtain the target 2D boxes.

With reference to the first aspect, in some implementations of the first aspect, the classification and regression network includes a first region convolutional neural network (RCNN), and the first RCNN includes a hidden layer, a plurality of sub-classification fully connected layers and Multiple sub-regression fully-connected layers, the hidden layer is connected to multiple sub-classification fully-connected layers, the hidden layer is connected to multiple sub-regression fully-connected layers, multiple sub-classification fully-connected layers correspond to multiple tasks one-to-one, and multiple sub-regression fully connected layers are connected to Multiple tasks are in one-to-one correspondence; the hidden layer is used to process the first feature information to obtain the second feature information; the sub-category fully-connected layer is used to obtain the candidate 2D frame corresponding to the sub-category fully-connected layer according to the second feature information. The confidence of the object category in the task; the sub-regression fully connected layer is used to adjust the position information of the candidate 2D frame according to the second feature information, and obtain the adjusted candidate 2D frame.

Exemplarily, the hidden layer may include at least one of the following: a convolutional layer or a fully connected layer. Since multiple tasks share the hidden layer, the convolutional layer in the hidden layer can also be called a shared convolutional layer (shared convolutional, shared conv), and the fully connected layer in the hidden layer can also be called a shared fully connected layer (shared fully connected layer). layers, shared fc).

The first RCNN includes hidden layers and multiple sub-classification fully connected layers and multiple sub-regression fully connected layers corresponding to multiple tasks. Each task can have an independent sub-classification fully-connected layer and sub-regression fully-connected layer. The sub-category fully-connected layer and the sub-regression fully-connected layer corresponding to each task can complete the detection of the object to be detected in the task. Specifically, the sub-category fully connected layer can output the confidence level that the candidate 2D frame belongs to the object category in the task , the sub-regression fully connected layer can output the adjusted candidate 2D box.

A first RCNN includes multiple sub-classification fully-connected layers and sub-regression fully-connected layers. Therefore, a first RCNN can complete the detection of objects to be detected in multiple tasks. The first RCNN can also be called a single-head multi-task RCNN.

According to the solution of the embodiment of the present application, multiple tasks share the hidden layer of the first RCNN, which further reduces the amount of parameters and computation of the perception network, and improves the processing efficiency. Moreover, each task corresponds to an independent sub-classification fully connected layer (fc) and sub-regression fc, which improves the scalability of the perception network. The perception network can flexibly implement functional configuration by adding or reducing sub-classification fc and sub-regression fc .

With reference to the first aspect, in some implementations of the first aspect, the classification and regression network includes a second RCNN, the second RCNN includes a hidden layer, a classification fully connected layer and a regression fully connected layer, and the hidden layer is connected with the classification fully connected layer, The hidden layer is connected to the regression fully connected layer; the hidden layer is used to process the first feature information to obtain the third feature information; the classification fully connected layer is used to obtain the confidence that the candidate 2D frame belongs to each category according to the third feature information ; The regression fully connected layer is used to adjust the position information of the candidate 2D frame according to the third feature information to obtain the adjusted candidate 2D frame.

A second RCNN can complete the detection of objects to be detected in multiple tasks. The second RCNN can also be called a single-head multi-task RCNN.

In the solution of the embodiment of the present application, the second RCNN is used as the classification and regression network, and multiple tasks share the hidden layer of the second RCNN, which further reduces the amount of parameters and calculation of the perception network, and improves the processing efficiency. In addition, the output of the hidden layer in the first RCNN needs to be input to all sub-classification fully connected layers and sub-regression fully connected layers for multiple matrix operations, while the output of the hidden layer in the second RCNN only needs to be input to the classification fully connected layer. The matrix operation is performed in the layer and the regression fully connected layer. In this way, the operation of the matrix operation can be further reduced, which is more friendly to the hardware, further reduces the time consumption of the operation, and improves the processing efficiency.

In combination with the first aspect, in some implementations of the first aspect, the classification fully connected layer is obtained by merging multiple sub-classification fully connected layers in the first RCNN, and the regression fully connected layer is obtained by merging the first RCNN into the fully connected layer. The first RCNN includes a hidden layer, multiple sub-category fully-connected layers and multiple sub-regression fully-connected layers, the hidden layer is connected with multiple sub-category fully-connected layers, and the hidden layer is connected with multiple sub-regression layers. The fully-connected layers are connected, the sub-classification fully-connected layers are in one-to-one correspondence with multiple tasks, and the sub-regression fully-connected layers are in one-to-one correspondence with multiple tasks; the sub-classification fully connected layer is used to obtain the candidate 2D frame belonging to the The confidence of the object category in the task corresponding to the sub-classification fully connected layer; the sub-regression fully connected layer is used to adjust the position information of the candidate 2D frame according to the third feature information to obtain the adjusted candidate 2D frame.

In the solution of the embodiment of the present application, multiple sub-classification fc and sub-regression fc in the first RCNN are combined, and the second RCNN is used as the classification and regression network, which can further reduce the operation of matrix operations, is more friendly to hardware, and further reduces the The operation is time-consuming and the processing efficiency is improved.

In a second aspect, a training method for a perceptual network is provided. The perceptual network includes: a candidate region generation network RPN, where the RPN is used to predict the position information of a candidate two-dimensional 2D frame of a target object in a sample image, and the target object includes multiple tasks the object to be detected, each task in the multiple tasks includes at least one category; the target object includes a first task object and a second task object; the method includes: acquiring training data, the training data includes sample images, first task objects on the sample images Annotation data of a task object and a pseudo frame of the second task object on the sample image. The annotation data includes the class label of the first task object and the annotated 2D frame of the first task object. The pseudo frame of the second task object is obtained by other perception The target 2D frame of the second task object obtained by the network inferring the sample image; the perception network is trained based on the training data.

In the case where the labeled data is partially labeled data, that is, when the labeled data only includes the labeled data of the first task object, when training the perception network based on the labeled data only, since multiple tasks share one RPN, when training the RPN, different The training data for tasks may inhibit each other. Specifically, since the labeled data is part of the labeled data, for example, only the labeled data of the object to be detected for one task is labeled on a sample image, when the labeled data of the object to be detected for the task is used for training, the RPN will be adjusted. , so that RPN can more accurately predict the candidate 2D frame of the object to be detected for this task, but cannot accurately predict the candidate 2D frame of the object to be detected for other tasks on the sample image. When using the labeled data of the object to be detected in another task for training, the parameters of the RPN will be adjusted, so the adjusted RPN may not be able to accurately predict the candidate 2D frame of the object to be detected in other tasks. In this way, the training data of different tasks may suppress each other, causing RPN to fail to predict all the target objects in the image.

According to the solution in the embodiment of the present application, the perception network is jointly trained based on the pseudo frame and the labeled data, and in the case that the labeled data only includes the labeled data of the first task object, that is, in the case of partial labeled data, the information of the second task object is provided. Pseudo frame, in order to provide a more comprehensive frame of the object to be detected on the same sample image as the target output of the RPN, to adjust the parameters of the RPN so that the output of the RPN is constantly close to the target data, avoiding mutual inhibition between different tasks, which is beneficial to It enables RPN to obtain more comprehensive and accurate candidate 2D boxes, while improving the recall rate. The labeling data of the sample images in the embodiments of the present application may be partial labeling data, so that targeted collection can be performed, that is, the required sample images are collected for specific tasks, and there is no need to mark the objects to be detected for all tasks in each sample image. It reduces the cost of data collection and the cost of labeling, which is conducive to balancing the training data of different tasks. In addition, the scheme using part of the labeled data has flexible scalability. In the case of adding tasks, it is only necessary to provide the labeled data of the new tasks, and there is no need to label new objects to be detected on the basis of the original training data. .

The first task objects may include objects to be detected in one or more tasks. The one or more tasks are the tasks where the first task object is located. The first task objects in different sample images in the training set may be the same or different.

The second task objects may include objects to be detected in one or more tasks. The one or more tasks are the tasks where the second task object is located. The same object to be detected may exist in the second task object and the first task object. That is to say, the first task object and the second task object may have overlapping objects to be detected, and the first task object and the second task object may also be completely different. The second task objects in different sample images in the training set can be the same or different.

Other perceptual networks refer to other perceptual networks than the one to be trained. Exemplarily, the other sensing networks may be a multi-head multi-tasking sensing network or multiple single-tasking sensing networks, or the like.

With reference to the second aspect, in some implementations of the second aspect, the perceptual network further includes a backbone network, a region of interest extraction module, and a classification and regression network, and the perceptual network is trained based on the training data, including: according to the first task object The difference between the marked 2D frame and the target 2D frame of the second task object and the candidate 2D frame of the target object in the sample image predicted by RPN calculates the first loss function value; calculates the second loss function value of the perception network according to the marked data ; Backpropagate the value of the first loss function and the value of the second loss function, adjust the parameters of the part that needs to be trained in the perception network, and the part that needs to be trained in the perception network includes the part to be trained in the classification and regression network, the sensor The region of interest extraction module, RPN and backbone network, and the part to be trained in the classification and regression network is determined according to the first task object.

The labeled 2D frame of the first task object and the target 2D frame of the second task object are compared with the candidate 2D frame of the target object predicted by RPN, and the loss function value of the RPN stage is obtained, that is, the first loss function value.

The labeled data of the sample image is compared with the output result of the classification and regression network, and the loss function value of the task where the first task object in the classification and regression network stage is located, that is, the second loss function value is obtained.

Calculate the gradient of the parameter related to the value of the first loss function, and backpropagate based on the value of the first loss function, that is, adjust the parameters related to the value of the first loss function based on the gradient of the parameter, and realize the adjustment of the perceptual network, so that the RPN can Predict candidate boxes more comprehensively.

The parameters related to the first loss function value are the parameters in the perceptual network used in the process of obtaining the first loss function value, for example, the parameters of the backbone and the parameters of the RPN. Further, in the case where the perceptual network includes FPN, the parameter related to the value of the first loss function also includes FPN.

Based on the back-propagation of the second loss function value, the gradient of the parameter related to the second loss function value is calculated, and then the parameter related to the second loss function value is adjusted based on the gradient of the parameter to realize the adjustment of the perceptual network, so that the classification regression The network can better correct the output 2D box and improve the accuracy of category prediction.

The parameters related to the second loss function are the parameters in the perceptual network used in the process of calculating the value of the second loss function, for example, the parameters of the backbone, the parameters of the RPN, the parameters of the region of interest extraction module, and the classification and regression network. parameters for the part of the training required. Further, in the case where the perceptual network includes FPN, the parameter related to the value of the second loss function also includes FPN. The parameters related to the second loss function are the parameters of the part of the perception network that needs to be trained.

According to the solution in this embodiment of the present application, the parts shared by different tasks in the perception network, such as the backbone network, RPN, and region of interest extraction module, etc., all participate in the training process based on the labeled data of different tasks, which can make The parts of the perceptual network shared by different tasks learn common features of each task. Different parts corresponding to different tasks in the perception network, for example, the parts corresponding to each task in the classification and regression network, only participate in the training process based on the labeled data of the respective tasks, which can make different parts corresponding to different tasks in the perception network. Its task-specific features can be learned, improving the accuracy of the model. At the same time, in the training process, the part of the classification and regression network that needs to be trained is determined according to the task, and different parts of the classification and regression network corresponding to different tasks do not affect each other during the training process, ensuring the independence of each task, making The model has strong flexibility.

With reference to the second aspect, in some implementations of the second aspect, the backbone network is used to perform convolution processing on the sample image and output the first feature map of the sample image; the RPN is used to output the target object based on the second feature map The position information of the candidate 2D frame of the The feature information is the feature of the area where the candidate 2D frame is located, and the third feature map is determined according to the first feature map; the classification and regression network is used to process the first feature information, and output the target 2D frame of the target object and the first indication information , the number of target 2D boxes is less than or equal to the number of candidate 2D boxes, and the first indication information is used to indicate the category to which the target object belongs.

With reference to the second aspect, in some implementations of the second aspect, the classification and regression network includes a first regional convolutional neural network RCNN, and the first RCNN includes a hidden layer, multiple sub-classification fully connected layers and multiple sub-regression fully connected layers, The hidden layer is connected with multiple sub-classification fully connected layers, the hidden layer is connected with multiple sub-regression fully connected layers, multiple sub-classification fully connected layers are in one-to-one correspondence with multiple tasks, and multiple sub-regression fully connected layers are in one-to-one correspondence with multiple tasks; The hidden layer is used to process the first feature information to obtain the second feature information; the sub-class fully connected layer is used to obtain the confidence of the object category in the task corresponding to the sub-class fully connected layer of the candidate 2D frame according to the second feature information. degree; the sub-regression fully connected layer is used to adjust the position information of the candidate 2D frame according to the second feature information to obtain the adjusted candidate 2D frame; and the part to be trained in the classification and regression network includes the hidden layer and the first task. The sub-classification fully connected layer and the sub-regression fully connected layer corresponding to the task where the object is located.

According to the method of the embodiment of the present application, the parts shared by different tasks in the perception network, that is, the backbone network, the RPN, the region of interest extraction module, and the hidden layer of the classification and regression network, all participate in the process of training based on the labeled data of different tasks. Training, so that the parts shared by different tasks in the perceptual network can learn the common features of each task. The different parts corresponding to different tasks in the perception network, that is, the sub-classification fully connected layer and the sub-regression fully connected layer corresponding to each task in the classification and regression network, only participate in the training process based on the labeled data of the respective tasks, which can make the perception Different parts of the network corresponding to different tasks can learn their task-specific features, which improves the accuracy of the model.

In a third aspect, an object recognition method is provided. The perception network includes: a backbone network, a candidate region generation network RPN, a region of interest extraction module, and a classification and regression network. The method includes: using the backbone network to perform convolution processing on an input image to obtain The first feature map of the input image; using RPN to output the position information of the candidate two-dimensional 2D frame of the target object based on the second feature map, the target object includes the objects to be detected in multiple tasks, each task in multiple tasks At least one category is included, and the second feature map is determined according to the first feature map; the region of interest extraction module is used to extract the first feature information on the third feature map based on the position information of the candidate 2D frame, and the first feature information is the candidate 2D The feature of the area where the frame is located, and the third feature map is determined according to the first feature map; the first feature information is processed by the classification and regression network to obtain the target 2D frame of the target object and the first indication information, and the number of target 2D frames is less than or equal to the number of candidate 2D boxes, the first indication information is used to indicate the category to which the target object belongs.

With reference to the third aspect, in some implementations of the third aspect, the classification and regression network is used to process the first feature information to obtain the target 2D frame of the target object and the first indication information, including: using the classification and regression network to process the first feature information. The feature information is processed to obtain the confidence that the candidate 2D frame belongs to each category in multiple tasks; the position information of the candidate 2D frame is adjusted by the classification and regression network, and the adjusted candidate 2D frame is obtained; according to the adjusted candidate 2D frame Determine the target 2D frame; determine the first indication information according to the confidence that the target 2D frame belongs to each category.

With reference to the third aspect, in some implementations of the third aspect, the classification and regression network includes a first regional convolutional neural network RCNN, and the first RCNN includes a hidden layer, multiple sub-classification fully connected layers and multiple sub-regression fully connected layers, The hidden layer is connected with multiple sub-classification fully connected layers, the hidden layer is connected with multiple sub-regression fully connected layers, multiple sub-classification fully connected layers are in one-to-one correspondence with multiple tasks, and multiple sub-regression fully connected layers are in one-to-one correspondence with multiple tasks; and using a classification and regression network to process the first feature information, and output the target 2D frame of the target object and the first indication information, including: using the hidden layer to process the first feature information to obtain the second feature information; using the sub-category full connection According to the second feature information, the layer obtains the confidence of the candidate 2D frame belonging to the object category in the task corresponding to the sub-classification fully connected layer; the sub-regression fully connected layer is used to adjust the position information of the candidate 2D frame according to the second feature information, and the adjustment is obtained. The candidate 2D box after.

In combination with the third aspect, in some implementations of the third aspect, the classification and regression network includes a second RCNN, the second RCNN includes a hidden layer, a classification fully connected layer and a regression fully connected layer, and the hidden layer is connected with the classification fully connected layer, The hidden layer is connected with the regression fully connected layer; and the classification and regression network is used to process the first feature information, and output the target 2D frame of the target object and the first indication information, including: using the hidden layer to process the first feature information to obtain the first feature information. Three feature information; use the classification fully connected layer to obtain the confidence that the candidate 2D frame belongs to each category according to the third feature information; use the regression fully connected layer to adjust the position information of the candidate 2D frame according to the third feature information to obtain the adjusted candidate 2D frame 2D box.

In combination with the third aspect, in some implementations of the third aspect, the classification fully connected layer is obtained by merging multiple sub-classification fully connected layers in the first RCNN, and the regression fully connected layer is obtained by combining the first RCNN. The first RCNN includes a hidden layer, multiple sub-category fully-connected layers and multiple sub-regression fully-connected layers, the hidden layer is connected with multiple sub-category fully-connected layers, and the hidden layer is connected with multiple sub-regression layers. The fully-connected layers are connected, the sub-classification fully-connected layers are in one-to-one correspondence with multiple tasks, and the sub-regression fully-connected layers are in one-to-one correspondence with multiple tasks; the sub-classification fully connected layer is used to obtain the candidate 2D frame belonging to the The confidence of the object category in the task corresponding to the sub-classification fully connected layer; the sub-regression fully connected layer is used to adjust the position information of the candidate 2D frame according to the third feature information to obtain the adjusted candidate 2D frame.

The target-aware network can be obtained by using the training method of the perception network in the second aspect. The target-aware network can be a trained image recognition model, and the trained image recognition model can be used to process the image to be processed.

In a fourth aspect, an apparatus for training a perceptual network is provided. The apparatus includes a module or unit for performing the method in the second aspect and any one of the implementation manners of the second aspect.

In a fifth aspect, an object recognition device is provided, the device comprising a module or unit for executing the method in the third aspect and any one of the implementation manners of the third aspect.

It should be understood that the extensions, definitions, explanations and descriptions of related matters in the above-mentioned first and second aspects also apply to the same matters in the third, fourth and fifth aspects.

In a sixth aspect, there is provided an apparatus for training a cognitive network, the apparatus comprising: a processor and a transmission interface, the processor receives or sends data through the transmission interface, and the processor is configured to call program instructions stored in the memory to execute The second aspect and the method in any one of the implementation manners of the second aspect.

The processor in the sixth aspect above may be either a central processing unit (CPU), or a combination of a CPU and a neural network computing processor, where the neural network computing processor may include a graphics processor (graphics processing unit). unit, GPU), neural network processor (neural-network processing unit, NPU) and tensor processor (tensor processing unit, TPU) and so on. Among them, TPU is Google's fully customized artificial intelligence accelerator application-specific integrated circuit for machine learning.

In a seventh aspect, an object recognition device is provided, the device comprising: a processor and a transmission interface, the processor receives or sends data through the transmission interface, and the processor is configured to call program instructions stored in the memory to execute a third A method in any one implementation manner of the aspect and the third aspect.

The processor in the above seventh aspect can be either a central processing unit, or a combination of a CPU and a neural network computing processor, where the neural network computing processor can include a graphics processor, a neural network processor, and a tensor processor. and many more. Among them, TPU is Google's fully customized artificial intelligence accelerator application-specific integrated circuit for machine learning.

In an eighth aspect, a computer-readable storage medium is provided, the computer-readable storage medium stores program code for execution by a device, and when the program code is run on a computer or a processor, causes the computer or processor to execute the second aspect or The method in any one of the implementation manners of the third aspect.

A ninth aspect provides a computer program product comprising instructions, when the computer program product runs on a computer, the computer causes the computer to execute the method in any one of the implementation manners of the second aspect or the third aspect.

A tenth aspect provides a chip, the chip includes a processor and a data interface, the processor reads an instruction stored in a memory through the data interface, and executes any one of the second aspect or the third aspect above method in the implementation.

Optionally, as an implementation manner, the chip may further include a memory, in which instructions are stored, the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the The processor is configured to execute the method in any one of the implementations of the first aspect or the second aspect.

The above chip may specifically be a field-programmable gate array (FPGA) or an application-specific integrated circuit (ASIC).

In an eleventh aspect, an electronic device is provided, and the electronic device includes the apparatus in any one of the above-mentioned fourth to seventh aspects.

Description of drawings

FIG. 1 is a schematic diagram of an application scenario provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of another application scenario provided by an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a system architecture provided by an embodiment of the present application;

4 is a schematic structural diagram of a convolutional neural network provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of the hardware structure of a chip according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a system architecture provided by an embodiment of the present application;

FIG. 7 is a schematic block diagram of a multi-head end multi-task perception network;

FIG. 8 is a schematic structural diagram of a perception network according to an embodiment of the present application;

FIG. 9 is a schematic structural diagram of another sensing network provided by an embodiment of the present application;

FIG. 10 is a schematic structural diagram of another sensing network provided by an embodiment of the present application;

FIG. 11 is a schematic structural diagram of another cognitive network provided by an embodiment of the present application;

12 is a schematic flowchart of a training method for a perceptual network provided by an embodiment of the present application;

13 is a schematic diagram of a training process of a perception network provided by an embodiment of the present application;

14 is a schematic block diagram of a perceptual network in a training process provided by an embodiment of the present application;

15 is a schematic flowchart of an object recognition method provided by an embodiment of the present application;

16 is a schematic diagram of an object recognition process provided by an embodiment of the present application;

17 is a schematic block diagram of a perceptual network in a reasoning process provided by an embodiment of the present application;

18 is a schematic diagram of a conversion process of a sensory network provided by an embodiment of the present application;

FIG. 19 is a schematic block diagram of an apparatus provided by an embodiment of the present application;

FIG. 20 is a schematic block diagram of another apparatus provided by an embodiment of the present application.

Detailed ways

The technical solutions in the present application will be described below with reference to the accompanying drawings.

The embodiments of the present application can be applied to fields that need to complete various sensing tasks, such as driving assistance, automatic driving, mobile phone terminals, monitoring, and security. The image is input into the perception network of the present application, and the detection result of the object of interest in the image is obtained. The detection results can be input to the post-processing module for processing, for example, sent to the planning control unit for decision-making in the autonomous driving system, or sent to the security system for abnormal situation detection.

The following is a brief introduction to the three application scenarios of advanced driving assistant system (ADAS)/autonomous driving system (ADS) visual perception system, album picture classification and monitoring.

ADAS/ADS visual perception system:

As shown in Figure 1, in ADAS and ADS, multiple types of target detection need to be performed in real time, and the detection targets include dynamic obstacles, static obstacles and traffic signs, such as pedestrians (pedestrians), cyclists (cyclists), tricycles (tricycle), car (car), truck (truck), bus (bus), wheel (wheel), car light (car light), traffic cone (traffic cone), traffic stick (traffic stick), fire hydrant (fire hydrant), motorcycle (motorcycle) and bicycle (bicycle), traffic sign (traffic sign), guide sign (guide sign), billboard (billboard), road sign (roadsign), road pole (pole), traffic light ( traffic light) and road signs, etc. Traffic lights include red traffic lights (trafficlight_red), yellow traffic lights (trafficlight_yellow), green traffic lights (trafficlight_green), and black traffic lights (trafficlight_black). Pavement signs include Around/straight, left/right, straight and left, straight and right/straight and around/ Turn left and around/left and right/left bend/right bend/pavement sign, etc.

Using the solution of the embodiment of the present application, the detection tasks of the above-mentioned various targets can be realized in one sensing network, that is, objects to be detected for multiple tasks can be detected in one sensing network, and the detection results can be sent to planning control after processing. The unit makes decisions, such as obstacle avoidance, traffic light decisions, or traffic sign decisions.

Album picture classification:

When a user stores a large number of pictures on a terminal device (eg, a mobile phone) or a cloud disk, identifying the images in the album can facilitate the user or the system to classify and manage the album and improve user experience.

Using the solutions of the embodiments of the present application, it is possible to obtain or optimize a perceptual network suitable for album picture classification. And use the perception network to classify pictures, for example, classify pictures into different categories such as photos containing animals, photos containing people, etc., so as to label pictures of different categories, which is convenient for users to view and find. In addition, the classification tags of these pictures can also be provided to the album management system for classification management, which saves the user's management time, improves the efficiency of album management, and enhances the user experience.

monitor:

Monitoring scenarios include: smart city, field monitoring, indoor monitoring, outdoor monitoring, and in-vehicle monitoring.

As shown in Figure 2, a variety of detection tasks need to be completed in the smart city perception system. For example, vehicles, license plates, people, and faces need to be detected. After processing, the detection results can be used to judge traffic violations, predict traffic congestion, etc. .

By adopting the solution of the embodiment of the present application, the input road picture can be processed in a perception network, and the detection tasks of the above-mentioned various targets can be completed. In addition, the detection tasks of the perception network can also be increased or decreased according to the actual situation. For example, the current detection tasks of the perception network include vehicle detection tasks and human detection tasks. If the detection task of traffic signs needs to be added to the detection tasks of the perception network, the structure of the perception network can be adjusted to add the detection task. The specific description can be found later, for example, FIG. 14 .

Since the embodiments of the present application involve a large number of neural network applications, for ease of understanding, related terms and concepts of the neural networks that may be involved in the embodiments of the present application are first introduced below.

(1) Neural network

A neural network can be composed of neural units, and a neural unit can refer to an operation unit that takes x _s and an intercept 1 as input, and the output of the operation unit can be:

Among them, s=1, 2, ... n, n is a natural number greater than 1, W _s is the weight of x _s , and b is the bias of the neural unit. f is an activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of the activation function can be used as the input of the next convolutional layer, and the activation function can be a sigmoid function. A neural network is a network formed by connecting a plurality of the above single neural units together, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field, and the local receptive field can be an area composed of several neural units.

(2) Deep neural network

A deep neural network (DNN), also known as a multi-layer neural network, can be understood as a neural network with multiple hidden layers. The DNN is divided according to the positions of different layers. The neural network inside the DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the middle layers are all hidden layers. The layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.

Although DNN looks complicated, in terms of the work of each layer, it is not complicated. In short, it is the following linear relationship expression:

in,

is the input vector,

is the output vector,

is the offset vector, W is the weight matrix (also called coefficients), and α() is the activation function. Each layer is just an input vector

After such a simple operation to get the output vector

Due to the large number of DNN layers, the coefficient W and offset vector

The number is also higher. These parameters are defined in the DNN as follows: Take the coefficient W as an example: Suppose that in a three-layer DNN, the linear coefficient from the fourth neuron in the second layer to the second neuron in the third layer is defined as

The superscript 3 represents the number of layers where the coefficient W is located, and the subscript corresponds to the output third layer index 2 and the input second layer index 4.

To sum up, the coefficient from the kth neuron in the L-1 layer to the jth neuron in the Lth layer is defined as

It should be noted that the input layer does not have a W parameter. In a deep neural network, more hidden layers allow the network to better capture the complexities of the real world. In theory, a model with more parameters is more complex and has a larger "capacity", which means that it can complete more complex learning tasks. Training the deep neural network is the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vectors W of many layers).

(3) Convolutional Neural Network

Convolutional neural network (CNN) is a deep neural network with a convolutional structure. A convolutional neural network consists of a feature extractor consisting of convolutional and subsampling layers. The feature extractor can be viewed as a filter, and the convolution process can be viewed as convolution with an input image or a convolutional feature map using a trainable filter. The convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal. In a convolutional layer of a convolutional neural network, a neuron can only be connected to some of its neighbors. A convolutional layer usually contains several feature planes, and each feature plane can be composed of some neural units arranged in a rectangle. Neural units in the same feature plane share weights, and the shared weights here are convolution kernels. Shared weights can be understood as the way to extract image information is independent of location. The underlying principle is that the statistics of one part of the image are the same as the other parts. This means that image information learned in one part can also be used in another part. So for all positions on the image, the same learned image information can be used. In the same convolution layer, multiple convolution kernels can be used to extract different image information. Generally, the more convolution kernels, the richer the image information reflected by the convolution operation.

The convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can obtain reasonable weights by learning during the training process of the convolutional neural network. In addition, the immediate benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.

(4) Loss function

In the process of training a deep neural network, because it is hoped that the output of the deep neural network is as close as possible to the value you really want to predict, you can compare the predicted value of the current network with the target value you really want, and then based on the difference between the two to update the weight vector of each layer of the neural network (of course, there is usually a process of transformation before the first update, that is, pre-configuring parameters for each layer in the deep neural network), for example, if the predicted value of the network If it is high, adjust the weight vector to make the prediction lower, and keep adjusting until the deep neural network can predict the real desired target value or a value very close to the real desired target value. Therefore, it is necessary to pre-define "how to compare the difference between the predicted value and the target value", which is the loss function (loss function) or objective function (objective function), which are used to measure the difference between the predicted value and the target value. important equation. Among them, taking the loss function as an example, the higher the output value of the loss function (loss), the greater the difference, then the training of the deep neural network becomes the process of reducing the loss as much as possible. Generally, the smaller the loss, the higher the training quality of the deep neural network, and the larger the loss, the lower the training quality of the deep neural network. Similarly, the smaller the loss fluctuation, the more stable the training; the larger the loss fluctuation, the more unstable the training.

(5) Back propagation algorithm

The neural network can use the error back propagation (BP) algorithm to correct the size of the parameters in the initial model during the training process, so that the reconstruction error loss of the model becomes smaller and smaller. Specifically, forwarding the input signal until the output will generate an error loss, and updating the parameters in the initial super-resolution model by back-propagating the error loss information, so that the error loss converges. The back-propagation algorithm is a back-propagation motion dominated by the error loss, aiming to obtain the parameters of the optimal super-resolution model, such as the weight matrix.

The method provided by the present application will be described below from the model training side and the model application side.

The training method for a perceptual network provided by the embodiment of the present application involves the processing of computer vision. Specifically, data processing methods such as data training, machine learning, and deep learning can be used to perform symbolic and formalized intelligent information modeling and extraction on the training data. , preprocessing, training, etc., to finally obtain a trained perceptual network; and, the object recognition method provided by the embodiment of the application can use the above-mentioned trained perceptual network to input the input data (such as the image to be processed in the application) into the In the trained perception network, output data (such as the first indication information and the target 2D frame of the target object in this application) are obtained. It should be noted that the perceptual network training method and the object recognition method provided by the embodiments of the present application are based on the same concept, and can also be understood as two parts in a system, or two stages of an overall process: such as the model training stage and model application stage.

As shown in FIG. 3 , an embodiment of the present application provides a system architecture 100 . In Figure 3, a data collection device 160 is used to collect training data. For the training method of the perceptual network according to the embodiment of the present application, the training data may include sample images, labeled data of the sample images, and pseudo frames on the sample images.

After collecting the training data, the data collection device 160 stores the training data in the database 130 , and the training device 120 obtains the target model/rule 101 by training based on the training data maintained in the database 130 .

The specific manner in which the training device 120 obtains the target model/rule 101 based on the training data will be described in detail later. The target model/rule 101 can be used to realize the object recognition method of the embodiment of the present application, that is, the image to be processed is input into the target model/rule 101, and the detection result of the object of interest in the image to be processed can be obtained. It should be noted that, in practical applications, the training data maintained in the database 130 may not necessarily come from the collection of the data collection device 160, and may also be received from other devices. In addition, it should be noted that the training device 120 may not necessarily train the target model/rule 101 completely based on the training data maintained by the database 130, and may also obtain training data from the cloud or other places for model training. The above description should not be used as a reference to this application Limitations of Examples.

The target model/rule 101 trained according to the training device 120 can be applied to different systems or devices, such as the execution device 110 shown in FIG. 3 , the execution device 110 can be a terminal, such as a mobile phone terminal, a tablet computer, a laptop computer , Augmented reality (AR) AR/virtual reality (VR), vehicle terminal, etc., it can also be a server or cloud. In FIG. 3 , the execution device 110 is configured with an input/output (I/O) interface 112 for data interaction with external devices. The user can input data to the I/O interface 112 through the client device 140, and input data In this embodiment of the present application, the image to be processed may be included.

When the execution device 110 preprocesses the input data, or the calculation module 111 of the execution device 110 performs calculations and other related processing, the execution device 110 can call the data, codes, etc. in the data storage system 150 for corresponding processing , the data and instructions obtained by corresponding processing may also be stored in the data storage system 150 .

Finally, the I/O interface 112 returns the processing result, such as the detection result obtained above, to the client device 140, thereby providing it to the user.

For example, client device 140 may be a planning control unit in an automated driving system.

It is worth noting that the training device 120 can generate corresponding target models/rules 101 based on different training data for different goals or different tasks, and the corresponding target models/rules 101 can be used to achieve the above-mentioned goals or complete the above-mentioned goals. tasks to provide the user with the desired result.

In the case shown in FIG. 3 , the user can manually specify input data, which can be operated through the interface provided by the I/O interface 112 . In another case, the client device 140 can automatically send the input data to the I/O interface 112 . If the user's authorization is required to request the client device 140 to automatically send the input data, the user can set the corresponding permission in the client device 140 . The user can view the result output by the execution device 110 on the client device 140, and the specific presentation form can be a specific manner such as display, sound, and action. The client device 140 can also be used as a data collection terminal to collect the input data of the input I/O interface 112 and the output result of the output I/O interface 112 as new sample data as shown in the figure, and store them in the database 130 . Of course, it is also possible not to collect through the client device 140, but the I/O interface 112 directly uses the input data input into the I/O interface 112 and the output result of the output I/O interface 112 as shown in the figure as a new sample The data is stored in database 130 .

It is worth noting that FIG. 3 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship among the devices, devices, modules, etc. shown in the figure does not constitute any limitation. For example, in FIG. 3 , the data The storage system 150 is an external memory relative to the execution device 110 , and in other cases, the data storage system 150 may also be placed in the execution device 110 .

As shown in FIG. 3 , a target model/rule 101 is obtained by training according to the training device 120 , and the target model/rule 101 may be a perceptual network in this embodiment of the present application.

Since CNN is a very common neural network, the structure of CNN will be introduced in detail in conjunction with Figure 4 below. As mentioned in the introduction to the basic concepts above, a convolutional neural network is a deep neural network with a convolutional structure. It is a deep learning architecture. A deep learning architecture refers to an algorithm updated through a neural network model. Multiple levels of learning at different levels of abstraction. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to images fed into it.

The structure of the neural network specifically adopted by the image recognition method of the embodiment of the present application may be as shown in FIG. 4 . In FIG. 4 , a convolutional neural network (CNN) 200 may include an input layer 210 , a convolutional/pooling layer 220 (where the pooling layer is optional), and a fully connected layer 230 . The input layer 210 can obtain the image to be processed, and submit the obtained image to be processed by the convolutional layer/pooling layer 220 and the fully connected layer 230 for processing, and the processing result of the image can be obtained. The internal layer structure of CNN200 in Figure 4 is described in detail below.

Convolutional layer/pooling layer 220:

Convolutional layer:

As shown in FIG. 4, the convolutional/pooling layer 220 may include layers 221-226 as examples, for example: in one implementation, layer 221 is a convolutional layer, layer 222 is a pooling layer, and layer 223 is a convolutional layer Layer 224 is a pooling layer, 225 is a convolutional layer, and 226 is a pooling layer; in another implementation, 221 and 222 are convolutional layers, 223 are pooling layers, and 224 and 225 are convolutional layers. layer, 226 is the pooling layer. That is, the output of a convolutional layer can be used as the input of a subsequent pooling layer, or it can be used as the input of another convolutional layer to continue the convolution operation.

The following will take the convolutional layer 221 as an example to introduce the inner working principle of a convolutional layer.

The convolution layer 221 may include many convolution operators. The convolution operator is also called a kernel. Its role in image processing is equivalent to a filter that extracts specific information from the input image matrix. The convolution operator is essentially Can be a weight matrix, which is usually pre-defined, usually one pixel by one pixel (or two pixels by two pixels) along the horizontal direction on the input image during the convolution operation on the image. ...It depends on the value of the stride step) to process, so as to complete the work of extracting specific features from the image. The size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image. During the convolution operation, the weight matrix will be extended to Enter the entire depth of the image. Therefore, convolution with a single weight matrix will result in a single depth dimension of the convolutional output, but in most cases a single weight matrix is not used, but multiple weight matrices of the same size (row × column) are applied, That is, multiple isotype matrices. The output of each weight matrix is stacked to form the depth dimension of the convolutional image, where the dimension can be understood as determined by the "multiple" described above. Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to extract unwanted noise in the image. Blur, etc. The multiple weight matrices have the same size (row×column), and the size of the convolution feature maps extracted from the multiple weight matrices with the same size is also the same, and then the multiple extracted convolution feature maps with the same size are combined to form The output of the convolution operation.

The weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained by training can be used to extract information from the input image, so that the convolutional neural network 200 can make correct predictions .

When the convolutional neural network 200 has multiple convolutional layers, the shallow convolutional layers (such as 221) often extract more general features, which can also be called low-level features; As the depth of the neural network 200 deepens, the features extracted by the later convolutional layers (eg, 226) become more and more complex, such as features such as high-level semantics. Features with higher semantics are more suitable for the problem to be solved.

Pooling layer:

Since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce a pooling layer after the convolutional layer. In the layers 221-226 exemplified by 220 in Figure 4, it can be a convolutional layer followed by a layer. The pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers. During image processing, the only purpose of pooling layers is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a max pooling operator for sampling the input image to obtain a smaller size image. The average pooling operator can calculate the pixel values in the image within a certain range to produce an average value as the result of average pooling. The max pooling operator can take the pixel with the largest value within a specific range as the result of max pooling. Also, just as the size of the weight matrix used in the convolutional layer should be related to the size of the image, the operators in the pooling layer should also be related to the size of the image. The size of the output image after processing by the pooling layer can be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.

Fully connected layer 230:

After being processed by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not sufficient to output the required output information. Because as mentioned before, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network 200 needs to utilize the fully connected layer 230 to generate one or a set of outputs of the required number of classes. Therefore, the fully connected layer 230 may include multiple hidden layers (231, 232 to 23n as shown in FIG. 4) and the output layer 240, and the parameters contained in the multiple hidden layers may be based on specific task types The relevant training data is pre-trained, for example, the task type can include image recognition, image classification, image super-resolution reconstruction and so on.

After the multi-layer hidden layers in the fully connected layer 230, that is, the last layer of the entire convolutional neural network 200 is the output layer 240, the output layer 240 has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error, Once the forward propagation of the entire convolutional neural network 200 (as shown in Fig. 4, the propagation from the direction 210 to 240 is forward propagation) is completed, the back propagation (as shown in Fig. 4, the propagation from the 240 to 210 direction is the back propagation) will be Start to update the weight values and biases of the aforementioned layers to reduce the loss of the convolutional neural network 200, that is, the error between the result output by the convolutional neural network 200 through the output layer and the ideal result.

It should be noted that the convolutional neural network shown in FIG. 4 is only used as an example of a possible convolutional neural network, and in specific applications, the convolutional neural network may also exist in the form of other network models.

FIG. 5 is a hardware structure of a chip provided by an embodiment of the application, and the chip includes a neural network processor 50 . The chip can be set in the execution device 110 as shown in FIG. 3 to complete the calculation work of the calculation module 111 . The chip can also be set in the training device 120 as shown in FIG. 3 to complete the training work of the training device 120 and output the target model/rule 101 . The methods in the embodiments of the present application may be implemented in the chip as shown in FIG. 5 .

The neural network processor NPU 50 is mounted on the main central processing unit (CPU) (host CPU) as a coprocessor, and tasks are allocated by the main CPU. The core part of the NPU is the operation circuit 503, and the controller 504 controls the operation circuit 503 to extract the data in the memory (weight memory or input memory) and perform operations.

In some implementations, the arithmetic circuit 503 includes multiple processing units (process engines, PEs). In some implementations, arithmetic circuit 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 503 is a general-purpose matrix processor.

For example, suppose there is an input matrix A, a weight matrix B, and an output matrix C. The operation circuit fetches the data corresponding to the matrix B from the weight memory 502 and buffers it on each PE in the operation circuit. The arithmetic circuit fetches the data of matrix A and matrix B from the input memory 501 to perform matrix operation, and stores the partial result or final result of the matrix in an accumulator 508 .

The vector calculation unit 507 can further process the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on. For example, the vector computing unit 507 can be used for network computation of non-convolutional/non-FC layers in the neural network, such as pooling (pooling), batch normalization (BN), local response normalization (local response normalization) )Wait.

In some implementations, vector computation unit 507 can store the processed output vectors to unified buffer 506 . For example, the vector calculation unit 507 may apply a nonlinear function to the output of the arithmetic circuit 503, such as a vector of accumulated values, to generate activation values. In some implementations, vector computation unit 507 generates normalized values, merged values, or both. In some implementations, the vector of processed outputs can be used as activation input to the arithmetic circuit 503, eg, for use in subsequent layers in a neural network.

The operation of the perceptual network provided by the embodiment of the present application may be performed by the operation circuit 503 or the vector calculation unit 507 .

Unified memory 506 is used to store input data and output data.

The weight data directly transfers the input data in the external memory to the input memory 501 and/or the unified memory 506 through the storage unit access controller 505 (direct memory access controller, DMAC), and stores the weight data in the external memory into the weight memory 502, And the data in the unified memory 506 is stored in the external memory.

A bus interface unit (BIU) 510 is used to realize the interaction between the main CPU, the DMAC and the instruction fetch memory 509 through the bus.

The instruction fetch memory (instruction fetch buffer) 509 connected with the controller 504 is used to store the instructions used by the controller 504;

The controller 504 is used for invoking the instructions cached in the memory 509 to control the working process of the operation accelerator.

Generally, the unified memory 506, the input memory 501, the weight memory 502 and the instruction fetch memory 509 are all on-chip (On-Chip) memories, and the external memory is the memory outside the NPU, and the external memory can be double data rate synchronous dynamic random access Memory (double data rate synchronous dynamic random access memory, DDR SDRAM), high bandwidth memory (high bandwidth memory, HBM) or other readable and writable memory.

The execution device 110 in FIG. 3 or the chip in FIG. 5 described above can execute each step of the object recognition method of the embodiment of the present application. The training device 120 in FIG. 3 or the chip in FIG. 5 described above can perform various steps of the training method for the perceptual network according to the embodiment of the present application.

As shown in FIG. 6 , an embodiment of the present application provides a system architecture 300 . The system architecture includes a local device 301, a local device 302, an execution device 310 and a data storage system 350, wherein the local device 301 and the local device 302 are connected with the execution device 310 through a communication network.

In one implementation, execution device 310 may be implemented by one or more servers. Optionally, the execution device 310 may be used in conjunction with other computing devices, such as data storage, routers, load balancers and other devices. The execution device 310 may be arranged on one physical site, or distributed across multiple physical sites. The execution device 310 may use the data in the data storage system 350 or call the program code in the data storage system 350 to implement the training method of the perception network in this embodiment of the present application.

Specifically, in an implementation manner, the perception network includes: a candidate region generation network RPN, where the RPN is used to predict the position information of the candidate two-dimensional 2D frame of the target object in the sample image, and the target object includes multiple tasks to be detected. Objects, each of the multiple tasks includes at least one category; the target objects include a first task object and a second task object.

The execution device 110 may perform the following processes:

Acquire training data, the training data includes the sample image, the label data of the first task object on the sample image, and the pseudo frame of the second task object on the sample image, and the label data includes the class label of the first task object and the first task object. The 2D frame is marked, and the pseudo frame of the second task object is the target 2D frame of the second task object obtained by inferring the sample image through other perceptual networks; the perceptual network is trained based on the training data.

Through the above process execution device 110, a perception network can be acquired, and the perception network can be used for detection of various tasks.

A user may operate respective user devices (eg, local device 301 and local device 302 ) to interact with execution device 310 . Each local device can represent any computing device, such as a surveillance camera, personal computer, computer workstation, smartphone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box or gaming console, etc. .

Each user's local device can interact with the execution device 310 through any communication mechanism/standard communication network, which can be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.

In an implementation manner, the local device 301 and the local device 302 obtain the relevant parameters of the sensory network from the execution device 310, deploy the sensory network on the local device 301 and the local device 302, and use the sensory network to detect objects.

In another implementation, a perceptual network may be directly deployed on the execution device 310, and the execution device 310 obtains the image to be processed from the local device 301 and the local device 302, and uses the perceptual network to process the image to be processed.

The above execution device 310 may also be a cloud device, in this case, the execution device 310 may be deployed in the cloud; or, the above execution device 310 may also be a terminal device, in this case, the execution device 310 may be deployed on the user terminal side, the embodiment of the present application This is not limited.

Exemplarily, the perception network can be deployed on a computing node on a vehicle-mounted visual perception device, a safe city perception device, or a security perception device to process the image to be processed to obtain the detection result of the object of interest in the image to be processed. For example, the computing node may be the execution device 110 in FIG. 3 , the execution device 310 in FIG. 5 , a local device, or the like.

Most of the current perception networks can only complete one detection task. To achieve multiple detection tasks, it is usually necessary to deploy different networks to achieve different detection tasks. However, the simultaneous operation of multiple perception networks will increase the power consumption of the hardware and reduce the running speed of the model. Existing perception networks that can complete a variety of detection tasks also have problems such as long running time. For example, as shown in Figure 7, the multi-header multi-task perception network includes a backbone network (backbone) and multiple headers, and each header includes a region proposal network (RPN), a Region of interest align (ROI-Align) module and region convolutional neural networks (RCNN). It takes a long time for RPN to generate proposals, which is difficult to apply to scenarios with high real-time requirements; moreover, the number of headers increases with the number of detection tasks, and the memory consumption, computing power and computing power increase. time increased rapidly. Chips used in many fields have low computing power, making it difficult to deploy large-scale sensor networks, and even more difficult to deploy multiple sensor networks.

The embodiments of the present application provide a perception network, which can reduce the amount of parameters and computation in the perception network, reduce the power consumption of hardware, and improve the running speed of the model.

FIG. 8 shows a schematic diagram of a sensor network in an embodiment of the present application. The sensor network 800 in FIG. 8 includes a backbone network (backbone) 810 and a head end (header).

The perception network in the embodiments of the present application may be implemented by hardware, software, or a combination of software and hardware.

The backbone network 810 is configured to perform convolution processing on the input image to obtain the first feature map of the input image.

The backbone network 810 can extract basic features through a series of convolution processing to provide corresponding features for subsequent detection.

In this embodiment of the present application, the "first feature map" refers to a feature map (feature map) output by the backbone network. The feature maps output by the backbone network can all be referred to as first feature maps.

There may be one or more first feature maps of the input image.

Exemplarily, the backbone network 810 can output feature maps of the input image at different scales. Feature maps at different scales can be understood as the first feature maps of the input image, and these feature maps can provide basic features for subsequent detection.

Feature maps at different scales can be understood as feature maps of different resolutions, or in other words, feature maps of different sizes.

Illustratively, the backbone network 810 may adopt various forms of networks, for example, the backbone network 810 may adopt a visual geometry group (VGG), a residual neural network (Resnet) or an inception network (inception- net), inception-net is the core structure of GoogleNet, etc.

The header is used to detect the target object according to the second feature map, and output the target 2-dimensional (2dementional, 2D) frame of the target object and the first indication information. The target objects include objects to be detected in multiple tasks. The second feature map is determined from the first feature map. The first indication information is used to indicate the category to which the target object belongs.

That is to say, the header is used to realize target detection according to the second feature map, and output the target 2D frame of the target object and the first indication information.

Exemplarily, the first indication information may include confidence that the target object belongs to each category. That is, the category to which the target object belongs can be indicated by the confidence of the target object belonging to each category. The higher the confidence, the greater the probability that the target object belongs to the category corresponding to the confidence. For example, the category corresponding to the highest confidence is the category to which the target object belongs. Alternatively, the first indication information may be a category to which the target object belongs. Alternatively, the first indication information may include the confidence level of the category to which the target object belongs. Each category includes object categories in multiple tasks. This embodiment of the present application does not limit the specific form of the first indication information.

A header can complete the detection of objects to be detected in various tasks, that is, it is used to detect whether there are objects to be detected in the various tasks in the input image.

Divide the objects to be detected into multiple tasks. It can also be understood that the objects to be detected are divided into multiple categories. Object categories in each task can be the same or different.

Exemplarily, according to the similarity of the objects to be detected and the richness and scarcity of training samples, the 31 types of objects to be detected are divided into 8 categories, namely 8 tasks, as shown in Table 1.

Table 1

It should be noted that the division manner in Table 1 is only an example, and in other embodiments, a task division manner different from that in Table 1 may be adopted, which is not limited in this embodiment of the present application.

A header can be used to complete a variety of object detection tasks. For example, a header can complete the 8 tasks in Table 1 above, and output the target 2D frame of the target object and the confidence that the target object belongs to the 31 types of objects.

Optionally, the perception network 800 may further include other processing modules connected to the header. Other processing modules are used to obtain other detection information of the target object according to the target 2D frame of the target object output by the header.

For example, other processing modules can extract the features of the area where the target 2D frame is located in the feature map output by the backbone network according to the target 2D frame output by the header, and complete the 3D detection or 3D detection of the target object in the target 2D frame according to the extracted features. Keypoint detection, etc.

It should be understood that the above is only for illustration, and other processing modules are optional modules, which may be set according to actual needs, which are not limited in this embodiment of the present application.

The header is described in detail below.

Specifically, the header includes an RPN 820 , a region of interest extraction module 830 and a classification and regression network 840 .

RPN820 is used to predict the area where the target object is located on the second feature map, and output the position information of the candidate 2D frame matching the area where the target object is located, that is, the position information of the candidate 2D frame of the target object. The target object includes objects to be detected in multiple tasks, each of the multiple tasks includes at least one category, and the second feature map is determined according to the first feature map.

The region of interest extraction module 830 is used to extract the first feature information on the third feature map based on the position information of the candidate 2D frame, the first feature information is the feature of the region where the candidate 2D frame is located, and the third feature map is based on the first feature. Figure is determined.

The classification and regression network 840 is used to process the first feature information, and output the target 2D frame of the target object and the first indication information, the number of target 2D frames is less than or equal to the number of candidate 2D frames, and the first indication information is used to indicate the target The class to which the object belongs.

For example, as shown in Figure 8, the classification and regression network can output the target 2D box (box) and class label (label) of the object to be detected in multiple tasks. The class label of the target object can be used as the first indication information. It should be understood that the use of the class label as the first indication information in FIG. 8 is only an example, and does not constitute a limitation to the solutions of the embodiments of the present application.

RPN820 can predict the area where the target object may exist on the second feature map, and give boxes that match the area where the target object may exist. These areas can be called candidate areas (proposal), and the boxes that match the candidate area are candidates. 2D box. The box that matches the proposal can also be called the 2D box of the proposal.

The target object includes objects to be detected in multiple tasks, for example, the objects to be detected in the 8 tasks in Table 1, and the RPN820 is used to predict the regions where the objects to be detected in the 8 tasks may exist.

Exemplarily, the target objects may include objects to be detected in all tasks of the perception network. That is, RPN can be used to predict the region of the object to be detected in all tasks that may exist on the second feature map. In other words, all tasks of the perception network share the same RPN.

The second feature map may be one or multiple.

Optionally, the perceptual network 800 further includes feature pyramid networks (FPN).

The FPN is connected to the backbone 810, and is used to perform feature fusion on the feature map output by the backbone 810, that is, perform feature fusion on the first feature map of the input image, and output the fused feature map. The fused feature map is input into the RPN. In this case, the second feature map may include one or more of the fused feature maps.

Specifically, FPN takes the feature maps of different scales output by the backbone 810 as input, and through the internal vertical feature fusion of the FPN and the horizontal feature fusion of the same layer with the backbone 810, a feature map with more expressive ability is generated and provided to the subsequent modules, and then Improve the performance of the model.

That is, FPN can be used to achieve multi-scale feature fusion.

In the case where the perception network does not include the FPN, as shown in Figure 8, the backbone 810 is connected to the RPN 820.

The region of interest extraction module 830 is configured to deduct the feature of the region where the candidate 2D frame is located from the third feature map according to the candidate 2D frame output by the RPN 820 .

Exemplarily, the third feature map is determined according to the first feature map, including:

In the case where the perceptual network includes FPN, the third feature map may be one of the feature maps output by the backbone (ie, the first feature map) or one of the fused feature maps output by the FPN;

In the case where the perceptual network does not include the FPN, the third feature map may be one of the feature maps (ie, the first feature map) output by the backbone.

For example, the region of interest extraction module 830 deducts the features of the region where each proposal is located from a certain feature map output by the backbone or FPN according to the proposal provided by the RPN 820, and adjusts the size (resize) to a fixed size to obtain each characteristics of a proposal.

Exemplarily, the region of interest extraction module 830 may adopt region of interest pooling (ROI-pooling), region of interest extraction (ROI-Align), position sensitive region of interest pooling (position sensitive ROI pooling, PS-ROIPOOLING) ) or position sensitive ROI align (PS-ROIALIGN) and other feature extraction methods.

For example, the region of interest extraction module 830 adopts the method of difference and sampling in the region where the proposal is located, deducts features of a fixed resolution, and inputs the deducted features into subsequent modules.

Optionally, the classification and regression network 840 is specifically configured to: process the first feature information to obtain the confidence that the candidate 2D frame belongs to each category in the multiple tasks; adjust the position information of the candidate 2D frame to obtain the adjusted 2D frame. The candidate 2D frame; the target 2D frame is determined according to the adjusted candidate 2D frame; the first indication information is determined according to the confidence that the target 2D frame belongs to each category.

For example, for the 8 tasks in Table 1, the classification and regression network 840 refines each proposal provided by the region of interest extraction module 830 to obtain the confidence that each proposal belongs to 31 categories in the 8 tasks, At the same time, the coordinates of the 2D boxes of each proposal are adjusted to obtain the adjusted candidate 2D boxes. Further, after the adjusted candidate 2D frame is merged by NMS, the target 2D frame and the first indication information are obtained. The number of candidate 2D boxes is greater than or equal to the number of target 2D boxes.

In a possible implementation manner, the classification and regression network 840 includes multiple third RCNNs, wherein the multiple third RCNNs correspond to multiple tasks one-to-one. That is, each third RCNN separately completes the detection of objects to be detected in different tasks.

FIG. 9 shows a schematic block diagram of a cognitive network provided by an embodiment of the present application. For example, as shown in Figure 9, the perceptual network includes backbone, FPN, RPN, ROI-Align module, and n third RCNNs.

Specifically, the third RCNN is used to: process the features of the area where the candidate 2D frame is located to obtain the confidence level of the object category in the task corresponding to the third RCNN; and adjust the position information of the candidate 2D frame so that The adjusted candidate 2D box.

That is, any third RCNN among the plurality of third RCNNs can predict the confidence that the candidate 2D frame belongs to the object category in the task corresponding to the third RCNN, and obtain the adjusted candidate 2D frame. The multiple third RCNNs can obtain the confidence of the candidate 2D frame belonging to each category, and the adjusted candidate 2D frame obtained by each third RCNN.

Further, after the adjusted candidate 2D frame is merged by NMS, the target 2D frame and the first indication information are obtained.

For example, the task corresponding to the third RCNN1# is the vehicle detection task in Table 1, then the third RCNN1# outputs the confidence that each proposal belongs to the three categories of cars, trucks and buses and the adjusted candidate 2D box . The task corresponding to the third RCNN2# is the detection task of wheels and lights in Table 1, then the third RCNN2# outputs the confidence that each proposal belongs to the two categories of wheels and lights and the adjusted candidate 2D frame. In this way, for any proposal, a total of five categories of confidence and adjusted candidate 2D boxes can be obtained after being processed by the third RCNN1# and the third RCNN2#.

The perceptual network in FIG. 9 is used to implement n tasks, for example, the n tasks include task 0, task 1 . . . task n-1 in FIG. 9 . n is an integer greater than 1. The n third RCNNs correspond to each of the n tasks one-to-one. Taking task 0 as an example, the third RCNN corresponding to task 0 outputs the confidence that each proposal belongs to each object category in task 0 and the adjusted candidate 2D box. The n third RCNNs corresponding to the n tasks obtain the confidence of each object category in each corresponding task, and the classification and regression network can obtain the confidence that each proposal belongs to each category.

It should be noted that the FPN in FIG. 9 is an optional module. In FIG. 9 , the ROI-Align module is used as the region of interest extraction module only as an example, and other methods may also be used to extract corresponding features. For a detailed description, please refer to the foregoing description, which will not be repeated here.

In another possible implementation manner, the classification and regression network includes a first RCNN, and the first RCNN includes a hidden layer, multiple sub-classification fully connected layers (classification fully connected layers, cls fc) and multiple sub-regression fully connected layers (regression fully connected layers). connected layers, reg fc), the hidden layer is connected to multiple sub-classification fully connected layers, the hidden layer is connected to multiple sub-regression fully connected layers, multiple sub-classification fully connected layers correspond to multiple tasks one-to-one, and multiple sub-regression fully connected layers are connected to One-to-one correspondence between multiple tasks.

In other words, the first RCNN includes a hidden layer and multiple sub-cls fc and multiple sub-reg fc corresponding to multiple tasks. Each task can have an independent sub-classification fc and sub-regression fc.

FIG. 10 shows a schematic block diagram of another cognitive network provided by an embodiment of the present application. For example, as shown in Figure 10, the perceptual network includes backbone, FPN, RPN, ROI-Align module and the first RCNN.

The hidden layer is used to process the first feature information to obtain the second feature information.

That is to say, the hidden layer is used to process the features of the region where the candidate 2D box is located, and the processed results are respectively input to multiple sub-classification fully connected layers and multiple sub-regression fully connected layers.

The sub-category fully-connected layer is used to obtain, according to the second feature information, the confidence level that the candidate 2D frame belongs to the object category in the task corresponding to the sub-category fully-connected layer.

The sub-regression fully connected layer is used to adjust the position information of the candidate 2D frame according to the second feature information to obtain the adjusted candidate 2D frame. Further, the sub-regression fully connected layer can use box merging operations, such as NMS operations, to remove duplicate boxes and output more compact candidate 2D boxes.

The sub-category fully-connected layer and the sub-regression fully-connected layer corresponding to each task can complete the detection of the object to be detected in the task. Specifically, the sub-category fully connected layer can output the confidence level that the candidate 2D frame belongs to the object category in the task , the sub-regression fully connected layer can output the adjusted candidate 2D box. That is to say, a first RCNN can complete the detection of objects to be detected in multiple tasks. The first RCNN can also be called a single-head multi-task RCNN.

The first RCNN can predict the confidence that the candidate 2D frame belongs to the object category in the multiple tasks corresponding to the first RCNN, and obtain the adjusted candidate frame.

For example, the multiple tasks corresponding to the first RCNN include 8 tasks in Table 1, then the first RCNN includes 8 sub-cls fc and 8 reg fc, respectively corresponding to the 8 tasks, each sub-cls fc outputs each Each proposal belongs to the confidence level of the object category in the task corresponding to the sub-cls fc, and each reg fc outputs the adjusted candidate 2D frame, so that the first RCNN can obtain that each proposal belongs to the 31 categories in the 8 tasks Object confidence and adjusted candidate 2D boxes.

The perceptual network in FIG. 10 is used to implement n tasks, for example, the n tasks include task 0, task 1 . . . task n-1 in FIG. 10 . n is an integer greater than 1. The first RCNN includes a hidden layer and n sub-cls fc and n sub-reg fc corresponding to n tasks, respectively. Hidden layers can include Shared fc and/or Shared conv.

Taking task 0 as an example, the sub-cls fc corresponding to task 0 in the first RCNN outputs the confidence that each proposal belongs to each object category in task 0, and the sub-reg fc corresponding to task 0 outputs the adjusted candidate 2D frame. In this way, the n sub-cls fc corresponding to the n tasks obtain the confidence that each proposal belongs to each object category in the corresponding task, and the first RCNN can obtain the confidence that each proposal belongs to each category.

It should be noted that the FPN in Figure 10 is an optional module. In Fig. 10, the ROI-Align module is used as the region of interest extraction module only as an example, and other methods can also be used to extract corresponding features. For the specific description, refer to the foregoing, which will not be repeated here.

In another possible implementation manner, the classification and regression network includes a second RCNN, the second RCNN includes a hidden layer, a classification fully connected layer and a regression fully connected layer, the hidden layer is connected to the classification fully connected layer, and the hidden layer is fully connected to the regression layer layers are connected.

FIG. 11 shows a schematic block diagram of yet another cognitive network provided by an embodiment of the present application. For example, as shown in Figure 11, the perceptual network includes backbone, FPN, RPN, ROI-Align module, and a second RCNN.

The hidden layer is used to process the first feature information to obtain the third feature information.

That is to say, the hidden layer is used to process the features of the region where the candidate 2D box is located, and the processed results are input to the classification fully connected layer and the regression fully connected layer respectively.

Exemplarily, the hidden layer may include at least one of the following: a convolutional layer or a fully connected layer. For a detailed description, please refer to the first RCNN, which will not be repeated here.

The classification fully connected layer is used to obtain the confidence that the candidate 2D frame belongs to each category according to the third feature information.

The regression fully connected layer is used to adjust the position information of the candidate 2D frame according to the third feature information to obtain the adjusted candidate 2D frame. Further, a frame merging operation is performed on the adjusted candidate 2D frame to obtain the target 2D frame.

That is, a second RCNN completes the detection of objects to be detected in multiple tasks. The second RCNN can also be called a single-head multi-task RCNN.

Specifically, the classification fully connected layer is obtained by merging multiple sub-classification fully connected layers in the first RCNN. The regressive fully connected layer is obtained by merging multiple sub-regressive fully connected layers in the first RCNN. In this case, the first characteristic information and the third characteristic information are the same.

Combining multiple sub-category fully-connected layers can be understood as splicing the weight matrices of multiple sub-category fully-connected layers. Combining multiple sub-regression fully connected layers can be understood as splicing the weight matrices of multiple sub-regression fully connected layers.

The first RCNN can use the sigmoid function to normalize the label logits obtained by the sub-category fc, which is equivalent to performing a binary classification process for each category. The relationship between sub-categories fc of multiple tasks in the model is merged into one category fc, which will not affect the inference results of the model, that is, the output of multiple sub-categories fc is the same as the output of the category fc obtained by merging multiple sub-categories fc.

That is to say, the tasks performed by the second RCNN and the first RCNN can be the same, and the output results are also the same. However, for accelerators such as NPU, only one matrix operation is completed at a time. In the first RCNN, the output of the hidden layer needs to be input to the sub-classification fc and sub-regression fc corresponding to each task for multiple matrix operations. The number of matrix multiplications in the RCNN increases with the number of tasks, and the number of matrix multiplications performed in the second RCNN is not affected by the number of tasks. That is to say, in the case where the parameters of the first RCNN and the second RCNN are the same, the time required to execute the second RCNN is less than the time required to execute the first RCNN.

Therefore, the classification fully connected layer of the second RCNN is obtained by merging the sub-classification fully connected layers corresponding to multiple tasks in the first RCNN, and the sub-regression fully connected layers corresponding to multiple tasks in the first RCNN are merged to obtain the second RCNN. The regression fully connected layer, which can reduce the number of operations of matrix multiplication in the neural network accelerator, is more friendly to hardware, and further reduces time-consuming.

The perceptual network of FIG. 11 is used to implement n tasks, the n tasks include task 0, task 1 . . . task n-1 in FIG. 10 . n is an integer greater than 1. The second RCNN includes hidden layers and cls fc and reg fc. Hidden layers can include Shared fc and/or Shared conv. cls fc may be obtained by merging n sub-cls fcs in FIG. 10 , and reg fc may be obtained by merging n sub-reg fcs in FIG. 10 .

In this way, cls fc can output the confidence that each proposal belongs to each category, and reg fc can output the adjusted candidate 2D box.

It should be noted that the FPN in Figure 11 is an optional module. In FIG. 11 , the ROI-Align module is used as the region of interest extraction module only as an example, and other methods may also be used to extract corresponding features. For details, refer to the foregoing description, which will not be repeated here.

Exemplarily, in the training process of the perceptual network, the classification and regression network adopts the first RCNN, and after the training is completed, the second RCNN is obtained based on the first RCNN, that is, in the perceptual network used for inference, the classification and regression network can adopt the second RCNN. .

Exemplarily, the perceptual network in FIG. 10 can be applied to the training side, and the first RCNN in the trained perceptual network is merged to obtain the perceptual network shown in FIG. 11 , that is, the model parameters in FIG. model parameters are obtained. The perceptual network in Figure 11 can be applied to the inference side to reduce time-consuming.

According to the solution of the embodiment of the present application, one sensing network is used to complete various sensing tasks, multiple tasks share one RPN, and one RPN predicts the area where the objects to be detected in multiple tasks are located, while ensuring the performance of the sensing network, It reduces the amount of parameters and calculation of the perception network, improves the processing efficiency, is conducive to deployment in scenarios with high real-time requirements, reduces the pressure on hardware, and saves costs.

In addition, in the solution of the embodiment of the present application, the first RCNN or the second RCNN is used as the classification and regression network, and multiple tasks share the hidden layer of the RCNN, which further reduces the amount of parameters and calculation of the perception network, and improves the processing efficiency. Moreover, when the first RCNN is used for training, each task corresponds to independent sub-classification fc and sub-regression fc, which improves the scalability of the perception network. Flexibly increase or decrease detection tasks by increasing or decreasing sub-classification fc and sub-regression fc.

In addition, in the solution of the embodiment of the present application, multiple sub-classification fc and sub-regression fc in the first RCNN are combined, and the second RCNN is used as the classification and regression network, which can further reduce the operation of matrix operations, is more friendly to hardware, and further The time-consuming operation is reduced and the processing efficiency is improved.

The perception network in the embodiment of the present application may be trained by using an existing training method.

However, when using the existing training method for training, if the fully-labeled sample data is used for training, it is necessary to label the objects to be detected for all tasks existing on the sample images in the data set, and the labeling cost is relatively high. Moreover, if the perception network needs to be expanded, that is, to add new tasks, the sample images in the entire dataset need to be re-labeled once to supplement the objects to be detected in the new task, which further increases the labeling cost and reduces the perception. Scalability of the network.

If some labeled sample images are used for training, it is not necessary to label the objects to be detected for all tasks on a sample image, which can reduce the cost of labeling. However, since each task shares one RPN, when training RPN, the training data of different tasks may inhibit each other, resulting in the inability of RPN to predict the candidate regions of objects to be detected in all tasks, thereby affecting the accuracy of the perception network. Specifically, since the labeling data is partial labeling data, for example, when only labeling the labeling data of the object to be detected for one task is marked on a sample image, when the labeling data of the object to be detected for the task is used for training, adjustments will be made. The parameters of the RPN, so that the RPN can more accurately predict the candidate 2D frame of the object to be detected for this task, but cannot accurately predict the candidate 2D frame of the object to be detected for other tasks on the sample image. When using the labeled data of the object to be detected in another task for training, the parameters of the RPN will be adjusted, so the adjusted RPN may not be able to accurately predict the candidate 2D frame of the object to be detected in other tasks. In this way, the training data of different tasks may suppress each other, causing RPN to fail to predict all the target objects in the image.

The embodiment of the present application provides a training method for a perceptual network, which utilizes sample images in the inference training set of other perceptual networks to provide pseudo boxes (pseudo bounding boxes, Pseudo Bboxes) for objects to be detected that are not marked in the sample images, and then The RPN is jointly trained based on pseudo-frames and labeled data, which is beneficial to obtain candidate 2D frames of objects to be detected in multiple tasks.

FIG. 12 shows a method 1200 for training a perceptual network provided by an embodiment of the present application. The method 1200 may be performed by a training device for a neural network model, and the training device may be a cloud service device or a terminal device. For example, A device with sufficient computing power to execute the neural network model training method, such as a computer and a server, can also be a system composed of cloud service equipment and terminal equipment. Illustratively, the method 1200 may be performed by the training device 120 in FIG. 3 , the neural network processor 50 in FIG. 5 , or the execution device 310 in FIG. 6 . The perception network includes: RPN, where the RPN is used to predict the position information of the candidate 2D frame of the target object in the sample image, the target object includes objects to be detected for multiple tasks, and each task in the multiple tasks includes at least one category.

Optionally, the sensory network may be the sensory network shown in FIG. 8 . To avoid unnecessary repetition, relevant descriptions are appropriately omitted when describing the training method. During training, just replace the input image with a sample image.

The method 1200 includes steps S1210 to S1220, and steps S1210 to S1220 are described below.

S1210, acquiring training data.

The target objects include a first task object and a second task object. The training data includes the sample image, the labeling data of the first task object on the sample image, and the pseudo-frame of the second task object on the sample image, and the labeling data includes the class label of the first task object and the labeled 2D frame of the first task object, The pseudo frame of the second task object is the target 2D frame of the second task object obtained by inferring the sample image through other perceptual networks.

Labeled data can also be understood as ground truth. Annotated class labels are used to indicate the true class to which the task object belongs. The labeled data of the first task object can also be understood as the labeled data of the sample image. The fully annotated data of the sample image includes the class labels and annotated 2D boxes of the objects to be detected in all tasks on the sample image. The fully annotated data includes the annotation information of all objects of interest. Part of the annotation data includes the class label and annotated 2D frame of the object to be detected in some tasks on the sample image. Part of the annotation data only includes the annotation information of some objects of interest.

The first task objects may include objects to be detected in one or more tasks. The one or more tasks are the tasks where the first task object is located. The first task objects in different sample images in the training set may be the same or different. The "first" in the "first task object" in the embodiment of the present application is only used to define the object to be detected that has a true value in the sample image, and has no other limiting role.

For example, the annotation data of sample image 1# is the annotation data of the car, that is, the first task object in the sample image 1# includes the objects to be detected in the detection task of the car, such as trucks, cars, buses, etc.; the sample image The labeled data of 2# is the labeled data of wheels and lights, that is, the first task object in sample image 2# includes the objects to be detected in the detection task of wheels and lights, such as wheels, lights, etc.; sample image The labeling data of 3# includes the labeling data of the car and the labeling data of the wheels and lights, that is, the first task object in the sample image 3# includes the objects in the detection task of the car and the objects to be detected in the detection task of the wheels and lights. object.

That is to say, the labeling data of the sample images in the embodiments of the present application may be partial labeling data, so that targeted collection can be carried out, that is, the required sample images are collected for specific tasks, and it is not necessary to mark each sample image in each sample image to be marked for all tasks. The detected objects reduce the cost of data collection and labeling. In addition, the scheme using part of the labeled data has flexible scalability. In the case of adding tasks, it is only necessary to provide the labeled data of the new tasks, and there is no need to label new objects to be detected on the basis of the original training data. .

The Pseudo Bboxes on the sample image are the target 2D boxes of the second task object obtained by inferring the sample image through other perceptual networks. The Pseudo Bboxes on the sample image can also be understood as the Pseudo Bboxes of the second task object.

Other perceptual networks refer to other perceptual networks than the one to be trained. Exemplarily, the other perceptual network may be a multi-head multi-tasking perceptual network.

For example, the perceptual network as shown in FIG. 7 is used to infer the sample images in the training set, and the inference result of the sample image is obtained, and the inference result includes the target 2D frame of the target object on the sample image.

Exemplarily, other perceptual networks may also include multiple single-task perceptual networks.

For example, multiple single-task perceptual networks are used to infer the sample images in the training set respectively, and the inference results of the sample images are obtained respectively. The inference results of each single-task perceptual network include the object to be detected in the task on the sample image. According to the inference results of the multiple single-task perceptual networks, the target 2D frames of the objects to be detected in multiple tasks on the sample image can be obtained.

The second task objects may include objects to be detected in one or more tasks. The one or more tasks are the tasks where the second task object is located. The same object to be detected may exist in the second task object and the first task object. The second task objects in different sample images in the training set can be the same or different. The "second" in the "second task object" in the embodiment of the present application is only used to define the object to be detected with a pseudo frame in the sample image, and has no other limiting role.

Exemplarily, in the case that the same object to be detected exists in the first task object and the second task object, the annotation frame in the annotation data is used as the target output of the RPN. Labeled data is usually human-labeled data, and the accuracy of labeling data is usually higher than that of pseudo-frames obtained by inference from other perception networks. Using labelled frames as the target output can improve the accuracy of the training model.

For example, the multiple tasks that the perception network needs to complete include the 8 tasks in Table 1, the labeled data of the sample image 1# is the labeled data of the car, and the first task object in the sample image 1# includes the object to be detected in the car detection task. The detected objects, such as trucks, cars and buses, that is, the labeled data of sample image 1# are part of the labeled data. The sample image 1# is reasoned through other perceptual networks to obtain the target 2D frame of the second task object, that is, the pseudo frame. For example, the sample image 1# is inferred by 7 single-task perception networks used to complete the 7 tasks except the car detection task in Table 1, and the target 2D frame of the second task object is obtained, in this case , the second task object may include objects in the seven tasks in Table 1 except for the vehicle detection task. For another example, the multi-head and multi-task perceptual network shown in Figure 7 can be used to complete the 8 tasks in Table 1. Using the perceptual network to infer the sample image 1#, the target 2D frame of the second task object can be obtained. , in this case, the second task object may include the objects to be detected in the eight tasks in Table 1. In this way, the regions where the objects to be detected are located in the eight tasks in the sample image 1# can be obtained after the pseudo frame and the annotation frame are combined.

Use pseudo-frames to supplement the unlabeled objects to be detected in the sample image, so as to avoid that when the RPN is trained based on the partial labeled data, the partial labeled data of different tasks will inhibit each other, which will affect the training of the RPN and improve the recall rate of the RPN, which is beneficial to The RPN predicts the area where the object to be detected is located in all tasks that need to be detected.

Further, other perceptual networks perform reasoning on the sample image, and can obtain the target 2D frame of the second task object on the sample image and the confidence level of the category to which the second task object belongs. When the confidence is greater than or equal to the first threshold, the target 2D frame of the second task object on the sample image obtained by other perceptual network inferences is used as the pseudo frame on the sample image. That is, when the confidence level is greater than or equal to the first threshold, the inference results of other perceptual networks are used for training.

Illustratively, a low threshold may be used for filtering. For example, the first threshold is 0.05, that is, the target 2D frame with a confidence level greater than or equal to 0.05 can be used as a pseudo frame on the sample image to participate in the training of the perceptual network together with the labeled data. It should be understood that the first threshold may be set as required, which is not limited in this embodiment of the present application.

S1220, train the perception network based on the training data.

Specifically, step S1220 may include steps S1221 to S1223.

S1221: Calculate a first loss function value according to the difference between the marked 2D frame of the first task object and the target 2D frame of the second task object and the candidate 2D frame of the target object in the sample image predicted by the RPN.

That is to say, the labeled 2D frame of the first task object and the target 2D frame of the second task object are compared with the candidate 2D frame of the target object predicted by RPN, and the loss function value of the RPN stage is obtained, that is, the first loss function value.

The forward propagation of the perceptual network is performed based on the sample image, and the candidate 2D frame of the target object on the sample image is predicted by the RPN. The specific forward propagation process is shown in FIG. 8 and will not be repeated here.

S1222: Calculate a second loss function value of the perceptual network according to the labeled data of the sample image.

The second loss function value of the perceptual network is the second loss function value of the part of the perceptual network that needs to be trained. The part to be trained in the perception network includes the part to be trained in the classification and regression network, the region of interest extraction module, the RPN and the backbone network, and the part to be trained in the classification and regression network is determined according to the first task object.

The part of the perceptual network that needs to be trained refers to the part of the perceptual network that needs to be trained as determined by the sample images.

The classification and regression network can predict the confidence that the candidate 2D box belongs to each category and the target 2D box of the target object.

Specifically, after RPN predicts and obtains the candidate 2D frame of the target object, the region of interest extraction module deducts the features of the candidate 2D frame from the feature map, and the features of the candidate 2D frame are input into the part to be trained in the classification and regression network, and the candidate 2D frame is obtained. Confidence that the 2D box belongs to the object category in the task corresponding to the first task object. The part of the classification and regression network that needs to be trained is determined according to the first task object. In other words, the part to be trained in the classification and regression network is determined according to the task where the first task object is located.

Optionally, the classification and regression network includes a plurality of third RCNNs, and the part to be trained in the classification and regression network includes the third RCNN corresponding to the task where the first task object is located.

Exemplarily, the perception network may be as shown in FIG. 9 . The task where the first task object in the sample image 1# (an example of the sample image) is located includes a vehicle detection task, and the first task object includes an object to be detected in the vehicle detection task. The features of the candidate 2D box are input into the third RCNN corresponding to the vehicle detection task, and then the confidence level of the candidate 2D box belonging to the three categories of cars, trucks and buses, and the target 2D box are obtained. For sample image 1#, the part to be trained in the classification and regression network is the third RCNN corresponding to the vehicle detection task.

Optionally, the classification and regression network includes a first RCNN, and the part to be trained in the classification and regression network includes the hidden layer in the first RCNN and the sub-classification fc and sub-regression fc corresponding to the task where the first task object is located.

Exemplarily, the perception network can be as shown in FIG. 10 . The task where the first task object in the sample image 1# is located includes the detection task of the car, and the first task object includes the object to be detected in the detection task of the car. The features of the candidate 2D box are input into the sub-classification fc and sub-regression fc corresponding to the vehicle detection task after passing through the hidden layer in the first RCNN, and then the confidence level of the candidate 2D box belonging to the three categories of cars, trucks and buses is obtained. , and the target 2D box. For the sample image 1#, the part that needs to be trained in the classification and regression network is the hidden layer in the first RCNN and the sub-classification fc and sub-regression fc corresponding to the car detection task.

The labeled data of the sample image is compared with the output result of the classification and regression network, and the loss function value of the task where the first task object in the classification and regression network stage is located, that is, the second loss function value is obtained. That is, the loss of other tasks not involved in the annotation data of the sample images is not calculated.

S1223: Perform backpropagation based on the first loss function value and the second loss function value, and adjust the parameters of the part of the perception network that needs to be trained.

Based on the back-propagation of the first loss function value, the gradient of the parameter related to the first loss function value is calculated, and then the parameter related to the first loss function value is adjusted based on the gradient of the parameter, so as to realize the adjustment of the perceptual network, so that the RPN can Predict candidate boxes more comprehensively.

When the training termination condition is met, the training is terminated, and a trained perceptual network is obtained.

For example, when the perceptual network converges, the training is terminated and the weights of the trained perceptual network are output.

It should be understood that steps S1221 to S1223 are only an implementation manner of step S1220, and step S1220 may also be implemented in other manners.

Exemplarily, step S1220 includes the following steps S1 to S3.

S1: Calculate a first loss function value according to the difference between the labeled 2D frame of the first task object and the target 2D frame of the second task object and the candidate 2D frame of the target object on the sample image predicted by the RPN.

S2, according to the labeled data of the sample image, the pseudo-frame on the sample image, and the pseudo-label of the second task object on the sample image, the second loss function value of the part to be trained in the perceptual network is calculated, and the value of the second loss function of the part to be trained in the perceptual network is calculated. The part includes the part that needs to be trained in the classification and regression network, the region of interest extraction module, the RPN and the backbone network. The part that needs to be trained in the classification and regression network is determined according to the first task object and the second task object. The pseudo-label on the sample image is the class label of the second task object on the sample image obtained by inferring the sample image through other perceptual networks.

Specifically, after RPN predicts and obtains the candidate 2D frame of the target object, the region of interest extraction module deducts the features of the candidate 2D frame from the feature map, and the features of the candidate 2D frame are input into the part to be trained in the classification and regression network, and the candidate 2D frame is obtained. The confidence that the 2D box belongs to the object category in the task where the first task object is located, and the confidence level that the candidate 2D box belongs to the object category in the task where the second task object is located. The part of the classification and regression network that needs to be trained is determined according to the first task object and the second task object. In other words, the part to be trained in the classification and regression network is determined according to the task where the first task object is located and the task where the second task object is located.

Exemplarily, the classification and regression network includes a plurality of third RCNNs, for example, the perceptual network may be as shown in FIG. 9 . The task where the first task object in the sample image 1# is located includes the detection task of the car, and the first task object includes the object to be detected in the detection task of the car. The features of the candidate 2D box are input into the third RCNN corresponding to the vehicle detection task, and then the confidence level of the candidate 2D box belonging to the three categories of cars, trucks and buses, and the target 2D box are obtained. The task where the second task object in the sample image 1# is located includes the detection task of wheels and lights, and the second task object includes objects in the detection task of wheels and lights. The features of the candidate 2D frame are input into the third RCNN corresponding to the detection task of wheels and lights, and then the confidence of the candidate 2D frame belonging to the two categories of wheels and lights, and the target 2D frame are obtained.

For sample image 1#, the parts to be trained in the classification and regression network are the third RCNN corresponding to the detection task of the car and the third RCNN corresponding to the detection task of the wheels and lights.

Exemplarily, the classification and regression network includes the first RCNN, for example, the perceptual network may be as shown in FIG. 10 . The task where the first task object in the sample image 1# is located includes the detection task of the car, and the first task object includes the object to be detected in the detection task of the car. The features of the candidate 2D box are input into the sub-classification fc and sub-regression fc corresponding to the vehicle detection task after passing through the hidden layer in the first RCNN, and then the confidence level of the candidate 2D box belonging to the three categories of cars, trucks and buses is obtained. , and the target 2D box. The task where the second task object in the sample image 1# is located includes the detection task of wheels and lights, and the second task object includes objects in the detection task of wheels and lights. The features of the candidate 2D box are input into the sub-classification fc and sub-regression fc corresponding to the detection task of wheels and lights after passing through the hidden layer in the first RCNN, and then the confidence that the candidate 2D box belongs to the two categories of wheels and lights is obtained. degrees, and the target 2D box.

For sample image 1#, the parts that need to be trained in the classification and regression network include the hidden layer in the first RCNN, the sub-classification fc and sub-regression fc corresponding to the detection task of the car, and the sub-classification corresponding to the detection task of wheels and lights fc and subregression fc.

Compare the labeled data of the sample image with the output of the classification and regression network, and obtain the loss function value of the task corresponding to the first task object in the classification and regression network stage and the loss function value of the task corresponding to the second task object, that is, the second loss. function value. That is, the loss of the labeled data of the sample image and other tasks not involved in the pseudo-label is not calculated.

S3: Back-propagation is performed based on the first loss function value and the second loss function value, and the parameters of the part to be trained in the perceptual network are adjusted.

Moreover, according to the solution in the embodiment of the present application, the parts of the perception network that are shared by different tasks, such as the backbone network, RPN, the region of interest extraction module, etc., all participate in the training process based on the labeled data of different tasks, so that It enables the parts shared by different tasks in the perceptual network to learn the common features of each task. Different parts corresponding to different tasks in the perception network, for example, the parts corresponding to each task in the classification and regression network, only participate in the training process based on the labeled data of the respective tasks, which can make different parts corresponding to different tasks in the perception network. Its task-specific features can be learned, improving the accuracy of the model. At the same time, in the training process, the part of the classification and regression network that needs to be trained is determined according to the task, and different parts of the classification and regression network corresponding to different tasks do not affect each other during the training process, ensuring the independence of each task, making The model has strong flexibility.

FIG. 13 shows a training method of a perceptual network provided by an embodiment of the present application. The method shown in FIG. 13 may be regarded as a specific implementation of the method shown in FIG. 12 . For related descriptions, refer to the description in method 1200 . In order to avoid unnecessary repetition, appropriate omissions are made when describing the method 1300 .

The solution of the embodiment of the present application is described in detail below by taking the visual perception system of ADAS/ADS as an example. The visual perception system of ADAS/ADS needs to perform target detection for various tasks, such as: dynamic obstacles, static obstacles, traffic signs, traffic lights, road signs (such as left turn signs or straight signs) and zebra crossings.

By adopting the solutions in the embodiments of the present application, the target detection of the above-mentioned various tasks can be completed in one sensing network. The solutions in the embodiments of the present application are described in detail below.

The training method of the perceptual network in the embodiment of the present application is described in detail below by taking the task division in Table 1 as an example.

Prepare training data before starting training. The target objects include the first task object and the second task object. The training data includes the sample image, the annotation data of the first task object on the sample image, and the pseudo frame of the second task object on the sample image. , the annotation data includes the class label of the first task object and the labeled 2D frame of the first task object.

According to the task division in Table 1, labeled data is provided for each task. For example, provide the labeling data of the car for the training process of task0, mark the 2D frame and class label of Car/Truck/Bus on one or more sample images in the dataset; provide the labeling data of the person for the training of task1, in the data One or more sample images in the set are marked with the 2D frame and class label of Pedestrian/Cyclist/Tricycle; the annotation data of wheels and lights are provided for task2, and Wheel/Car_light is marked on one or more sample images in the dataset The 2D box and class label of TrafficLight_Red/Yellow/Green/Black are marked on one or more sample images in the dataset, and the 2D box and class label of TrafficLight_Red/Yellow/Green/Black are marked for task3, and so on. In this way, each sample image has annotated data for at least one task.

In a possible implementation manner, the sample image includes annotation information of all objects of interest. That is, all objects of interest are annotated in each sample image. Exemplarily, the object of interest is the object to be detected in the eight categories in Table 1.

In another possible implementation, each type of annotation data only needs to annotate a specific type of object. That is, the labeled data of each sample image may be partial labeled data.

Exemplarily, only the class label and 2D frame of the object to be detected in one task are marked on each sample image.

Alternatively, class labels and 2D boxes of objects to be detected in multiple tasks can also be labeled on each sample image, that is, to provide mixed labeled data. For example, label the 2D box and class label of Car/Truck/Bus/Pedestrian/Cyclist/Tricycle at the same time on the sample image. In this way, the training data can be used to train the required training part of the perceptual network corresponding to the two tasks at the same time.

Illustratively, a task label may be assigned to each sample image, and the task label may be used to indicate that the sample image is used to train the required training portion of the perceptual network.

The labeled data of the sample image can be obtained in the above manner. Illustratively, the annotation data may be stored in an annotation file. The annotation file is the ground truth file.

The sample images are inferred through other perceptual networks, and the inference results are obtained. Inference results include Pseudo Bboxes on sample images. Pseudo Bboxes can be used to complement objects to be detected belonging to other tasks that are not labeled in the labeled data of the sample image. Illustratively, the inference result may be stored in an inference result file. The inference result file is the Pseudo Bboxes file.

Each sample image can correspond to an annotation file and an inference result file. In a possible implementation manner, the labeled 2D boxes in the labeled data of the sample image and the Pseudo Bboxes can be combined to obtain the 2D boxes of the objects to be detected in all tasks on the sample image.

For example, using a multi-head multi-task perceptual network to infer sample images to obtain inference results.

For another example, use multiple single-task perceptual networks to infer the sample images respectively, obtain the inference results of multiple tasks, and fuse the inference results of multiple tasks together.

Further, the inference result also includes the confidence level of the category to which the second task object on the sample image belongs. A low threshold is used to filter the inference results. That is, inference results whose confidence is less than the first threshold are filtered out. The confidence levels corresponding to the Pseudo Bboxes used for training are all greater than or equal to the first threshold. For example, the first threshold is 0.05.

A perceptual network is trained based on partially labeled data and Pseudo Bboxes. Specifically, the method 1300 includes steps S1310 to S1350.

S1310, acquiring training data.

The training data is input into the perception network, and the training data includes a sample image, annotated data of the first task object on the sample image, and a pseudo frame of the second task object on the sample image.

For example, input the sample image, the annotation file corresponding to the sample image, and the Pseudo Bboxes file into the perceptual network.

Step S1310 corresponds to step S1210 in the method 1200. For details, please refer to step S1210.

Perform forward propagation of the perceptual network based on the training data.

Exemplarily, according to the task division method in Table 1, the structure of the perceptual network used in the training process is shown in Figure 14. As shown in Figure 14, the perceptual network includes: backbone, RPN, region of interest extraction module and first RCNN. The sensory network shown in FIG. 14 can be regarded as a specific implementation of the sensory network shown in FIG. 10 . The perceptual network in Figure 14 can simultaneously complete the object detection of the 8 categories in Table 1. In other words, the perceptual network in Figure 14 can simultaneously complete the target detection of the eight tasks in Table 1. Specifically, the 8 sub-classifications fc and the sub-regression fc in the first RCNN in Figure 14 simultaneously complete the 2D object detection of the 8 categories in Table 1. It can be seen from Figure 14 that the perceptual network of the present application can flexibly add or delete the classification fc and regression fc in the first RCNN according to the needs of the business, so as to train to obtain target detection that can achieve different numbers of tasks perception network.

S1320, use the labeled data and Pseudo Bboxes to calculate the loss in the RPN stage.

The labeled data of the sample image includes the labeled 2D box and the class label of the first task object. The Pseudo Bboxes on the sample image include the Pseudo Bboxes of the second task object.

Step S1320 includes: using the labeled 2D box of the first task object and the Pseudo Bboxes of the second task object to calculate the loss in the RPN stage, that is, the first loss function value.

For example, merge the Pseudo Bboxes with confidence greater than or equal to 0.05 in the Pseudo Bboxes file with the annotated 2D boxes in the annotation data to obtain the 2D boxes of all target objects on the sample image. The 2D boxes of all target objects are compared with the candidate 2D boxes predicted by RPN, and the loss function value of the RPN stage is obtained, that is, the first loss function value.

Step S1320 corresponds to step S1221 in the method 1200. For details, please refer to step S1221.

S1330, using the labeled data to calculate the loss in the classification and regression network stage.

The sample image may belong to one or more tasks according to the data type it is annotated. In other words, the sample image may belong to one or more tasks according to the task corresponding to the first task object. For example, if a sample image is only marked with traffic signs, the sample image only belongs to the task of traffic signs. If a sample image is marked with people and cars at the same time, then the sample image belongs to the two tasks of people and cars. When calculating the loss of the classification and regression network stage, only the loss of the part corresponding to the task to which the current sample image belongs is calculated, and the loss of other tasks is not calculated. For example, if the currently input sample image belongs to the task of people and cars, only the loss of the part corresponding to the person and the car is calculated, and the loss of the part corresponding to the other tasks (such as traffic lights and traffic signs) is not calculated.

For example, as shown in Figure 14, the region of interest extraction module deducts features from a feature map according to the candidate 2D frame predicted by RPN, and enters the sub-classification fc and sub-regression corresponding to the task of the sample image after shared fc and shared conv In fc, the prediction result is obtained, that is, the confidence that the candidate 2D frame belongs to the object category in the task, and the target 2D frame. Then, compare the labeled data with the prediction result to obtain the loss, which is the loss in the classification and regression network stage corresponding to the task.

If the labeled data of the current sample image only includes the labeled data of one task, when the sample image is input to the network for training, for the multiple sub-classification fc and sub-regression fc in the first RCNN, only the task in the first RCNN is trained. The corresponding subclassification fc and subregression fc do not affect the subclassification fc and subregression fc corresponding to other tasks in the first RCNN.

For example, as shown in Figure 14, if the current sample image is only marked with the 2D frame of the traffic light, as shown in Table 1, the task of the traffic light is task 3, then during training, only the sub-categories fc and sub-categories corresponding to task 3 are passed. Regression fc obtains the prediction result of the traffic light in the sample image, and compares it with the true value to obtain the loss value. That is to say, the sample image of the traffic light only passes through the backbone, RPN, region of interest extraction module, and the sub-classification fc and sub-regression fc corresponding to the traffic light in the first RCNN, and the sub-classification fc and sub-regression fc corresponding to other tasks are not involved. Calculation of loss value.

If the labeled data of the current sample image includes the labeled data of multiple tasks, when the sample image is input to the network for training, for the multiple sub-classification fc and sub-regression fc in the first RCNN, only the multiple sub-classification fc and sub-regression fc in the first RCNN are trained The subclassification fc and subregression fc corresponding to each task do not affect the subclassification fc and subregression fc corresponding to other tasks in the first RCNN.

For example, as shown in Figure 14, if the current sample image is marked with a 2D frame of a traffic light and a 2D frame of a person, as shown in Table 1, the task of the traffic light is task 3, and the task of people is task 1. During training, the prediction results of the traffic lights in the sample image are obtained through the sub-classification fc and sub-regression fc corresponding to task 3, and the prediction results of the people in the sample image are obtained through the sub-classification fc and sub-regression fc corresponding to task 1, and the The true values are compared to obtain the loss values corresponding to the two tasks. That is to say, the sample image only passes through the backbone, RPN, region of interest extraction module, sub-classification fc and sub-regression fc corresponding to task3 in the first RCNN, and sub-classification fc and sub-regression fc corresponding to task1 in the first RCNN. The sub-classification fc and sub-regression fc corresponding to other tasks do not participate in the calculation of the loss value. In this way, the losses of the classification and regression stages corresponding to the two tasks will be obtained, and the overall loss value of the classification and regression stages can be the average of the multiple losses.

S1340, the gradient is returned.

After calculating the loss, it is necessary to carry out gradient back-propagation, that is, back-propagation.

Based on the loss (first loss function value) of the RPN stage and the loss (second loss function value) of the classification and regression network backpropagation, the gradient of the relevant parameters is calculated, and the gradient is returned.

In the perceptual network, the part that needs to be trained in the perceptual network is subjected to gradient backhaul. The part of the perception network that needs to be trained is determined according to the task to which the sample image belongs, and the part not corresponding to the task to which the sample image belongs does not participate in the gradient backhaul.

For example, as shown in Figure 14, the gradient is passed back along the sub-class fc and sub-regression fc corresponding to the task to which the sample image belongs, without affecting the sub-class fc and sub-regression fc corresponding to other tasks, the shared fc or conv of the first RCNN and the RPN and Backbone participates in gradient return.

S1350, adjust the parameters of the sensing network.

The weight parameters of the part of the perceptual network that need to be trained in the perceptual network are updated using the back-passed gradients.

In this way, the part corresponding to the task to which the sample image belongs in the perceptual network can be adjusted in a targeted manner, so that the part corresponding to the task to which the sample image belongs can better learn the task to which the sample image belongs.

S1360, determine whether the sensing network converges.

If the perceptual network converges, output the weight parameters of the perceptual network.

If the perception network does not converge, go to step S1310 to continue the training process.

The labeling data of the sample images in the embodiments of the present application may be partial labeling data, so that targeted collection can be performed, that is, the required sample images are collected for specific tasks without labeling all objects of interest in each picture. processing, reducing the cost of data collection and labeling. In addition, the method of preparing training data by using partial labeling data has very flexible scalability. In the case of adding detection tasks, only the part corresponding to the detection task needs to be added to the classification and regression network. For example, adding the corresponding part of the detection task The sub-classification fc and sub-regression fc of the sub-category fc and sub-regression fc are provided, and the sample image with the annotation data of the newly added object can be provided, and the newly added object to be detected is not required to be marked on the basis of the original training data.

Moreover, the pseudo-frames are used to supplement the unlabeled objects to be detected in the sample images, so as to avoid that when the RPN is trained based on the partial labeled data, the partial labeled data of different tasks will suppress each other, which will affect the training of the RPN, and is conducive to the prediction of the RPN. The area where the object to be detected is located in all tasks that need to be detected.

In addition, the part corresponding to each task in the perception network only detects the object to be detected in the task, and during the training process, it can avoid accidental injury to objects of other tasks that are not labeled. In addition, the shared parts in the perception network, such as backbone, RPN, region of interest extraction module, etc., learn the common features of each task, while the parts corresponding to each task in the classification and regression network learn task-specific features, for example, the first The sub-classification fc and sub-regression fc corresponding to each task in an RCNN learn its task-specific features.

This embodiment of the present application further provides an object recognition method 1500, and the method 1500 can be executed by an object recognition apparatus. The object recognition device may be a cloud service device or a terminal device, for example, a vehicle, drone, robot, computer, server or mobile phone and other devices with sufficient computing power to execute the object recognition method, or a cloud service device. A system consisting of equipment and terminal equipment. For example, the method 1500 may be executed by the execution device 110 in FIG. 3 , the neural network processor 50 in FIG. 5 , or the execution device 310 in FIG. 6 , or a local device.

For example, the object recognition method may be specifically executed by the execution device 110 shown in FIG. 3 .

Optionally, the object recognition method may be processed by the GPU, or may be jointly processed by the CPU and the GPU, or other processors suitable for neural network computing may be used without using the GPU, which is not limited in this application.

In the method 1500, the perceptual network in the embodiment of the present application is used to process the image. In order to avoid unnecessary repetition, the repeated description is appropriately omitted when introducing the method 1500 below.

The method 1500 includes steps S1510 to S1540, which are described below.

The perception network includes backbone network, RPN, region of interest extraction module and classification regression network.

S1510, using the backbone network to perform convolution processing on the input image, and output the first feature map of the input image.

Exemplarily, the input image may be an image captured by a terminal device (or other device or device such as a computer, server, etc.) through a camera, or the input image may also be an image obtained from the terminal device (or other device or device such as a computer, server, etc.) The obtained image (for example, an image stored in an album of the terminal device, or an image obtained by the terminal device from the cloud) is not limited in this embodiment of the present application.

S1520, use the RPN to output the position information of the candidate two-dimensional 2D frame of the target object based on the second feature map, the target object includes objects to be detected in multiple tasks, each task in the multiple tasks includes at least one category, and the first The second feature map is determined according to the first feature map.

S1530, using the region of interest extraction module to extract first feature information on the third feature map based on the position information of the candidate 2D frame, where the first feature information is a feature of the region where the candidate 2D frame is located, and the third feature map is based on the first feature map definite.

S1540: Use a classification and regression network to process the first feature information to obtain a target 2D frame and first indication information of the target object, where the number of target 2D frames is less than or equal to the number of candidate 2D frames, and the first indication information is used to indicate the target object the category to which it belongs.

Optionally, using a classification and regression network to process the first feature information to obtain the target 2D frame of the target object and the first indication information, including: using a classification and regression network to process the first feature information to obtain a candidate 2D frame belonging to multiple The confidence of each category in the task; use the classification and regression network to adjust the position information of the candidate 2D frame to obtain the adjusted candidate 2D frame; determine the target 2D frame according to the adjusted candidate 2D frame; according to the target 2D frame belongs to each category The confidence of determining the first indication information.

Optionally, the classification and regression network includes a first regional convolutional neural network RCNN, the first RCNN includes a hidden layer, multiple sub-category fully connected layers and multiple sub-regression fully connected layers, the hidden layer is connected to the multiple sub-category fully connected layers, and the hidden layer is The layer is connected with multiple sub-regression fully-connected layers, the multiple sub-classification fully-connected layers are in one-to-one correspondence with multiple tasks, and the multiple sub-regression fully-connected layers are in one-to-one correspondence with multiple tasks; and the classification and regression network is used to process the first feature information. , outputting the target 2D frame of the target object and the first indication information, including: using the hidden layer to process the first feature information to obtain the second feature information; using the sub-classification fully connected layer to obtain the candidate 2D frame belonging to the sub-class according to the second feature information. Classify the confidence of the object category in the task corresponding to the fully connected layer; use the sub-regression fully connected layer to adjust the position information of the candidate 2D frame according to the second feature information, and obtain the adjusted candidate 2D frame.

Optionally, the classification and regression network includes a second RCNN, the second RCNN includes a hidden layer, a classification fully connected layer and a regression fully connected layer, the hidden layer is connected to the classification fully connected layer, and the hidden layer is connected to the regression fully connected layer; The regression network processes the first feature information, and outputs the target 2D frame of the target object and the first indication information, including: using the hidden layer to process the first feature information to obtain third feature information; The feature information obtains the confidence that the candidate 2D frame belongs to each category; the position information of the candidate 2D frame is adjusted according to the third feature information by using the regression fully connected layer, and the adjusted candidate 2D frame is obtained.

Optionally, the classification fully connected layer is obtained by merging multiple sub-classification fully connected layers in the first RCNN, and the regression fully connected layer is obtained by merging multiple sub-regression fully connected layers in the first RCNN, The first RCNN includes a hidden layer, multiple sub-category fully-connected layers, and multiple sub-regression fully-connected layers. The hidden layer is connected to multiple sub-category fully-connected layers, the hidden layer is connected to multiple sub-regression fully-connected layers, and the multiple sub-category fully-connected layers are connected to Multiple tasks are in one-to-one correspondence, and multiple sub-regression fully-connected layers are in one-to-one correspondence with multiple tasks; the sub-classification fully-connected layer is used to obtain the object category of the candidate 2D frame belonging to the task corresponding to the sub-classification fully-connected layer according to the third feature information. The confidence of the sub-regression fully connected layer is used to adjust the position information of the candidate 2D frame according to the third feature information, and obtain the adjusted candidate 2D frame.

FIG. 16 shows the processing flow of the object recognition method provided by the embodiment of the present application. The processing flow in FIG. 16 can be regarded as a specific implementation of the method shown in FIG. 15 , and the method in FIG. The perceptual network shown in the figure is executed. For related descriptions, refer to the description in the perceptual network 800. In order to avoid unnecessary repetition, repeated descriptions are appropriately omitted when describing the method 1600.

The solution of the embodiment of the present application is described in detail below by taking the visual perception system of ADAS/ADS as an example.

According to the task division method in Table 1, the structure of the perception network adopted in this embodiment of the present application is shown in FIG. 17 . As shown in Figure 17, the perceptual network includes: backbone, RPN, region of interest extraction module and second RCNN. The sensory network shown in FIG. 17 can be regarded as a specific implementation of the sensory network shown in FIG. 11 . The perceptual network in Figure 17 can simultaneously complete the object detection of the 8 categories in Table 1. In other words, the perceptual network in Figure 17 can simultaneously complete the target detection of the 8 tasks in Table 1. The sensory network shown in FIG. 17 may be determined according to the sensory network shown in FIG. 14 . For example, as shown in FIG. 18 , the classification fc in the second RCNN is obtained by merging multiple sub-classifications fc in the first RCNN, and the regression fc in the second RCNN is obtained after merging the sub-regression fc in the first RCNN. As can be seen from Figure 18, the perceptual network of the present application can flexibly add or delete the sub-classification fc and sub-regression fc in the first RCNN according to the needs of the business, so as to achieve target detection of different numbers of tasks.

Specifically, the method 1600 includes steps S1610 to S1650.

S1610, input the image to be processed.

S1620, generate basic features.

Exemplarily, step S1620 may be performed by the backbone in FIG. 17 .

Specifically, the backbone performs convolution processing on the input image to generate several feature maps of different scales, that is, the first feature map.

Exemplarily, the backbone can adopt various forms of convolutional networks, such as VGG16, Resnet50, or Inception-Net, etc.

Further, when the perceptual network further includes FPN, step S1620 may further include: performing feature fusion based on the first feature map, and outputting the fused feature map.

The feature maps output by the backbone network or FPN can be provided as basic features to subsequent modules.

S1630, predict the candidate 2D frame.

Exemplarily, step S1630 may be performed by the RPN in FIG. 17 .

The RPN predicts the region where the target object is located on the second feature map, and outputs a candidate 2D frame matching the region where the target object is located. The target object includes objects to be detected in multiple tasks. The second feature map may include the feature map output by the backbone network or FPN.

Specifically, RPN predicts areas where target objects may exist based on the feature map provided by backbone or FPN, and outputs candidate frames of these areas, or the coordinates of candidate areas (proposal). In the embodiment of the present application, the RPN can predict that there may be candidate frames of objects to be detected in the eight categories in Table 1.

S1640, extract the features of the candidate 2D frame.

Exemplarily, step S1640 may be performed by the region of interest extraction module in FIG. 17 .

The region of interest extraction module extracts the features of the region where the candidate 2D frame is located on the third feature map. The third feature map can be a feature map provided by backbone or FPN.

Exemplarily, the region of interest extraction module extracts the features of the region where each proposal is located on a feature map provided by the backbone or FPN according to the coordinates of the proposal provided by the RPN, and resize to a fixed size to obtain the value of each proposal. feature.

S1650, correcting and classifying the candidate 2D frame.

Exemplarily, step S1650 may be performed by the second RCNN in FIG. 17 .

Specifically, the hidden layer in the second RCNN, for example, shared fc/conv, further performs feature extraction on the features of each proposal extracted by the region of interest extraction module, and sends them to cls fc and reg fc. Classify the proposal to obtain the confidence that each proposal belongs to each category, adjust the coordinates of the 2D frame of the proposal by reg fc to obtain a more compact 2D frame coordinate, and then perform the frame sum operation, such as the NMS operation to merge and adjust After the 2D box, output the target 2D box and the classification result. The classification result can be used as the first indication information.

The weights of the classification fc in the second RCNN in FIG. 17 are obtained by combining the weights of multiple sub-classifications fc of the first RCNN in FIG. 16 . The weights of the regression fc in the second RCNN in FIG. 17 are obtained by combining the weights of the multiple sub-regression fcs of the first RCNN in FIG. 16 . During the training process, the first RCNN in Figure 16 uses the sigmoid function to normalize the label logits obtained by the sub-category fc to obtain the confidence of each category, which is equivalent to performing a second classification for each category Processing, the confidence of the current category has no relationship with other categories, so during inference, the sub-categories fc of all tasks of the model can be combined into one category fc. The sub-regressions fc can also be combined into one regression fc.

For example, the candidate 2D frame is a rectangular frame, the position information of the candidate 2D frame is represented by 4 values, the length of the feature output by the hidden layer is 1024, and the number of categories in each task is n, then the number of categories in each task is n. The weight of the sub-regression fc is a tensor of 1024*4n, and the weight of the sub-category fc in each task is a tensor of 1024*n. There are 8 tasks and 31 types of task objects in Table 1. The weight of the regression fc formed after merging is the tensor of 1024*124, and the weight of the classification fc is the tensor of 1024*31. That is to say, the second RCNN obtained after merging only includes one classification fc and one regression fc, and its input and output are consistent with the tensor shape of the combined weight, that is, the input of classification fc and regression fc is 1024, the output of classification fc is 31, and the output of regression fc is 124.

Table 2 shows the parameter amount and calculation amount of 8 tasks implemented by the single-head multi-task network in the embodiment of the present application and the existing multi-head multi-task network when the input image size is 720*1280 (@720p). registration. That is, Table 2 shows the parameter amount and calculation amount of the 8-task single-head-end task network and the multi-head-end multi-task network.

Table 2

8 Task-Model@720p8 Task-Model@720p	GFlopsGFlops	Parameters(M)Parameters(M)
多头端多任务网络(8task)Multi-head multi-task network (8task)	413.96413.96	142.76142.76
单头端多任务网络(8task)Single-end multitasking network (8task)	139.61139.61	41.2941.29

As shown in Table 2, if a multi-head-end multi-task network is used to implement the target detection of the 8 tasks in the embodiment of the present application, the total amount of computation required is 413.96 GFlops, and the amount of network parameters (Parameters) is 142.76M, which is a huge amount of computation. And the amount of network parameters will put a lot of pressure on the hardware. However, using the single-end multitasking network provided by the embodiment of the present application can reduce the amount of calculation by 60% and the amount of parameters by 71%, which greatly reduces the amount of calculation and parameters, reduces calculation consumption, and reduces hardware pressure.

Table 3 shows the comparison of inference time consumption between the single-head multi-task network in the embodiment of the present application and the existing multi-head multi-task network.

table 3

8 Task-Model@720p8 Task-Model@720p	720p latency(ms)720p latency(ms)	1080p latency(ms)1080p latency(ms)
多头端多任务网络(8task)Multi-head multi-task network (8task)	2828	4040
单头端多任务网络(8task)Single-end multitasking network (8task)	23twenty three	3131

As shown in Table 3, compared with the multi-head-end multi-task network, the single-head-end multi-task network of the embodiment of the present application reduces the latency by 17% and 22% on images with resolutions of 720p and 1080p, respectively, which is significantly The processing efficiency is improved, which is conducive to deployment in scenarios with high real-time requirements.

In addition, the single-head multitasking network in the embodiment of the present application can achieve the same detection performance as the multi-heading multitasking network. Table 4 shows the performance comparison of the single-head-end multi-task network and the multi-head-end multi-task network on some categories.

Table 4

类别category	多头端多任务网络(AP)Multi-Head End Multitasking Network (AP)	单头端多任务网络(AP)Single-end multitasking network (AP)
PedestrianPedestrian	75.6675.66	72.7672.76
CyclistCyclist	84.7784.77	81.9281.92
CarCar	96.0996.09	97.5697.56
TruckTruck	88.1888.18	90.2190.21
TramTram	88.4888.48	94.6294.62
TrafficConeTrafficCone	83.1183.11	87.6587.65
TrafficStickTrafficStick	73.3173.31	86.6886.68
FireHydrantFireHydrant	63.363.3	77.8677.86
TrafficLight_RedTrafficLight_Red	96.5196.51	95.7195.71
TrafficLight_YellowTrafficLight_Yellow	98.6698.66	96.4996.49

TrafficLight_GreenTrafficLight_Green	95.8595.85	94.2194.21
TrafficSignTrafficSign	83.8783.87	86.1486.14
GuideSignGuideSign	56.5756.57	59.9659.96

As shown in Table 4, the average precision (average precision, AP) of the single-head multi-task network in the embodiment of the present application and the existing multi-head multi-task network is not much different, that is, the performance of the two is comparable. It can be seen from this that the single-end multitasking network in the embodiment of the present application can save the amount of computation and memory on the premise of ensuring the performance of the model.

The apparatus of the embodiment of the present application will be described below with reference to FIG. 19 to FIG. 20 . It should be understood that the apparatuses described below can execute the methods of the foregoing embodiments of the present application. In order to avoid unnecessary repetition, the repetitive descriptions are appropriately omitted when introducing the apparatuses of the embodiments of the present application below.

FIG. 19 is a schematic block diagram of an apparatus according to an embodiment of the present application. The apparatus 4000 shown in FIG. 19 includes an acquisition unit 4010 and a processing unit 4020 .

In an implementation manner, the apparatus 4000 may be used as a training apparatus for a perceptual network, and the acquiring unit 4010 and the processing unit 4020 may be used to perform the training method of the perceptual network of the embodiments of the present application, for example, may be used to perform the method 1200 or Method 1300.

Specifically, the perception network includes a candidate region generation network RPN, and the RPN is used to predict the position information of the candidate two-dimensional 2D frame of the target object in the sample image. The target object includes objects to be detected for multiple tasks. Each task includes at least one category, and the target objects include a first task object and a second task object.

The obtaining unit 4010 is used to obtain training data, the training data includes the sample image, the labeling data of the first task object on the sample image, and the pseudo frame of the second task object on the sample image, and the labeling data includes the class label of the first task object and The labeled 2D frame of the first task object, and the pseudo frame of the second task object is the target 2D frame of the second task object obtained by inferring the sample image through other perceptual networks.

The processing unit 4020 is configured to train the perception network based on the training data.

Optionally, as an embodiment, the perception network further includes a backbone network, a region of interest extraction module, and a classification and regression network, and the processing unit 4020 is specifically configured to: according to the marked 2D frame of the first task object and the target 2D of the second task object. The difference between the frame and the candidate 2D frame of the target object in the sample image predicted by RPN calculates the first loss function value; calculates the second loss function value of the perceptual network according to the labeled data; The function value is back-propagated, and the parameters of the part to be trained in the perception network are adjusted. The part to be trained in the perception network includes the part to be trained in the classification and regression network, the region of interest extraction module, the RPN and the backbone network, and the classification The part of the regression network that needs to be trained is determined according to the first task object.

Optionally, as an embodiment, the backbone network is used to perform convolution processing on the sample image and output the first feature map of the sample image; RPN is used to output the position of the candidate 2D frame of the target object based on the second feature map. information, the second feature map is determined according to the first feature map; the region of interest extraction module is used to extract the first feature information on the third feature map based on the position information of the candidate 2D frame, and the first feature information is the candidate 2D frame The feature of the area, the third feature map is determined according to the first feature map; the classification and regression network is used to process the first feature information, output the target 2D frame of the target object and the first indication information, the target 2D of the target object The number of boxes is less than or equal to the number of candidate 2D boxes of the target object, and the first indication information is used to indicate the category to which the target object belongs.

Optionally, as an embodiment, the classification and regression network includes a first regional convolutional neural network RCNN, the first RCNN includes a hidden layer, multiple sub-classification fully connected layers and multiple sub-regression fully connected layers, and the hidden layer and multiple sub-classification fully connected layers. The connection layer is connected, the hidden layer is connected with multiple sub-regression fully-connected layers, multiple sub-classification fully-connected layers are in one-to-one correspondence with multiple tasks, and multiple sub-regression fully-connected layers are in one-to-one correspondence with multiple tasks; One feature information is processed to obtain second feature information; the sub-classification fully connected layer is used to obtain the confidence level of the object category in the task corresponding to the sub-classification fully connected layer according to the second feature information; the sub-regression fully connected layer It is used to adjust the position information of the candidate 2D frame according to the result of the hidden layer processing to obtain the adjusted candidate 2D frame; and the part to be trained in the classification and regression network includes the hidden layer and the task corresponding to the first task object. The sub-classification fully connected layer and the sub-regression fully connected layer.

In another implementation, the device 4000 may function as an object recognition device. The object recognition apparatus includes an acquisition unit 4010 and a processing unit 4020 . The perception network includes: backbone network, candidate region generation network, region of interest extraction module and classification regression network.

The acquiring unit 4010 and the processing unit 4020 may be used to execute the object recognition method of the embodiments of the present application, for example, may be used to execute the method 1500 or the method 1600 .

The acquisition unit 4010 is used to acquire an input image.

The processing unit 4020 is configured to use the backbone network to perform convolution processing on the input image to obtain the first feature map of the input image; use the RPN to output the position information of the candidate two-dimensional 2D frame of the target object based on the second feature map, and the target object includes a plurality of The object to be detected in the task, each task in the multiple tasks includes at least one category, and the second feature map is determined according to the first feature map; the region of interest extraction module is used based on the position information of the candidate 2D frame. Extract the first feature information from the feature map, the first feature information is the feature of the area where the candidate 2D frame is located, and the third feature map is determined according to the first feature map; use the classification and regression network to process the first feature information to obtain the target object The target 2D frame of the target object and the first indication information, the number of target 2D frames of the target object is less than or equal to the number of candidate 2D frames of the target object, and the first indication information is used to indicate the category to which the target object belongs.

Optionally, as an embodiment, the processing unit 4020 is specifically configured to: use a classification and regression network to process the first feature information to obtain the confidence that the candidate 2D frame belongs to each category in multiple tasks; The position information of the 2D frame is adjusted to obtain an adjusted candidate 2D frame; the target 2D frame is determined according to the adjusted candidate 2D frame; the first indication information is determined according to the confidence that the target 2D frame belongs to each category.

Optionally, as an embodiment, the classification and regression network includes a first regional convolutional neural network RCNN, the first RCNN includes a hidden layer, multiple sub-classification fully connected layers and multiple sub-regression fully connected layers, and the hidden layer and multiple sub-classification fully connected layers. The connection layer is connected, the hidden layer is connected with multiple sub-regression fully-connected layers, the multiple sub-classification fully-connected layers are in one-to-one correspondence with multiple tasks, and the multiple sub-regression fully-connected layers are in one-to-one correspondence with multiple tasks; and the processing unit is specifically used for: Use the hidden layer to process the first feature information to obtain the second feature information; use the sub-classification fully connected layer to obtain the confidence level of the object category in the task corresponding to the sub-classification fully connected layer by using the sub-classification fully connected layer; use The sub-regression fully connected layer adjusts the position information of the candidate 2D frame according to the second feature information, and obtains the adjusted candidate 2D frame.

Optionally, as an embodiment, the classification and regression network includes a second RCNN, the second RCNN includes a hidden layer, a classification fully connected layer, and a regression fully connected layer, the hidden layer is connected to the classification fully connected layer, and the hidden layer is connected to the regression fully connected layer. and the processing unit 4020 is specifically used for: using the hidden layer to process the first feature information to obtain the third feature information; using the classification fully connected layer to obtain the confidence that the candidate 2D frame belongs to each category according to the third feature information obtained ; Use the regression fully connected layer to adjust the position information of the candidate 2D frame according to the obtained third feature information, and obtain the adjusted candidate 2D frame.

Optionally, as an embodiment, the classification fully connected layer is obtained by merging multiple sub-classification fully connected layers in the first RCNN, and the regression fully connected layer is obtained by combining multiple sub-regression fully connected layers in the first RCNN. After merging, the first RCNN includes a hidden layer, multiple sub-classification fully connected layers and multiple sub-regression fully connected layers, the hidden layer is connected with multiple sub-classification fully connected layers, the hidden layer is connected with multiple sub-regression fully connected layers, and multiple sub-regression fully connected layers. The classification fully connected layer corresponds to multiple tasks one-to-one, and multiple sub-regression fully connected layers correspond to multiple tasks one-to-one; the sub-classification fully connected layer is used to obtain the candidate 2D frame according to the obtained third feature information, which belongs to the sub-classification fully connected layer. The confidence of the object category in the task; the sub-regression fully connected layer is used to adjust the position information of the candidate 2D frame according to the obtained third feature information, and obtain the adjusted candidate 2D frame.

It should be noted that the above-mentioned apparatus 4000 is embodied in the form of functional units. The term "unit" here can be implemented in the form of software and/or hardware, which is not specifically limited.

For example, a "unit" may be a software program, a hardware circuit, or a combination of the two that realizes the above-mentioned functions. The hardware circuits may include application specific integrated circuits (ASICs), electronic circuits, processors for executing one or more software or firmware programs (eg, shared processors, proprietary processors, or group processors) etc.) and memory, merge logic and/or other suitable components to support the described functions.

Therefore, the units of each example described in the embodiments of the present application can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.

FIG. 20 is a schematic diagram of a hardware structure of an apparatus provided by an embodiment of the present application. The apparatus 6000 shown in FIG. 20 (the apparatus 6000 may specifically be a computer device) includes a memory 6001 , a processor 6002 , a communication interface 6003 and a bus 6004 . The memory 6001 , the processor 6002 , and the communication interface 6003 are connected to each other through the bus 6004 for communication.

In one implementation, the apparatus 6000 may serve as a training apparatus for a perceptual network.

The memory 6001 may be a read only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM). The memory 6001 may store a program, and when the program stored in the memory 6001 is executed by the processor 6002, the processor 6002 is configured to execute each step of the method for training a perceptual network according to the embodiment of the present application. Specifically, the processor 6002 may perform step S1220 in the method shown in FIG. 12 above.

The processor 6002 may adopt a general-purpose central processing unit (CPU), a microprocessor, an application specific integrated circuit (ASIC), a graphics processor (graphics processing unit, GPU), or one or more The integrated circuit is used to execute the relevant program to realize the training method of the perceptual network according to the method embodiment of the present application.

The processor 6002 may also be an integrated circuit chip with signal processing capability, for example, the chip shown in FIG. 5 . In the implementation process, each step of the training method of the perceptual network of the present application can be completed by the hardware integrated logic circuit in the processor 6002 or the instructions in the form of software.

The above-mentioned processor 6002 can also be a general-purpose processor, a digital signal processor (digital signal processing, DSP), an application specific integrated circuit (ASIC), an off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, Discrete gate or transistor logic devices, discrete hardware components. The methods, steps, and logic block diagrams disclosed in the embodiments of this application can be implemented or executed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art. The storage medium is located in the memory 6001, and the processor 6002 reads the information in the memory 6001 and, in combination with its hardware, completes the functions required to be performed by the units included in the training device in the embodiments of the present application, or executes the diagrams in the method embodiments of the present application. 12 shows the training method of the perceptual network.

The communication interface 6003 uses a transceiver device such as but not limited to a transceiver to implement communication between the device 6000 and other devices or a communication network. For example, training data can be obtained through the communication interface 6003 .

The bus 6004 may include a pathway for communicating information between the various components of the device 6000 (eg, the memory 6001, the processor 6002, the communication interface 6003).

In another implementation, the device 6000 may function as an object recognition device.

The memory 6001 may be ROM, static storage device and RAM. The memory 6001 may store a program. When the program stored in the memory 6001 is executed by the processor 6002, the processor 6002 and the communication interface 6003 are used to execute each step of the object recognition method of the embodiment of the present application. Specifically, the processor 6002 may perform steps S1520 to S1540 in the method shown in FIG. 15 above.

The processor 6002 may adopt a general-purpose CPU, a microprocessor, an ASIC, a GPU, or one or more integrated circuits, and is used to execute a related program, so as to realize the functions required to be performed by the unit in the object recognition apparatus of the embodiment of the present application, Or execute the object recognition method of the method embodiment of the present application.

The processor 6002 may also be an integrated circuit chip with signal processing capability, for example, the chip shown in FIG. 6 . In the implementation process, each step of the object recognition method of the embodiment of the present application may be completed by an integrated logic circuit of hardware in the processor 6002 or an instruction in the form of software.

The above-mentioned processor 6002 may also be a general-purpose processor, DSP, ASIC, FPGA or other programmable logic devices, discrete gate or transistor logic devices, and discrete hardware components. The methods, steps, and logic block diagrams disclosed in the embodiments of this application can be implemented or executed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art. The storage medium is located in the memory 6001, and the processor 6002 reads the information in the memory 6001, and combines its hardware to complete the functions required to be performed by the units included in the object recognition device of the embodiment of the present application, or to perform the object recognition of the method embodiment of the present application. method.

The communication interface 6003 uses a transceiver device such as but not limited to a transceiver to implement communication between the device 6000 and other devices or a communication network. For example, the data to be processed can be acquired through the communication interface 6003 .

It should be noted that although the above-mentioned apparatus 6000 only shows a memory, a processor, and a communication interface, in the specific implementation process, those skilled in the art should understand that the apparatus 6000 may also include other devices necessary for normal operation. Meanwhile, according to specific needs, those skilled in the art should understand that the apparatus 6000 may further include hardware devices that implement other additional functions. In addition, those skilled in the art should understand that the apparatus 6000 may only include the necessary devices for implementing the embodiments of the present application, and does not necessarily include all the devices shown in FIG. 20 .

An embodiment of the present application provides a computer-readable medium, where the computer-readable medium stores program code executed by a device, where the program code includes relevant content for executing the object recognition method shown in FIG. 15 or FIG. 16 .

An embodiment of the present application provides a computer-readable medium, where the computer-readable medium stores program code executed by a device, where the program code includes relevant content for executing the training method shown in FIG. 12 or FIG. 13 .

An embodiment of the present application provides a computer program product, which, when the computer program product runs on a computer, enables the computer to execute the relevant content of the object recognition method shown in FIG. 15 or FIG. 16 .

An embodiment of the present application provides a computer program product, which, when the computer program product runs on a computer, enables the computer to execute the relevant content of the training method shown in FIG. 12 or FIG. 13 .

An embodiment of the present application provides a chip, where the chip includes a processor and a data interface, the processor reads an instruction on a memory through the data interface, and executes the object recognition method as shown in FIG. 15 or FIG. 16 .

An embodiment of the present application provides a chip, where the chip includes a processor and a data interface, the processor reads an instruction on a memory through the data interface, and executes the training method as shown in FIG. 12 or FIG. 13 .

Optionally, as an implementation manner, the chip may further include a memory, in which instructions are stored, the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the The processor is configured to execute the object recognition method of FIG. 15 or FIG. 16 or the training method of FIG. 12 or FIG. 13 .

It should be understood that the processor in the embodiment of the present application may be a central processing unit (central processing unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (digital signal processors, DSP), application-specific integrated circuits (application specific integrated circuit, ASIC), off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It should also be understood that the memory in the embodiments of the present application may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically programmable Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory. Volatile memory may be random access memory (RAM), which acts as an external cache. By way of example and not limitation, many forms of random access memory (RAM) are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (DRAM) Access memory (synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (double data rate SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous connection dynamic random access memory Fetch memory (synchlink DRAM, SLDRAM) and direct memory bus random access memory (direct rambus RAM, DR RAM).

The above embodiments may be implemented in whole or in part by software, hardware, firmware or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions or computer programs. When the computer instructions or computer programs are loaded or executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated. The computer may be a general purpose computer, special purpose computer, computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server or data center Transmission to another website site, computer, server or data center by wire (eg, infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, or the like that contains one or more sets of available media. The usable media may be magnetic media (e.g., floppy disks, hard disks, magnetic tapes), optical media (e.g., DVDs), or semiconductor media. The semiconductor medium may be a solid state drive.

It should be understood that the term "and/or" in this document is only an association relationship to describe associated objects, indicating that there can be three kinds of relationships, for example, A and/or B, which can mean that A exists alone, and A and B exist at the same time , there are three cases of B alone, where A and B can be singular or plural. In addition, the character "/" in this document generally indicates that the related objects before and after are an "or" relationship, but may also indicate an "and/or" relationship, which can be understood with reference to the context.

In this application, "at least one" means one or more, and "plurality" means two or more. "At least one item(s) below" or similar expressions thereof refer to any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one item (a) of a, b, or c can represent: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, c may be single or multiple .

It should be understood that, in various embodiments of the present application, the size of the sequence numbers of the above-mentioned processes does not mean the sequence of execution, and the execution sequence of each process should be determined by its functions and internal logic, and should not be dealt with in the embodiments of the present application. implementation constitutes any limitation.

Those of ordinary skill in the art can realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.

Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the above-described systems, devices and units may refer to the corresponding processes in the foregoing method embodiments, which will not be repeated here.

In the several embodiments provided in this application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .

The above are only specific embodiments of the present application, but the protection scope of the present application is not limited to this. should be covered within the scope of protection of this application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

A perception network, comprising: a backbone network, a candidate region generation network RPN, a region of interest extraction module and a classification and regression network;

The backbone network is used to perform convolution processing on the input image and output the first feature map of the input image;

The RPN is used to output the position information of the candidate two-dimensional 2D frame of the target object based on the second feature map, where the target object includes objects to be detected in multiple tasks, and each task in the multiple tasks includes: at least one category, the second feature map is determined according to the first feature map;

The region of interest extraction module is configured to extract first feature information on the third feature map based on the position information of the candidate 2D frame, where the first feature information is a feature of the region where the candidate 2D frame is located, and the The third feature map is determined according to the first feature map;

The classification and regression network is used to process the first feature information, and output the target 2D frame of the target object and the first indication information, and the number of the target 2D frame is less than or equal to the number of the candidate 2D frames , and the first indication information is used to indicate the category to which the target object belongs.
The perception network according to claim 1, wherein the classification and regression network is specifically used for:

processing the first feature information to obtain the confidence that the candidate 2D frame belongs to each category in the multiple tasks;

Adjust the position information of the candidate 2D frame to obtain the adjusted candidate 2D frame;

Determine the target 2D frame according to the adjusted candidate 2D frame;

The first indication information is determined according to the confidence that the target 2D frame belongs to each category.
The perceptual network according to claim 2, wherein the classification and regression network comprises a first regional convolutional neural network (RCNN), and the first RCNN comprises a hidden layer, a plurality of sub-classification fully connected layers and a plurality of sub-regression fully connected layers layer, the hidden layer is connected with the multiple sub-classification fully connected layers, the hidden layer is connected with the multiple sub-regression fully connected layers, and the multiple sub-classification fully connected layers are in one-to-one correspondence with the multiple tasks, The multiple sub-regression fully connected layers are in one-to-one correspondence with the multiple tasks;

The hidden layer is used to process the first feature information to obtain second feature information;

The sub-category fully-connected layer is used to obtain, according to the second feature information, the confidence of the candidate 2D frame belonging to the object category in the task corresponding to the sub-category fully-connected layer;

The sub-regression fully connected layer is configured to adjust the position information of the candidate 2D frame according to the second feature information to obtain the adjusted candidate 2D frame.
The perception network according to claim 2, wherein the classification and regression network includes a second RCNN, the second RCNN includes a hidden layer, a classification fully connected layer and a regression fully connected layer, the hidden layer and the The classification fully connected layer is connected, and the hidden layer is connected with the regression fully connected layer;

The hidden layer is used to process the first feature information to obtain third feature information;

The classification fully connected layer is used to obtain the confidence that the candidate 2D frame belongs to the respective categories according to the third feature information;

The regression fully connected layer is configured to adjust the position information of the candidate 2D frame according to the third feature information to obtain the adjusted candidate 2D frame.
The perceptual network according to claim 4, wherein the classification fully connected layer is obtained by merging a plurality of sub-classification fully connected layers in the first RCNN, and the regression fully connected layer is obtained by combining the It is obtained by merging multiple sub-regression fully connected layers in the first RCNN,

The first RCNN includes the hidden layer, multiple sub-category fully connected layers and multiple sub-regression fully connected layers, the hidden layer is connected to the multiple sub-category fully connected layers, and the hidden layer is connected to the multiple sub-regression layers. The fully-connected layers are connected, the multiple sub-category fully-connected layers are in one-to-one correspondence with the multiple tasks, and the multiple sub-regression fully-connected layers are in one-to-one correspondence with the multiple tasks;

The sub-category fully-connected layer is used to obtain, according to the third feature information, the confidence level that the candidate 2D frame belongs to the object category in the task corresponding to the sub-category fully-connected layer;

The sub-regression fully connected layer is configured to adjust the position information of the candidate 2D frame according to the third feature information to obtain the adjusted candidate 2D frame.
A training method for a perceptual network, wherein the perceptual network comprises: a candidate region generation network RPN, the RPN is used to predict the position information of a candidate two-dimensional 2D frame of a target object in a sample image, the target object an object to be detected including multiple tasks, each of the multiple tasks includes at least one category; the target object includes a first task object and a second task object;

The method includes:

Acquire training data, where the training data includes the sample image, the labeling data of the first task object on the sample image, and the pseudo-frame of the second task object on the sample image, the labeling data Including the class label of the first task object and the labeled 2D frame of the first task object, the pseudo frame of the second task object is the second task obtained by inferring the sample image through other perceptual networks The object's target 2D box;

The perceptual network is trained based on the training data.
The training method according to claim 6, wherein the perception network further comprises a backbone network, a region of interest extraction module and a classification and regression network,

The training of the perception network based on the training data includes:

Calculate the first loss function according to the difference between the labeled 2D frame of the first task object and the target 2D frame of the second task object and the candidate 2D frame of the target object in the sample image predicted by the RPN value;

Calculate a second loss function value of the perceptual network according to the labeled data;

The first loss function value and the second loss function value are back-propagated, and the parameters of the part to be trained in the perceptual network are adjusted, and the part to be trained in the perceptual network includes the classification regression The part to be trained in the network, the region of interest extraction module, the RPN and the backbone network, and the part to be trained in the classification and regression network are determined according to the first task object.
The training method according to claim 6 or 7, wherein,

the backbone network, configured to perform convolution processing on the sample image, and output the first feature map of the sample image;

The RPN is used to output the position information of the candidate 2D frame of the target object based on a second feature map, where the second feature map is determined according to the first feature map;

The region of interest extraction module is configured to extract first feature information on the third feature map based on the position information of the candidate 2D frame, where the first feature information is a feature of the region where the candidate 2D frame is located, and the The third feature map is determined according to the first feature map;

The classification and regression network is used to process the first feature information, and output the target 2D frame of the target object and the first indication information, and the number of the target 2D frame is less than or equal to the number of the candidate 2D frames , and the first indication information is used to indicate the category to which the target object belongs.
The training method according to claim 8, wherein the classification and regression network comprises a first regional convolutional neural network (RCNN), and the first RCNN comprises a hidden layer, multiple sub-classification fully connected layers and multiple sub-regression fully connected layers layer, the hidden layer is connected with the multiple sub-classification fully connected layers, the hidden layer is connected with the multiple sub-regression fully connected layers, and the multiple sub-classification fully connected layers are in one-to-one correspondence with the multiple tasks, The multiple sub-regression fully connected layers are in one-to-one correspondence with the multiple tasks;

The hidden layer is used to process the first feature information to obtain second feature information;

The sub-category fully-connected layer is used to obtain, according to the second feature information, the confidence of the candidate 2D frame belonging to the object category in the task corresponding to the sub-category fully-connected layer;

The sub-regression fully connected layer is used to adjust the position information of the candidate 2D frame according to the second feature information to obtain the adjusted candidate 2D frame;

as well as

The part to be trained in the classification and regression network includes the hidden layer and the sub-classification fully connected layer and the sub-regression fully connected layer corresponding to the task where the first task object is located.
An object recognition method, characterized in that the perception network comprises: a backbone network, a candidate region generation network RPN, a region of interest extraction module and a classification regression network, the method comprising:

Using the backbone network to perform convolution processing on the input image to obtain the first feature map of the input image;

Using the RPN to output the position information of the candidate two-dimensional 2D frame of the target object based on the second feature map, the target object includes objects to be detected in multiple tasks, and each task in the multiple tasks at least includes a category, the second feature map is determined according to the first feature map;

Using the region of interest extraction module to extract first feature information on the third feature map based on the position information of the candidate 2D frame, the first feature information is the feature of the region where the candidate 2D frame is located, and the third The feature map is determined according to the first feature map;

The first feature information is processed by the classification and regression network to obtain the target 2D frame of the target object and the first indication information. The number of the target 2D frames is less than or equal to the number of the candidate 2D frames, so The first indication information is used to indicate the category to which the target object belongs.
The method according to claim 10, wherein the using the classification and regression network to process the first feature information to obtain the target 2D frame and the first indication information of the target object, comprising:

Use the classification and regression network to process the first feature information to obtain the confidence that the candidate 2D frame belongs to each category in the multiple tasks;

Use the classification and regression network to adjust the position information of the candidate 2D frame to obtain the adjusted candidate 2D frame;

Determine the target 2D frame according to the adjusted candidate 2D frame;

The first indication information is determined according to the confidence that the target 2D frame belongs to each category.
The method according to claim 11, wherein the classification and regression network comprises a first regional convolutional neural network (RCNN), and the first RCNN comprises a hidden layer, a plurality of sub-classification fully connected layers and a plurality of sub-regression fully connected layers , the hidden layer is connected to the multiple sub-classification fully connected layers, the hidden layer is connected to the multiple sub-regression fully connected layers, and the multiple sub-classification fully connected layers are in one-to-one correspondence with the multiple tasks, so The plurality of sub-regression fully connected layers are in one-to-one correspondence with the plurality of tasks; and

Using the classification and regression network to process the first feature information, and output the target 2D frame of the target object and the first indication information, include:

Use the hidden layer to process the first feature information to obtain second feature information;

Using the sub-category fully-connected layer to obtain the confidence level of the object category in the task corresponding to the sub-category fully-connected layer that the candidate 2D frame belongs to according to the second feature information;

The position information of the candidate 2D frame is adjusted by using the sub-regression fully connected layer according to the second feature information to obtain the adjusted candidate 2D frame.
The method according to claim 11, wherein the classification and regression network comprises a second RCNN, the second RCNN comprises a hidden layer, a classification fully connected layer and a regression fully connected layer, the hidden layer and the classification A fully connected layer is connected, and the hidden layer is connected to the regression fully connected layer; and

The processing of the first feature information by using the classification and regression network, and outputting the target 2D frame of the target object and the first indication information, including:

Use the hidden layer to process the first feature information to obtain third feature information;

Obtain the confidence that the candidate 2D frame belongs to the respective categories according to the third feature information by using the classification fully-connected layer;

The position information of the candidate 2D frame is adjusted by using the regression fully connected layer according to the third feature information to obtain the adjusted candidate 2D frame.
The method according to claim 13, wherein the classification fully connected layer is obtained by merging multiple sub-classification fully connected layers in the first RCNN, and the regression fully connected layer is obtained by combining the first RCNN It is obtained by merging multiple sub-regression fully connected layers in an RCNN,

The first RCNN includes the hidden layer, the multiple sub-category fully connected layers and the multiple sub-regression fully connected layers, the hidden layer is connected to the multiple sub-category fully connected layers, and the hidden layer is connected to the multiple sub-category fully connected layers. The multiple sub-regression fully-connected layers are connected, the multiple sub-classification fully-connected layers are in one-to-one correspondence with the multiple tasks, and the multiple sub-regression fully-connected layers are in one-to-one correspondence with the multiple tasks;

The sub-category fully-connected layer is used to obtain, according to the third feature information, the confidence level that the candidate 2D frame belongs to the object category in the task corresponding to the sub-category fully-connected layer;

The sub-regression fully connected layer is configured to adjust the position information of the candidate 2D frame according to the third feature information to obtain the adjusted candidate 2D frame.
A training device for a perceptual network, wherein the perceptual network includes a candidate region generation network RPN, the RPN is used to predict the position information of a candidate two-dimensional 2D frame of a target object in a sample image, and the target object includes Objects to be detected for multiple tasks, each task in the multiple tasks includes at least one category, and the target objects include a first task object and a second task object; the training device includes:

an acquisition unit, configured to acquire training data, the training data includes the sample image, the labeling data of the first task object on the sample image, and the pseudo frame of the second task object on the sample image , the annotation data includes the class label of the first task object and the labeled 2D frame of the first task object, and the pseudo frame of the second task object is obtained by inferring the sample image through other perceptual networks the target 2D frame of the second task object;

A processing unit, configured to train the perception network based on the training data.
The training device according to claim 15, wherein the perception network further comprises a backbone network, a region of interest extraction module and a classification and regression network, and the processing unit is specifically configured to:

Calculate the first loss function according to the difference between the labeled 2D frame of the first task object and the target 2D frame of the second task object and the candidate 2D frame of the target object in the sample image predicted by the RPN value;

Calculate a second loss function value of the perceptual network according to the labeled data;

The first loss function value and the second loss function value are back-propagated, and the parameters of the part to be trained in the perceptual network are adjusted, and the part to be trained in the perceptual network includes the classification regression The part to be trained in the network, the region of interest extraction module, the RPN and the backbone network, and the part to be trained in the classification and regression network are determined according to the first task object.
The training device according to claim 15 or 16, characterized in that,

the backbone network, configured to perform convolution processing on the sample image, and output the first feature map of the sample image;

The RPN is used to output the position information of the candidate 2D frame of the target object based on the second feature map, and the second feature map is determined according to the first feature map;

The region of interest extraction module is configured to extract first feature information on the third feature map based on the position information of the candidate 2D frame, where the first feature information is a feature of the region where the candidate 2D frame is located, and the The third feature map is determined according to the first feature map;

The classification and regression network is used to process the first feature information, and output the target 2D frame of the target object and the first indication information, and the number of target 2D frames of the target object is less than or equal to the target object The number of candidate 2D boxes of , the first indication information is used to indicate the category to which the target object belongs.
The training device according to claim 17, wherein the classification and regression network comprises a first regional convolutional neural network (RCNN), and the first RCNN comprises a hidden layer, a plurality of sub-classification fully connected layers and a plurality of sub-regression fully connected layers layer, the hidden layer is connected to the multiple sub-classification fully connected layers, the hidden layer is connected to the multiple sub-regression fully connected layers, and the multiple sub-classification fully connected layers correspond to the multiple tasks one-to-one, The multiple sub-regression fully connected layers are in one-to-one correspondence with the multiple tasks;

The hidden layer is used to process the first feature information to obtain second feature information;

The sub-category fully-connected layer is used to obtain, according to the second feature information, the confidence of the candidate 2D frame belonging to the object category in the task corresponding to the sub-category fully-connected layer;

The sub-regression fully connected layer is used to adjust the position information of the candidate 2D frame according to the result of the hidden layer processing, so as to obtain the adjusted candidate 2D frame;

as well as

The part to be trained in the classification and regression network includes the hidden layer and the sub-classification fully connected layer and the sub-regression fully connected layer corresponding to the task where the first task object is located.
An object recognition device, characterized in that the perception network includes: a backbone network, a candidate region generation network RPN, a region of interest extraction module and a classification and regression network, the perception network is deployed on the device, and the device includes:

an acquisition unit for acquiring an input image;

Processing unit for:

Using the backbone network to perform convolution processing on the input image to obtain the first feature map of the input image;

Using the RPN to output the position information of the candidate two-dimensional 2D frame of the target object based on the second feature map, the target object includes objects to be detected in multiple tasks, and each task in the multiple tasks includes at least one category, the second feature map is determined according to the first feature map;

Using the region of interest extraction module to extract first feature information on the third feature map based on the position information of the candidate 2D frame, the first feature information is the feature of the region where the candidate 2D frame is located, and the third The feature map is determined according to the first feature map;

The first feature information is processed by the classification and regression network to obtain the target 2D frame of the target object and the first indication information. The number of target 2D frames of the target object is less than or equal to the candidate of the target object The number of 2D boxes, and the first indication information is used to indicate the category to which the target object belongs.
The device according to claim 19, wherein the processing unit is specifically configured to:

Using the classification and regression network to process the first feature information to obtain the confidence that the candidate 2D frame belongs to each category in the multiple tasks;

Use the classification and regression network to adjust the position information of the candidate 2D frame to obtain the adjusted candidate 2D frame;

Determine the target 2D frame according to the adjusted candidate 2D frame;

The first indication information is determined according to the confidence that the target 2D frame belongs to each category.
The apparatus according to claim 20, wherein the classification and regression network comprises a first regional convolutional neural network (RCNN), and the first RCNN comprises a hidden layer, multiple sub-classification fully connected layers and multiple sub-regression fully connected layers , the hidden layer is connected with the multiple sub-classification fully connected layers, the hidden layer is connected with the multiple sub-regression fully connected layers, and the multiple sub-classification fully connected layers are in one-to-one correspondence with the multiple tasks, so The multiple sub-regression fully connected layers are in one-to-one correspondence with the multiple tasks; and the processing unit is specifically used for:

Use the hidden layer to process the first feature information to obtain second feature information; use the sub-class fully connected layer to obtain the candidate 2D frame belonging to the sub-class fully connected layer according to the second feature information The confidence of the object category in the corresponding task;

The position information of the candidate 2D frame is adjusted by using the sub-regression fully connected layer according to the second feature information to obtain the adjusted candidate 2D frame.
The apparatus according to claim 20, wherein the classification and regression network comprises a second RCNN, the second RCNN comprises a hidden layer, a classification fully connected layer and a regression fully connected layer, the hidden layer and the classification The fully connected layer is connected, and the hidden layer is connected with the regression fully connected layer; and the processing unit is specifically used for:

Use the hidden layer to process the first feature information to obtain third feature information;

Using the classification fully connected layer to obtain the confidence that the candidate 2D frame belongs to the respective categories according to the obtained third feature information;

Using the regression fully connected layer to adjust the position information of the candidate 2D frame according to the obtained third feature information, to obtain the adjusted candidate 2D frame.
The apparatus according to claim 22, wherein the classification fully connected layer is obtained by merging a plurality of sub-classification fully connected layers in the first RCNN, and the regression fully connected layer is obtained by combining the first RCNN It is obtained by merging multiple sub-regression fully connected layers in an RCNN,

The first RCNN includes the hidden layer, the multiple sub-category fully connected layers and the multiple sub-regression fully connected layers, the hidden layer is connected to the multiple sub-category fully connected layers, and the hidden layer is connected to the multiple sub-category fully connected layers. The multiple sub-regression fully-connected layers are connected, the multiple sub-classification fully-connected layers are in one-to-one correspondence with the multiple tasks, and the multiple sub-regression fully-connected layers are in one-to-one correspondence with the multiple tasks;

The sub-category fully-connected layer is used to obtain the confidence level of the object category in the task corresponding to the sub-category fully-connected layer that the candidate 2D frame belongs to according to the obtained third feature information;

The sub-regression fully connected layer is configured to adjust the position information of the candidate 2D frame according to the obtained third feature information to obtain the adjusted candidate 2D frame.
A training device for a perception network, characterized in that it includes a processor and a transmission interface, the processor receives or sends data through the transmission interface, and the processor is configured to call program instructions stored in a memory to The method of any one of claims 6 to 9 is performed.
An object recognition device, characterized in that it includes a processor and a transmission interface, the processor receives or sends data through the transmission interface, and the processor is configured to call program instructions stored in a memory to execute a right The method of any one of claims 10 to 14.
A computer-readable storage medium, characterized in that the computer-readable storage medium stores program code for execution by a device, and when the program code is executed on a computer or a processor, causes the computer or the processing The device performs the method of any one of claims 6 to 9 or 10 to 14.
A computer program product comprising instructions, characterized in that, when the computer program product is run on a computer or a processor, the computer or the processor is caused to perform any one of claims 6 to 9 or 10 to 14. one of the methods described.