CN117157679A

CN117157679A - Perception network, training method of perception network, object recognition method and device

Info

Publication number: CN117157679A
Application number: CN202180096605.3A
Authority: CN
Inventors: 周凯强; 江立辉; 黄梓钊; 秘谧; 王鑫
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-04-12
Filing date: 2021-04-12
Publication date: 2023-12-01
Also published as: WO2022217434A1

Abstract

A sensing network (800), a training method (1200) of the sensing network, an object recognition method (1500) and a device, wherein the sensing network (800) comprises a main network (810), a candidate region generation network (RPN) (820), a region of interest extraction module (830) and a classification regression network (840), a plurality of sensing tasks share the RPN (820), one RPN (820) predicts the region where an object to be detected in a plurality of tasks is located, and the classification regression network obtains a target 2D frame and a classification result. The sensing network (800) can reduce the quantity of parameters and calculated quantity in the multi-task sensing network, reduce the power consumption of hardware and improve the running speed of a model.

Description

Perception network, training method of perception network, object recognition method and device

Technical Field

The present application relates to the field of computer vision, and more particularly, to a perception network, a training method of the perception network, an object recognition method, and an object recognition apparatus.

Background

Computer vision is an integral part of various intelligent/autonomous systems in various fields of application, such as manufacturing, inspection, document analysis, and medical diagnostics. In visual terms, computer vision is the installation of eyes (cameras/video cameras) and brain (algorithms) to a computer, thereby enabling the computer to perceive the environment. Computer vision uses various imaging systems to replace visual organs to obtain input information, and then uses computer to replace brain to process and interpret the input information.

With the development of visual perception technology and the increasing demand for artificial intelligence (artificial intelligence, AI) perception of real scenes, more and more perception networks are widely deployed in various fields. For example, sensory networks deployed in advanced driving assistance systems (advanced driving assistant system, ADAS) and automated driving systems (autonomous driving system, ADS) may be used to identify obstacles on a road. Most of the current sensing networks can only complete one detection task, and if multiple detection tasks are to be realized, different networks are usually required to be deployed to realize different detection tasks. However, the simultaneous operation of multiple aware networks may result in increased power consumption of the hardware, reducing the operation speed of the model. Moreover, the computational effort of the chip used in many fields is low, and it is difficult to deploy a large-scale sensing network, and more difficult to deploy a plurality of sensing networks.

Therefore, how to reduce the hardware power consumption during the operation of the multi-task aware network is a urgent problem to be solved.

Disclosure of Invention

The application provides a perception network, a training method of the perception network, an object identification method and a device, which can reduce the quantity of parameters and calculated quantity in a multi-task perception network, reduce the power consumption of hardware and improve the running speed of a model.

In a first aspect, there is provided a cognitive network comprising: a backbone network, a candidate region generation network (region proposal network, RPN), a region of interest extraction module, and a classification regression network; an RPN for outputting position information of a candidate two-dimensional (2 d) frame of a target object based on a second feature map, the target object including an object to be detected among a plurality of tasks, each task of the plurality of tasks including at least one category, the second feature map being determined from the first feature map; the region of interest extraction module is used for extracting first characteristic information on a third characteristic map based on the position information of the candidate 2D frame, wherein the first characteristic information is the characteristic of the region where the candidate 2D frame is positioned, and the third characteristic map is determined according to the first characteristic map; and the classification regression network is used for processing the first characteristic information, outputting target 2D frames of the target object and first indication information, wherein the number of the target 2D frames is smaller than or equal to that of the candidate 2D frames, and the first indication information is used for indicating the class to which the target object belongs.

According to the scheme of the embodiment of the application, a plurality of sensing tasks are completed by utilizing one sensing network, and the plurality of tasks share one RPN, namely, the area where the object to be detected in the plurality of tasks is located is predicted by one RPN, so that the parameter quantity and the calculated quantity of the sensing network are reduced while the performance of the sensing network is ensured, the processing efficiency is improved, the sensing network is beneficial to being deployed in scenes with higher real-time requirements, the pressure of hardware is reduced, and the cost is saved.

The "first feature map" refers to a feature map (feature map) output by the backbone network. The feature maps output by the backbone network may all be referred to as first feature maps.

The first feature map of the input image may be one or a plurality of first feature maps.

Multiple tasks can also be understood as multiple broad categories. One broad class includes at least one class. Alternatively, a major class is a collection of at least one class. The division criteria of the tasks may be set as desired. For example, objects to be detected are divided into a plurality of tasks according to the similarity of the objects to be detected.

The multiple tasks in the embodiment of the application share the same RPN, and the RPN can also be called a single-head multi-task RPN.

The second feature map may be one or more.

Illustratively, the second feature map may include one or more of the first feature maps.

The third feature map may be one of the first feature maps, for example.

With reference to the first aspect, in some implementations of the first aspect, the sensing network further includes a feature pyramid network (feature pyramid networks, FPN), where the FPN is connected to the backbone network and configured to perform feature fusion on the first feature map, and output the fused feature map.

In this case, the second feature map may include one or more of the fused feature maps.

The third profile may be, for example, one of the first profiles or one of the fused profiles of the FPN output.

According to the scheme provided by the embodiment of the application, the FPN is adopted to perform feature fusion on the first feature map, so that the feature map with more expression capability can be generated and provided for a later module, and the performance of the model is further improved.

With reference to the first aspect, in certain implementations of the first aspect, the classification regression network is specifically configured to: processing the first characteristic information to obtain the confidence that the candidate 2D frames belong to each category in the plurality of tasks; adjusting the position information of the candidate 2D frames to obtain adjusted candidate 2D frames; determining a target 2D frame according to the adjusted candidate 2D frame; and determining first indication information according to the confidence that the target 2D frame belongs to each category.

Illustratively, the position information of the candidate 2D frame is adjusted such that the adjusted candidate 2D frame is more matched to the shape of the real object than the candidate 2D frame, i.e. the adjusted candidate 2D frame is a more compact candidate 2D frame.

Further, performing frame merging operation on the adjusted candidate 2D frames to obtain target 2D frames. For example, non-maximal suppression (non maximum suppression, NMS) is combined on the adjusted 2D frame to obtain the target 2D frame.

With reference to the first aspect, in certain implementations of the first aspect, the classification regression network includes a first regional convolutional neural network (region convolutional neural networks, RCNN), the first RCNN including a hidden layer, a plurality of sub-classification full-connection layers, and a plurality of sub-regression full-connection layers, the hidden layer being connected to the plurality of sub-classification full-connection layers, the hidden layer being connected to the plurality of sub-regression full-connection layers, the plurality of sub-classification full-connection layers being in one-to-one correspondence with the plurality of tasks, the plurality of sub-regression full-connection layers being in one-to-one correspondence with the plurality of tasks; the hidden layer is used for processing the first characteristic information to obtain second characteristic information; the sub-classification full-connection layer is used for obtaining the confidence coefficient of the object category of the candidate 2D frame belonging to the task corresponding to the sub-classification full-connection layer according to the second characteristic information; and the sub-regression full-connection layer is used for adjusting the position information of the candidate 2D frames according to the second characteristic information to obtain the adjusted candidate 2D frames.

Illustratively, the hidden layer may include at least one of: a convolutional layer or a fully-concatenated layer. Since multiple tasks share hidden layers, the convolution layers in hidden layers may also be referred to as shared convolution layers (shared convolutional), and the full-connection layers in hidden layers may also be referred to as shared full-connection layers (shared fully connected layers).

The first RCNN comprises a hidden layer, a plurality of sub-class full-connection layers and a plurality of sub-regression full-connection layers, wherein the sub-class full-connection layers and the sub-regression full-connection layers correspond to a plurality of tasks. Each task may have a separate one of a sub-category full connection layer and a sub-regression full connection layer. The sub-classification full-connection layer and the sub-regression full-connection layer corresponding to each task can finish detection of an object to be detected in the task, specifically, the sub-classification full-connection layer can output confidence that the candidate 2D frame belongs to an object category in the task, and the sub-regression full-connection layer can output the adjusted candidate 2D frame.

One first RCNN includes a plurality of sub-classification full-connection layers and sub-regression full-connection layers, and thus, one first RCNN can complete detection of an object to be detected in a plurality of tasks. The first RCNN may also be referred to as a single-head multi-tasking RCNN.

According to the scheme provided by the embodiment of the application, the hidden layer of the first RCNN is shared by a plurality of tasks, so that the quantity of parameters and the calculated quantity of the sensing network are further reduced, and the processing efficiency is improved. Moreover, each task corresponds to an independent sub-classification full-connection layer (fully connected layers, fc) and sub-regression fc, which improves the scalability of the sensing network, which can flexibly implement functional configuration by increasing or decreasing the sub-classification fc and sub-regression fc.

With reference to the first aspect, in certain implementations of the first aspect, the classification regression network includes a second RCNN including a hidden layer, a classification full-connection layer, and a regression full-connection layer, the hidden layer being connected to the classification full-connection layer, the hidden layer being connected to the regression full-connection layer; the hidden layer is used for processing the first characteristic information to obtain third characteristic information; the classification full-connection layer is used for obtaining the confidence coefficient of the candidate 2D frame belonging to each category according to the third characteristic information; and the regression full-connection layer is used for adjusting the position information of the candidate 2D frames according to the third characteristic information to obtain the adjusted candidate 2D frames.

A second RCNN is capable of performing detection of objects to be detected in a plurality of tasks. The second RCNN may also be referred to as a single-ended multitasking RCNN.

In the scheme of the embodiment of the application, the second RCNN is adopted as the classification regression network, and a plurality of tasks share the hidden layer of the second RCNN, so that the quantity of parameters and the calculated quantity of the sensing network are further reduced, and the processing efficiency is improved. In addition, the output result of the hidden layer in the first RCNN is required to be input into all the sub-classified full-connection layers and the sub-regression full-connection layers for multiple matrix operation, and the output of the hidden layer in the second RCNN is only required to be input into the classified full-connection layers and the regression full-connection layers for matrix operation, so that the operation of the matrix operation can be further reduced, the hardware is more friendly, the operation time consumption is further reduced, and the processing efficiency is improved.

With reference to the first aspect, in some implementations of the first aspect, the classification full-connection layer is obtained by combining a plurality of sub-classification full-connection layers in the first RCNN, the regression full-connection layer is obtained by combining a plurality of sub-regression full-connection layers in the first RCNN, the first RCNN includes a hidden layer, a plurality of sub-classification full-connection layers and a plurality of sub-regression full-connection layers, the hidden layer is connected with the plurality of sub-classification full-connection layers, the hidden layer is connected with the plurality of sub-regression full-connection layers, the plurality of sub-classification full-connection layers are in one-to-one correspondence with the plurality of tasks, and the plurality of sub-regression full-connection layers are in one-to-one correspondence with the plurality of tasks; the sub-classification full-connection layer is used for obtaining the confidence coefficient of the object category of the candidate 2D frame belonging to the task corresponding to the sub-classification full-connection layer according to the third characteristic information; and the sub-regression full-connection layer is used for adjusting the position information of the candidate 2D frames according to the third characteristic information to obtain the adjusted candidate 2D frames.

In the scheme of the embodiment of the application, the sub-classifications fc and the sub-regressions fc in the first RCNN are combined, and the second RCNN is adopted as the classification regression network, so that the operation of matrix operation can be further reduced, the hardware is more friendly, the operation time consumption is further reduced, and the processing efficiency is improved.

In a second aspect, a training method of a sensing network is provided, where the sensing network includes: the method comprises the steps that a candidate region generation network (RPN) is used for predicting position information of a candidate two-dimensional 2D frame of a target object in a sample image, wherein the target object comprises objects to be detected of a plurality of tasks, and each task in the plurality of tasks comprises at least one category; the target object comprises a first task object and a second task object; the method comprises the following steps: the method comprises the steps that training data are obtained, the training data comprise sample images, labeling data of a first task object on the sample images and pseudo frames of a second task object on the sample images, the labeling data comprise class labels of the first task object and labeling 2D frames of the first task object, and the pseudo frames of the second task object are target 2D frames of the second task object obtained by reasoning the sample images through other perception networks; the perception network is trained based on the training data.

When the annotation data is part of the annotation data, that is, the annotation data only includes the annotation data of the first task object, when the perception network is trained based on the annotation data only, since the plurality of tasks share one RPN, training data of different tasks may be mutually inhibited when the RPN is trained. Specifically, since the labeling data is part of the labeling data, for example, labeling data of only one task of the object to be detected is labeled on one sample image, when training is performed by using the labeling data of the object to be detected of the task, parameters of the RPN are adjusted, so that the RPN can more accurately predict candidate 2D frames of the object to be detected of the task, but cannot accurately predict candidate 2D frames of other tasks of the object to be detected on the sample image. When training is performed by using the labeling data of the object to be detected of another task, the parameters of the RPN are adjusted, so that the adjusted RPN may not accurately predict the candidate 2D frames of the object to be detected of the other task. Thus, training data for different tasks may be mutually suppressed, resulting in that the RPN cannot predict all target objects in the image.

According to the scheme in the embodiment of the application, the sensing network is trained based on the pseudo frame and the annotation data together, and the pseudo frame of the second task object is provided under the condition that the annotation data only comprises the annotation data of the first task object, namely, under the condition of partial annotation data, so that a more comprehensive frame of the object to be detected is provided on the same sample image as the target output of the RPN, the parameters of the RPN are adjusted to enable the output of the RPN to be continuously close to the target data, mutual inhibition among different tasks is avoided, the RPN can obtain more comprehensive and accurate candidate 2D frames, and the recall rate is improved. The labeling data of the sample images in the embodiment of the application can be part of labeling data, so that the sample images required by specific tasks can be acquired, objects to be detected of all tasks do not need to be marked in each sample image, the acquisition cost and the labeling cost of the data are reduced, and the training data of different tasks are balanced. In addition, the scheme adopting part of the labeling data has flexible expansibility, and under the condition of adding tasks, only the labeling data of the newly added tasks are needed to be provided, and new objects to be detected do not need to be labeled on the basis of the original training data.

The first task object may comprise an object to be detected in one or more tasks. The one or more tasks are tasks where the first task object is located. The first task objects in the different sample images in the training set may be the same or different.

The second task object may comprise an object to be detected in one or more tasks. The one or more tasks are tasks where the second task object is located. The second task object and the first task object may have the same object to be detected. That is, the first task object and the second task object may have coincident objects to be detected, and the first task object and the second task object may also be completely different. The second task objects in the different sample images in the training set may be the same or different.

Other sensing networks refer to sensing networks other than the sensing network to be trained. By way of example, the other aware network may be a multi-head multi-tasked aware network or a plurality of single-tasked aware networks or the like.

With reference to the second aspect, in some implementations of the second aspect, the sensing network further includes a backbone network, a region of interest extraction module, and a classification regression network, training the sensing network based on training data, including: calculating a first loss function value according to the difference between the labeling 2D frame of the first task object and the target 2D frame of the second task object and the candidate 2D frame of the target object in the sample image obtained by RPN prediction; calculating a second loss function value of the sensing network according to the labeling data; and back-propagating the first loss function value and the second loss function value, adjusting parameters of a part required to be trained in the perception network, wherein the part required to be trained in the perception network comprises a part required to be trained in a classification regression network, an interested region extraction module, an RPN and a backbone network, and the part required to be trained in the classification regression network is determined according to the first task object.

And comparing the labeling 2D frame of the first task object and the target 2D frame of the second task object with the candidate 2D frame of the target object predicted by the RPN to obtain a loss function value of the RPN stage, namely a first loss function value.

And comparing the labeling data of the sample image with the output result of the classification regression network to obtain a loss function value of the task where the first task object is located in the stage of the classification regression network, namely a second loss function value.

The gradient of the parameter related to the first loss function value is calculated, and the parameter related to the first loss function value is adjusted based on the back propagation of the first loss function value, namely based on the gradient of the parameter, so that the adjustment of the sensing network is realized, and the RPN can more comprehensively predict the candidate frame.

The parameters related to the first loss function value are parameters in the sensing network utilized in the process of calculating the first loss function value, for example, parameters of the backup and parameters of the RPN. Further, in case the perceptual network comprises an FPN, the parameter related to the first loss function value further comprises an FPN.

Based on the back propagation of the second loss function value, calculating the gradient of the parameter related to the second loss function value, and further adjusting the parameter related to the second loss function value based on the gradient of the parameter, so as to realize the adjustment of the perception network, enable the classification regression network to be capable of correcting the output 2D frame better, and improve the accuracy of classification prediction.

The parameters related to the second loss function are parameters in the sensing network utilized in the process of calculating the second loss function value, for example, parameters of the backup, parameters of the RPN, parameters of the region of interest extraction module, and parameters of the part of the classification regression network required to be trained. Further, in case the perceptual network comprises an FPN, the parameter related to the second loss function value further comprises an FPN. The parameters related to the second loss function are parameters of the part of the perceptual network that needs to be trained.

According to the scheme in the embodiment of the application, parts shared by different tasks in the perception network, such as a backbone network, an RPN (remote procedure network), an interested region extraction module and the like, participate in training in the training process based on the labeling data of the different tasks, so that the shared parts of the different tasks in the perception network can learn the common characteristics of the tasks. Different parts corresponding to different tasks in the perception network, for example, the parts corresponding to the tasks in the classification regression network, only participate in training in the training process based on the labeling data of the respective tasks, so that the different parts corresponding to the different tasks in the perception network can learn the specific characteristics of the tasks, and the accuracy of the model is improved. Meanwhile, in the training process, the part to be trained in the classification regression network is determined according to the task, and different parts in the classification regression network corresponding to different tasks are not mutually affected in the training process, so that the independence of each task is ensured, and the model has stronger flexibility.

With reference to the second aspect, in some implementations of the second aspect, the backbone network is configured to perform convolution processing on the sample image and output a first feature map of the sample image; the RPN is used for outputting the position information of the candidate 2D frame of the target object based on a second characteristic diagram, and the second characteristic diagram is determined according to the first characteristic diagram; the region of interest extraction module is used for extracting first characteristic information on a third characteristic map based on the position information of the candidate 2D frame, wherein the first characteristic information is the characteristic of the region where the candidate 2D frame is positioned, and the third characteristic map is determined according to the first characteristic map; and the classification regression network is used for processing the first characteristic information, outputting target 2D frames of the target object and first indication information, wherein the number of the target 2D frames is smaller than or equal to that of the candidate 2D frames, and the first indication information is used for indicating the class to which the target object belongs.

With reference to the second aspect, in some implementations of the second aspect, the classification regression network includes a first regional convolutional neural network RCNN, the first RCNN includes a hidden layer, a plurality of sub-classification full-connection layers, and a plurality of sub-regression full-connection layers, the hidden layer is connected to the plurality of sub-classification full-connection layers, the hidden layer is connected to the plurality of sub-regression full-connection layers, the plurality of sub-classification full-connection layers are in one-to-one correspondence with the plurality of tasks, and the plurality of sub-regression full-connection layers are in one-to-one correspondence with the plurality of tasks; the hidden layer is used for processing the first characteristic information to obtain second characteristic information; the sub-classification full-connection layer is used for obtaining the confidence coefficient of the object category of the candidate 2D frame belonging to the task corresponding to the sub-classification full-connection layer according to the second characteristic information; the sub-regression full-connection layer is used for adjusting the position information of the candidate 2D frames according to the second characteristic information to obtain adjusted candidate 2D frames; and the part of the classification regression network, which is required to be trained, comprises a hidden layer, a sub-classification full-connection layer and a sub-regression full-connection layer, wherein the sub-classification full-connection layer and the sub-regression full-connection layer correspond to the task where the first task object is located.

According to the method provided by the embodiment of the application, the parts shared by different tasks in the perception network, namely the main network, the RPN, the interested region extraction module and the hidden layer of the classification regression network, participate in training in the training process based on the labeling data of the different tasks, so that the shared parts of the different tasks in the perception network can learn the common characteristics of the tasks. Different parts corresponding to different tasks in the perception network, namely a sub-classification full-connection layer and a sub-regression full-connection layer corresponding to each task in the classification regression network, participate in training only in the training process based on the labeling data of each task, so that different parts corresponding to different tasks in the perception network can learn the specific characteristics of the tasks, and the accuracy of the model is improved.

In a third aspect, there is provided a method of object recognition, a perception network comprising: the method comprises the steps of a backbone network, a candidate region generation network (RPN), a region of interest extraction module and a classification regression network, wherein the method comprises the following steps: carrying out convolution processing on an input image by using a backbone network to obtain a first feature map of the input image; outputting position information of a candidate two-dimensional 2D frame of a target object on the basis of a second feature map by using the RPN, wherein the target object comprises an object to be detected in a plurality of tasks, each task in the plurality of tasks at least comprises a category, and the second feature map is determined according to the first feature map; extracting first characteristic information on a third characteristic map based on the position information of the candidate 2D frame by using a region-of-interest extraction module, wherein the first characteristic information is the characteristic of the region where the candidate 2D frame is positioned, and the third characteristic map is determined according to the first characteristic map; and processing the first characteristic information by using a classification regression network to obtain target 2D frames of the target object and first indication information, wherein the number of the target 2D frames is smaller than or equal to that of the candidate 2D frames, and the first indication information is used for indicating the category to which the target object belongs.

With reference to the third aspect, in some implementations of the third aspect, processing the first feature information by using a classification regression network to obtain a target 2D frame of the target object and first indication information includes: processing the first characteristic information by using a classification regression network to obtain the confidence that the candidate 2D frames belong to each category in the plurality of tasks; adjusting the position information of the candidate 2D frames by using a classification regression network to obtain adjusted candidate 2D frames; determining a target 2D frame according to the adjusted candidate 2D frame; and determining first indication information according to the confidence that the target 2D frame belongs to each category.

With reference to the third aspect, in some implementations of the third aspect, the classification regression network includes a first regional convolutional neural network RCNN, the first RCNN includes a hidden layer, a plurality of sub-classification full-connection layers, and a plurality of sub-regression full-connection layers, the hidden layer is connected to the plurality of sub-classification full-connection layers, the hidden layer is connected to the plurality of sub-regression full-connection layers, the plurality of sub-classification full-connection layers are in one-to-one correspondence with the plurality of tasks, and the plurality of sub-regression full-connection layers are in one-to-one correspondence with the plurality of tasks; and processing the first feature information by using a classification regression network, outputting a target 2D frame of the target object and first indication information, including: processing the first characteristic information by using the hidden layer to obtain second characteristic information; obtaining the confidence coefficient of the object category of the candidate 2D frame belonging to the task corresponding to the sub-classification full-connection layer according to the second characteristic information by using the sub-classification full-connection layer; and adjusting the position information of the candidate 2D frames according to the second characteristic information by utilizing the sub-regression full-connection layer to obtain the adjusted candidate 2D frames.

With reference to the third aspect, in some implementations of the third aspect, the classification regression network includes a second RCNN, the second RCNN including a hidden layer, a classification full-connection layer, and a regression full-connection layer, the hidden layer being connected to the classification full-connection layer, the hidden layer being connected to the regression full-connection layer; and processing the first feature information by using a classification regression network, outputting a target 2D frame of the target object and first indication information, including: processing the first characteristic information by using the hidden layer to obtain third characteristic information; obtaining the confidence coefficient of the candidate 2D frames belonging to each category according to the third characteristic information by using the classification full-connection layer; and adjusting the position information of the candidate 2D frames according to the third characteristic information by using the regression full connection layer to obtain the adjusted candidate 2D frames.

With reference to the third aspect, in some implementations of the third aspect, the classification full-connection layer is obtained by combining a plurality of sub-classification full-connection layers in the first RCNN, the regression full-connection layer is obtained by combining a plurality of sub-regression full-connection layers in the first RCNN, the first RCNN includes a hidden layer, a plurality of sub-classification full-connection layers and a plurality of sub-regression full-connection layers, the hidden layer is connected with the plurality of sub-classification full-connection layers, the hidden layer is connected with the plurality of sub-regression full-connection layers, the plurality of sub-classification full-connection layers are in one-to-one correspondence with the plurality of tasks, and the plurality of sub-regression full-connection layers are in one-to-one correspondence with the plurality of tasks; the sub-classification full-connection layer is used for obtaining the confidence coefficient of the object category of the candidate 2D frame belonging to the task corresponding to the sub-classification full-connection layer according to the third characteristic information; and the sub-regression full-connection layer is used for adjusting the position information of the candidate 2D frames according to the third characteristic information to obtain the adjusted candidate 2D frames.

The target perceived network may be obtained by using the perceived network training method of the second aspect, and the target perceived network may be a trained image recognition model, and the image to be processed may be processed using the trained perceived network.

In a fourth aspect, a network aware training apparatus is provided, the apparatus comprising means or units for performing the method of any one of the implementations of the second aspect and the above-described second aspect.

In a fifth aspect, there is provided an object recognition apparatus comprising means or units for performing the method of any one of the above third aspect and third aspect.

It should be appreciated that the extensions, definitions, explanations and illustrations of the relevant content in the first and second aspects described above also apply to the same content in the third, fourth and fifth aspects.

In a sixth aspect, there is provided a network aware training apparatus, the apparatus comprising: a processor and a transmission interface, the processor receiving or transmitting data through the transmission interface, the processor being configured to invoke program instructions stored in the memory to perform the method of the second aspect and any implementation of the second aspect.

The processor in the sixth aspect may be a central processing unit (central processing unit, CPU) or a combination of a CPU and a neural network operation processor, where the neural network operation processor may include a graphics processor (graphics processing unit, GPU), a neural network processor (neural-network processing unit, NPU), a tensor processor (tensor processing unit, TPU), and the like. Wherein the TPU is an artificial intelligence accelerator application specific integrated circuit fully customized for machine learning by google (google).

In a seventh aspect, there is provided an object recognition apparatus comprising: a processor and a transmission interface, the processor receiving or transmitting data through the transmission interface, the processor being configured to invoke program instructions stored in the memory to perform the method of the third aspect and any implementation of the third aspect.

The processor in the seventh aspect may be a central processing unit or a combination of a CPU and a neural network operation processor, where the neural network operation processor may include a graphics processor, a neural network processor, a tensor processor, and the like. Wherein, TPU is an artificial intelligent accelerator application specific integrated circuit which is fully customized by google for machine learning.

In an eighth aspect, there is provided a computer readable storage medium storing program code for execution by a device, the program code when run on a computer or processor causing the computer or processor to perform the method of any one of the implementations of the second or third aspects.

A ninth aspect provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any one of the implementations of the second or third aspects described above.

In a tenth aspect, a chip is provided, the chip comprising a processor and a data interface, the processor reading instructions stored on a memory through the data interface, performing the method of any one of the implementations of the second or third aspects.

Optionally, as an implementation manner, the chip may further include a memory, where the memory stores instructions, and the processor is configured to execute the instructions stored on the memory, where the instructions, when executed, are configured to perform the method in any implementation manner of the first aspect or the second aspect.

The chip may be a field-programmable gate array (FPGA) or an application-specific integrated circuit (ASIC).

An eleventh aspect provides an electronic device comprising the apparatus of any one of the fourth to seventh aspects.

Drawings

Fig. 1 is a schematic diagram of an application scenario provided in an embodiment of the present application;

fig. 2 is a schematic diagram of another application scenario provided in an embodiment of the present application;

fig. 3 is a schematic structural diagram of a system architecture according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a convolutional neural network according to an embodiment of the present application;

fig. 5 is a schematic diagram of a hardware structure of a chip according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a system architecture according to an embodiment of the present application;

FIG. 7 is a schematic block diagram of a multi-head multitasking aware network;

fig. 8 is a schematic block diagram of a sensing network according to an embodiment of the present application;

FIG. 9 is a schematic block diagram of another awareness network provided by an embodiment of the present application;

FIG. 10 is a schematic block diagram of another awareness network provided by an embodiment of the present application;

FIG. 11 is a schematic block diagram of another awareness network provided by an embodiment of the present application;

FIG. 12 is a schematic flow chart of a training method of a cognitive network provided by an embodiment of the present application;

fig. 13 is a schematic diagram of a training process of a perception network according to an embodiment of the present application;

FIG. 14 is a schematic block diagram of a perception network in a training process provided by an embodiment of the present application;

FIG. 15 is a schematic flow chart of an object identification method according to an embodiment of the present application;

FIG. 16 is a schematic diagram of an object identification process according to an embodiment of the present application;

FIG. 17 is a schematic block diagram of a perceptive network in the reasoning process provided by an embodiment of the present application;

fig. 18 is a schematic diagram of a switching process of a network aware provided by an embodiment of the present application;

FIG. 19 is a schematic block diagram of an apparatus provided by an embodiment of the present application;

fig. 20 is a schematic block diagram of another apparatus provided in an embodiment of the present application.

Detailed Description

The technical scheme of the application will be described below with reference to the accompanying drawings.

The embodiment of the application can be applied to the fields of driving assistance, automatic driving, mobile phone terminals, monitoring, security protection and the like which need to finish various perception tasks. Inputting the image into the sensing network of the application to obtain the detection result of the object of interest in the image. The detection result can be input into a post-processing module for processing, for example, the detection result is sent into a planning control unit for decision making in an automatic driving system, or is sent into a security system for abnormal condition detection.

The following provides a brief description of three application scenarios of advanced driver assistance system (advanced driving assistant system, ADAS)/automatic driver system (autonomous driving system, ADS) visual perception system, album picture classification and monitoring.

ADAS/ADS visual perception system:

as shown in fig. 1, in ADAS and ADS, various types of target detection are required in real time, and the detection targets include dynamic obstacles, static obstacles, and traffic signs, for example, pedestrians (petestrian), riders (cyclist), tricycles (tricycles), cars (car), trucks (truck), buses (bus), wheels (wire), car lights (car light), traffic cone (traffic cone), traffic rod (traffic stick), fire hydrant (fire hydrants), motorcycles (motorbike), and bicycles (bicycle), traffic signs (traffic sign), guide signs (guide sign), billboards (billboards), road poles (pole), traffic lights (traffic lights), and road signs, etc. Traffic lights include red traffic lights (traffic lights_red), yellow traffic lights (traffic lights_yellow), green traffic lights (traffic lights_green), black traffic lights (traffic lights_black), and the like. Pavement markers include Around/straight (right), left turn (left)/right turn (right), straight and left turn (straight and left), straight and right turn (straight and right)/straight turn Around (straight and Around)/left turn Around (left and right)/left turn (left and right)/right turn (right and left) lane markers (pavement sign), and the like.

By utilizing the scheme of the embodiment of the application, the detection tasks of the multiple targets can be realized in one perception network, namely objects to be detected of the multiple tasks can be detected in one perception network, and the detection results can be sent to a planning control unit for decision making after being processed, such as obstacle avoidance, traffic light decision making or traffic sign decision making.

Album picture classification:

when a user stores a large number of pictures on terminal equipment (for example, a mobile phone) or a cloud disk, the user or the system can conveniently manage the album in a classified mode by identifying the images in the album, and user experience is improved.

By utilizing the scheme of the embodiment of the application, the perception network suitable for classifying the photo album pictures can be obtained or optimized. And classifying the pictures by utilizing the perception network, for example, classifying the pictures into different categories such as pictures containing animals, pictures containing people and the like, so as to label the pictures in the different categories, and facilitate the user to check and find. In addition, the classification labels of the pictures can also be provided for an album management system to carry out classification management, so that the management time of a user is saved, the album management efficiency is improved, and the user experience is improved.

And (3) monitoring:

the monitoring scene comprises: smart city, field monitoring, indoor monitoring, outdoor monitoring, in-car monitoring, etc.

As shown in fig. 2, in the smart city sensing system, various detection tasks, such as detecting vehicles, license plates, people, faces, etc., are required to be completed, and the detection results can be used for judging traffic violation behaviors, predicting traffic congestion level, etc.

By adopting the scheme of the embodiment of the application, the input road picture can be processed in one perception network, so that the detection tasks of the multiple targets can be completed. In addition, the detection task of the sensing network can be increased or decreased according to the actual situation. For example, the current detection tasks of the sensing network include a vehicle detection task and a person detection task, and if the detection task of the traffic sign is added to the detection task of the sensing network, the structure of the sensing network is adjusted, and the detection task is added. For a specific description, see later, for example fig. 14.

Since embodiments of the present application relate to a large number of applications of neural networks, for ease of understanding, the following description will first discuss the terms and concepts related to neural networks that may be involved in embodiments of the present application.

(1) Neural network

The neural network may be composed of neural units, which may be referred to as x _s And an arithmetic unit whose intercept 1 is an input, the output of the arithmetic unit may be:

wherein s=1, 2, … … n, n is a natural number greater than 1, W _s Is x _s B is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit to an output signal. The output signal of the activation function may be used as an input to a next convolutional layer, and the activation function may be a sigmoid function. A neural network is a network formed by joining together a plurality of the above-described single neural units, i.e., the output of one neural unit may be the input of another neural unit. The input of each neural unit may be connected to a local receptive field of a previous layer to extract features of the local receptive field, which may be an area composed of several neural units.

(2) Deep neural network

Deep neural networks (deep neural network, DNN), also known as multi-layer neural networks, can be understood as neural networks with multiple hidden layers. The DNNs are divided according to the positions of different layers, and the neural networks inside the DNNs can be divided into three types: input layer, hidden layer, output layer. Typically the first layer is the input layer, the last layer is the output layer, and the intermediate layers are all hidden layers. The layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.

Although DNN appears to be complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression:wherein,is the input vector which is to be used for the input,is the output vector of the vector,is the offset vector, W is the weight matrix (also called coefficient), and α () is the activation function. Each layer is only for input vectorsObtaining the output vector through such simple operationSince the DNN layers are many, the coefficient W and the offset vectorAnd the number of (2) is also relatively large. The definition of these parameters in DNN is as follows: taking the coefficient W as an example: it is assumed that in a three-layer DNN, the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined asThe superscript 3 represents the number of layers in which the coefficient W is located, and the subscript corresponds to the output third layer index 2 and the input second layer index 4.

In summary, the coefficients of the kth neuron of the L-1 layer to the jth neuron of the L layer are defined as

It should be noted that the input layer is devoid of W parameters. In deep neural networks, more hidden layers make the network more capable of characterizing complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the greater the "capacity", meaning that it can accomplish more complex learning tasks. The process of training the deep neural network, i.e. learning the weight matrix, has the final objective of obtaining a weight matrix (a weight matrix formed by a number of layers of vectors W) for all layers of the trained deep neural network.

(3) Convolutional neural network

The convolutional neural network (convolutional neuron network, CNN) is a deep neural network with a convolutional structure. The convolutional neural network comprises a feature extractor consisting of a convolutional layer and a sub-sampling layer. The feature extractor can be seen as a filter and the convolution process can be seen as a convolution with an input image or convolution feature plane (feature map) using a trainable filter. The convolution layer refers to a neuron layer in the convolution neural network, which performs convolution processing on an input signal. In the convolutional layer of the convolutional neural network, one neuron may be connected with only a part of adjacent layer neurons. A convolutional layer typically contains a number of feature planes, each of which may be composed of a number of neural elements arranged in a rectangular pattern. Neural elements of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights can be understood as the way image information is extracted is independent of location. The underlying principle in this is: the statistics of a certain part of the image are the same as other parts. I.e. meaning that the image information learned in one part can also be used in another part. The same learned image information can be used for all locations on the image. In the same convolution layer, a plurality of convolution kernels may be used to extract different image information, and in general, the greater the number of convolution kernels, the more abundant the image information reflected by the convolution operation.

The convolution kernel can be initialized in the form of a matrix with random size, and reasonable weight can be obtained through learning in the training process of the convolution neural network. In addition, the direct benefit of sharing weights is to reduce the connections between layers of the convolutional neural network, while reducing the risk of overfitting.

(4) Loss function

In training the deep neural network, since the output of the deep neural network is expected to be as close to the value actually expected, the weight vector of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the actually expected target value (of course, the process of pre-configuring parameters for each layer in the deep neural network is usually performed before the first update), for example, if the predicted value of the network is higher, the weight vector is adjusted to be lower than the predicted value, and the adjustment is continuously performed until the deep neural network can predict the actually expected target value or the value very close to the actually expected target value. Thus, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which is a loss function (loss function) or an objective function (objective function), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, the higher the output value (loss) of the loss function is, the larger the difference is, and then the training of the deep neural network becomes a process of reducing the loss as much as possible. In general, the smaller the loss, the higher the training quality of the deep neural network, and the larger the loss, the lower the training quality of the deep neural network. Similarly, the smaller the loss ripple, the more stable the training; the greater the loss fluctuation, the less stable the training.

(5) Back propagation algorithm

The neural network can adopt an error Back Propagation (BP) algorithm to correct the size of parameters in the initial model in the training process, so that the reconstruction error loss of the model is smaller and smaller. Specifically, the input signal is transmitted forward until the output is generated with error loss, and the parameters in the initial super-resolution model are updated by back-propagating the error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion that dominates the error loss, and aims to obtain parameters of the optimal super-resolution model, such as a weight matrix.

The method provided by the present application is described below from the model training side and the model application side.

The training method of the perception network provided by the embodiment of the application relates to the processing of computer vision, and can be particularly applied to data processing methods such as data training, machine learning, deep learning and the like, so as to perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on training data, and finally obtain a trained perception network; in addition, the object recognition method provided by the embodiment of the application can use the trained sensing network to input data (such as the image to be processed in the application) into the trained sensing network, so as to obtain output data (such as the first indication information and the target 2D frame of the target object in the application). It should be noted that, the training method and the object recognition method of the perceived network provided by the embodiments of the present application are based on the same concept, and may be understood as two parts in a system, or two stages of an overall process: such as a model training phase and a model application phase.

As shown in fig. 3, an embodiment of the present application provides a system architecture 100. In fig. 3, a data acquisition device 160 is used to acquire training data. For the training method of the perception network according to the embodiment of the present application, the training data may include a sample image, labeling data of the sample image, and a pseudo frame on the sample image.

After the training data is collected, the data collection device 160 stores the training data in the database 130 and the training device 120 trains the target model/rule 101 based on the training data maintained in the database 130.

The specific manner in which the training device 120 obtains the target model/rule 101 based on the training data will be described in detail later. The target model/rule 101 can be used to implement the object recognition method according to the embodiment of the present application, that is, the image to be processed is input into the target model/rule 101, so as to obtain the detection result of the object of interest in the image to be processed. In practical applications, the training data maintained in the database 130 is not necessarily collected by the data collecting device 160, but may be received from other devices. It should be noted that the training device 120 is not necessarily completely based on the training data maintained by the database 130 to perform training of the target model/rule 101, and it is also possible to obtain the training data from the cloud or other places to perform model training, which should not be taken as a limitation of the embodiments of the present application.

The target model/rule 101 obtained by training according to the training device 120 may be applied to different systems or devices, such as the execution device 110 shown in fig. 3, where the execution device 110 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an augmented reality (augmented reality, AR) AR/Virtual Reality (VR), a vehicle-mounted terminal, or may also be a server or cloud. In fig. 3, the execution device 110 configures an input/output (I/O) interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through the client device 140, where the input data may include in an embodiment of the present application: an image to be processed.

In preprocessing input data by the execution device 110, or in performing processing related to computation or the like by the computation module 111 of the execution device 110, the execution device 110 may call data, codes or the like in the data storage system 150 for corresponding processing, or may store data, instructions or the like obtained by corresponding processing in the data storage system 150.

Finally, the I/O interface 112 returns the processing result, such as the detection result obtained as described above, to the client device 140, thereby providing the processing result to the user.

For example, the client device 140 may be a planning control unit in an autopilot system.

It should be noted that the training device 120 may generate, based on different training data, a corresponding target model/rule 101 for different targets or different tasks, where the corresponding target model/rule 101 may be used to achieve the targets or to complete the tasks, thereby providing the user with the desired result.

In the case shown in FIG. 3, the user may manually give input data, which may be manipulated through an interface provided by the I/O interface 112. In another case, the client device 140 may automatically send the input data to the I/O interface 112, and if the client device 140 is required to automatically send the input data requiring the user's authorization, the user may set the corresponding permissions in the client device 140. The user may view the results output by the execution device 110 at the client device 140, and the specific presentation may be in the form of a display, a sound, an action, or the like. The client device 140 may also be used as a data collection terminal to collect input data of the input I/O interface 112 and output results of the output I/O interface 112 as new sample data as shown in the figure, and store the new sample data in the database 130. Of course, instead of being collected by the client device 140, the I/O interface 112 may directly store the input data input to the I/O interface 112 and the output result output from the I/O interface 112 as new sample data into the database 130.

It should be noted that fig. 3 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship among devices, apparatuses, modules, etc. shown in the drawing is not limited in any way, for example, in fig. 3, the data storage system 150 is an external memory with respect to the execution device 110, and in other cases, the data storage system 150 may be disposed in the execution device 110.

As shown in fig. 3, the target model/rule 101 is trained according to the training device 120, and the target model/rule 101 may be a perception network in an embodiment of the present application.

Since CNN is a very common neural network, the structure of CNN will be described in detail with reference to fig. 4. As described in the above description of the basic concept, the convolutional neural network is a deep neural network with a convolutional structure, and is a deep learning architecture, where the deep learning architecture refers to the algorithm updated by the neural network model, and performs multiple levels of learning at different abstraction levels. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to an image input thereto.

The structure of the neural network specifically adopted by the image recognition method in the embodiment of the application can be shown in fig. 4. In fig. 4, a Convolutional Neural Network (CNN) 200 may include an input layer 210, a convolutional layer/pooling layer 220 (where the pooling layer is optional), and a fully-connected layer (fully connected layer) 230. The input layer 210 may acquire an image to be processed, and process the acquired image to be processed by the convolution layer/pooling layer 220 and the following full connection layer 230, so as to obtain a processing result of the image. The internal layer structure of the CNN200 of fig. 4 is described in detail below.

Convolution layer/pooling layer 220:

convolution layer:

the convolution/pooling layer 220 as shown in fig. 4 may include layers as examples 221-226, for example: in one implementation, layer 221 is a convolutional layer, layer 222 is a pooling layer, layer 223 is a convolutional layer, layer 224 is a pooling layer, layer 225 is a convolutional layer, and layer 226 is a pooling layer; in another implementation, 221, 222 are convolutional layers, 223 are pooling layers, 224, 225 are convolutional layers, and 226 are pooling layers. I.e. the output of the convolution layer may be used as input to a subsequent pooling layer or as input to another convolution layer to continue the convolution operation.

The internal principle of operation of one convolution layer will be described below using the convolution layer 221 as an example.

The convolution layer 221 may include a plurality of convolution operators, also known as kernels, which function in image processing as a filter to extract specific information from the input image matrix, which may be a weight matrix in nature, which is typically predefined, and which is typically processed on the input image in a horizontal direction, pixel by pixel (or two pixels by two pixels … …, depending on the value of the step size stride), to accomplish the task of extracting specific features from the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix produces a convolved output of a single depth dimension, but in most cases does not use a single weight matrix, but instead applies multiple weight matrices of the same size (row by column), i.e., multiple homography matrices. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image, where the dimension is understood to be determined by the "multiple" as described above. Different weight matrices may be used to extract different features in the image, e.g., one weight matrix is used to extract image edge information, another weight matrix is used to extract a particular color of the image, yet another weight matrix is used to blur unwanted noise in the image, etc. The sizes (rows and columns) of the weight matrixes are the same, the sizes of the convolution feature images extracted by the weight matrixes with the same sizes are the same, and the convolution feature images with the same sizes are combined to form the output of convolution operation.

The weight values in the weight matrices are required to be obtained through a large amount of training in practical application, and each weight matrix formed by the weight values obtained through training can be used for extracting information from an input image, so that the convolutional neural network 200 can perform correct prediction.

When convolutional neural network 200 has multiple convolutional layers, the shallow convolutional layers (e.g., 221) tend to extract more general features, which may also be referred to as low-level features; as the depth of the convolutional neural network 200 increases, features extracted by the later convolutional layers (e.g., 226) become more complex, such as features of high level semantics, which are more suitable for the problem to be solved.

Pooling layer:

since it is often desirable to reduce the number of training parameters, the convolutional layers often require periodic introduction of pooling layers, one convolutional layer followed by one pooling layer, or multiple convolutional layers followed by one or more pooling layers, as illustrated by layers 221-226 in FIG. 4, 220. The only purpose of the pooling layer during image processing is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to obtain a smaller size image. The average pooling operator may calculate pixel values in the image over a particular range to produce an average as a result of the average pooling. The max pooling operator may take the pixel with the largest value in a particular range as the result of max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer can be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents the average value or the maximum value of the corresponding sub-region of the image input to the pooling layer.

Full connection layer 230:

after processing by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not yet sufficient to output the desired output information. Because, as previously described, the convolution/pooling layer 220 will only extract features and reduce the parameters imposed by the input image. However, in order to generate the final output information (the required class information or other relevant information), convolutional neural network 200 needs to utilize fully-connected layer 230 to generate the output of the required number of classes or groups. Thus, multiple hidden layers (231, 232 to 23n as shown in fig. 4) may be included in the fully connected layer 230, and the output layer 240, where parameters included in the multiple hidden layers may be pre-trained according to relevant training data of a specific task type, for example, the task type may include image recognition, image classification, image super-resolution reconstruction, and so on.

After the hidden layers in the fully connected layer 230, i.e., the final layer of the overall convolutional neural network 200 is the output layer 240, the output layer 240 has a class-cross entropy-like loss function, specifically for calculating the prediction error, once the forward propagation of the overall convolutional neural network 200 (e.g., propagation from 210 to 240 directions in fig. 4 is forward propagation) is completed, the backward propagation (e.g., propagation from 240 to 210 directions in fig. 4 is backward propagation) will begin to update the weight values and deviations of the aforementioned layers to reduce the loss of the convolutional neural network 200, i.e., the error between the result output by the convolutional neural network 200 through the output layer and the ideal result.

It should be noted that the convolutional neural network shown in fig. 4 is only an example of one possible convolutional neural network, and the convolutional neural network may also exist in the form of other network models in a specific application.

Fig. 5 is a hardware structure of a chip according to an embodiment of the present application, where the chip includes a neural network processor 50. The chip may be provided in an execution device 110 as shown in fig. 3 for performing the calculation of the calculation module 111. The chip may also be provided in the training device 120 as shown in fig. 3 for completing the training work of the training device 120 and outputting the target model/rule 101. The method of the embodiment of the present application may be implemented in a chip as shown in fig. 5.

The neural network processor NPU 50 is mounted as a coprocessor to a main central processing unit (central processing unit, CPU) (host CPU) and tasks are distributed by the main CPU. The NPU has a core part of an arithmetic circuit 503, and a controller 504 controls the arithmetic circuit 503 to extract data in a memory (weight memory or input memory) and perform arithmetic.

In some implementations, the arithmetic circuitry 503 internally includes a plurality of processing units (PEs). In some implementations, the operational circuitry 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, the operation circuit 503 is a general-purpose matrix processor.

For example, assume that there is an input matrix a, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 502 and buffers the data on each PE in the arithmetic circuit. The arithmetic circuit takes matrix a data from the input memory 501 and performs matrix operation with matrix B, and the obtained partial result or final result of the matrix is stored in an accumulator (accumulator) 508.

The vector calculation unit 507 may further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. For example, the vector calculation unit 507 may be used for network calculations of non-convolutional/non-FC layers in a neural network, such as pooling (pooling), batch normalization (batch normalization, BN), local response normalization (local response normalization), and the like.

In some implementations, the vector computation unit 507 can store the vector of processed outputs to the unified buffer 506. For example, the vector calculation unit 507 may apply a nonlinear function to an output of the operation circuit 503, such as a vector of accumulated values, to generate an activation value. In some implementations, the vector calculation unit 507 generates a normalized value, a combined value, or both. In some implementations, the vector of processed outputs can be used as an activation input to the operational circuitry 503, for example for use in subsequent layers in a neural network.

The operations of the sensing network provided by the embodiments of the present application may be performed by the operation circuit 503 or the vector calculation unit 507.

The unified memory 506 is used for storing input data and output data.

The weight data is directly transferred to the input memory 501 and/or the unified memory 506 through the memory cell access controller 505 (direct memory access controller, DMAC), the weight data in the external memory is stored in the weight memory 502, and the data in the unified memory 506 is stored in the external memory.

A bus interface unit (bus interface unit, BIU) 510 for interfacing between the main CPU, DMAC and finger memory 509 via a bus.

An instruction fetch memory (instruction fetch buffer) 509 connected to the controller 504 for storing instructions used by the controller 504;

and a controller 504 for calling the instruction cached in the instruction memory 509 to control the operation of the operation accelerator.

Typically, the unified memory 506, the input memory 501, the weight memory 502, and the finger memory 509 are On-Chip (On-Chip) memories, and the external memory is a memory external to the NPU, which may be a double data rate synchronous dynamic random access memory (double data rate synchronous dynamic random access memory, DDR SDRAM), a high bandwidth memory (high bandwidth memory, HBM), or other readable and writable memory.

The execution device 110 in fig. 3 or the chip in fig. 5 described above is capable of executing the steps of the object recognition method of the embodiment of the present application. The training device 120 of fig. 3 or the chip of fig. 5 described above is capable of performing the steps of the training method of the cognitive network of an embodiment of the present application.

As shown in fig. 6, an embodiment of the present application provides a system architecture 300. The system architecture comprises a local device 301, a local device 302, and an executing device 310 and a data storage system 350, wherein the local device 301 and the local device 302 are connected to the executing device 310 through a communication network.

In one implementation, the execution device 310 may be implemented by one or more servers. Alternatively, the execution device 310 may be used with other computing devices, such as: data storage, routers, load balancers, etc. The execution device 310 may be disposed on one physical site or distributed across multiple physical sites. The execution device 310 may use data in the data storage system 350 or invoke program code in the data storage system 350 to implement the network aware training method of embodiments of the present application.

Specifically, in one implementation, a sensing network includes: the method comprises the steps that a candidate region generation network (RPN) is used for predicting position information of a candidate two-dimensional 2D frame of a target object in a sample image, wherein the target object comprises objects to be detected of a plurality of tasks, and each task in the plurality of tasks comprises at least one category; the target object includes a first task object and a second task object.

The execution device 110 may perform the following process:

the method comprises the steps that training data are obtained, the training data comprise sample images, labeling data of a first task object on the sample images and pseudo frames of a second task object on the sample images, the labeling data comprise class labels of the first task object and labeling 2D frames of the first task object, and the pseudo frames of the second task object are target 2D frames of the second task object obtained by reasoning the sample images through other perception networks; the perception network is trained based on the training data.

A sensor network can be obtained by the process execution device 110, which can be used for detecting various tasks.

The user may operate respective user devices (e.g., local device 301 and local device 302) to interact with execution device 310. Each local device may represent any computing device, such as a monitoring camera, personal computer, computer workstation, smart phone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set top box or game console, etc.

The local device of each user may interact with the performing device 310 through a communication network of any communication mechanism/communication standard, which may be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.

In one implementation, the local device 301, 302 obtains relevant parameters of the sensing network from the executing device 310, deploys the sensing network on the local device 301, 302, and uses the sensing network for target detection.

In another implementation, the perceived network may be deployed directly on the executing device 310, and the executing device 310 processes the image to be processed by acquiring the image to be processed from the local device 301 and the local device 302, and employing the perceived network.

The executing device 310 may also be a cloud device, where the executing device 310 may be deployed at the cloud; alternatively, the executing device 310 may be a terminal device, and in this case, the executing device 310 may be disposed on the user terminal side, which is not limited in the embodiment of the present application.

The sensing network may be deployed on a vehicle-mounted visual sensing device, a safe city sensing device or a computing node on a safe protection sensing device, and the image to be processed is processed to obtain a detection result of an object of interest in the image to be processed. For example, the computing node may be the execution device 110 in fig. 3, the execution device 310 in fig. 5, or a local device, etc.

Most of the current sensing networks can only complete one detection task, and if multiple detection tasks are to be realized, different networks are usually required to be deployed to realize different detection tasks. However, the simultaneous operation of multiple aware networks may result in increased power consumption of the hardware, reducing the operation speed of the model. The existing sensing network capable of completing various detection tasks also has the problems of long running time and the like. For example, as shown in fig. 7, a multi-head (head) multitasking sensing network includes a backbone network (backbone) and a plurality of heads, each head including a candidate region generation network (region proposal network, RPN), a region of interest extraction (region of interest Align, ROI-Align) module, and a region convolutional neural network (region convolutional neural networks, RCNN). The time for generating the candidate region (generate proposal) by the RPN is long, and the RPN is difficult to apply to scenes with high requirements on real-time performance; moreover, the number of the headers increases with the number of the detection tasks, and the memory consumption, the computing power and the computing time are rapidly increased. The chip used in many fields has lower calculation power, is difficult to deploy a large-scale sensing network, and is more difficult to deploy a plurality of sensing networks.

The embodiment of the application provides a perception network, which can reduce the quantity of parameters and calculated quantity in the perception network, reduce the power consumption of hardware and improve the running speed of a model.

Fig. 8 shows a schematic diagram of a sensing network in an embodiment of the present application, and the sensing network 800 in fig. 8 includes a backbone network (backbone) 810 and a headend.

The sensing network in the embodiment of the application can be realized by hardware, software or a combination of the hardware and the software.

The backbone network 810 is configured to perform convolution processing on the input image to obtain a first feature map of the input image.

The backbone network 810 may extract the underlying features through a series of convolution processes, providing corresponding features for subsequent detection.

In the embodiment of the present application, the "first feature map" refers to a feature map (feature map) output by the backbone network. The feature maps output by the backbone network may all be referred to as first feature maps.

Illustratively, the backbone network 810 may output feature maps of the input image at different scales. The feature maps at different scales can be understood as the first feature map of the input image, which feature maps can provide basic features for subsequent detection.

The feature maps at different scales can be understood as feature maps of different resolutions or, in other words, feature maps of different sizes.

Illustratively, the backbone network 810 may take various forms of networks, for example, the backbone network 810 may take the form of a visual geometry group (visual geometry group, VGG), a residual neural network (residual neural network, resnet) or an indication network (indication-net), which is a core structure of GoogleNet, or the like.

The header is used for detecting the target object according to the second feature map and outputting a target 2D (2 d) frame of the target object and first indication information. The target object includes an object to be detected in a plurality of tasks. The second feature map is determined from the first feature map. The first indication information is used for indicating the category to which the target object belongs.

That is, the header is configured to implement target detection according to the second feature map, and output a target 2D frame of the target object and the first indication information.

For example, the first indication information may include a confidence that the target object belongs to each category. I.e. the class to which the target object belongs may be indicated by a confidence that the target object belongs to the respective class. The higher the confidence, the greater the probability that the target object belongs to the category to which the confidence corresponds. For example, the category corresponding to the highest confidence is the category to which the target object belongs. Alternatively, the first indication information may be a category to which the target object belongs. Alternatively, the first indication information may include a confidence level of a category to which the target object belongs. Each category includes object categories in a plurality of tasks. The embodiment of the application does not limit the specific form of the first indication information.

One header may perform detection of an object to be detected in a plurality of tasks, that is, to detect whether the object to be detected in the plurality of tasks exists in an input image.

The object to be detected is divided into a plurality of tasks. It is also understood that the objects to be detected are divided into a plurality of broad categories. The object categories in the respective tasks may be the same or different.

Illustratively, 31 classes of objects to be detected are divided into 8 major classes, i.e., 8 tasks, according to the similarity of the objects to be detected and the abundance and scarcity of the training samples, as shown in table 1.

TABLE 1

It should be noted that the division manner in table 1 is merely an example, and in other embodiments, a different task division manner from table 1 may be used, which is not limited in the embodiment of the present application.

One header may be used to accomplish a variety of target detection tasks. For example, a header may perform 8 tasks in table 1 above, outputting the target 2D frame of the target object and the confidence that the target object belongs to the class 31 object.

Alternatively, the awareness network 800 may also include other processing modules connected to the header. And the other processing modules are used for obtaining other detection information of the target object according to the target 2D frame of the target object output by the header.

For example, the other processing modules may extract, from the feature map output by the backbone network, features of an area where the target 2D frame is located according to the target 2D frame output by the header, and complete 3D detection or key point detection of the target object in the target 2D frame according to the extracted features.

It should be understood that the foregoing is merely illustrative, and other processing modules are optional modules, which may be set according to actual needs, which is not limited in this embodiment of the present application.

The header is described in detail below.

Specifically, the header includes an RPN820, a region of interest extraction module 830, and a classification regression network 840.

And the RPN820 is configured to predict an area where the target object is located on the second feature map, and output position information of the candidate 2D frame matched with the area where the target object is located, that is, position information of the candidate 2D frame of the target object. The target object comprises an object to be detected in a plurality of tasks, each task in the plurality of tasks comprising at least one category, and the second feature map is determined according to the first feature map.

The region of interest extraction module 830 is configured to extract first feature information on a third feature map based on the location information of the candidate 2D frame, where the first feature information is a feature of the region where the candidate 2D frame is located, and the third feature map is determined according to the first feature map.

The classification regression network 840 is configured to process the first feature information, output target 2D frames of the target object, and first indication information, where the number of target 2D frames is less than or equal to the number of candidate 2D frames, and the first indication information is used to indicate a class to which the target object belongs.

For example, as shown in fig. 8, the classification regression network may output a target 2D box (box) and a class label (label) of an object to be detected in a plurality of tasks. The class label of the target object may be used as the first indication information. It should be understood that, in fig. 8, class labels are merely examples and are not limited to the solution of the embodiment of the present application.

The RPN820 may predict regions where the target object may exist on the second feature map and give boxes that match the regions where the target object may exist, which may be referred to as candidate regions (proposals), i.e., candidate 2D boxes. The box that matches proposal may also be referred to as the 2D box of proposal.

The target object includes an object to be detected among a plurality of tasks, for example, an object to be detected among 8 tasks in table 1, and the RPN820 is a region for predicting an object to be detected among the 8 tasks may exist.

By way of example, the target object may comprise an object to be detected in all tasks of the perception network. That is, the RPN may be used to predict the region of the object to be detected in all tasks that may exist on the second feature map. Alternatively, all tasks of the awareness network share the same RPN.

The second feature map may be one or more.

Optionally, the perception network 800 also includes a feature pyramid network (feature pyramid networks, FPN).

The FPN is connected to the backup 810, and is configured to perform feature fusion on the feature map output by the backup 810, that is, perform feature fusion on the first feature map of the input image, and output the feature map after fusion. The fused feature map is input into the RPN. In this case, the second feature map may include one or more of the fused feature maps.

Specifically, the FPN takes feature maps of different scales output by the backhaul 810 as input, and generates a feature map with more expression capability by feature fusion in the longitudinal direction of the FPN and feature fusion in the transverse direction of the same layer as the backhaul 810, and provides the feature map with more expression capability to a subsequent module, so that the performance of the model is improved.

That is, FPN may be used to achieve multi-scale feature fusion.

In the case where the sensing network does not include a FPN, a backhaul 810 is connected to an RPN820 as shown in fig. 8.

The region of interest extraction module 830 is configured to, according to the candidate 2D frame output by the RPN820, snap the feature of the region where the candidate 2D frame is located on the third feature map.

Illustratively, the third signature is determined from the first signature, comprising:

in the case where the perceptual network includes FPN, the third feature map may be one of the feature maps of the back-bone output (i.e., the first feature map) or one of the fused feature maps of the FPN output;

in the case where the sensing network does not include FPN, the third feature map may be one of the feature maps (i.e., the first feature map) of the backhaul output.

For example, the region of interest extraction module 830 takes out the feature buttons of the region where each proposal is located on a feature map output by the backup or FPN according to the proposal provided by the RPN820, and adjusts the size (resize) to a fixed size to obtain the feature of each proposal.

Illustratively, the region of interest extraction module 830 may employ feature extraction methods such as region of interest pooling (ROI-pooling), region of interest extraction (ROI-Align), position-sensitive region of interest pooling (position sensitive ROI pooling, PS-roiooling), or position-sensitive region of interest extraction (position sensitive ROI Align, PS-ROIALIGN).

For example, the region of interest extraction module 830 uses a difference and sampling method in the region where the proposal is located, buckles out features with fixed resolution, and inputs the buckled out features into the subsequent module.

Optionally, the classification regression network 840 is specifically configured to: processing the first characteristic information to obtain the confidence that the candidate 2D frames belong to each category in the plurality of tasks; adjusting the position information of the candidate 2D frames to obtain adjusted candidate 2D frames; determining a target 2D frame according to the adjusted candidate 2D frame; and determining first indication information according to the confidence that the target 2D frame belongs to each category.

For example, for the 8 tasks in table 1, the classification regression network 840 refines each proposal provided by the region of interest extraction module 830 to obtain the confidence that each proposal belongs to 31 categories in the 8 tasks, and adjusts the coordinates of the 2D frame of each proposal to obtain the adjusted candidate 2D frame. Further, after the adjusted candidate 2D frames are combined by the NMS, the target 2D frame and the first indication information are obtained. The number of candidate 2D frames is greater than or equal to the number of target 2D frames.

In one possible implementation, the classification regression network 840 includes a plurality of third RCNNs, wherein the plurality of third RCNNs are in one-to-one correspondence with the plurality of tasks. I.e. each third RCNN performs detection of the object to be detected in a different task, respectively.

Fig. 9 shows a schematic block diagram of a network aware provided by an embodiment of the application. For example, as shown in FIG. 9, the sensing network includes backbone, FPN, RPN, ROI-Align modules and n third RCNNs.

Specifically, the third RCNN is configured to: processing the characteristics of the region where the candidate 2D frame is located to obtain the confidence coefficient of the object category of the candidate 2D frame belonging to the task corresponding to the third RCNN; and adjusting the position information of the candidate 2D frames so that the adjusted candidate 2D frames.

That is, any one of the plurality of third RCNNs can predict the confidence that the candidate 2D frame belongs to the object category in the task corresponding to the third RCNN and obtain the adjusted candidate 2D frame. The plurality of third RCNNs may obtain confidence that the candidate 2D frames belong to each category, and adjusted candidate 2D frames obtained by each third RCNN.

Further, after the adjusted candidate 2D frames are combined by the NMS, the target 2D frame and the first indication information are obtained.

For example, if the task corresponding to the third rcnn1# is the detection task of the car in table 1, the third rcnn1# outputs the confidence that each proposal belongs to three categories of cars, trucks and buses, and the adjusted candidate 2D frame. The task corresponding to the third rcnn2# is the detection task of the wheel and the vehicle lamp in table 1, and then the third rcnn2# outputs the confidence that each proposal belongs to two categories of the wheel and the vehicle lamp and the adjusted candidate 2D frame. Thus, for either proposal, a total of five categories of confidence and adjusted candidate 2D frames can be obtained from the third rcnn1# and third rcnn2# processing.

The awareness network in FIG. 9 is used to implement n tasks (tasks), including, for example, task 0, task 1 …, task n-1 in FIG. 9. n is an integer greater than 1. The n third RCNNs are respectively in one-to-one correspondence with the n tasks. Taking task 0 as an example, the third RCNN corresponding to task 0 outputs the confidence that each proposal belongs to each object class in task 0 and the adjusted candidate 2D frame. And n third RCNNs corresponding to the n tasks respectively obtain the confidence coefficient of each proposal belonging to each object category in the corresponding task, and the confidence coefficient of each proposal belonging to each category can be obtained by classifying the regression network.

Note that the FPN in fig. 9 is an optional module. In fig. 9, the ROI-alignment module is taken as an example of the region of interest extraction module, and other manners of extracting corresponding features may be also adopted, which are described in detail in the foregoing, and are not repeated here.

In another possible implementation, the classification regression network includes a first RCNN including a hidden layer, a plurality of sub-classification full-connection layers (classification fully connected layers, cls fc) and a plurality of sub-regression full-connection layers (regression fully connected layers, reg fc), the hidden layer being connected to the plurality of sub-classification full-connection layers, the hidden layer being connected to the plurality of sub-regression full-connection layers, the plurality of sub-classification full-connection layers being in one-to-one correspondence with the plurality of tasks, the plurality of sub-regression full-connection layers being in one-to-one correspondence with the plurality of tasks.

Alternatively, the first RCNN includes a hidden layer, a plurality of sub-cls fc and a plurality of sub-reg fc corresponding to a plurality of tasks. Each task may have a separate one of sub-classifications fc and sub-regressions fc.

Fig. 10 shows a schematic block diagram of another network aware provided by an embodiment of the application. For example, as shown in FIG. 10, the sensing network includes backbone, FPN, RPN, ROI-Align modules and a first RCNN.

And the hidden layer is used for processing the first characteristic information to obtain second characteristic information.

That is, the hidden layer is used for processing the characteristics of the region where the candidate 2D frame is located, and the processed results are respectively input into the multiple sub-classification full-connection layers and the multiple sub-regression full-connection layers.

The sub-classification full-connection layer is used for obtaining the confidence coefficient of the object category of the candidate 2D frame belonging to the task corresponding to the sub-classification full-connection layer according to the second characteristic information.

And the sub-regression full-connection layer is used for adjusting the position information of the candidate 2D frames according to the second characteristic information to obtain the adjusted candidate 2D frames. Further, the sub-regression full connection layer may output more compact candidate 2D frames by a frame merging operation, e.g., removing duplicate frames by an NMS operation.

The sub-classification full-connection layer and the sub-regression full-connection layer corresponding to each task can finish detection of an object to be detected in the task, specifically, the sub-classification full-connection layer can output confidence that the candidate 2D frame belongs to an object category in the task, and the sub-regression full-connection layer can output the adjusted candidate 2D frame. That is, one first RCNN may perform detection of an object to be detected in a plurality of tasks. The first RCNN may also be referred to as a single-head multi-tasking RCNN.

The first RCNN may predict a confidence that the candidate 2D frame belongs to an object class in a plurality of tasks corresponding to the first RCNN and obtain an adjusted candidate frame.

For example, the tasks corresponding to the first RCNN include 8 tasks in table 1, where the first RCNN includes 8 sub-cls fc and 8 reg fc, and each sub-cls fc corresponds to the 8 tasks, and outputs a confidence level of an object class in the task corresponding to the sub-cls fc, and each reg fc outputs an adjusted candidate 2D frame, so that the first RCNN can obtain the confidence level of the 31 class object in the 8 tasks and the adjusted candidate 2D frame.

The awareness network in FIG. 10 is used to implement n tasks (tasks), including, for example, task 0, task 1 …, task n-1 in FIG. 10. n is an integer greater than 1. The first RCNN includes a hidden layer, n sub-cls fc and n sub-reg fc corresponding to n tasks, respectively. The hidden layer may include Shared fc and/or Shared conv.

Taking task 0 as an example, a sub-cls fc corresponding to task 0 in the first RCNN outputs a confidence level that each proposal belongs to each object class in task 0, and a sub-reg fc corresponding to task 0 outputs an adjusted candidate 2D frame. In this way, n sub-cls fc corresponding to n tasks respectively obtain the confidence coefficient of each proposal belonging to each object class in the corresponding task, and the first RCNN can obtain the confidence coefficient of each proposal belonging to each class.

Note that the FPN in fig. 10 is an optional module. In fig. 10, the ROI-alignment module is taken as an example of the region of interest extraction module, and other manners of extracting corresponding features may be also adopted, which are described in detail in the foregoing, and are not repeated here.

In another possible implementation, the classification regression network includes a second RCNN including a hidden layer, a classification fully connected layer, and a regression fully connected layer, the hidden layer being connected to the regression fully connected layer.

Fig. 11 shows a schematic block diagram of yet another network aware provided by an embodiment of the application. For example, as shown in FIG. 11, the sensing network includes backbone, FPN, RPN, ROI-Align modules and a second RCNN.

And the hidden layer is used for processing the first characteristic information to obtain third characteristic information.

That is, the hidden layer is used for processing the characteristics of the region where the candidate 2D frame is located, and the processed results are respectively input to the classification full-connection layer and the regression full-connection layer.

Illustratively, the hidden layer may include at least one of: a convolutional layer or a fully-concatenated layer. For a detailed description, reference may be made to the first RCNN, which is not described here again.

And the classification full-connection layer is used for obtaining the confidence coefficient of the candidate 2D frame belonging to each category according to the third characteristic information.

And the regression full-connection layer is used for adjusting the position information of the candidate 2D frames according to the third characteristic information to obtain the adjusted candidate 2D frames. Further, performing frame merging operation on the adjusted candidate 2D frames to obtain target 2D frames.

That is, one second RCNN performs detection of an object to be detected in a plurality of tasks. The second RCNN may also be referred to as a single-ended multitasking RCNN.

Specifically, the classified full-connection layer is obtained by combining a plurality of sub-classified full-connection layers in the first RCNN. The regressive all-connection layer is obtained by combining multiple sub-regressive all-connection layers in the first RCNN. In this case, the first characteristic information and the third characteristic information are the same.

Combining the sub-category full-connection layers may be understood as stitching the weight matrices of the sub-category full-connection layers. Combining the multiple sub-regressive fully-connected layers can be understood as stitching the weight matrices of the multiple sub-regressive fully-connected layers.

The first RCNN may normalize a tag logic value (table logits) obtained by the sub-classification fc by using a sigmoid function, which is equivalent to performing a classification process on each class, where the confidence that the target object belongs to one class is irrelevant to other classes, and the sub-classification fc of the tasks in the model is combined into one classification fc, so that the reasoning result of the model is not affected, that is, the output of the sub-classification fc is the same as the output of the classification fc obtained by combining the sub-classifications fc.

That is, the tasks performed by the second RCNN and the first RCNN may be identical, and the output results are identical. However, in the case of an accelerator such as an NPU, only one matrix operation is completed at a time, in the first RCNN, the output result of the hidden layer needs to be input into the sub-classification fc and the sub-regression fc corresponding to each task to perform matrix operations a plurality of times, the number of times of matrix multiplication in the first RCNN increases with the increase in the number of tasks, and the number of times of matrix multiplication in the second RCNN is performed is not affected by the number of tasks. That is, in the case where the parameter amounts of the first RCNN and the second RCNN are the same, the time required to execute the second RCNN is less than the time required to execute the first RCNN.

Therefore, the sub-classification full-connection layers corresponding to the tasks in the first RCNN are combined to obtain the classification full-connection layer of the second RCNN, and the sub-regression full-connection layers corresponding to the tasks in the first RCNN are combined to obtain the regression full-connection layer of the second RCNN, so that the operation times of matrix multiplication in the neural network accelerator can be reduced, the neural network accelerator is more friendly to hardware, and the time consumption is further reduced.

The awareness network of FIG. 11 is used to implement n tasks (tasks) including task 0, task 1 …, task n-1 in FIG. 10. n is an integer greater than 1. The second RCNN includes hidden layers, cls fc and reg fc. The hidden layer may include Shared fc and/or Shared conv. cls fc may be combined from n sub-cls fc in fig. 10, and reg fc may be combined from n sub-reg fc in fig. 10.

Thus, cls fc may output the confidence that each proposal belongs to a respective category, reg fc may output the adjusted candidate 2D frame.

Note that the FPN in fig. 11 is an optional module. In fig. 11, the ROI-alignment module is taken as an example of the region of interest extraction module, and other manners of extracting corresponding features may be also adopted, which are described in detail in the foregoing, and are not repeated here.

Illustratively, in the training process of the sensing network, the classification regression network adopts a first RCNN, and after the training is completed, a second RCNN is obtained based on the first RCNN, that is, in the sensing network for reasoning, the classification regression network may adopt the second RCNN.

Illustratively, the sensing network in fig. 10 may be applied to a training side, and the first RCNN in the trained sensing network is combined to obtain the sensing network shown in fig. 11, that is, the model parameters in fig. 11 are obtained according to the model parameters in fig. 10. The awareness network in fig. 11 can be applied to the inference side to reduce time consumption.

According to the scheme of the embodiment of the application, a plurality of sensing tasks are completed by utilizing one sensing network, the plurality of tasks share one RPN, and the area where the object to be detected in the plurality of tasks is located is predicted by the one RPN, so that the parameter quantity and the calculated quantity of the sensing network are reduced while the performance of the sensing network is ensured, the processing efficiency is improved, the sensing network is beneficial to being deployed in a scene with higher real-time requirement, the pressure of hardware is reduced, and the cost is saved.

In addition, in the scheme of the embodiment of the application, the first RCNN or the second RCNN is adopted as the classification regression network, and a plurality of tasks share the hidden layer of the RCNN, so that the parameter quantity and the calculated quantity of the perception network are further reduced, and the processing efficiency is improved. Moreover, when the first RCNN training is adopted, each task corresponds to an independent sub-classification fc and sub-regression fc, so that the expandability of the sensing network is improved, and the sensing network can flexibly realize functional configuration by increasing or decreasing the sub-classification fc and the sub-regression fc, namely, flexibly increase or decrease the detection task by increasing or decreasing the sub-classification fc and the sub-regression fc.

In addition, in the scheme of the embodiment of the application, the sub-classifications fc and the sub-regressions fc in the first RCNN are combined, and the second RCNN is adopted as the classification regression network, so that the operation of matrix operation can be further reduced, the hardware is more friendly, the operation time consumption is further reduced, and the processing efficiency is improved.

The sensing network in the embodiment of the application can be trained by adopting the existing training method.

However, when training is performed by using the existing training method, if training is performed by using the fully labeled sample data, all the objects to be detected of the tasks existing on the sample image in the dataset need to be labeled, and the labeling cost is high. In addition, if the perceived network needs to be expanded, i.e. a new task is added, the sample image in the whole data set needs to be re-marked once to supplement the object to be detected in the new task, so that the marking cost is further increased, and the expandability of the perceived network is reduced.

If the training is performed by adopting the partially marked sample images, the objects to be detected of all tasks do not need to be marked on one sample image, so that the marking cost can be reduced. However, since each task shares one RPN, when the RPN is trained, training data of different tasks may be mutually inhibited, so that the RPN cannot predict candidate areas of objects to be detected in all tasks, thereby affecting accuracy of the sensing network. Specifically, since the labeling data is part of the labeling data, for example, labeling data of the object to be detected of only one task is labeled on one sample image, when training is performed by using the labeling data of the object to be detected of the task, parameters of the RPN are adjusted, so that the RPN can more accurately predict candidate 2D frames of the object to be detected of the task, but cannot accurately predict candidate 2D frames of the object to be detected of other tasks on the sample image. When training is performed by using the labeling data of the object to be detected of another task, the parameters of the RPN are adjusted, so that the adjusted RPN may not accurately predict the candidate 2D frames of the object to be detected of the other task. Thus, training data for different tasks may be mutually suppressed, resulting in that the RPN cannot predict all target objects in the image.

The embodiment of the application provides a training method of a perception network, which utilizes sample images in other perception network reasoning training sets to provide Pseudo frames (Pseudo bounding boxes, pseudo Bboxes) for objects to be detected which are not marked in the sample images, further trains RPN (remote procedure) based on the Pseudo frames and marking data together, and is beneficial to obtaining candidate 2D frames of the objects to be detected in a plurality of tasks.

Fig. 12 shows a training method 1200 of a neural network model according to an embodiment of the present application, where the method 1200 may be performed by a training apparatus of the neural network model, and the training apparatus may be a cloud service device, or may be a terminal device, for example, a device with computing power sufficient for executing the training method of the neural network model, such as a computer, a server, or may be a system formed by the cloud service device and the terminal device. Illustratively, the method 1200 may be performed by the training device 120 of fig. 3, the neural network processor 50 of fig. 5, or the execution device 310 of fig. 6. The awareness network includes: the RPN is used for predicting position information of a candidate 2D frame of a target object in the sample image, wherein the target object comprises objects to be detected of a plurality of tasks, and each task of the plurality of tasks comprises at least one category.

Alternatively, the sensing network may be the sensing network shown in fig. 8. In order to avoid unnecessary repetition, the description of the correlation is appropriately omitted when describing the training method. In the training process, the input image is replaced by the sample image.

The method 1200 includes steps S1210 to S1220, and the following description will explain steps S1210 to S1220.

S1210, obtaining training data.

The target object includes a first task object and a second task object. The training data comprises a sample image, labeling data of a first task object on the sample image and a pseudo frame of a second task object on the sample image, wherein the labeling data comprises a class label of the first task object and a labeling 2D frame of the first task object, and the pseudo frame of the second task object is a target 2D frame of the second task object obtained by reasoning the sample image through other perception networks.

The annotation data can also be understood as a true value (ground trunk). The noted class labels are used to indicate the true class to which the task object belongs. The annotation data of the first task object can also be understood as annotation data of the sample image. The full annotation data of the sample image comprises class labels and annotation 2D boxes of objects to be detected in all tasks on the sample image. The full annotation data includes annotation information for all objects of interest. The partial annotation data comprises class labels and annotation 2D frames of objects to be detected in the partial tasks on the sample image. The partial annotation data only includes annotation information of a part of the object of interest.

The first task object may comprise an object to be detected in one or more tasks. The one or more tasks are tasks where the first task object is located. The first task objects in the different sample images in the training set may be the same or different. The "first" of the "first task objects" in the embodiment of the present application is only used to define the object to be detected having the true value in the sample image, and has no other defining effect.

For example, the labeling data of the sample image 1# is labeling data of a vehicle, that is, the first task object in the sample image 1# includes an object to be detected in a detection task of the vehicle, for example, a truck, a sedan, a bus, and the like; the labeling data of the sample image 2# is labeling data of wheels and car lights, namely, a first task object in the sample image 2# comprises an object to be detected in a detection task of the wheels and the car lights, such as the wheels, the car lights and the like; the labeling data of the sample image 3# comprises labeling data of the vehicle and labeling data of wheels and vehicle lamps, namely, a first task object in the sample image 3# comprises an object in a detection task of the vehicle and an object to be detected in the detection task of the wheels and the vehicle lamps.

That is, the labeling data of the sample images in the embodiment of the application can be part of the labeling data, so that the target acquisition can be performed, namely, the required sample images are acquired for specific tasks, objects to be detected of all tasks are not required to be marked in each sample image, and the acquisition cost and the labeling cost of the data are reduced. In addition, the scheme of adopting part of the labeling data has flexible expansibility, and under the condition of adding tasks, only the labeling data of the newly added tasks are needed to be provided, and new objects to be detected do not need to be labeled on the basis of the original training data.

The Pseudo Bboxes on the sample image are target 2D boxes of the second task object obtained by reasoning the sample image through other perception networks. The pseudobboxes on the sample image can also be understood as pseudobboxes of the second task object.

Other sensing networks refer to sensing networks other than the sensing network to be trained. By way of example, the other aware network may be a multi-head end multi-tasking aware network.

For example, the sample images in the training set are inferred by using the perception network as shown in fig. 7, so as to obtain an inference result of the sample images, wherein the inference result comprises a target 2D frame of the target object on the sample images.

By way of example, other sensing networks may also include multiple single-tasked sensing networks.

For example, the sample images in the training set are respectively inferred by adopting a plurality of perception networks of single tasks, the inference results of the sample images are respectively obtained, the inference results of the perception networks of each single task comprise the target 2D frames of the objects to be detected in the tasks on the sample images, and the target 2D frames of the objects to be detected in the tasks on the sample images can be obtained according to the inference results of the perception networks of the single tasks.

The second task object may comprise an object to be detected in one or more tasks. The one or more tasks are tasks where the second task object is located. The second task object and the first task object may have the same object to be detected. The second task objects in the different sample images in the training set may be the same or different. The "second" of the "second task objects" in the embodiment of the present application is only used to define the object to be detected having the dummy frame in the sample image, and has no other defining effect.

For example, in the case that the same object to be detected exists in the first task object and the second task object, a labeling frame in the labeling data is taken as a target output of the RPN. The labeling data are usually artificially labeled data, the accuracy of the labeling data is usually higher than that of pseudo frames obtained by other perception network reasoning, and the labeling frames are taken as targets to be output, so that the accuracy of a training model can be improved.

For example, the plurality of tasks that the sensing network needs to complete include 8 tasks in table 1, the labeling data of the sample image 1# is labeling data of the vehicle, the first task object in the sample image 1# includes an object to be detected in the detection task of the vehicle, for example, a truck, a car and a bus, that is, the labeling data of the sample image 1# is part of the labeling data. And reasoning the sample image 1# through other perception networks to obtain a target 2D frame of the second task object, namely a pseudo frame. For example, the sample image 1# is inferred by 7 single-task awareness networks for completing 7 tasks other than the detection task of the vehicle in table 1, resulting in a target 2D frame of the second task object, which in this case may include an object of the 7 tasks other than the detection task of the vehicle in table 1. For another example, a multi-head multi-task perception network as shown in fig. 7 may be used to complete 8 tasks in table 1, and by using the perception network to infer the sample image 1# a target 2D frame of the second task object may be obtained, where the second task object may include an object to be detected in 8 tasks in table 1. Thus, after the pseudo frame and the labeling frame are combined, the region where the object to be detected is located in 8 tasks in the sample image 1# can be obtained.

The false frame is utilized to supplement the unlabeled object to be detected in the sample image, so that the mutual inhibition among the part of labeling data of different tasks is avoided when the RPN trains based on the part of labeling data, the training of the RPN is influenced, the recall rate of the RPN is improved, and the area where the object to be detected in all tasks to be detected is predicted by the RPN is facilitated.

Further, other perception networks infer the sample image, so that a target 2D frame of the second task object on the sample image and the confidence level of the category to which the second task object belongs can be obtained. And under the condition that the confidence is larger than or equal to a first threshold value, taking the target 2D frame of the second task object on the sample image obtained by reasoning of other perception networks as a pseudo frame on the sample image. That is, in the event that the confidence level is greater than or equal to the first threshold, training is performed using the inference results of other perceptual networks.

For example, a low threshold may be employed for filtering. For example, a target 2D box with a first threshold of 0.05, i.e., a confidence level greater than or equal to 0.05, may be used as a pseudo box on the sample image, along with annotation data, to participate in training of the perceptual network. It should be understood that the first threshold may be set as desired, which is not limited by the embodiment of the present application.

S1220, training the perception network based on the training data.

Specifically, step S1220 may include steps S1221 to S1223.

S1221, calculating a first loss function value according to the difference between the labeling 2D frame of the first task object and the target 2D frame of the second task object and the candidate 2D frame of the target object in the sample image obtained by the RPN prediction.

That is, the labeling 2D frame of the first task object and the target 2D frame of the second task object are compared with the candidate 2D frame of the target object predicted by the RPN, so as to obtain a loss function value in the RPN stage, that is, a first loss function value.

Forward propagation of the sensing network is performed based on the sample image, and the candidate 2D frame of the target object on the sample image is obtained by RPN prediction, and the specific forward propagation process is referred to in fig. 8, which is not described herein.

S1222, calculating a second loss function value of the perception network according to the labeling data of the sample image.

The second loss function value of the sensing network is the second loss function value of the part of the sensing network which needs training. The part of the perception network required to be trained comprises the part of the classification regression network required to be trained, an interested region extraction module, an RPN and a backbone network, wherein the part of the classification regression network required to be trained is determined according to the first task object.

The portion of the perceptual network that requires training refers to the portion of the perceptual network that is determined by the sample image.

The classification regression network may predict a confidence that the candidate 2D frame belongs to each class and a target 2D frame of the target object.

Specifically, after the candidate 2D frame of the target object is obtained by RPN prediction, the feature of the candidate 2D frame is buckled and extracted by the interested region extraction module on the feature map, and the feature of the candidate 2D frame is input into a part of the classification regression network to be trained, so as to obtain the confidence that the candidate 2D frame belongs to the object class in the task corresponding to the first task object. The portion of the training required in the categorical regression network is determined from the first task object. Alternatively, the portion of the classification regression network that requires training is determined based on the task in which the first task object is located.

Optionally, the classification regression network includes a plurality of third RCNNs, and the portion of the classification regression network requiring training includes the third RCNNs corresponding to the task where the first task object is located.

Illustratively, the awareness network may be as shown in fig. 9. The task in which the first task object in the sample image # 1 (an example of the sample image) is located includes a detection task of the vehicle, and the first task object includes an object to be detected in the detection task of the vehicle. The features of the candidate 2D frames are input into a third RCNN corresponding to the detection task of the vehicle, and further confidence that the candidate 2D frames belong to three categories of cars, trucks and buses and target 2D frames are obtained. For the sample image 1#, the part of the classification regression network required to be trained is the third RCNN corresponding to the detection task of the vehicle.

Optionally, the classification regression network includes a first RCNN, and the portion of the classification regression network that needs to be trained includes a hidden layer in the first RCNN and a sub-classification fc and a sub-regression fc corresponding to a task where the first task object is located.

Illustratively, the awareness network may be as shown in fig. 10. The task where the first task object in the sample image 1# is located comprises a detection task of the vehicle, and the first task object comprises an object to be detected in the detection task of the vehicle. Features of the candidate 2D frames are input into the sub-classification fc and the sub-regression fc corresponding to the detection task of the vehicle after passing through the hidden layer in the first RCNN, so that confidence degrees that the candidate 2D frames belong to three categories of cars, trucks and buses and target 2D frames are obtained. For the sample image 1#, the part to be trained in the classification regression network is the hidden layer in the first RCNN and the sub-classification fc and the sub-regression fc corresponding to the detection task of the vehicle.

And comparing the labeling data of the sample image with the output result of the classification regression network to obtain a loss function value of the task where the first task object is located in the stage of the classification regression network, namely a second loss function value. That is, no other tasks are lost that are not involved in computing the annotation data for the sample image.

S1223, back-propagating based on the first loss function value and the second loss function value, adjusting parameters of the portion of the perceptual network that needs to be trained.

Based on the back propagation of the first loss function value, calculating the gradient of the parameter related to the first loss function value, and further adjusting the parameter related to the first loss function value based on the gradient of the parameter, so as to realize the adjustment of the sensing network, and enable the RPN to more comprehensively predict the candidate frame.

And under the condition that the training termination condition is met, terminating training to obtain a trained perception network.

For example, in the case of convergence of the sensing network, the training is terminated, and the weight of the trained sensing network is output.

It should be understood that steps S1221 through S1223 are only one implementation of step S1220, and step S1220 may be implemented in other manners.

Illustratively, step S1220 includes the following steps S1 through S3.

S1, calculating a first loss function value according to the difference between the marked 2D frame of the first task object, the target 2D frame of the second task object and the candidate 2D frame of the target object on the sample image obtained by RPN prediction.

S2, calculating a second loss function value of a part needing training in the perception network according to the labeling data of the sample image, the pseudo frame on the sample image and the pseudo tag of the second task object on the sample image, wherein the part needing training in the perception network comprises the part needing training in the classification regression network, an interested region extraction module, an RPN and a backbone network, and the part needing training in the classification regression network is determined according to the first task object and the second task object. The pseudo tag on the sample image is a class tag of a second task object on the sample image, which is obtained by reasoning the sample image through other perception networks.

Specifically, after the candidate 2D frame of the target object is obtained by RPN prediction, the feature of the candidate 2D frame is buckled and extracted by the interested region extraction module on the feature map, the feature of the candidate 2D frame is input into a part of the classification regression network to be trained, and the confidence that the candidate 2D frame belongs to the object class in the task where the first task object is located and the confidence that the candidate 2D frame belongs to the object class in the task where the second task object is located are obtained. The portion of the classification regression network that requires training is determined based on the first task object and the second task object. Alternatively, the portion of the classification regression network that requires training is determined based on the task at which the first task object is located and the task at which the second task object is located.

Illustratively, the classification regression network includes a plurality of third RCNNs, for example, the perceptive network may be as shown in FIG. 9. The task where the first task object in the sample image 1# is located comprises a detection task of the vehicle, and the first task object comprises an object to be detected in the detection task of the vehicle. The features of the candidate 2D frames are input into a third RCNN corresponding to the detection task of the vehicle, and further confidence that the candidate 2D frames belong to three categories of cars, trucks and buses and target 2D frames are obtained. The task of the second task object in the sample image 1# comprises a detection task of a wheel and a car light, and the second task object comprises an object in the detection task of the wheel and the car light. And inputting the characteristics of the candidate 2D frames into a third RCNN corresponding to the detection tasks of the wheels and the car lights, and further obtaining the confidence that the candidate 2D frames belong to the two categories of the wheels and the car lights and the target 2D frames.

For the sample image 1#, the part to be trained in the classification regression network is the third RCNN corresponding to the detection task of the vehicle and the third RCNN corresponding to the detection tasks of the wheels and the vehicle lamps.

Illustratively, the classification regression network includes a first RCNN, e.g., the perceptive network may be as shown in FIG. 10. The task where the first task object in the sample image 1# is located comprises a detection task of the vehicle, and the first task object comprises an object to be detected in the detection task of the vehicle. Features of the candidate 2D frames are input into the sub-classification fc and the sub-regression fc corresponding to the detection task of the vehicle after passing through the hidden layer in the first RCNN, so that confidence degrees that the candidate 2D frames belong to three categories of cars, trucks and buses and target 2D frames are obtained. The task of the second task object in the sample image 1# comprises a detection task of a wheel and a car light, and the second task object comprises an object in the detection task of the wheel and the car light. Features of the candidate 2D frames are input into the sub-classification fc and the sub-regression fc corresponding to detection tasks of the wheels and the lamps after passing through the hidden layer in the first RCNN, and then confidence degrees that the candidate 2D frames belong to the two categories of the wheels and the lamps and the target 2D frames are obtained.

For sample image 1#, the part of the classification regression network that needs training includes hidden layer in the first RCNN, sub-classification fc and sub-regression fc corresponding to the detection task of the vehicle, and sub-classification fc and sub-regression fc corresponding to the detection task of the vehicle wheels and lights.

And comparing the labeling data of the sample image with the output result of the classification regression network to obtain the loss function value of the task corresponding to the first task object and the loss function value of the task corresponding to the second task object in the stage of the classification regression network, namely the second loss function value. That is, the labeling data of the sample image and the loss of other tasks that the pseudo tag does not involve are not calculated.

And S3, back propagation is carried out based on the first loss function value and the second loss function value, and parameters of a part needing training in the perception network are adjusted.

Moreover, according to the scheme in the embodiment of the application, parts shared by different tasks in the perception network, such as a backbone network, an RPN (remote procedure network), an interested region extraction module and the like, participate in training in the training process based on the labeling data of the different tasks, so that the shared parts of the different tasks in the perception network can learn the common characteristics of the tasks. Different parts corresponding to different tasks in the perception network, for example, the parts corresponding to the tasks in the classification regression network, only participate in training in the training process based on the labeling data of the respective tasks, so that the different parts corresponding to the different tasks in the perception network can learn the specific characteristics of the tasks, and the accuracy of the model is improved. Meanwhile, in the training process, the part to be trained in the classification regression network is determined according to the task, and different parts in the classification regression network corresponding to different tasks are not mutually affected in the training process, so that the independence of each task is ensured, and the model has stronger flexibility.

Fig. 13 illustrates a training method for a cognitive network according to an embodiment of the present application, where the method illustrated in fig. 13 may be regarded as a specific implementation of the method illustrated in fig. 12, and the related description may be referred to in the description of the method 1200, and is omitted when describing the method 1300 to avoid unnecessary repetition.

The following describes the scheme of the embodiment of the application in detail by taking an ADAS/ADS visual perception system as an example. Target detection in visual perception systems of ADAS/ADS requires a variety of tasks, such as: dynamic obstacles, static obstacles, traffic signs, traffic lights, road signs (e.g., left turn signs or straight running signs), zebra crossings, and the like.

By adopting the scheme in the embodiment of the application, the target detection of the tasks can be completed in one perception network, and the scheme in the embodiment of the application is described in detail below.

The following describes the training method of the cognitive network in the embodiment of the present application in detail taking task division of table 1 as an example.

Before training is started, training data is prepared, the target object comprises a first task object and a second task object, the training data comprises a sample image, labeling data of the first task object on the sample image and a pseudo frame of the second task object on the sample image, and the labeling data comprises a class label of the first task object and a labeling 2D frame of the first task object.

Labeling data is provided for each task according to the task partitioning of table 1. For example, providing labeling data of a Car for a training process of task0, and labeling a 2D frame of Car/training/Bus and class labels on one or more sample images in a dataset; providing labeling data of a person for training task1, and labeling a 2D frame of Pederstrian/cycle/Tricycle and class labels on one or more sample images in a data set; providing labeling data of wheels and lamps for the task2, labeling 2D frames and class labels of Wheel/car_light on one or more sample images in the data set, providing labeling data of traffic lights for the task3, labeling 2D frames and class labels of trafficlight_red/Yellow/Green/Black on one or more sample images in the data set, and the like. Thus, each sample image is provided with at least one task of annotation data.

In one possible implementation, the sample image includes labeling information for all objects of interest. That is, all objects of interest are marked in each sample image. Illustratively, the object of interest is the object to be detected in 8 broad categories in table 1.

In another possible implementation, each annotation data only requires the annotation of a specific type of object. That is, the annotation data for each sample image may be part of the annotation data.

Illustratively, each sample image is labeled with class labels and 2D boxes of objects to be detected in only one task.

Alternatively, each sample image may be labeled with a class label and a 2D box of the object to be detected in multiple tasks, i.e., data providing a hybrid label. For example, 2D frames of Car/Truck/Bus/Pederstrian/cycle/Tricycle and class labels are marked on the sample image at the same time. In this way, the training data can be used to train the part of the perception network corresponding to the two tasks that needs to be trained.

For example, a task tag may be assigned to each sample image, which may be used to indicate the portion of the sample image that is used to train the desired training in the perception network.

The labeling data of the sample image can be obtained through the mode. For example, annotation data can be stored in an annotation file. The annotation file is a group trunk file.

And reasoning the sample image through other perception networks to obtain a reasoning result. The reasoning results include pseudobboxes on the sample image. The pseudobboxes can be used for complementing objects to be detected belonging to other tasks which are not marked in the marked data of the sample image. The inference results may be stored in an inference results file, for example. The reasoning result file is a Pseudo Bboxes file.

Each sample image may correspond to a annotation file and an inference result file. In one possible implementation manner, after the labeling 2D frames in the labeling data of the sample image are combined with the pseudobboxes, the 2D frames of the object to be detected in all tasks on the sample image can be obtained.

For example, the sample image is inferred by using a multi-head multi-task perception network, and an inference result is obtained.

For another example, the sample images are respectively inferred by utilizing the perception networks of a plurality of single tasks to obtain the inference results of the plurality of tasks, and the inference results of the plurality of tasks are fused together.

Further, the reasoning result also comprises the confidence of the category to which the second task object belongs on the sample image. And filtering the reasoning result by adopting a low threshold value. I.e. filtering out the inference results with confidence below the first threshold. Confidence levels corresponding to the Pseudo Bboxes for training are all greater than or equal to a first threshold. For example, the first threshold is 0.05.

The perception network is trained based on the partial annotation data and the Pseudo Bboxes. Specifically, the method 1300 includes steps S1310 to S1350.

S1310, obtaining training data.

Training data is input into the perception network, the training data including a sample image, annotation data of a first task object on the sample image, and a pseudo-frame of a second task object on the sample image.

For example, the sample image, the annotation file corresponding to the sample image, and the Pseudo Bboxes file are input into the perception network.

Step S1310 corresponds to step S1210 in the method 1200, and details of step S1210 are described in detail.

Forward propagation of the perceptual network is performed based on the training data.

Illustratively, the structure of the perceptual network employed in the training process is shown in fig. 14 according to the task partitioning scheme in table 1. As shown in fig. 14, the awareness network includes: the device comprises a backstbone, an RPN, a region of interest extraction module and a first RCNN. The sensor network shown in fig. 14 may be regarded as a specific implementation of the sensor network shown in fig. 10. The sensing network in fig. 14 is capable of completing target detection for 8 broad classes in table 1 simultaneously. Alternatively, the sensor network of FIG. 14 is capable of performing target detection for 8 tasks in Table 1 simultaneously. Specifically, the 8 sub-classifications fc and the sub-regressions fc in the first RCNN in fig. 14 complete the 2D target detection of the 8 broad classes in table 1 at the same time. As can be seen from fig. 14, the sensing network of the present application can flexibly add or subtract the classification fc and the regression fc in the first RCNN according to the service requirement, so as to train to obtain the sensing network capable of achieving the target detection of different numbers of tasks.

S1320, calculating the loss of the RPN stage by using the annotation data and the pseudoBboxes.

The annotation data of the sample image comprises an annotation 2D frame and a class label of the first task object. The Pseudo Bboxes on the sample image comprise Pseudo Bboxes of the second task object.

Step S1320 includes: the pass of the RPN phase, i.e. the first loss function value, is calculated using the annotation 2D box of the first task object and the Pseudo Bboxes of the second task object.

For example, the Pseudo Bboxes with confidence level greater than or equal to 0.05 in the Pseudo Bboxes file are combined with the labeling 2D frames in the labeling data to obtain 2D frames of all target objects on the sample image. And comparing the 2D frames of all the target objects with the candidate 2D frames obtained by the RPN prediction to obtain a loss function value of the RPN stage, namely a first loss function value.

Step S1320 corresponds to step S1221 of method 1200, and is described in detail in step S1221.

S1330, calculating the loss of the classified regression network stage by using the labeling data.

The sample image may be classified into one or more tasks according to the type of data that it is marked with, or the sample image may be classified into one or more tasks according to the task that it corresponds to the first task object. For example, if only a traffic sign is marked in one sample image, the sample image only belongs to the task of traffic sign, and if one sample image is marked with both a person and a car, the sample image belongs to the task of both the person and the car. When the loss of the classification regression network stage is calculated, only the loss of the part corresponding to the task to which the current sample image belongs is calculated, and the loss of the rest tasks is not calculated. For example, when the currently input sample image belongs to a task of a person or a car, only the loss of the portion corresponding to the person or the car is calculated, and the loss of the portion corresponding to the remaining tasks (such as traffic lights and traffic signs) is not calculated.

For example, as shown in fig. 14, the region of interest extraction module deducts features from a feature map according to the candidate 2D frame predicted by the RPN, and then enters the sub-classification fc and the sub-regression fc corresponding to the task to which the sample image belongs after the shared fc and the shared conv, so as to obtain a prediction result, that is, the confidence that the candidate 2D frame belongs to the object class in the task, and the target 2D frame. And comparing the marked data with the prediction result to obtain loss, namely loss of the classification regression network stage corresponding to the task.

If the annotation data of the current sample image only includes the annotation data of one task, when the sample image is input into the network for training, for a plurality of sub-classifications fc and sub-regressions fc in the first RCNN, only the sub-classifications fc and sub-regressions fc corresponding to the task in the first RCNN are trained, and the sub-classifications fc and sub-regressions fc corresponding to other tasks in the first RCNN are not affected.

For example, as shown in fig. 14, if the current sample image is only marked with a 2D frame of a traffic light, as shown in table 1, the task of the traffic light is task 3, then during training, the prediction result of the traffic light in the sample image is obtained only through the sub-classification fc and the sub-regression fc corresponding to task 3, and is compared with the true value to obtain the loss value. That is, the sample image of the traffic light passes through only the backup, the RPN, the region of interest extraction module, and the sub-classifications fc and the sub-regressions fc corresponding to the traffic light in the first RCNN, and the sub-classifications fc and the sub-regressions fc corresponding to the other tasks do not participate in the calculation of the loss value.

If the labeling data of the current sample image includes labeling data of a plurality of tasks, when the sample image is input into the network for training, for a plurality of sub-classifications fc and sub-regressions fc in the first RCNN, only the sub-classifications fc and sub-regressions fc corresponding to the plurality of tasks in the first RCNN are trained, and the sub-classifications fc and sub-regressions fc corresponding to other tasks in the first RCNN are not affected.

For example, as shown in fig. 14, if the current sample image is marked with a 2D frame of a traffic light and a 2D frame of a person, the task of the traffic light is task3 and the task of the person is task1 as shown in table 1. During training, the prediction result of the traffic light in the sample image is obtained through the sub-classification fc and the sub-regression fc corresponding to the task3, the prediction result of the person in the sample image is obtained through the sub-classification fc and the sub-regression fc corresponding to the task1, and the prediction result is compared with the true value to obtain the loss values corresponding to the two tasks. That is, the sample image passes through only the backup, RPN, region of interest extraction module, sub-classifications fc and sub-regressions fc corresponding to task3 in the first RCNN, and sub-classifications fc and sub-regressions fc corresponding to task1 in the first RCNN. The other sub-classifications fc and sub-regressions fc corresponding to the task do not participate in the calculation of the loss value. Thus, loss values of the two tasks in the classification regression phase can be obtained, and the loss values of the classification regression phase can be the average value of the multiple loss values.

S1340, gradient back.

After the loss is calculated, a gradient back-pass, i.e., back-propagation, is required.

And calculating the gradient of the related parameters based on the loss (first loss function value) of the RPN stage and the loss (second loss function value) of the classification regression network to carry out gradient back transmission.

And carrying out gradient feedback on the part needing training in the perception network, wherein the part needing training in the perception network is determined according to the task to which the sample image belongs, and the part which is not corresponding to the task to which the sample image belongs does not participate in the gradient feedback.

For example, as shown in fig. 14, the gradient is returned along the sub-classifications fc and sub-regressions fc corresponding to the task to which the sample image belongs, without affecting the sub-classifications fc and sub-regressions fc corresponding to other tasks, and the shared fc or conv of the first RCNN and the RPN and the backbox are both involved in the gradient return.

S1350, adjusting parameters of the sensing network.

And updating the weight parameters of the part to be trained in the perception network by using the returned gradient.

Therefore, the part corresponding to the task to which the sample image belongs in the perception network can be adjusted in a targeted manner, so that the part corresponding to the task to which the sample image belongs can learn the task to which the sample image belongs better.

S1360, judging whether the sensing network is converged.

And if the sensing network converges, outputting the weight parameters of the sensing network.

If the sensing network has not converged, the process goes to step S1310 to continue the training process.

The labeling data of the sample image in the embodiment of the application can be part of the labeling data, so that the sample image required by specific tasks can be acquired, all interested objects are not required to be labeled in each picture, and the acquisition cost and the labeling cost of the data are reduced. In addition, the mode of preparing the training data by adopting the scheme of partially labeling the data has flexible expansibility, and under the condition of adding the detection task, only the part corresponding to the detection task is needed to be added in the classification regression network, for example, the sub-classification fc and the sub-regression fc corresponding to the detection task are added, and the sample image with the labeling data of the newly added object is provided, so that the newly added object to be detected does not need to be labeled on the basis of the original training data.

Moreover, the false frame is utilized to supplement the unlabeled object to be detected in the sample image, so that the mutual inhibition among the partial labeling data of different tasks is avoided when the RPN trains based on the partial labeling data, the training of the RPN is influenced, and the RPN is beneficial to predicting the area where the object to be detected is located in all tasks to be detected.

In addition, the part corresponding to each task in the perception network only detects the object to be detected in the task, and in the training process, accidental injury to the object of other unlabeled tasks can be avoided. Furthermore, the shared part in the sensing network, such as the backup, the RPN, the region of interest extraction module, etc., learns the common features of the respective tasks, while the part corresponding to the respective tasks in the classification regression network learns the task-specific features thereof, such as the sub-classification fc and the sub-regression fc corresponding to the respective tasks in the first RCNN learn the task-specific features thereof.

The embodiment of the application also provides an object recognition method 1500, and the method 1500 can be executed by an object recognition device. The object recognition device may be a cloud service device, a terminal device, for example, a device having a sufficient computing capability to execute an object recognition method, such as a vehicle, an unmanned aerial vehicle, a robot, a computer, a server, or a mobile phone, or a system composed of the cloud service device and the terminal device. Illustratively, the method 1500 may be performed by the executing device 110 in fig. 3, the neural network processor 50 in fig. 5, or the executing device 310 or the local device in fig. 6.

For example, the object recognition method may be specifically performed by the performing device 110 as shown in fig. 3.

Alternatively, the object recognition method may be processed by a GPU, or may be processed by a CPU and the GPU together, or may not use the GPU, and other processors suitable for neural network computation may be used, which is not limited by the present application.

The image is processed by using the sensing network in the embodiment of the present application in the method 1500, and in order to avoid unnecessary repetition, the repetitive description is omitted when describing the method 1500.

The method 1500 includes steps S1510 to S1540, and steps S1510 to S1540 are described below.

The sensing network comprises a backbone network, an RPN, a region of interest extraction module and a classification regression network.

S1510, convolving the input image by using the backbone network, and outputting a first feature map of the input image.

The input image may be an image captured by a terminal device (or other apparatus or device such as a computer, a server, or the like) through a camera, or the input image may be an image obtained from inside the terminal device (or other apparatus or device such as a computer, a server, or the like) (for example, an image stored in an album of the terminal device, or an image obtained from a cloud end by the terminal device), which is not limited by the embodiment of the present application.

S1520, outputting position information of a candidate two-dimensional 2D frame of the target object on the second feature map by using the RPN, wherein the target object comprises an object to be detected in a plurality of tasks, each task in the plurality of tasks at least comprises a category, and the second feature map is determined according to the first feature map.

S1530, extracting first feature information on a third feature map based on the position information of the candidate 2D frame by using the region of interest extraction module, wherein the first feature information is the feature of the region where the candidate 2D frame is located, and the third feature map is determined according to the first feature map.

S1540, processing the first characteristic information by using a classification regression network to obtain target 2D frames of the target object and first indication information, wherein the number of the target 2D frames is smaller than or equal to that of the candidate 2D frames, and the first indication information is used for indicating the category to which the target object belongs.

Optionally, processing the first feature information by using a classification regression network to obtain a target 2D frame of the target object and first indication information, including: processing the first characteristic information by using a classification regression network to obtain the confidence that the candidate 2D frame belongs to each category in a plurality of tasks; adjusting the position information of the candidate 2D frames by using a classification regression network to obtain adjusted candidate 2D frames; determining a target 2D frame according to the adjusted candidate 2D frame; and determining first indication information according to the confidence that the target 2D frame belongs to each category.

Optionally, the classification regression network includes a first regional convolutional neural network RCNN, where the first RCNN includes a hidden layer, a plurality of sub-classification full-connection layers, and a plurality of sub-regression full-connection layers, where the hidden layer is connected with the plurality of sub-classification full-connection layers, the hidden layer is connected with the plurality of sub-regression full-connection layers, the plurality of sub-classification full-connection layers are in one-to-one correspondence with the plurality of tasks, and the plurality of sub-regression full-connection layers are in one-to-one correspondence with the plurality of tasks; and processing the first feature information by using a classification regression network, outputting a target 2D frame of the target object and first indication information, including: processing the first characteristic information by using the hidden layer to obtain second characteristic information; obtaining the confidence coefficient of the object category of the candidate 2D frame belonging to the task corresponding to the sub-classification full-connection layer according to the second characteristic information by using the sub-classification full-connection layer; and adjusting the position information of the candidate 2D frames according to the second characteristic information by utilizing the sub-regression full-connection layer to obtain the adjusted candidate 2D frames.

Optionally, the classified regression network includes a second RCNN, where the second RCNN includes a hidden layer, a classified full-connection layer, and a regressive full-connection layer, where the hidden layer is connected with the classified full-connection layer, and where the hidden layer is connected with the regressive full-connection layer; and processing the first feature information by using a classification regression network, outputting a target 2D frame of the target object and first indication information, including: processing the first characteristic information by using the hidden layer to obtain third characteristic information; obtaining the confidence coefficient of the candidate 2D frames belonging to each category according to the third characteristic information by using the classification full-connection layer; and adjusting the position information of the candidate 2D frames according to the third characteristic information by using the regression full connection layer to obtain the adjusted candidate 2D frames.

Optionally, the classified full-connection layer is obtained by combining a plurality of sub-classified full-connection layers in the first RCNN, the regression full-connection layer is obtained by combining a plurality of sub-regressive full-connection layers in the first RCNN, the first RCNN comprises a hidden layer, a plurality of sub-classified full-connection layers and a plurality of sub-regressive full-connection layers, the hidden layer is connected with the plurality of sub-classified full-connection layers, the hidden layer is connected with the plurality of sub-regressive full-connection layers, the plurality of sub-classified full-connection layers are in one-to-one correspondence with a plurality of tasks, and the plurality of sub-regressive full-connection layers are in one-to-one correspondence with the plurality of tasks; the sub-classification full-connection layer is used for obtaining the confidence coefficient of the object category of the candidate 2D frame belonging to the task corresponding to the sub-classification full-connection layer according to the third characteristic information; and the sub-regression full-connection layer is used for adjusting the position information of the candidate 2D frames according to the third characteristic information to obtain the adjusted candidate 2D frames.

Fig. 16 illustrates a process flow of the object recognition method according to the embodiment of the present application, where the process flow in fig. 16 may be regarded as a specific implementation of the method shown in fig. 15, the method in fig. 16 may be performed using the sensing network shown in fig. 8, and the description thereof may be referred to the description in the sensing network 800, so that in order to avoid unnecessary repetition, the repeated description is omitted when describing the method 1600.

The following describes the scheme of the embodiment of the application in detail by taking an ADAS/ADS visual perception system as an example.

The structure of the sensing network adopted in the embodiment of the present application is shown in fig. 17 according to the task division manner in table 1. As shown in fig. 17, the awareness network includes: the device comprises a backstbone, an RPN, a region of interest extraction module and a second RCNN. The sensor network shown in fig. 17 may be viewed as a specific implementation of the sensor network shown in fig. 11. The sensing network in fig. 17 is capable of completing target detection for 8 broad classes in table 1 simultaneously. Alternatively, the sensor network of FIG. 17 is capable of performing target detection for 8 tasks in Table 1 simultaneously. The sensor network shown in fig. 17 may be determined from the sensor network shown in fig. 14. For example, as shown in fig. 18, the classification fc in the second RCNN is obtained by combining a plurality of sub-classifications fc in the first RCNN, and the regression fc in the second RCNN is obtained by combining sub-regressions fc in the first RCNN. As can be seen from fig. 18, the sensing network of the present application can flexibly add or subtract the sub-classification fc and the sub-regression fc in the first RCNN according to the service requirement, so as to implement the target detection of different numbers of tasks.

Specifically, method 1600 includes steps S1610 to S1650.

S1610, inputting an image to be processed.

S1620, generating a basic feature.

Illustratively, step S1620 may be performed by a backup in fig. 17.

Specifically, the input image is convolved by the backbone to generate a plurality of feature maps with different scales, namely a first feature map.

Illustratively, a backbone may employ various forms of convolutional networks, such as VGG16, resnet50, or acceptance-Net, among others.

Further, in the case where the sensing network further includes the FPN, step S1620 may further include: and carrying out feature fusion based on the first feature map, and outputting the fused feature map.

The feature map of the backbone network or FPN output may be provided as a base feature to the following modules.

S1630, predicting the candidate 2D frame.

Illustratively, step S1630 may be performed by the RPN in fig. 17.

And the RPN predicts the region where the target object is located on the second feature map, and outputs a candidate 2D frame matched with the region where the target object is located, wherein the target object comprises objects to be detected in a plurality of tasks. The second feature map may comprise a feature map of a backbone network or FPN output.

Specifically, the RPN predicts regions where the target object may exist based on the feature map provided by the backbox or FPN, and outputs the coordinates of candidate boxes or candidate regions (proposal) of these regions. In the embodiment of the application, the RPN can predict the candidate frames of the object to be detected which possibly exist in 8 major classes in the table 1.

And S1640, extracting the characteristics of the candidate 2D frames.

Illustratively, step S1640 may be performed by the region of interest extraction module in fig. 17.

The region of interest extraction module extracts the features of the region where the candidate 2D frame is located on the third feature map. The third profile may be one provided by a backbone or FPN.

The region of interest extraction module extracts the feature of the region where each proposal is located on a feature map provided by a backup or FPN according to the coordinates of the proposal provided by the RPN, and extracts the feature of each proposal to a fixed size.

And S1650, correcting and classifying the candidate 2D frames.

Illustratively, step S1650 may be performed by the second RCNN in fig. 17.

Specifically, the hidden layer in the second RCNN, for example, shared fc/conv, further performs feature extraction on the features of each proposal extracted by the region of interest extraction module, sends the features to cls fc and reg fc, classifies the proposal by cls fc to obtain the confidence coefficient of each proposal belonging to each category, adjusts the coordinates of the 2D frame of the proposal by reg fc to obtain more compact 2D frame coordinates, and then performs frame sum operation, for example, NMS operation to combine the adjusted 2D frames, and outputs the target 2D frame and classification result. The classification result may be used as the first indication information.

The weights of the classifications fc in the second RCNN in fig. 17 are obtained by combining the weights of the sub-classifications fc of the first RCNN in fig. 16. The weights of the regressions fc in the second RCNN in fig. 17 are obtained by combining the weights of the multiple sub-regressions fc of the first RCNN in fig. 16. In the training process, the first RCNN in fig. 16 normalizes the tag logic values (table logits) obtained by the sub-classification fc by using the sigmoid function to obtain the confidence coefficient of each class, which is equivalent to performing a two-class process on each class, and the confidence coefficient of the current class has no relation with other classes, so that the sub-classification fc of all tasks of the model can be combined into one classification fc during reasoning. The sub-regressions fc can also be combined into one regression fc.

For example, the candidate 2D frame is a rectangular frame, the position information of the candidate 2D frame is represented by 4 values, the length of the feature (feature) of the hidden layer output is 1024, the number of categories in each task is n, the weight of the sub-regression fc in each task is a tensor (tensor) of 1024×4n, and the weight of the sub-category fc in each task is a tensor of 1024×n. In table 1, there are 8 tasks and 31 task objects, the weight of the regression fc formed after merging is 1024×124 tensor, and the weight of the classification fc is 1024×31 tensor. That is, only one classification fc and one regression fc are included in the second RCNN obtained after the combination, and the input and output thereof are consistent with the tensor dimension (tensor shape) of the weight after the combination, that is, the input of the classification fc and the regression fc is 1024, the output of the classification fc is 31, and the output of the regression fc is 124.

Table 2 shows statistics of parameters and calculation amounts for implementing 8 tasks in the case of an input image size of 720×1280 (@ 720 p) for a single-head end multitasking network and an existing multi-head end multitasking network in an embodiment of the present application. I.e. table 2 shows the parameters and calculations for a single head end task network and a multi head end multi task network for 8 tasks (tasks).

TABLE 2

8 Task-Model@720p	GFlops	Parameters(M)
Multi-head multi-task network (8 task)	413.96	142.76
Single-head end multitasking network (8 task)	139.61	41.29

As shown in table 2, if the target detection of 8 tasks in the embodiment of the present application is implemented by using a multi-head multi-task network, the total required amount is 413.96 gflips, the number of network Parameters (Parameters) is 142.76M, and huge calculation amount and number of network Parameters bring great stress to hardware. By using the single-head end multi-task network provided by the embodiment of the application, the calculated amount can be reduced by 60%, the parameter amount can be reduced by 71%, the calculated amount and the parameter amount can be greatly reduced, the calculation consumption can be reduced, and the hardware pressure can be reduced.

Table 3 shows the comparison of the inferred time consumption of a single head end multitasking network and an existing multi head end multitasking network in an embodiment of the application.

TABLE 3 Table 3

8 Task-Model@720p	720p latency(ms)	1080p latency(ms)
Multi-head multi-task network (8 task)	28	40
Single-head end multitasking network (8 task)	23	31

As shown in table 3, compared with the multi-head end multi-task network, the time consumption (latency) of the single-head end multi-task network in the embodiment of the application on the images with the resolution of 720p and 1080p is reduced by 17% and 22%, so that the processing efficiency is remarkably improved, and the single-head end multi-task network is favorable for being deployed in scenes with higher real-time requirements.

In addition, the single-head end multi-task network in the embodiment of the application can realize the detection performance the same as that of the multi-head end multi-task network. Table 4 shows a performance comparison of a single-head end multitasking network and a multi-head end multitasking network over a partial class.

TABLE 4 Table 4

Category(s)	Multi-headend multitasking network (AP)	Single-head end multi-task network (AP)
Pedestrian	75.66	72.76
Cyclist	84.77	81.92
Car	96.09	97.56
Truck	88.18	90.21
Tram	88.48	94.62
TrafficCone	83.11	87.65
TrafficStick	73.31	86.68
FireHydrant	63.3	77.86
TrafficLight_Red	96.51	95.71
TrafficLight_Yellow	98.66	96.49

TrafficLight_Green	95.85	94.21
TrafficSign	83.87	86.14
GuideSign	56.57	59.96

As shown in table 4, the average accuracy (average precision, AP) of the single-head end multi-tasking network and the existing multi-head end multi-tasking network in the embodiment of the present application are not quite different, i.e. the performance of the two is equivalent. Therefore, the single-head end multi-task network in the embodiment of the application can save the calculated amount and the display memory on the premise of ensuring the performance of the model.

An apparatus according to an embodiment of the present application will be described with reference to fig. 19 to 20. It should be understood that the apparatus described below is capable of performing the method of the foregoing embodiments of the present application, and in order to avoid unnecessary repetition, the repeated description is appropriately omitted when describing the apparatus of the embodiments of the present application.

Fig. 19 is a schematic block diagram of an apparatus of an embodiment of the application. The apparatus 4000 shown in fig. 19 includes an acquisition unit 4010 and a processing unit 4020.

In one implementation, the apparatus 4000 may be used as a training apparatus for a cognitive network, and the acquisition unit 4010 and the processing unit 4020 may be used to perform the training method for a cognitive network according to an embodiment of the present application, and may be used to perform the method 1200 or the method 1300, for example.

Specifically, the perception network comprises a candidate region generation network RPN, the RPN being used for predicting position information of a candidate two-dimensional 2D frame of a target object in the sample image, the target object comprising objects to be detected of a plurality of tasks, each task of the plurality of tasks comprising at least one category, the target object comprising a first task object and a second task object.

The obtaining unit 4010 is configured to obtain training data, where the training data includes a sample image, labeling data of a first task object on the sample image, and a pseudo frame of a second task object on the sample image, the labeling data includes a class label of the first task object and a labeling 2D frame of the first task object, and the pseudo frame of the second task object is a target 2D frame of the second task object obtained by reasoning the sample image through other perception networks.

The processing unit 4020 is configured to train the sensing network based on the training data.

Optionally, as an embodiment, the sensing network further includes a backbone network, a region of interest extraction module, and a classification regression network, and the processing unit 4020 is specifically configured to: calculating a first loss function value according to the difference between the labeling 2D frame of the first task object and the target 2D frame of the second task object and the candidate 2D frame of the target object in the sample image obtained by RPN prediction; calculating a second loss function value of the sensing network according to the labeling data; and back-propagating the first loss function value and the second loss function value, adjusting parameters of a part required to be trained in the perception network, wherein the part required to be trained in the perception network comprises a part required to be trained in a classification regression network, an interested region extraction module, an RPN and a backbone network, and the part required to be trained in the classification regression network is determined according to the first task object.

Optionally, as an embodiment, the backbone network is configured to perform convolution processing on the sample image, and output a first feature map of the sample image; the RPN is used for outputting the position information of the candidate 2D frame of the target object on the basis of a second characteristic diagram, and the second characteristic diagram is determined according to the first characteristic diagram; the region of interest extraction module is used for extracting first characteristic information on a third characteristic map based on the position information of the candidate 2D frame, wherein the first characteristic information is the characteristic of the region where the candidate 2D frame is positioned, and the third characteristic map is determined according to the first characteristic map; the classification regression network is used for processing the first characteristic information, outputting target 2D frames of the target object and first indication information, wherein the number of the target 2D frames of the target object is smaller than or equal to the number of the candidate 2D frames of the target object, and the first indication information is used for indicating the class to which the target object belongs.

Optionally, as an embodiment, the classification regression network includes a first regional convolutional neural network RCNN, where the first RCNN includes a hidden layer, a plurality of sub-classification full-connection layers, and a plurality of sub-regression full-connection layers, the hidden layer is connected to the plurality of sub-classification full-connection layers, the hidden layer is connected to the plurality of sub-regression full-connection layers, the plurality of sub-classification full-connection layers are in one-to-one correspondence with the plurality of tasks, and the plurality of sub-regression full-connection layers are in one-to-one correspondence with the plurality of tasks; the hidden layer is used for processing the first characteristic information to obtain second characteristic information; the sub-classification full-connection layer is used for obtaining the confidence coefficient of the object category of the candidate 2D frame belonging to the task corresponding to the sub-classification full-connection layer according to the second characteristic information; the sub-regression full-connection layer is used for adjusting the position information of the candidate 2D frames according to the result after hidden layer processing to obtain adjusted candidate 2D frames; and the part of the classification regression network, which is required to be trained, comprises a hidden layer, a sub-classification full-connection layer and a sub-regression full-connection layer, wherein the sub-classification full-connection layer and the sub-regression full-connection layer correspond to the task where the first task object is located.

In another implementation, the apparatus 4000 may act as an object recognition apparatus. The object recognition apparatus includes an acquisition unit 4010 and a processing unit 4020. The awareness network includes: the system comprises a backbone network, a candidate region generation network, a region of interest extraction module and a classification regression network.

The acquisition unit 4010 and the processing unit 4020 may be used to perform the object recognition method of the embodiment of the application, for example, may be used to perform the method 1500 or the method 1600.

The acquisition unit 4010 is used for acquiring an input image.

The processing unit 4020 is configured to perform convolution processing on an input image by using a backbone network to obtain a first feature map of the input image; outputting position information of a candidate two-dimensional 2D frame of a target object based on a second feature map by using the RPN, wherein the target object comprises an object to be detected in a plurality of tasks, each task in the plurality of tasks comprises at least one category, and the second feature map is determined according to the first feature map; extracting first characteristic information on a third characteristic map based on the position information of the candidate 2D frame by using a region-of-interest extraction module, wherein the first characteristic information is the characteristic of the region where the candidate 2D frame is positioned, and the third characteristic map is determined according to the first characteristic map; and processing the first characteristic information by using a classification regression network to obtain target 2D frames of the target object and first indication information, wherein the number of the target 2D frames of the target object is smaller than or equal to the number of the candidate 2D frames of the target object, and the first indication information is used for indicating the category to which the target object belongs.

Optionally, as an embodiment, the processing unit 4020 is specifically configured to: processing the first characteristic information by using a classification regression network to obtain the confidence that the candidate 2D frames belong to each category in the plurality of tasks; adjusting the position information of the candidate 2D frames by using a classification regression network to obtain adjusted candidate 2D frames; determining a target 2D frame according to the adjusted candidate 2D frame; and determining first indication information according to the confidence that the target 2D frame belongs to each category.

Optionally, as an embodiment, the classification regression network includes a first regional convolutional neural network RCNN, where the first RCNN includes a hidden layer, a plurality of sub-classification full-connection layers, and a plurality of sub-regression full-connection layers, the hidden layer is connected to the plurality of sub-classification full-connection layers, the hidden layer is connected to the plurality of sub-regression full-connection layers, the plurality of sub-classification full-connection layers are in one-to-one correspondence with the plurality of tasks, and the plurality of sub-regression full-connection layers are in one-to-one correspondence with the plurality of tasks; the processing unit is specifically configured to: processing the first characteristic information by using the hidden layer to obtain second characteristic information; obtaining the confidence coefficient of the object category of the candidate 2D frame belonging to the task corresponding to the sub-classification full-connection layer according to the second characteristic information by using the sub-classification full-connection layer; and adjusting the position information of the candidate 2D frames according to the second characteristic information by utilizing the sub-regression full-connection layer to obtain the adjusted candidate 2D frames.

Optionally, as an embodiment, the classification regression network includes a second RCNN, where the second RCNN includes a hidden layer, a classification full-connection layer, and a regression full-connection layer, where the hidden layer is connected to the classification full-connection layer, and where the hidden layer is connected to the regression full-connection layer; the processing unit 4020 specifically is configured to: processing the first characteristic information by utilizing the utilized hidden layer to obtain third characteristic information; obtaining the confidence coefficient of the candidate 2D frames belonging to each category according to the obtained third characteristic information by using the classification full-connection layer; and adjusting the position information of the candidate 2D frames by using the regression full-connection layer according to the obtained third characteristic information to obtain the adjusted candidate 2D frames.

Optionally, as an embodiment, the classification full-connection layer is obtained by combining multiple sub-classification full-connection layers in the first RCNN, the regression full-connection layer is obtained by combining multiple sub-regression full-connection layers in the first RCNN, the first RCNN includes a hidden layer, multiple sub-classification full-connection layers and multiple sub-regression full-connection layers, the hidden layer is connected with the multiple sub-classification full-connection layers, the hidden layer is connected with the multiple sub-regression full-connection layers, the multiple sub-classification full-connection layers are in one-to-one correspondence with the multiple tasks, and the multiple sub-regression full-connection layers are in one-to-one correspondence with the multiple tasks; the sub-classification full-connection layer is used for obtaining the confidence coefficient of the object category of the candidate 2D frame belonging to the task corresponding to the sub-classification full-connection layer according to the obtained third characteristic information; and the sub-regression full-connection layer is used for adjusting the position information of the candidate 2D frames according to the obtained third characteristic information to obtain the adjusted candidate 2D frames.

It should be noted that the above-mentioned apparatus 4000 is embodied in the form of a functional unit. The term "unit" herein may be implemented in software and/or hardware, without specific limitation.

For example, a "unit" may be a software program, a hardware circuit or a combination of both that implements the functions described above. The hardware circuitry may include application specific integrated circuits (application specific integrated circuit, ASICs), electronic circuits, processors (e.g., shared, proprietary, or group processors, etc.) and memory for executing one or more software or firmware programs, merged logic circuits, and/or other suitable components that support the described functions.

Thus, the elements of the examples described in the embodiments of the present application can be implemented in electronic hardware, or in a combination of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

Fig. 20 is a schematic diagram of a hardware structure of an apparatus according to an embodiment of the present application. The apparatus 6000 as shown in fig. 20 (the apparatus 6000 may in particular be a computer device) comprises a memory 6001, a processor 6002, a communication interface 6003 and a bus 6004. The memory 6001, the processor 6002, and the communication interface 6003 are connected to each other by a bus 6004.

In one implementation, the device 6000 may act as a training device for the sensory network.

The memory 6001 may be a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a random access memory (random access memory, RAM). The memory 6001 may store a program which, when executed by the processor 6002, the processor 6002 is configured to perform the steps of a training method of a cognitive network in accordance with embodiments of the present application. Specifically, the processor 6002 may perform step S1220 in the method shown in fig. 12 above.

The processor 6002 may employ a general-purpose central processing unit (central processing unit, CPU), microprocessor, application specific integrated circuit (application specific integrated circuit, ASIC), graphics processor (graphics processing unit, GPU) or one or more integrated circuits for executing associated programs to implement the network aware training method of the method embodiments of the present application.

The processor 6002 may also be an integrated circuit chip with signal processing capabilities, for example, the chip shown in fig. 5. In implementation, the steps of the training method of the cognitive network of the present application may be accomplished by instructions in the form of integrated logic circuits or software of hardware in the processor 6002.

The processor 6002 may also be a general purpose processor, a digital signal processor (digital signal processing, DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 6001, and the processor 6002 reads the information in the memory 6001, and combines the hardware thereof to perform the functions required to be performed by the units included in the training apparatus in the embodiment of the application, or to perform the training method of the perception network shown in fig. 12 in the method embodiment of the application.

The communication interface 6003 enables communication between the apparatus 6000 and other devices or communication networks using transceiving means such as, but not limited to, a transceiver. For example, training data may be acquired through the communication interface 6003.

Bus 6004 may include a path to transfer information between components of device 6000 (e.g., memory 6001, processor 6002, communication interface 6003).

In another implementation, the device 6000 may act as an object recognition device.

The memory 6001 may be a ROM, a static storage device, and a RAM. The memory 6001 may store a program, and the processor 6002 and the communication interface 6003 are configured to execute respective steps of the object recognition method of the embodiment of the present application when the program stored in the memory 6001 is executed by the processor 6002. Specifically, the processor 6002 may perform steps S1520 to S1540 in the method shown in fig. 15 above.

The processor 6002 may employ a general-purpose CPU, microprocessor, ASIC, GPU, or one or more integrated circuits for performing the procedures required to implement the functions performed by the elements of the object recognition device of an embodiment of the application or to perform the object recognition method of an embodiment of the application.

The processor 6002 may also be an integrated circuit chip with signal processing capabilities, for example, the chip shown in fig. 6. In implementation, the steps of the object recognition method according to the embodiment of the present application may be completed by an integrated logic circuit of hardware in the processor 6002 or an instruction in the form of software.

The processor 6002 may also be a general purpose processor, DSP, ASIC, FPGA or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 6001, and the processor 6002 reads information in the memory 6001, and in combination with hardware thereof, performs functions to be executed by units included in the object recognition apparatus of the embodiment of the application, or executes the object recognition method of the method embodiment of the application.

The communication interface 6003 enables communication between the apparatus 6000 and other devices or communication networks using transceiving means such as, but not limited to, a transceiver. For example, data to be processed can be acquired through the communication interface 6003.

It should be noted that although the above-described apparatus 6000 only shows a memory, a processor, a communication interface, in a specific implementation, it will be appreciated by those skilled in the art that the apparatus 6000 may also include other devices necessary to achieve normal operation. Also, as will be appreciated by those skilled in the art, the apparatus 6000 may also include hardware devices that perform other additional functions, as desired. Furthermore, it will be appreciated by those skilled in the art that the apparatus 6000 may also include only the devices necessary to implement the embodiments of the present application, and not necessarily all of the devices shown in fig. 20.

Embodiments of the present application provide a computer-readable medium storing program code for execution by a device, the program code including relevant content for performing an object recognition method as shown in fig. 15 or 16.

Embodiments of the present application provide a computer readable medium storing program code for execution by a device, the program code including relevant content for performing a training method as shown in fig. 12 or 13.

Embodiments of the present application provide a computer program product which, when run on a computer, causes the computer to perform the relevant content of the object recognition method as shown in fig. 15 or fig. 16.

Embodiments of the present application provide a computer program product which, when run on a computer, causes the computer to perform the relevant content of the training method as shown in fig. 12 or fig. 13.

The embodiment of the application provides a chip, which comprises a processor and a data interface, wherein the processor reads instructions on a memory through the data interface to execute the object identification method shown in fig. 15 or 16.

The embodiment of the application provides a chip, which comprises a processor and a data interface, wherein the processor reads instructions on a memory through the data interface to execute the training method shown in fig. 12 or fig. 13.

Optionally, as an implementation manner, the chip may further include a memory, where the memory stores instructions, and the processor is configured to execute the instructions stored on the memory, and when the instructions are executed, the processor is configured to execute the object recognition method of fig. 15 or 16 or the training method of fig. 12 or 13.

It is to be appreciated that the processor in embodiments of the application may be a central processing unit (central processing unit, CPU), but may also be other general purpose processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), off-the-shelf programmable gate arrays (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It should also be appreciated that the memory in embodiments of the present application may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. The volatile memory may be random access memory (random access memory, RAM) which acts as an external cache. By way of example but not limitation, many forms of random access memory (random access memory, RAM) are available, such as Static RAM (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), synchronous Link DRAM (SLDRAM), and direct memory bus RAM (DR RAM).

The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. When the computer instructions or computer program are loaded or executed on a computer, the processes or functions described in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more sets of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk.

It should be understood that the term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean: there are three cases, a alone, a and B together, and B alone, wherein a, B may be singular or plural. In addition, the character "/" herein generally indicates that the associated object is an "or" relationship, but may also indicate an "and/or" relationship, and may be understood by referring to the context.

In the present application, "at least one" means one or more, and "a plurality" means two or more. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.

It should be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

A cognitive network, comprising: the method comprises a backbone network, a candidate region generation network RPN, a region of interest extraction module and a classification regression network;

the main network is used for carrying out convolution processing on the input image and outputting a first feature map of the input image;

the RPN is used for outputting position information of a candidate two-dimensional 2D frame of a target object based on a second feature map, the target object comprises an object to be detected in a plurality of tasks, each task in the plurality of tasks comprises at least one category, and the second feature map is determined according to the first feature map;

the region of interest extraction module is configured to extract first feature information on a third feature map based on the position information of the candidate 2D frame, where the first feature information is a feature of a region where the candidate 2D frame is located, and the third feature map is determined according to the first feature map;

The classification regression network is configured to process the first feature information, output target 2D frames of the target object and first indication information, where the number of target 2D frames is less than or equal to the number of candidate 2D frames, and the first indication information is used to indicate a class to which the target object belongs.
The awareness network of claim 1 wherein the classification regression network is specifically configured to:

processing the first characteristic information to obtain the confidence that the candidate 2D frames belong to each category in the plurality of tasks;

adjusting the position information of the candidate 2D frames to obtain adjusted candidate 2D frames;

determining the target 2D frame according to the adjusted candidate 2D frame;

and determining the first indication information according to the confidence that the target 2D frame belongs to each category.
The perceptive network of claim 2, wherein said classification regression network comprises a first regional convolutional neural network RCNN, said first RCNN comprising a hidden layer, a plurality of sub-classification fully-connected layers and a plurality of sub-regression fully-connected layers, said hidden layer being connected to said plurality of sub-classification fully-connected layers, said plurality of sub-classification fully-connected layers being in one-to-one correspondence with said plurality of tasks, said plurality of sub-regression fully-connected layers being in one-to-one correspondence with said plurality of tasks;

The hidden layer is used for processing the first characteristic information to obtain second characteristic information;

the sub-classification full-connection layer is used for obtaining the confidence coefficient of the object category of the candidate 2D frame belonging to the task corresponding to the sub-classification full-connection layer according to the second characteristic information;

and the sub-regression full-connection layer is used for adjusting the position information of the candidate 2D frames according to the second characteristic information to obtain the adjusted candidate 2D frames.
The perceptive network of claim 2, wherein said classification regression network comprises a second RCNN comprising a hidden layer, a classification fully connected layer, and a regression fully connected layer, said hidden layer being connected to said regression fully connected layer;

the hidden layer is used for processing the first characteristic information to obtain third characteristic information;

the classification full-connection layer is used for obtaining the confidence that the candidate 2D frames belong to each category according to the third characteristic information;

and the regression full-connection layer is used for adjusting the position information of the candidate 2D frames according to the third characteristic information to obtain the adjusted candidate 2D frames.
The network of claim 4, wherein the classification full-connection layer is obtained by merging a plurality of sub-classification full-connection layers in a first RCNN, the regression full-connection layer is obtained by merging a plurality of sub-regression full-connection layers in the first RCNN,

the first RCNN comprises a hidden layer, a plurality of sub-classification full-connection layers and a plurality of sub-regression full-connection layers, wherein the hidden layer is connected with the plurality of sub-classification full-connection layers, the hidden layer is connected with the plurality of sub-regression full-connection layers, the plurality of sub-classification full-connection layers are in one-to-one correspondence with the plurality of tasks, and the plurality of sub-regression full-connection layers are in one-to-one correspondence with the plurality of tasks;

the sub-classification full-connection layer is used for obtaining the confidence coefficient of the object category of the candidate 2D frame belonging to the task corresponding to the sub-classification full-connection layer according to the third characteristic information;

and the sub-regression full-connection layer is used for adjusting the position information of the candidate 2D frames according to the third characteristic information to obtain the adjusted candidate 2D frames.
A method of training a cognitive network, the cognitive network comprising: generating a network (RPN) by a candidate region, wherein the RPN is used for predicting the position information of a candidate two-dimensional (2D) frame of a target object in a sample image, the target object comprises an object to be detected of a plurality of tasks, and each task in the plurality of tasks comprises at least one category; the target object comprises a first task object and a second task object;

The method comprises the following steps:

acquiring training data, wherein the training data comprises the sample image, labeling data of the first task object on the sample image and a pseudo frame of the second task object on the sample image, the labeling data comprises class labels of the first task object and labeling 2D frames of the first task object, and the pseudo frame of the second task object is a target 2D frame of the second task object obtained by reasoning the sample image through other perception networks;

training the perception network based on the training data.
The training method of claim 6, wherein the sensory network further comprises a backbone network, a region of interest extraction module, and a classification regression network,

the training the perception network based on the training data comprises:

calculating a first loss function value according to the difference between the marked 2D frame of the first task object, the target 2D frame of the second task object and the candidate 2D frame of the target object in the sample image obtained by the RPN prediction;

calculating a second loss function value of the sensing network according to the labeling data;

And back-propagating the first loss function value and the second loss function value, and adjusting parameters of a part required to be trained in the perception network, wherein the part required to be trained in the perception network comprises the part required to be trained in the classification regression network, the region of interest extraction module, the RPN and the backbone network, and the part required to be trained in the classification regression network is determined according to the first task object.
The training method of claim 6 or 7, wherein,

the main network is used for carrying out convolution processing on the sample image and outputting a first feature map of the sample image;

the RPN is used for outputting the position information of the candidate 2D frame of the target object based on a second characteristic diagram, and the second characteristic diagram is determined according to the first characteristic diagram;

the region of interest extraction module is configured to extract first feature information on a third feature map based on the position information of the candidate 2D frame, where the first feature information is a feature of a region where the candidate 2D frame is located, and the third feature map is determined according to the first feature map;

the classification regression network is configured to process the first feature information, output target 2D frames of the target object and first indication information, where the number of target 2D frames is less than or equal to the number of candidate 2D frames, and the first indication information is used to indicate a class to which the target object belongs.
The training method of claim 8, wherein the classification regression network comprises a first regional convolutional neural network RCNN, the first RCNN comprising a hidden layer, a plurality of sub-classification fully-connected layers, and a plurality of sub-regression fully-connected layers, the hidden layer being connected to the plurality of sub-classification fully-connected layers, the plurality of sub-classification fully-connected layers being in one-to-one correspondence with the plurality of tasks, the plurality of sub-regression fully-connected layers being in one-to-one correspondence with the plurality of tasks;

the hidden layer is used for processing the first characteristic information to obtain second characteristic information;

the sub-classification full-connection layer is used for obtaining the confidence coefficient of the object category of the candidate 2D frame belonging to the task corresponding to the sub-classification full-connection layer according to the second characteristic information;

the sub-regression full-connection layer is used for adjusting the position information of the candidate 2D frames according to the second characteristic information to obtain the adjusted candidate 2D frames;

and

the part to be trained in the classification regression network comprises the hidden layer, a sub-classification full-connection layer and a sub-regression full-connection layer, wherein the sub-classification full-connection layer and the sub-classification full-connection layer correspond to the task where the first task object is located.
An object recognition method, characterized in that the perception network comprises: the method comprises the steps of a backbone network, a candidate region generation network (RPN), a region of interest extraction module and a classification regression network, wherein the method comprises the following steps:

carrying out convolution processing on an input image by using the backbone network to obtain a first feature map of the input image;

outputting position information of a candidate two-dimensional 2D frame of a target object on a second feature map by utilizing the RPN, wherein the target object comprises an object to be detected in a plurality of tasks, each task in the plurality of tasks at least comprises a category, and the second feature map is determined according to the first feature map;

extracting first characteristic information on a third characteristic map based on the position information of the candidate 2D frame by using the region-of-interest extraction module, wherein the first characteristic information is the characteristic of the region where the candidate 2D frame is located, and the third characteristic map is determined according to the first characteristic map;

and processing the first characteristic information by using the classification regression network to obtain target 2D frames of the target object and first indication information, wherein the number of the target 2D frames is smaller than or equal to that of the candidate 2D frames, and the first indication information is used for indicating the category to which the target object belongs.
The method according to claim 10, wherein the processing the first feature information by using the classification regression network to obtain a target 2D frame of the target object and first indication information includes:

processing the first characteristic information by using the classification regression network to obtain the confidence that the candidate 2D frame belongs to each category in the plurality of tasks;

adjusting the position information of the candidate 2D frames by using the classification regression network to obtain adjusted candidate 2D frames;

determining the target 2D frame according to the adjusted candidate 2D frame;

and determining the first indication information according to the confidence that the target 2D frame belongs to each category.
The method of claim 11, wherein the classification regression network comprises a first regional convolutional neural network RCNN, the first RCNN comprising a hidden layer, a plurality of sub-classification fully-connected layers, and a plurality of sub-regression fully-connected layers, the hidden layer being connected to the plurality of sub-classification fully-connected layers, the plurality of sub-classification fully-connected layers being in one-to-one correspondence with the plurality of tasks, the plurality of sub-regression fully-connected layers being in one-to-one correspondence with the plurality of tasks; and

The processing the first feature information by using the classification regression network, outputting a target 2D frame of the target object and the first indication information, including:

processing the first characteristic information by using the hidden layer to obtain second characteristic information;

obtaining the confidence coefficient of the object category of the task corresponding to the sub-classification full-connection layer, which belongs to the candidate 2D frame, by using the sub-classification full-connection layer according to the second characteristic information;

and adjusting the position information of the candidate 2D frames by utilizing the sub-regression full-connection layer according to the second characteristic information to obtain the adjusted candidate 2D frames.
The method of claim 11, wherein the classification regression network comprises a second RCNN comprising a hidden layer, a classification fully connected layer, and a regression fully connected layer, the hidden layer being connected to the regression fully connected layer; and

the processing the first feature information by using the classification regression network, outputting a target 2D frame of the target object and first indication information, including:

processing the first characteristic information by using the hidden layer to obtain third characteristic information;

Obtaining the confidence that the candidate 2D frames belong to each category according to the third characteristic information by using the classification full-connection layer;

and adjusting the position information of the candidate 2D frames according to the third characteristic information by using the regression full connection layer to obtain the adjusted candidate 2D frames.
The method of claim 13, wherein the categorized full-connectivity layer is obtained by merging a plurality of sub-categorized full-connectivity layers in a first RCNN, wherein the regressive full-connectivity layer is obtained by merging a plurality of sub-regressive full-connectivity layers in the first RCNN,

the first RCNN comprises the hidden layer, the plurality of sub-classification full-connection layers and the plurality of sub-regression full-connection layers, wherein the hidden layer is connected with the plurality of sub-classification full-connection layers, the hidden layer is connected with the plurality of sub-regression full-connection layers, the plurality of sub-classification full-connection layers are in one-to-one correspondence with the plurality of tasks, and the plurality of sub-regression full-connection layers are in one-to-one correspondence with the plurality of tasks;

the sub-classification full-connection layer is used for obtaining the confidence coefficient of the object category of the candidate 2D frame belonging to the task corresponding to the sub-classification full-connection layer according to the third characteristic information;

And the sub-regression full-connection layer is used for adjusting the position information of the candidate 2D frames according to the third characteristic information to obtain the adjusted candidate 2D frames.
A training device of a perception network, characterized in that the perception network comprises a candidate region generation network RPN, the RPN is used for predicting position information of a candidate two-dimensional 2D frame of a target object in a sample image, the target object comprises an object to be detected of a plurality of tasks, each task of the plurality of tasks comprises at least one category, and the target object comprises a first task object and a second task object; the training device comprises:

the acquisition unit is used for acquiring training data, wherein the training data comprises the sample image, labeling data of the first task object on the sample image and a pseudo frame of the second task object on the sample image, the labeling data comprises a class label of the first task object and a labeling 2D frame of the first task object, and the pseudo frame of the second task object is a target 2D frame of the second task object obtained by reasoning the sample image through other perception networks;

And the processing unit is used for training the perception network based on the training data.
The training device of claim 15, wherein the perception network further comprises a backbone network, a region of interest extraction module, and a classification regression network, and wherein the processing unit is specifically configured to:

calculating a first loss function value according to the difference between the marked 2D frame of the first task object, the target 2D frame of the second task object and the candidate 2D frame of the target object in the sample image obtained by the RPN prediction;

calculating a second loss function value of the sensing network according to the labeling data;

and back-propagating the first loss function value and the second loss function value, and adjusting parameters of a part required to be trained in the perception network, wherein the part required to be trained in the perception network comprises the part required to be trained in the classification regression network, the region of interest extraction module, the RPN and the backbone network, and the part required to be trained in the classification regression network is determined according to the first task object.
Training device according to claim 15 or 16, characterized in that,

The main network is used for carrying out convolution processing on the sample image and outputting a first feature map of the sample image;

the RPN is used for outputting the position information of the candidate 2D frame of the target object based on a second characteristic diagram, and the second characteristic diagram is determined according to the first characteristic diagram;

the region of interest extraction module is configured to extract first feature information on a third feature map based on the position information of the candidate 2D frame, where the first feature information is a feature of a region where the candidate 2D frame is located, and the third feature map is determined according to the first feature map;

the classification regression network is configured to process the first feature information, and output target 2D frames of the target object and first indication information, where the number of target 2D frames of the target object is less than or equal to the number of candidate 2D frames of the target object, and the first indication information is used to indicate a class to which the target object belongs.
The training device of claim 17, wherein the classification regression network comprises a first regional convolutional neural network RCNN, the first RCNN comprising a hidden layer, a plurality of sub-classification fully-connected layers, and a plurality of sub-regression fully-connected layers, the hidden layer being connected to the plurality of sub-classification fully-connected layers, the plurality of sub-classification fully-connected layers being in one-to-one correspondence with the plurality of tasks, the plurality of sub-regression fully-connected layers being in one-to-one correspondence with the plurality of tasks;

The hidden layer is used for processing the first characteristic information to obtain second characteristic information;

the sub-classification full-connection layer is used for obtaining the confidence coefficient of the object category of the candidate 2D frame belonging to the task corresponding to the sub-classification full-connection layer according to the second characteristic information;

the sub-regression full-connection layer is used for adjusting the position information of the candidate 2D frames according to the result after the hidden layer processing to obtain the adjusted candidate 2D frames;

and

the part to be trained in the classification regression network comprises the hidden layer, a sub-classification full-connection layer and a sub-regression full-connection layer, wherein the sub-classification full-connection layer and the sub-classification full-connection layer correspond to the task where the first task object is located.
An object recognition device, characterized in that the perception network comprises: the device comprises a backbone network, a candidate region generation network (RPN), a region of interest extraction module and a classification regression network, wherein the perception network is deployed on the device, and the device comprises:

an acquisition unit configured to acquire an input image;

a processing unit for:

carrying out convolution processing on an input image by using the backbone network to obtain a first feature map of the input image;

outputting position information of a candidate two-dimensional 2D frame of a target object based on a second feature map by utilizing the RPN, wherein the target object comprises an object to be detected in a plurality of tasks, each task in the plurality of tasks comprises at least one category, and the second feature map is determined according to the first feature map;

Extracting first characteristic information on a third characteristic map based on the position information of the candidate 2D frame by using the region-of-interest extraction module, wherein the first characteristic information is the characteristic of the region where the candidate 2D frame is located, and the third characteristic map is determined according to the first characteristic map;

and processing the first characteristic information by using the classification regression network to obtain target 2D frames of the target object and first indication information, wherein the number of the target 2D frames of the target object is smaller than or equal to the number of the candidate 2D frames of the target object, and the first indication information is used for indicating the category to which the target object belongs.
The apparatus according to claim 19, wherein the processing unit is specifically configured to:

processing the first characteristic information by using the classification regression network to obtain the confidence that the candidate 2D frame belongs to each category in the plurality of tasks;

adjusting the position information of the candidate 2D frames by using the classification regression network to obtain adjusted candidate 2D frames;

determining the target 2D frame according to the adjusted candidate 2D frame;

and determining the first indication information according to the confidence that the target 2D frame belongs to each category.
The apparatus of claim 20, wherein the classification regression network comprises a first regional convolutional neural network RCNN, the first RCNN comprising a hidden layer, a plurality of sub-classification fully-connected layers, and a plurality of sub-regression fully-connected layers, the hidden layer being connected to the plurality of sub-classification fully-connected layers, the plurality of sub-classification fully-connected layers being in one-to-one correspondence with the plurality of tasks, the plurality of sub-regression fully-connected layers being in one-to-one correspondence with the plurality of tasks; the processing unit is specifically configured to:

processing the first characteristic information by using the hidden layer to obtain second characteristic information; obtaining the confidence coefficient of the object category of the task corresponding to the sub-classification full-connection layer, which belongs to the candidate 2D frame, by using the sub-classification full-connection layer according to the second characteristic information;

and adjusting the position information of the candidate 2D frames by utilizing the sub-regression full-connection layer according to the second characteristic information to obtain the adjusted candidate 2D frames.
The apparatus of claim 20, wherein the classification regression network comprises a second RCNN comprising a hidden layer, a classification fully connected layer, and a regression fully connected layer, the hidden layer being connected to the regression fully connected layer; the processing unit is specifically configured to:

Processing the first characteristic information by using the hidden layer to obtain third characteristic information;

obtaining the confidence that the candidate 2D frames belong to each category according to the obtained third characteristic information by using the classification full-connection layer;

and adjusting the position information of the candidate 2D frames by utilizing the regression full-connection layer according to the obtained third characteristic information to obtain the adjusted candidate 2D frames.
The apparatus of claim 22, wherein the categorized full-connectivity layer is obtained by combining a plurality of sub-categorized full-connectivity layers in a first RCNN, wherein the regressive full-connectivity layer is obtained by combining a plurality of sub-regressive full-connectivity layers in the first RCNN,

the first RCNN comprises the hidden layer, the plurality of sub-classification full-connection layers and the plurality of sub-regression full-connection layers, wherein the hidden layer is connected with the plurality of sub-classification full-connection layers, the hidden layer is connected with the plurality of sub-regression full-connection layers, the plurality of sub-classification full-connection layers are in one-to-one correspondence with the plurality of tasks, and the plurality of sub-regression full-connection layers are in one-to-one correspondence with the plurality of tasks;

the sub-classification full-connection layer is used for obtaining the confidence coefficient of the object category of the candidate 2D frame belonging to the task corresponding to the sub-classification full-connection layer according to the obtained third characteristic information;

And the sub-regression full-connection layer is used for adjusting the position information of the candidate 2D frames according to the obtained third characteristic information to obtain the adjusted candidate 2D frames.
A network aware training device comprising a processor and a transmission interface, the processor receiving or transmitting data through the transmission interface, the processor being configured to invoke program instructions stored in a memory to perform the method of any of claims 6 to 9.
An object recognition device comprising a processor and a transmission interface, the processor receiving or transmitting data through the transmission interface, the processor being configured to invoke program instructions stored in a memory to perform the method of any of claims 10 to 14.
A computer readable storage medium storing program code for device execution, which when run on a computer or processor causes the computer or processor to perform the method of any one of claims 6 to 9 or 10 to 14.
A computer program product comprising instructions which, when run on a computer or processor, cause the computer or processor to perform the method of any of claims 6 to 9 or 10 to 14.