WO2022217434A1 - Réseau cognitif, procédé de formation de réseau cognitif, et procédé et appareil de reconnaissance d'objet - Google Patents

Réseau cognitif, procédé de formation de réseau cognitif, et procédé et appareil de reconnaissance d'objet Download PDF

Info

Publication number
WO2022217434A1
WO2022217434A1 PCT/CN2021/086643 CN2021086643W WO2022217434A1 WO 2022217434 A1 WO2022217434 A1 WO 2022217434A1 CN 2021086643 W CN2021086643 W CN 2021086643W WO 2022217434 A1 WO2022217434 A1 WO 2022217434A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame
network
regression
candidate
classification
Prior art date
Application number
PCT/CN2021/086643
Other languages
English (en)
Chinese (zh)
Inventor
周凯强
江立辉
黄梓钊
秘谧
王鑫
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2021/086643 priority Critical patent/WO2022217434A1/fr
Priority to CN202180096605.3A priority patent/CN117157679A/zh
Publication of WO2022217434A1 publication Critical patent/WO2022217434A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Definitions

  • the present application relates to the field of computer vision, and more particularly, to a perceptual network, a training method for a perceptual network, and an object recognition method and apparatus.
  • Computer vision is an integral part of various intelligent/autonomous systems in various application fields such as manufacturing, inspection, document analysis, and medical diagnostics. To put it figuratively, computer vision is to install eyes (cameras/cameras) and brains (algorithms) on computers, so that computers can perceive the environment. Computer vision uses various imaging systems to replace the visual organ to obtain input information, and then the computer replaces the brain to process and interpret the input information.
  • perception networks deployed in advanced driving assistance systems (ADAS) and autonomous driving systems (ADS) can be used to identify obstacles on the road.
  • Most of the current perception networks can only complete one detection task.
  • To achieve multiple detection tasks it is usually necessary to deploy different networks to achieve different detection tasks.
  • the simultaneous operation of multiple perception networks will increase the power consumption of the hardware and reduce the running speed of the model.
  • the computing power of chips used in many fields is low, making it difficult to deploy large-scale sensor networks, and even more difficult to deploy multiple sensor networks.
  • the present application provides a perceptual network, a training method for a perceptual network, an object recognition method and a device, which can reduce the amount of parameters and calculations in a multi-task perceptual network, reduce the power consumption of hardware, and improve the running speed of the model.
  • a perception network including: a backbone network, a region proposal network (RPN), a region of interest extraction module, and a classification and regression network; the RPN is used to output a target object based on the second feature map
  • the position information of the candidate two-dimensional (2 dementional, 2D) frame, the target object includes objects to be detected in multiple tasks, each task in the multiple tasks includes at least one category, and the second feature map is based on the first feature.
  • the region of interest extraction module is used to extract the first feature information on the third feature map based on the position information of the candidate 2D frame, the first feature information is the feature of the region where the candidate 2D frame is located, and the third feature map is based on The first feature map is determined; the classification and regression network is used to process the first feature information, output the target 2D frame of the target object and the first indication information, the number of target 2D frames is less than or equal to the number of candidate 2D frames, the first The indication information is used to indicate the category to which the target object belongs.
  • a sensing network is used to complete a variety of sensing tasks, and multiple tasks share one RPN, that is, one RPN predicts the area where the object to be detected in multiple tasks is located, while ensuring the performance of the sensing network.
  • RPN predicts the area where the object to be detected in multiple tasks is located
  • the "first feature map” refers to the feature map output by the backbone network.
  • the feature maps output by the backbone network can all be referred to as first feature maps.
  • a broad category includes at least one category.
  • a broad category is a collection of at least one category.
  • Task division criteria can be set as needed. For example, the objects to be detected are divided into multiple tasks according to the similarity of the objects to be detected.
  • RPN may also be referred to as a single-head multi-task RPN.
  • the second feature map may be one or multiple.
  • the second feature map may include one or more of the first feature map.
  • the third feature map may be one of the first feature maps.
  • the perception network further includes feature pyramid networks (FPN), and the FPN is connected to the backbone network for feature fusion on the first feature map and output fusion feature map after.
  • FPN feature pyramid networks
  • the second feature map may include one or more of the fused feature maps.
  • the third feature map may be one of the first feature maps or one of the fused feature maps output by the FPN.
  • the classification and regression network is specifically used to: process the first feature information to obtain the confidence that the candidate 2D frame belongs to each category in multiple tasks; The position information of the frame is adjusted to obtain an adjusted candidate 2D frame; the target 2D frame is determined according to the adjusted candidate 2D frame; the first indication information is determined according to the confidence that the target 2D frame belongs to each category.
  • the position information of the candidate 2D frame is adjusted so that the adjusted candidate 2D frame matches the shape of the actual object more closely than the candidate 2D frame, that is, the adjusted candidate 2D frame is a more compact candidate 2D frame.
  • a frame merging operation is performed on the adjusted candidate 2D frame to obtain the target 2D frame.
  • the adjusted 2D boxes are merged with non-maximum suppression (NMS) to obtain the target 2D boxes.
  • NMS non-maximum suppression
  • the classification and regression network includes a first region convolutional neural network (RCNN), and the first RCNN includes a hidden layer, a plurality of sub-classification fully connected layers and Multiple sub-regression fully-connected layers, the hidden layer is connected to multiple sub-classification fully-connected layers, the hidden layer is connected to multiple sub-regression fully-connected layers, multiple sub-classification fully-connected layers correspond to multiple tasks one-to-one, and multiple sub-regression fully connected layers are connected to Multiple tasks are in one-to-one correspondence; the hidden layer is used to process the first feature information to obtain the second feature information; the sub-category fully-connected layer is used to obtain the candidate 2D frame corresponding to the sub-category fully-connected layer according to the second feature information. The confidence of the object category in the task; the sub-regression fully connected layer is used to adjust the position information of the candidate 2D frame according to the second feature information, and obtain the adjusted candidate 2D frame.
  • RCNN first region convolutional neural network
  • the hidden layer may include at least one of the following: a convolutional layer or a fully connected layer. Since multiple tasks share the hidden layer, the convolutional layer in the hidden layer can also be called a shared convolutional layer (shared convolutional, shared conv), and the fully connected layer in the hidden layer can also be called a shared fully connected layer (shared fully connected layer). layers, shared fc).
  • the first RCNN includes hidden layers and multiple sub-classification fully connected layers and multiple sub-regression fully connected layers corresponding to multiple tasks.
  • Each task can have an independent sub-classification fully-connected layer and sub-regression fully-connected layer.
  • the sub-category fully-connected layer and the sub-regression fully-connected layer corresponding to each task can complete the detection of the object to be detected in the task.
  • the sub-category fully connected layer can output the confidence level that the candidate 2D frame belongs to the object category in the task
  • the sub-regression fully connected layer can output the adjusted candidate 2D box.
  • a first RCNN includes multiple sub-classification fully-connected layers and sub-regression fully-connected layers. Therefore, a first RCNN can complete the detection of objects to be detected in multiple tasks.
  • the first RCNN can also be called a single-head multi-task RCNN.
  • each task corresponds to an independent sub-classification fully connected layer (fc) and sub-regression fc, which improves the scalability of the perception network.
  • the perception network can flexibly implement functional configuration by adding or reducing sub-classification fc and sub-regression fc .
  • the classification and regression network includes a second RCNN
  • the second RCNN includes a hidden layer, a classification fully connected layer and a regression fully connected layer
  • the hidden layer is connected with the classification fully connected layer
  • the hidden layer is connected to the regression fully connected layer
  • the hidden layer is used to process the first feature information to obtain the third feature information
  • the classification fully connected layer is used to obtain the confidence that the candidate 2D frame belongs to each category according to the third feature information
  • the regression fully connected layer is used to adjust the position information of the candidate 2D frame according to the third feature information to obtain the adjusted candidate 2D frame.
  • a second RCNN can complete the detection of objects to be detected in multiple tasks.
  • the second RCNN can also be called a single-head multi-task RCNN.
  • the second RCNN is used as the classification and regression network, and multiple tasks share the hidden layer of the second RCNN, which further reduces the amount of parameters and calculation of the perception network, and improves the processing efficiency.
  • the output of the hidden layer in the first RCNN needs to be input to all sub-classification fully connected layers and sub-regression fully connected layers for multiple matrix operations, while the output of the hidden layer in the second RCNN only needs to be input to the classification fully connected layer.
  • the matrix operation is performed in the layer and the regression fully connected layer. In this way, the operation of the matrix operation can be further reduced, which is more friendly to the hardware, further reduces the time consumption of the operation, and improves the processing efficiency.
  • the classification fully connected layer is obtained by merging multiple sub-classification fully connected layers in the first RCNN, and the regression fully connected layer is obtained by merging the first RCNN into the fully connected layer.
  • the first RCNN includes a hidden layer, multiple sub-category fully-connected layers and multiple sub-regression fully-connected layers, the hidden layer is connected with multiple sub-category fully-connected layers, and the hidden layer is connected with multiple sub-regression layers.
  • the fully-connected layers are connected, the sub-classification fully-connected layers are in one-to-one correspondence with multiple tasks, and the sub-regression fully-connected layers are in one-to-one correspondence with multiple tasks; the sub-classification fully connected layer is used to obtain the candidate 2D frame belonging to the The confidence of the object category in the task corresponding to the sub-classification fully connected layer; the sub-regression fully connected layer is used to adjust the position information of the candidate 2D frame according to the third feature information to obtain the adjusted candidate 2D frame.
  • multiple sub-classification fc and sub-regression fc in the first RCNN are combined, and the second RCNN is used as the classification and regression network, which can further reduce the operation of matrix operations, is more friendly to hardware, and further reduces the The operation is time-consuming and the processing efficiency is improved.
  • a training method for a perceptual network includes: a candidate region generation network RPN, where the RPN is used to predict the position information of a candidate two-dimensional 2D frame of a target object in a sample image, and the target object includes multiple tasks the object to be detected, each task in the multiple tasks includes at least one category; the target object includes a first task object and a second task object; the method includes: acquiring training data, the training data includes sample images, first task objects on the sample images Annotation data of a task object and a pseudo frame of the second task object on the sample image.
  • the annotation data includes the class label of the first task object and the annotated 2D frame of the first task object.
  • the pseudo frame of the second task object is obtained by other perception
  • the target 2D frame of the second task object obtained by the network inferring the sample image; the perception network is trained based on the training data.
  • the labeled data is partially labeled data, that is, when the labeled data only includes the labeled data of the first task object, when training the perception network based on the labeled data only, since multiple tasks share one RPN, when training the RPN, different The training data for tasks may inhibit each other. Specifically, since the labeled data is part of the labeled data, for example, only the labeled data of the object to be detected for one task is labeled on a sample image, when the labeled data of the object to be detected for the task is used for training, the RPN will be adjusted.
  • RPN can more accurately predict the candidate 2D frame of the object to be detected for this task, but cannot accurately predict the candidate 2D frame of the object to be detected for other tasks on the sample image.
  • the parameters of the RPN will be adjusted, so the adjusted RPN may not be able to accurately predict the candidate 2D frame of the object to be detected in other tasks. In this way, the training data of different tasks may suppress each other, causing RPN to fail to predict all the target objects in the image.
  • the perception network is jointly trained based on the pseudo frame and the labeled data, and in the case that the labeled data only includes the labeled data of the first task object, that is, in the case of partial labeled data, the information of the second task object is provided.
  • Pseudo frame in order to provide a more comprehensive frame of the object to be detected on the same sample image as the target output of the RPN, to adjust the parameters of the RPN so that the output of the RPN is constantly close to the target data, avoiding mutual inhibition between different tasks, which is beneficial to It enables RPN to obtain more comprehensive and accurate candidate 2D boxes, while improving the recall rate.
  • the labeling data of the sample images in the embodiments of the present application may be partial labeling data, so that targeted collection can be performed, that is, the required sample images are collected for specific tasks, and there is no need to mark the objects to be detected for all tasks in each sample image. It reduces the cost of data collection and the cost of labeling, which is conducive to balancing the training data of different tasks.
  • the scheme using part of the labeled data has flexible scalability. In the case of adding tasks, it is only necessary to provide the labeled data of the new tasks, and there is no need to label new objects to be detected on the basis of the original training data. .
  • the first task objects may include objects to be detected in one or more tasks.
  • the one or more tasks are the tasks where the first task object is located.
  • the first task objects in different sample images in the training set may be the same or different.
  • the second task objects may include objects to be detected in one or more tasks.
  • the one or more tasks are the tasks where the second task object is located.
  • the same object to be detected may exist in the second task object and the first task object. That is to say, the first task object and the second task object may have overlapping objects to be detected, and the first task object and the second task object may also be completely different.
  • the second task objects in different sample images in the training set can be the same or different.
  • Other perceptual networks refer to other perceptual networks than the one to be trained.
  • the other sensing networks may be a multi-head multi-tasking sensing network or multiple single-tasking sensing networks, or the like.
  • the perceptual network further includes a backbone network, a region of interest extraction module, and a classification and regression network, and the perceptual network is trained based on the training data, including: according to the first task object The difference between the marked 2D frame and the target 2D frame of the second task object and the candidate 2D frame of the target object in the sample image predicted by RPN calculates the first loss function value; calculates the second loss function value of the perception network according to the marked data ; Backpropagate the value of the first loss function and the value of the second loss function, adjust the parameters of the part that needs to be trained in the perception network, and the part that needs to be trained in the perception network includes the part to be trained in the classification and regression network, the sensor The region of interest extraction module, RPN and backbone network, and the part to be trained in the classification and regression network is determined according to the first task object.
  • the labeled 2D frame of the first task object and the target 2D frame of the second task object are compared with the candidate 2D frame of the target object predicted by RPN, and the loss function value of the RPN stage is obtained, that is, the first loss function value.
  • the labeled data of the sample image is compared with the output result of the classification and regression network, and the loss function value of the task where the first task object in the classification and regression network stage is located, that is, the second loss function value is obtained.
  • the parameters related to the first loss function value are the parameters in the perceptual network used in the process of obtaining the first loss function value, for example, the parameters of the backbone and the parameters of the RPN. Further, in the case where the perceptual network includes FPN, the parameter related to the value of the first loss function also includes FPN.
  • the gradient of the parameter related to the second loss function value is calculated, and then the parameter related to the second loss function value is adjusted based on the gradient of the parameter to realize the adjustment of the perceptual network, so that the classification regression
  • the network can better correct the output 2D box and improve the accuracy of category prediction.
  • the parameters related to the second loss function are the parameters in the perceptual network used in the process of calculating the value of the second loss function, for example, the parameters of the backbone, the parameters of the RPN, the parameters of the region of interest extraction module, and the classification and regression network. parameters for the part of the training required. Further, in the case where the perceptual network includes FPN, the parameter related to the value of the second loss function also includes FPN.
  • the parameters related to the second loss function are the parameters of the part of the perception network that needs to be trained.
  • the parts shared by different tasks in the perception network all participate in the training process based on the labeled data of different tasks, which can make
  • the parts of the perceptual network shared by different tasks learn common features of each task. Different parts corresponding to different tasks in the perception network, for example, the parts corresponding to each task in the classification and regression network, only participate in the training process based on the labeled data of the respective tasks, which can make different parts corresponding to different tasks in the perception network. Its task-specific features can be learned, improving the accuracy of the model.
  • the part of the classification and regression network that needs to be trained is determined according to the task, and different parts of the classification and regression network corresponding to different tasks do not affect each other during the training process, ensuring the independence of each task, making The model has strong flexibility.
  • the backbone network is used to perform convolution processing on the sample image and output the first feature map of the sample image; the RPN is used to output the target object based on the second feature map
  • the position information of the candidate 2D frame of the The feature information is the feature of the area where the candidate 2D frame is located, and the third feature map is determined according to the first feature map;
  • the classification and regression network is used to process the first feature information, and output the target 2D frame of the target object and the first indication information , the number of target 2D boxes is less than or equal to the number of candidate 2D boxes, and the first indication information is used to indicate the category to which the target object belongs.
  • the classification and regression network includes a first regional convolutional neural network RCNN
  • the first RCNN includes a hidden layer, multiple sub-classification fully connected layers and multiple sub-regression fully connected layers
  • the hidden layer is connected with multiple sub-classification fully connected layers
  • the hidden layer is connected with multiple sub-regression fully connected layers
  • multiple sub-classification fully connected layers are in one-to-one correspondence with multiple tasks
  • multiple sub-regression fully connected layers are in one-to-one correspondence with multiple tasks
  • the hidden layer is used to process the first feature information to obtain the second feature information
  • the sub-class fully connected layer is used to obtain the confidence of the object category in the task corresponding to the sub-class fully connected layer of the candidate 2D frame according to the second feature information.
  • the sub-regression fully connected layer is used to adjust the position information of the candidate 2D frame according to the second feature information to obtain the adjusted candidate 2D frame; and the part to be trained in the classification and regression network includes the hidden layer and the first task.
  • the sub-classification fully connected layer and the sub-regression fully connected layer corresponding to the task where the object is located.
  • the parts shared by different tasks in the perception network that is, the backbone network, the RPN, the region of interest extraction module, and the hidden layer of the classification and regression network, all participate in the process of training based on the labeled data of different tasks. Training, so that the parts shared by different tasks in the perceptual network can learn the common features of each task.
  • the different parts corresponding to different tasks in the perception network that is, the sub-classification fully connected layer and the sub-regression fully connected layer corresponding to each task in the classification and regression network, only participate in the training process based on the labeled data of the respective tasks, which can make the perception
  • Different parts of the network corresponding to different tasks can learn their task-specific features, which improves the accuracy of the model.
  • an object recognition method includes: a backbone network, a candidate region generation network RPN, a region of interest extraction module, and a classification and regression network.
  • the method includes: using the backbone network to perform convolution processing on an input image to obtain The first feature map of the input image; using RPN to output the position information of the candidate two-dimensional 2D frame of the target object based on the second feature map, the target object includes the objects to be detected in multiple tasks, each task in multiple tasks At least one category is included, and the second feature map is determined according to the first feature map; the region of interest extraction module is used to extract the first feature information on the third feature map based on the position information of the candidate 2D frame, and the first feature information is the candidate 2D The feature of the area where the frame is located, and the third feature map is determined according to the first feature map; the first feature information is processed by the classification and regression network to obtain the target 2D frame of the target object and the first indication information, and the number of target 2D frames is less than or equal
  • a sensing network is used to complete a variety of sensing tasks, and multiple tasks share one RPN, that is, one RPN predicts the area where the object to be detected in multiple tasks is located, while ensuring the performance of the sensing network.
  • RPN predicts the area where the object to be detected in multiple tasks is located
  • the classification and regression network is used to process the first feature information to obtain the target 2D frame of the target object and the first indication information, including: using the classification and regression network to process the first feature information.
  • the feature information is processed to obtain the confidence that the candidate 2D frame belongs to each category in multiple tasks; the position information of the candidate 2D frame is adjusted by the classification and regression network, and the adjusted candidate 2D frame is obtained; according to the adjusted candidate 2D frame Determine the target 2D frame; determine the first indication information according to the confidence that the target 2D frame belongs to each category.
  • the classification and regression network includes a first regional convolutional neural network RCNN, and the first RCNN includes a hidden layer, multiple sub-classification fully connected layers and multiple sub-regression fully connected layers,
  • the hidden layer is connected with multiple sub-classification fully connected layers
  • the hidden layer is connected with multiple sub-regression fully connected layers
  • multiple sub-classification fully connected layers are in one-to-one correspondence with multiple tasks
  • multiple sub-regression fully connected layers are in one-to-one correspondence with multiple tasks
  • using a classification and regression network to process the first feature information, and output the target 2D frame of the target object and the first indication information, including: using the hidden layer to process the first feature information to obtain the second feature information; using the sub-category full connection According to the second feature information, the layer obtains the confidence of the candidate 2D frame belonging to the object category in the task corresponding to the sub-classification fully connected layer
  • the sub-regression fully connected layer is used to adjust the position information of the candidate 2D
  • the classification and regression network includes a second RCNN
  • the second RCNN includes a hidden layer, a classification fully connected layer and a regression fully connected layer
  • the hidden layer is connected with the classification fully connected layer
  • the hidden layer is connected with the regression fully connected layer
  • the classification and regression network is used to process the first feature information, and output the target 2D frame of the target object and the first indication information, including: using the hidden layer to process the first feature information to obtain the first feature information.
  • Three feature information use the classification fully connected layer to obtain the confidence that the candidate 2D frame belongs to each category according to the third feature information; use the regression fully connected layer to adjust the position information of the candidate 2D frame according to the third feature information to obtain the adjusted candidate 2D frame 2D box.
  • the classification fully connected layer is obtained by merging multiple sub-classification fully connected layers in the first RCNN, and the regression fully connected layer is obtained by combining the first RCNN.
  • the first RCNN includes a hidden layer, multiple sub-category fully-connected layers and multiple sub-regression fully-connected layers, the hidden layer is connected with multiple sub-category fully-connected layers, and the hidden layer is connected with multiple sub-regression layers.
  • the fully-connected layers are connected, the sub-classification fully-connected layers are in one-to-one correspondence with multiple tasks, and the sub-regression fully-connected layers are in one-to-one correspondence with multiple tasks; the sub-classification fully connected layer is used to obtain the candidate 2D frame belonging to the The confidence of the object category in the task corresponding to the sub-classification fully connected layer; the sub-regression fully connected layer is used to adjust the position information of the candidate 2D frame according to the third feature information to obtain the adjusted candidate 2D frame.
  • the target-aware network can be obtained by using the training method of the perception network in the second aspect.
  • the target-aware network can be a trained image recognition model, and the trained image recognition model can be used to process the image to be processed.
  • an apparatus for training a perceptual network includes a module or unit for performing the method in the second aspect and any one of the implementation manners of the second aspect.
  • an object recognition device comprising a module or unit for executing the method in the third aspect and any one of the implementation manners of the third aspect.
  • an apparatus for training a cognitive network comprising: a processor and a transmission interface, the processor receives or sends data through the transmission interface, and the processor is configured to call program instructions stored in the memory to execute The second aspect and the method in any one of the implementation manners of the second aspect.
  • the processor in the sixth aspect above may be either a central processing unit (CPU), or a combination of a CPU and a neural network computing processor, where the neural network computing processor may include a graphics processor (graphics processing unit). unit, GPU), neural network processor (neural-network processing unit, NPU) and tensor processor (tensor processing unit, TPU) and so on.
  • TPU is Google's fully customized artificial intelligence accelerator application-specific integrated circuit for machine learning.
  • an object recognition device comprising: a processor and a transmission interface, the processor receives or sends data through the transmission interface, and the processor is configured to call program instructions stored in the memory to execute a third A method in any one implementation manner of the aspect and the third aspect.
  • the processor in the above seventh aspect can be either a central processing unit, or a combination of a CPU and a neural network computing processor, where the neural network computing processor can include a graphics processor, a neural network processor, and a tensor processor. and many more.
  • TPU is Google's fully customized artificial intelligence accelerator application-specific integrated circuit for machine learning.
  • a computer-readable storage medium stores program code for execution by a device, and when the program code is run on a computer or a processor, causes the computer or processor to execute the second aspect or The method in any one of the implementation manners of the third aspect.
  • a ninth aspect provides a computer program product comprising instructions, when the computer program product runs on a computer, the computer causes the computer to execute the method in any one of the implementation manners of the second aspect or the third aspect.
  • a tenth aspect provides a chip, the chip includes a processor and a data interface, the processor reads an instruction stored in a memory through the data interface, and executes any one of the second aspect or the third aspect above method in the implementation.
  • the chip may further include a memory, in which instructions are stored, the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the The processor is configured to execute the method in any one of the implementations of the first aspect or the second aspect.
  • the above chip may specifically be a field-programmable gate array (FPGA) or an application-specific integrated circuit (ASIC).
  • FPGA field-programmable gate array
  • ASIC application-specific integrated circuit
  • an electronic device in an eleventh aspect, includes the apparatus in any one of the above-mentioned fourth to seventh aspects.
  • FIG. 1 is a schematic diagram of an application scenario provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of another application scenario provided by an embodiment of the present application.
  • FIG. 3 is a schematic structural diagram of a system architecture provided by an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a convolutional neural network provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of the hardware structure of a chip according to an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a system architecture provided by an embodiment of the present application.
  • FIG. 7 is a schematic block diagram of a multi-head end multi-task perception network
  • FIG. 8 is a schematic structural diagram of a perception network according to an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of another sensing network provided by an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of another sensing network provided by an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of another cognitive network provided by an embodiment of the present application.
  • FIG. 12 is a schematic flowchart of a training method for a perceptual network provided by an embodiment of the present application.
  • FIG. 13 is a schematic diagram of a training process of a perception network provided by an embodiment of the present application.
  • FIG. 14 is a schematic block diagram of a perceptual network in a training process provided by an embodiment of the present application.
  • 16 is a schematic diagram of an object recognition process provided by an embodiment of the present application.
  • 17 is a schematic block diagram of a perceptual network in a reasoning process provided by an embodiment of the present application.
  • FIG. 18 is a schematic diagram of a conversion process of a sensory network provided by an embodiment of the present application.
  • FIG. 19 is a schematic block diagram of an apparatus provided by an embodiment of the present application.
  • FIG. 20 is a schematic block diagram of another apparatus provided by an embodiment of the present application.
  • the embodiments of the present application can be applied to fields that need to complete various sensing tasks, such as driving assistance, automatic driving, mobile phone terminals, monitoring, and security.
  • the image is input into the perception network of the present application, and the detection result of the object of interest in the image is obtained.
  • the detection results can be input to the post-processing module for processing, for example, sent to the planning control unit for decision-making in the autonomous driving system, or sent to the security system for abnormal situation detection.
  • ADAS advanced driving assistant system
  • ADS autonomous driving system
  • the detection targets include dynamic obstacles, static obstacles and traffic signs, such as pedestrians (pedestrians), cyclists (cyclists), tricycles (tricycle), car (car), truck (truck), bus (bus), wheel (wheel), car light (car light), traffic cone (traffic cone), traffic stick (traffic stick), fire hydrant (fire hydrant), motorcycle (motorcycle) and bicycle (bicycle), traffic sign (traffic sign), guide sign (guide sign), billboard (billboard), road sign (roadsign), road pole (pole), traffic light ( traffic light) and road signs, etc.
  • traffic signs such as pedestrians (pedestrians), cyclists (cyclists), tricycles (tricycle), car (car), truck (truck), bus (bus), wheel (wheel), car light (car light), traffic cone (traffic cone), traffic stick (traffic stick), fire hydrant (fire hydrant), motorcycle (motorcycle) and bicycle (bicycle), traffic sign (traffic sign), guide sign (guide sign), billboard (billboard), road sign (roadsign), road pole (pole
  • Traffic lights include red traffic lights (trafficlight_red), yellow traffic lights (trafficlight_yellow), green traffic lights (trafficlight_green), and black traffic lights (trafficlight_black).
  • Pavement signs include Around/straight, left/right, straight and left, straight and right/straight and around/ Turn left and around/left and right/left bend/right bend/pavement sign, etc.
  • the detection tasks of the above-mentioned various targets can be realized in one sensing network, that is, objects to be detected for multiple tasks can be detected in one sensing network, and the detection results can be sent to planning control after processing.
  • the unit makes decisions, such as obstacle avoidance, traffic light decisions, or traffic sign decisions.
  • identifying the images in the album can facilitate the user or the system to classify and manage the album and improve user experience.
  • the solutions of the embodiments of the present application it is possible to obtain or optimize a perceptual network suitable for album picture classification.
  • the perception network to classify pictures, for example, classify pictures into different categories such as photos containing animals, photos containing people, etc., so as to label pictures of different categories, which is convenient for users to view and find.
  • the classification tags of these pictures can also be provided to the album management system for classification management, which saves the user's management time, improves the efficiency of album management, and enhances the user experience.
  • Monitoring scenarios include: smart city, field monitoring, indoor monitoring, outdoor monitoring, and in-vehicle monitoring.
  • a variety of detection tasks need to be completed in the smart city perception system. For example, vehicles, license plates, people, and faces need to be detected. After processing, the detection results can be used to judge traffic violations, predict traffic congestion, etc. .
  • the input road picture can be processed in a perception network, and the detection tasks of the above-mentioned various targets can be completed.
  • the detection tasks of the perception network can also be increased or decreased according to the actual situation.
  • the current detection tasks of the perception network include vehicle detection tasks and human detection tasks. If the detection task of traffic signs needs to be added to the detection tasks of the perception network, the structure of the perception network can be adjusted to add the detection task. The specific description can be found later, for example, FIG. 14 .
  • a neural network can be composed of neural units, and a neural unit can refer to an operation unit that takes x s and an intercept 1 as input, and the output of the operation unit can be:
  • W s is the weight of x s
  • b is the bias of the neural unit.
  • f is an activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal.
  • the output signal of the activation function can be used as the input of the next convolutional layer, and the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting a plurality of the above single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field, and the local receptive field can be an area composed of several neural units.
  • a deep neural network also known as a multi-layer neural network, can be understood as a neural network with multiple hidden layers.
  • the DNN is divided according to the positions of different layers.
  • the neural network inside the DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the middle layers are all hidden layers.
  • the layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
  • DNN looks complicated, in terms of the work of each layer, it is not complicated. In short, it is the following linear relationship expression: in, is the input vector, is the output vector, is the offset vector, W is the weight matrix (also called coefficients), and ⁇ () is the activation function.
  • Each layer is just an input vector After such a simple operation to get the output vector Due to the large number of DNN layers, the coefficient W and offset vector The number is also higher.
  • the DNN Take the coefficient W as an example: Suppose that in a three-layer DNN, the linear coefficient from the fourth neuron in the second layer to the second neuron in the third layer is defined as The superscript 3 represents the number of layers where the coefficient W is located, and the subscript corresponds to the output third layer index 2 and the input second layer index 4.
  • the coefficient from the kth neuron in the L-1 layer to the jth neuron in the Lth layer is defined as
  • the input layer does not have a W parameter.
  • more hidden layers allow the network to better capture the complexities of the real world.
  • a model with more parameters is more complex and has a larger "capacity", which means that it can complete more complex learning tasks.
  • Training the deep neural network is the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vectors W of many layers).
  • Convolutional neural network is a deep neural network with a convolutional structure.
  • a convolutional neural network consists of a feature extractor consisting of convolutional and subsampling layers.
  • the feature extractor can be viewed as a filter, and the convolution process can be viewed as convolution with an input image or a convolutional feature map using a trainable filter.
  • the convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal.
  • a neuron can only be connected to some of its neighbors.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some neural units arranged in a rectangle.
  • Neural units in the same feature plane share weights, and the shared weights here are convolution kernels.
  • Shared weights can be understood as the way to extract image information is independent of location. The underlying principle is that the statistics of one part of the image are the same as the other parts. This means that image information learned in one part can also be used in another part. So for all positions on the image, the same learned image information can be used.
  • multiple convolution kernels can be used to extract different image information. Generally, the more convolution kernels, the richer the image information reflected by the convolution operation.
  • the convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can obtain reasonable weights by learning during the training process of the convolutional neural network.
  • the immediate benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • the loss function loss function
  • objective function objective function
  • the training of the deep neural network becomes the process of reducing the loss as much as possible.
  • the smaller the loss the higher the training quality of the deep neural network, and the larger the loss, the lower the training quality of the deep neural network.
  • the smaller the loss fluctuation the more stable the training; the larger the loss fluctuation, the more unstable the training.
  • the neural network can use the error back propagation (BP) algorithm to correct the size of the parameters in the initial model during the training process, so that the reconstruction error loss of the model becomes smaller and smaller. Specifically, forwarding the input signal until the output will generate an error loss, and updating the parameters in the initial super-resolution model by back-propagating the error loss information, so that the error loss converges.
  • the back-propagation algorithm is a back-propagation motion dominated by the error loss, aiming to obtain the parameters of the optimal super-resolution model, such as the weight matrix.
  • the training method for a perceptual network involves the processing of computer vision.
  • data processing methods such as data training, machine learning, and deep learning can be used to perform symbolic and formalized intelligent information modeling and extraction on the training data. , preprocessing, training, etc., to finally obtain a trained perceptual network; and, the object recognition method provided by the embodiment of the application can use the above-mentioned trained perceptual network to input the input data (such as the image to be processed in the application) into the In the trained perception network, output data (such as the first indication information and the target 2D frame of the target object in this application) are obtained.
  • the perceptual network training method and the object recognition method provided by the embodiments of the present application are based on the same concept, and can also be understood as two parts in a system, or two stages of an overall process: such as the model training stage and model application stage.
  • an embodiment of the present application provides a system architecture 100 .
  • a data collection device 160 is used to collect training data.
  • the training data may include sample images, labeled data of the sample images, and pseudo frames on the sample images.
  • the data collection device 160 After collecting the training data, the data collection device 160 stores the training data in the database 130 , and the training device 120 obtains the target model/rule 101 by training based on the training data maintained in the database 130 .
  • the target model/rule 101 can be used to realize the object recognition method of the embodiment of the present application, that is, the image to be processed is input into the target model/rule 101, and the detection result of the object of interest in the image to be processed can be obtained.
  • the training data maintained in the database 130 may not necessarily come from the collection of the data collection device 160, and may also be received from other devices.
  • the training device 120 may not necessarily train the target model/rule 101 completely based on the training data maintained by the database 130, and may also obtain training data from the cloud or other places for model training. The above description should not be used as a reference to this application Limitations of Examples.
  • the target model/rule 101 trained according to the training device 120 can be applied to different systems or devices, such as the execution device 110 shown in FIG. 3 , the execution device 110 can be a terminal, such as a mobile phone terminal, a tablet computer, a laptop computer , Augmented reality (AR) AR/virtual reality (VR), vehicle terminal, etc., it can also be a server or cloud.
  • the execution device 110 is configured with an input/output (I/O) interface 112 for data interaction with external devices.
  • the user can input data to the I/O interface 112 through the client device 140, and input data
  • the image to be processed may be included.
  • the execution device 110 When the execution device 110 preprocesses the input data, or the calculation module 111 of the execution device 110 performs calculations and other related processing, the execution device 110 can call the data, codes, etc. in the data storage system 150 for corresponding processing , the data and instructions obtained by corresponding processing may also be stored in the data storage system 150 .
  • the I/O interface 112 returns the processing result, such as the detection result obtained above, to the client device 140, thereby providing it to the user.
  • client device 140 may be a planning control unit in an automated driving system.
  • the training device 120 can generate corresponding target models/rules 101 based on different training data for different goals or different tasks, and the corresponding target models/rules 101 can be used to achieve the above-mentioned goals or complete the above-mentioned goals. tasks to provide the user with the desired result.
  • the user can manually specify input data, which can be operated through the interface provided by the I/O interface 112 .
  • the client device 140 can automatically send the input data to the I/O interface 112 . If the user's authorization is required to request the client device 140 to automatically send the input data, the user can set the corresponding permission in the client device 140 .
  • the user can view the result output by the execution device 110 on the client device 140, and the specific presentation form can be a specific manner such as display, sound, and action.
  • the client device 140 can also be used as a data collection terminal to collect the input data of the input I/O interface 112 and the output result of the output I/O interface 112 as new sample data as shown in the figure, and store them in the database 130 .
  • the I/O interface 112 directly uses the input data input into the I/O interface 112 and the output result of the output I/O interface 112 as shown in the figure as a new sample The data is stored in database 130 .
  • FIG. 3 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship among the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data The storage system 150 is an external memory relative to the execution device 110 , and in other cases, the data storage system 150 may also be placed in the execution device 110 .
  • a target model/rule 101 is obtained by training according to the training device 120 , and the target model/rule 101 may be a perceptual network in this embodiment of the present application.
  • CNN is a very common neural network
  • a convolutional neural network is a deep neural network with a convolutional structure. It is a deep learning architecture.
  • a deep learning architecture refers to an algorithm updated through a neural network model. Multiple levels of learning at different levels of abstraction.
  • CNN is a feed-forward artificial neural network in which individual neurons can respond to images fed into it.
  • a convolutional neural network (CNN) 200 may include an input layer 210 , a convolutional/pooling layer 220 (where the pooling layer is optional), and a fully connected layer 230 .
  • the input layer 210 can obtain the image to be processed, and submit the obtained image to be processed by the convolutional layer/pooling layer 220 and the fully connected layer 230 for processing, and the processing result of the image can be obtained.
  • the internal layer structure of CNN200 in Figure 4 is described in detail below.
  • the convolutional/pooling layer 220 may include layers 221-226 as examples, for example: in one implementation, layer 221 is a convolutional layer, layer 222 is a pooling layer, and layer 223 is a convolutional layer Layer 224 is a pooling layer, 225 is a convolutional layer, and 226 is a pooling layer; in another implementation, 221 and 222 are convolutional layers, 223 are pooling layers, and 224 and 225 are convolutional layers. layer, 226 is the pooling layer. That is, the output of a convolutional layer can be used as the input of a subsequent pooling layer, or it can be used as the input of another convolutional layer to continue the convolution operation.
  • the convolution layer 221 may include many convolution operators.
  • the convolution operator is also called a kernel. Its role in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator is essentially Can be a weight matrix, which is usually pre-defined, usually one pixel by one pixel (or two pixels by two pixels) along the horizontal direction on the input image during the convolution operation on the image. ...It depends on the value of the stride step) to process, so as to complete the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image.
  • each weight matrix is stacked to form the depth dimension of the convolutional image, where the dimension can be understood as determined by the "multiple" described above.
  • Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to extract unwanted noise in the image.
  • the multiple weight matrices have the same size (row ⁇ column), and the size of the convolution feature maps extracted from the multiple weight matrices with the same size is also the same, and then the multiple extracted convolution feature maps with the same size are combined to form The output of the convolution operation.
  • weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained by training can be used to extract information from the input image, so that the convolutional neural network 200 can make correct predictions .
  • the shallow convolutional layers (such as 221) often extract more general features, which can also be called low-level features; As the depth of the neural network 200 deepens, the features extracted by the later convolutional layers (eg, 226) become more and more complex, such as features such as high-level semantics. Features with higher semantics are more suitable for the problem to be solved.
  • the pooling layer can be a convolutional layer followed by a layer.
  • the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
  • the pooling layer may include an average pooling operator and/or a max pooling operator for sampling the input image to obtain a smaller size image.
  • the average pooling operator can calculate the pixel values in the image within a certain range to produce an average value as the result of average pooling.
  • the max pooling operator can take the pixel with the largest value within a specific range as the result of max pooling. Also, just as the size of the weight matrix used in the convolutional layer should be related to the size of the image, the operators in the pooling layer should also be related to the size of the image.
  • the size of the output image after processing by the pooling layer can be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network 200 After being processed by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not sufficient to output the required output information. Because as mentioned before, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network 200 needs to utilize the fully connected layer 230 to generate one or a set of outputs of the required number of classes. Therefore, the fully connected layer 230 may include multiple hidden layers (231, 232 to 23n as shown in FIG. 4) and the output layer 240, and the parameters contained in the multiple hidden layers may be based on specific task types The relevant training data is pre-trained, for example, the task type can include image recognition, image classification, image super-resolution reconstruction and so on.
  • the output layer 240 After the multi-layer hidden layers in the fully connected layer 230, that is, the last layer of the entire convolutional neural network 200 is the output layer 240, the output layer 240 has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error, Once the forward propagation of the entire convolutional neural network 200 (as shown in Fig. 4, the propagation from the direction 210 to 240 is forward propagation) is completed, the back propagation (as shown in Fig.
  • the propagation from the 240 to 210 direction is the back propagation
  • the convolutional neural network shown in FIG. 4 is only used as an example of a possible convolutional neural network, and in specific applications, the convolutional neural network may also exist in the form of other network models.
  • FIG. 5 is a hardware structure of a chip provided by an embodiment of the application, and the chip includes a neural network processor 50 .
  • the chip can be set in the execution device 110 as shown in FIG. 3 to complete the calculation work of the calculation module 111 .
  • the chip can also be set in the training device 120 as shown in FIG. 3 to complete the training work of the training device 120 and output the target model/rule 101 .
  • the methods in the embodiments of the present application may be implemented in the chip as shown in FIG. 5 .
  • the neural network processor NPU 50 is mounted on the main central processing unit (CPU) (host CPU) as a coprocessor, and tasks are allocated by the main CPU.
  • the core part of the NPU is the operation circuit 503, and the controller 504 controls the operation circuit 503 to extract the data in the memory (weight memory or input memory) and perform operations.
  • the arithmetic circuit 503 includes multiple processing units (process engines, PEs). In some implementations, arithmetic circuit 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 503 is a general-purpose matrix processor.
  • the operation circuit fetches the data corresponding to the matrix B from the weight memory 502 and buffers it on each PE in the operation circuit.
  • the arithmetic circuit fetches the data of matrix A and matrix B from the input memory 501 to perform matrix operation, and stores the partial result or final result of the matrix in an accumulator 508 .
  • the vector calculation unit 507 can further process the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on.
  • the vector computing unit 507 can be used for network computation of non-convolutional/non-FC layers in the neural network, such as pooling (pooling), batch normalization (BN), local response normalization (local response normalization) )Wait.
  • vector computation unit 507 can store the processed output vectors to unified buffer 506 .
  • the vector calculation unit 507 may apply a nonlinear function to the output of the arithmetic circuit 503, such as a vector of accumulated values, to generate activation values.
  • vector computation unit 507 generates normalized values, merged values, or both.
  • the vector of processed outputs can be used as activation input to the arithmetic circuit 503, eg, for use in subsequent layers in a neural network.
  • the operation of the perceptual network provided by the embodiment of the present application may be performed by the operation circuit 503 or the vector calculation unit 507 .
  • Unified memory 506 is used to store input data and output data.
  • the weight data directly transfers the input data in the external memory to the input memory 501 and/or the unified memory 506 through the storage unit access controller 505 (direct memory access controller, DMAC), and stores the weight data in the external memory into the weight memory 502, And the data in the unified memory 506 is stored in the external memory.
  • DMAC direct memory access controller
  • a bus interface unit (BIU) 510 is used to realize the interaction between the main CPU, the DMAC and the instruction fetch memory 509 through the bus.
  • the instruction fetch memory (instruction fetch buffer) 509 connected with the controller 504 is used to store the instructions used by the controller 504;
  • the controller 504 is used for invoking the instructions cached in the memory 509 to control the working process of the operation accelerator.
  • the unified memory 506, the input memory 501, the weight memory 502 and the instruction fetch memory 509 are all on-chip (On-Chip) memories, and the external memory is the memory outside the NPU, and the external memory can be double data rate synchronous dynamic random access Memory (double data rate synchronous dynamic random access memory, DDR SDRAM), high bandwidth memory (high bandwidth memory, HBM) or other readable and writable memory.
  • DDR SDRAM double data rate synchronous dynamic random access Memory
  • HBM high bandwidth memory
  • the execution device 110 in FIG. 3 or the chip in FIG. 5 described above can execute each step of the object recognition method of the embodiment of the present application.
  • the training device 120 in FIG. 3 or the chip in FIG. 5 described above can perform various steps of the training method for the perceptual network according to the embodiment of the present application.
  • an embodiment of the present application provides a system architecture 300 .
  • the system architecture includes a local device 301, a local device 302, an execution device 310 and a data storage system 350, wherein the local device 301 and the local device 302 are connected with the execution device 310 through a communication network.
  • execution device 310 may be implemented by one or more servers.
  • the execution device 310 may be used in conjunction with other computing devices, such as data storage, routers, load balancers and other devices.
  • the execution device 310 may be arranged on one physical site, or distributed across multiple physical sites.
  • the execution device 310 may use the data in the data storage system 350 or call the program code in the data storage system 350 to implement the training method of the perception network in this embodiment of the present application.
  • the perception network includes: a candidate region generation network RPN, where the RPN is used to predict the position information of the candidate two-dimensional 2D frame of the target object in the sample image, and the target object includes multiple tasks to be detected. Objects, each of the multiple tasks includes at least one category; the target objects include a first task object and a second task object.
  • RPN candidate region generation network
  • the execution device 110 may perform the following processes:
  • the training data includes the sample image, the label data of the first task object on the sample image, and the pseudo frame of the second task object on the sample image, and the label data includes the class label of the first task object and the first task object.
  • the 2D frame is marked, and the pseudo frame of the second task object is the target 2D frame of the second task object obtained by inferring the sample image through other perceptual networks; the perceptual network is trained based on the training data.
  • a perception network can be acquired, and the perception network can be used for detection of various tasks.
  • a user may operate respective user devices (eg, local device 301 and local device 302 ) to interact with execution device 310 .
  • Each local device can represent any computing device, such as a surveillance camera, personal computer, computer workstation, smartphone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box or gaming console, etc. .
  • Each user's local device can interact with the execution device 310 through any communication mechanism/standard communication network, which can be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.
  • any communication mechanism/standard communication network which can be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.
  • the local device 301 and the local device 302 obtain the relevant parameters of the sensory network from the execution device 310, deploy the sensory network on the local device 301 and the local device 302, and use the sensory network to detect objects.
  • a perceptual network may be directly deployed on the execution device 310, and the execution device 310 obtains the image to be processed from the local device 301 and the local device 302, and uses the perceptual network to process the image to be processed.
  • the above execution device 310 may also be a cloud device, in this case, the execution device 310 may be deployed in the cloud; or, the above execution device 310 may also be a terminal device, in this case, the execution device 310 may be deployed on the user terminal side, the embodiment of the present application This is not limited.
  • the perception network can be deployed on a computing node on a vehicle-mounted visual perception device, a safe city perception device, or a security perception device to process the image to be processed to obtain the detection result of the object of interest in the image to be processed.
  • the computing node may be the execution device 110 in FIG. 3 , the execution device 310 in FIG. 5 , a local device, or the like.
  • the multi-header multi-task perception network includes a backbone network (backbone) and multiple headers, and each header includes a region proposal network (RPN), a Region of interest align (ROI-Align) module and region convolutional neural networks (RCNN).
  • backbone backbone
  • RPN region proposal network
  • ROI-Align Region of interest align
  • RCNN region convolutional neural networks
  • Chips used in many fields have low computing power, making it difficult to deploy large-scale sensor networks, and even more difficult to deploy multiple sensor networks.
  • the embodiments of the present application provide a perception network, which can reduce the amount of parameters and computation in the perception network, reduce the power consumption of hardware, and improve the running speed of the model.
  • FIG. 8 shows a schematic diagram of a sensor network in an embodiment of the present application.
  • the sensor network 800 in FIG. 8 includes a backbone network (backbone) 810 and a head end (header).
  • backbone backbone
  • head end head end
  • the perception network in the embodiments of the present application may be implemented by hardware, software, or a combination of software and hardware.
  • the backbone network 810 is configured to perform convolution processing on the input image to obtain the first feature map of the input image.
  • the backbone network 810 can extract basic features through a series of convolution processing to provide corresponding features for subsequent detection.
  • the "first feature map” refers to a feature map (feature map) output by the backbone network.
  • the feature maps output by the backbone network can all be referred to as first feature maps.
  • the backbone network 810 can output feature maps of the input image at different scales.
  • Feature maps at different scales can be understood as the first feature maps of the input image, and these feature maps can provide basic features for subsequent detection.
  • Feature maps at different scales can be understood as feature maps of different resolutions, or in other words, feature maps of different sizes.
  • the backbone network 810 may adopt various forms of networks, for example, the backbone network 810 may adopt a visual geometry group (VGG), a residual neural network (Resnet) or an inception network (inception- net), inception-net is the core structure of GoogleNet, etc.
  • VCG visual geometry group
  • Resnet residual neural network
  • inception- net inception-net is the core structure of GoogleNet, etc.
  • the header is used to detect the target object according to the second feature map, and output the target 2-dimensional (2dementional, 2D) frame of the target object and the first indication information.
  • the target objects include objects to be detected in multiple tasks.
  • the second feature map is determined from the first feature map.
  • the first indication information is used to indicate the category to which the target object belongs.
  • the header is used to realize target detection according to the second feature map, and output the target 2D frame of the target object and the first indication information.
  • the first indication information may include confidence that the target object belongs to each category. That is, the category to which the target object belongs can be indicated by the confidence of the target object belonging to each category. The higher the confidence, the greater the probability that the target object belongs to the category corresponding to the confidence. For example, the category corresponding to the highest confidence is the category to which the target object belongs.
  • the first indication information may be a category to which the target object belongs.
  • the first indication information may include the confidence level of the category to which the target object belongs.
  • Each category includes object categories in multiple tasks. This embodiment of the present application does not limit the specific form of the first indication information.
  • a header can complete the detection of objects to be detected in various tasks, that is, it is used to detect whether there are objects to be detected in the various tasks in the input image.
  • a broad category includes at least one category.
  • a broad category is a collection of at least one category.
  • Task division criteria can be set as needed. For example, the objects to be detected are divided into multiple tasks according to the similarity of the objects to be detected.
  • the 31 types of objects to be detected are divided into 8 categories, namely 8 tasks, as shown in Table 1.
  • a header can be used to complete a variety of object detection tasks. For example, a header can complete the 8 tasks in Table 1 above, and output the target 2D frame of the target object and the confidence that the target object belongs to the 31 types of objects.
  • the perception network 800 may further include other processing modules connected to the header.
  • Other processing modules are used to obtain other detection information of the target object according to the target 2D frame of the target object output by the header.
  • processing modules can extract the features of the area where the target 2D frame is located in the feature map output by the backbone network according to the target 2D frame output by the header, and complete the 3D detection or 3D detection of the target object in the target 2D frame according to the extracted features. Keypoint detection, etc.
  • the header is described in detail below.
  • the header includes an RPN 820 , a region of interest extraction module 830 and a classification and regression network 840 .
  • RPN820 is used to predict the area where the target object is located on the second feature map, and output the position information of the candidate 2D frame matching the area where the target object is located, that is, the position information of the candidate 2D frame of the target object.
  • the target object includes objects to be detected in multiple tasks, each of the multiple tasks includes at least one category, and the second feature map is determined according to the first feature map.
  • the region of interest extraction module 830 is used to extract the first feature information on the third feature map based on the position information of the candidate 2D frame, the first feature information is the feature of the region where the candidate 2D frame is located, and the third feature map is based on the first feature. Figure is determined.
  • the classification and regression network 840 is used to process the first feature information, and output the target 2D frame of the target object and the first indication information, the number of target 2D frames is less than or equal to the number of candidate 2D frames, and the first indication information is used to indicate the target The class to which the object belongs.
  • the classification and regression network can output the target 2D box (box) and class label (label) of the object to be detected in multiple tasks.
  • the class label of the target object can be used as the first indication information. It should be understood that the use of the class label as the first indication information in FIG. 8 is only an example, and does not constitute a limitation to the solutions of the embodiments of the present application.
  • RPN may also be referred to as a single-head multi-task RPN.
  • RPN820 can predict the area where the target object may exist on the second feature map, and give boxes that match the area where the target object may exist. These areas can be called candidate areas (proposal), and the boxes that match the candidate area are candidates. 2D box. The box that matches the proposal can also be called the 2D box of the proposal.
  • the target object includes objects to be detected in multiple tasks, for example, the objects to be detected in the 8 tasks in Table 1, and the RPN820 is used to predict the regions where the objects to be detected in the 8 tasks may exist.
  • the target objects may include objects to be detected in all tasks of the perception network. That is, RPN can be used to predict the region of the object to be detected in all tasks that may exist on the second feature map. In other words, all tasks of the perception network share the same RPN.
  • the second feature map may be one or multiple.
  • the second feature map may include one or more of the first feature map.
  • the perceptual network 800 further includes feature pyramid networks (FPN).
  • FPN feature pyramid networks
  • the FPN is connected to the backbone 810, and is used to perform feature fusion on the feature map output by the backbone 810, that is, perform feature fusion on the first feature map of the input image, and output the fused feature map.
  • the fused feature map is input into the RPN.
  • the second feature map may include one or more of the fused feature maps.
  • FPN takes the feature maps of different scales output by the backbone 810 as input, and through the internal vertical feature fusion of the FPN and the horizontal feature fusion of the same layer with the backbone 810, a feature map with more expressive ability is generated and provided to the subsequent modules, and then Improve the performance of the model.
  • FPN can be used to achieve multi-scale feature fusion.
  • the backbone 810 is connected to the RPN 820.
  • the region of interest extraction module 830 is configured to deduct the feature of the region where the candidate 2D frame is located from the third feature map according to the candidate 2D frame output by the RPN 820 .
  • the third feature map is determined according to the first feature map, including:
  • the third feature map may be one of the feature maps output by the backbone (ie, the first feature map) or one of the fused feature maps output by the FPN;
  • the third feature map may be one of the feature maps (ie, the first feature map) output by the backbone.
  • the region of interest extraction module 830 deducts the features of the region where each proposal is located from a certain feature map output by the backbone or FPN according to the proposal provided by the RPN 820, and adjusts the size (resize) to a fixed size to obtain each characteristics of a proposal.
  • the region of interest extraction module 830 may adopt region of interest pooling (ROI-pooling), region of interest extraction (ROI-Align), position sensitive region of interest pooling (position sensitive ROI pooling, PS-ROIPOOLING) ) or position sensitive ROI align (PS-ROIALIGN) and other feature extraction methods.
  • ROI-pooling region of interest pooling
  • ROI-Align region of interest extraction
  • position sensitive region of interest pooling position sensitive ROI pooling
  • PS-ROIPOOLING position sensitive ROI align
  • PS-ROIALIGN position sensitive ROI align
  • the region of interest extraction module 830 adopts the method of difference and sampling in the region where the proposal is located, deducts features of a fixed resolution, and inputs the deducted features into subsequent modules.
  • the classification and regression network 840 is specifically configured to: process the first feature information to obtain the confidence that the candidate 2D frame belongs to each category in the multiple tasks; adjust the position information of the candidate 2D frame to obtain the adjusted 2D frame.
  • the candidate 2D frame; the target 2D frame is determined according to the adjusted candidate 2D frame; the first indication information is determined according to the confidence that the target 2D frame belongs to each category.
  • the position information of the candidate 2D frame is adjusted so that the adjusted candidate 2D frame matches the shape of the actual object more closely than the candidate 2D frame, that is, the adjusted candidate 2D frame is a more compact candidate 2D frame.
  • a frame merging operation is performed on the adjusted candidate 2D frame to obtain the target 2D frame.
  • the adjusted 2D boxes are merged with non-maximum suppression (NMS) to obtain the target 2D boxes.
  • NMS non-maximum suppression
  • the classification and regression network 840 refines each proposal provided by the region of interest extraction module 830 to obtain the confidence that each proposal belongs to 31 categories in the 8 tasks, At the same time, the coordinates of the 2D boxes of each proposal are adjusted to obtain the adjusted candidate 2D boxes. Further, after the adjusted candidate 2D frame is merged by NMS, the target 2D frame and the first indication information are obtained. The number of candidate 2D boxes is greater than or equal to the number of target 2D boxes.
  • the classification and regression network 840 includes multiple third RCNNs, wherein the multiple third RCNNs correspond to multiple tasks one-to-one. That is, each third RCNN separately completes the detection of objects to be detected in different tasks.
  • FIG. 9 shows a schematic block diagram of a cognitive network provided by an embodiment of the present application.
  • the perceptual network includes backbone, FPN, RPN, ROI-Align module, and n third RCNNs.
  • the third RCNN is used to: process the features of the area where the candidate 2D frame is located to obtain the confidence level of the object category in the task corresponding to the third RCNN; and adjust the position information of the candidate 2D frame so that The adjusted candidate 2D box.
  • any third RCNN among the plurality of third RCNNs can predict the confidence that the candidate 2D frame belongs to the object category in the task corresponding to the third RCNN, and obtain the adjusted candidate 2D frame.
  • the multiple third RCNNs can obtain the confidence of the candidate 2D frame belonging to each category, and the adjusted candidate 2D frame obtained by each third RCNN.
  • the target 2D frame and the first indication information are obtained.
  • the task corresponding to the third RCNN1# is the vehicle detection task in Table 1, then the third RCNN1# outputs the confidence that each proposal belongs to the three categories of cars, trucks and buses and the adjusted candidate 2D box .
  • the task corresponding to the third RCNN2# is the detection task of wheels and lights in Table 1, then the third RCNN2# outputs the confidence that each proposal belongs to the two categories of wheels and lights and the adjusted candidate 2D frame. In this way, for any proposal, a total of five categories of confidence and adjusted candidate 2D boxes can be obtained after being processed by the third RCNN1# and the third RCNN2#.
  • the perceptual network in FIG. 9 is used to implement n tasks, for example, the n tasks include task 0, task 1 . . . task n-1 in FIG. 9 .
  • n is an integer greater than 1.
  • the n third RCNNs correspond to each of the n tasks one-to-one. Taking task 0 as an example, the third RCNN corresponding to task 0 outputs the confidence that each proposal belongs to each object category in task 0 and the adjusted candidate 2D box.
  • the n third RCNNs corresponding to the n tasks obtain the confidence of each object category in each corresponding task, and the classification and regression network can obtain the confidence that each proposal belongs to each category.
  • the FPN in FIG. 9 is an optional module.
  • the ROI-Align module is used as the region of interest extraction module only as an example, and other methods may also be used to extract corresponding features.
  • the ROI-Align module is used as the region of interest extraction module only as an example, and other methods may also be used to extract corresponding features.
  • the classification and regression network includes a first RCNN
  • the first RCNN includes a hidden layer, multiple sub-classification fully connected layers (classification fully connected layers, cls fc) and multiple sub-regression fully connected layers (regression fully connected layers). connected layers, reg fc), the hidden layer is connected to multiple sub-classification fully connected layers, the hidden layer is connected to multiple sub-regression fully connected layers, multiple sub-classification fully connected layers correspond to multiple tasks one-to-one, and multiple sub-regression fully connected layers are connected to One-to-one correspondence between multiple tasks.
  • the first RCNN includes a hidden layer and multiple sub-cls fc and multiple sub-reg fc corresponding to multiple tasks.
  • Each task can have an independent sub-classification fc and sub-regression fc.
  • FIG. 10 shows a schematic block diagram of another cognitive network provided by an embodiment of the present application.
  • the perceptual network includes backbone, FPN, RPN, ROI-Align module and the first RCNN.
  • the hidden layer is used to process the first feature information to obtain the second feature information.
  • the hidden layer is used to process the features of the region where the candidate 2D box is located, and the processed results are respectively input to multiple sub-classification fully connected layers and multiple sub-regression fully connected layers.
  • the hidden layer may include at least one of the following: a convolutional layer or a fully connected layer. Since multiple tasks share the hidden layer, the convolutional layer in the hidden layer can also be called a shared convolutional layer (shared convolutional, shared conv), and the fully connected layer in the hidden layer can also be called a shared fully connected layer (shared fully connected layer). layers, shared fc).
  • the sub-category fully-connected layer is used to obtain, according to the second feature information, the confidence level that the candidate 2D frame belongs to the object category in the task corresponding to the sub-category fully-connected layer.
  • the sub-regression fully connected layer is used to adjust the position information of the candidate 2D frame according to the second feature information to obtain the adjusted candidate 2D frame. Further, the sub-regression fully connected layer can use box merging operations, such as NMS operations, to remove duplicate boxes and output more compact candidate 2D boxes.
  • box merging operations such as NMS operations
  • the sub-category fully-connected layer and the sub-regression fully-connected layer corresponding to each task can complete the detection of the object to be detected in the task.
  • the sub-category fully connected layer can output the confidence level that the candidate 2D frame belongs to the object category in the task
  • the sub-regression fully connected layer can output the adjusted candidate 2D box. That is to say, a first RCNN can complete the detection of objects to be detected in multiple tasks.
  • the first RCNN can also be called a single-head multi-task RCNN.
  • the first RCNN can predict the confidence that the candidate 2D frame belongs to the object category in the multiple tasks corresponding to the first RCNN, and obtain the adjusted candidate frame.
  • the multiple tasks corresponding to the first RCNN include 8 tasks in Table 1, then the first RCNN includes 8 sub-cls fc and 8 reg fc, respectively corresponding to the 8 tasks, each sub-cls fc outputs each Each proposal belongs to the confidence level of the object category in the task corresponding to the sub-cls fc, and each reg fc outputs the adjusted candidate 2D frame, so that the first RCNN can obtain that each proposal belongs to the 31 categories in the 8 tasks Object confidence and adjusted candidate 2D boxes.
  • the perceptual network in FIG. 10 is used to implement n tasks, for example, the n tasks include task 0, task 1 . . . task n-1 in FIG. 10 .
  • n is an integer greater than 1.
  • the first RCNN includes a hidden layer and n sub-cls fc and n sub-reg fc corresponding to n tasks, respectively.
  • Hidden layers can include Shared fc and/or Shared conv.
  • the sub-cls fc corresponding to task 0 in the first RCNN outputs the confidence that each proposal belongs to each object category in task 0, and the sub-reg fc corresponding to task 0 outputs the adjusted candidate 2D frame.
  • the n sub-cls fc corresponding to the n tasks obtain the confidence that each proposal belongs to each object category in the corresponding task, and the first RCNN can obtain the confidence that each proposal belongs to each category.
  • the FPN in Figure 10 is an optional module.
  • the ROI-Align module is used as the region of interest extraction module only as an example, and other methods can also be used to extract corresponding features. For the specific description, refer to the foregoing, which will not be repeated here.
  • the classification and regression network includes a second RCNN
  • the second RCNN includes a hidden layer, a classification fully connected layer and a regression fully connected layer
  • the hidden layer is connected to the classification fully connected layer
  • the hidden layer is fully connected to the regression layer layers are connected.
  • FIG. 11 shows a schematic block diagram of yet another cognitive network provided by an embodiment of the present application.
  • the perceptual network includes backbone, FPN, RPN, ROI-Align module, and a second RCNN.
  • the hidden layer is used to process the first feature information to obtain the third feature information.
  • the hidden layer is used to process the features of the region where the candidate 2D box is located, and the processed results are input to the classification fully connected layer and the regression fully connected layer respectively.
  • the hidden layer may include at least one of the following: a convolutional layer or a fully connected layer.
  • a convolutional layer or a fully connected layer.
  • the first RCNN which will not be repeated here.
  • the classification fully connected layer is used to obtain the confidence that the candidate 2D frame belongs to each category according to the third feature information.
  • the regression fully connected layer is used to adjust the position information of the candidate 2D frame according to the third feature information to obtain the adjusted candidate 2D frame. Further, a frame merging operation is performed on the adjusted candidate 2D frame to obtain the target 2D frame.
  • a second RCNN completes the detection of objects to be detected in multiple tasks.
  • the second RCNN can also be called a single-head multi-task RCNN.
  • the classification fully connected layer is obtained by merging multiple sub-classification fully connected layers in the first RCNN.
  • the regressive fully connected layer is obtained by merging multiple sub-regressive fully connected layers in the first RCNN.
  • the first characteristic information and the third characteristic information are the same.
  • Combining multiple sub-category fully-connected layers can be understood as splicing the weight matrices of multiple sub-category fully-connected layers.
  • Combining multiple sub-regression fully connected layers can be understood as splicing the weight matrices of multiple sub-regression fully connected layers.
  • the first RCNN can use the sigmoid function to normalize the label logits obtained by the sub-category fc, which is equivalent to performing a binary classification process for each category.
  • the relationship between sub-categories fc of multiple tasks in the model is merged into one category fc, which will not affect the inference results of the model, that is, the output of multiple sub-categories fc is the same as the output of the category fc obtained by merging multiple sub-categories fc.
  • the tasks performed by the second RCNN and the first RCNN can be the same, and the output results are also the same.
  • accelerators such as NPU
  • only one matrix operation is completed at a time.
  • the output of the hidden layer needs to be input to the sub-classification fc and sub-regression fc corresponding to each task for multiple matrix operations.
  • the number of matrix multiplications in the RCNN increases with the number of tasks, and the number of matrix multiplications performed in the second RCNN is not affected by the number of tasks. That is to say, in the case where the parameters of the first RCNN and the second RCNN are the same, the time required to execute the second RCNN is less than the time required to execute the first RCNN.
  • the classification fully connected layer of the second RCNN is obtained by merging the sub-classification fully connected layers corresponding to multiple tasks in the first RCNN, and the sub-regression fully connected layers corresponding to multiple tasks in the first RCNN are merged to obtain the second RCNN.
  • the regression fully connected layer which can reduce the number of operations of matrix multiplication in the neural network accelerator, is more friendly to hardware, and further reduces time-consuming.
  • the perceptual network of FIG. 11 is used to implement n tasks, the n tasks include task 0, task 1 . . . task n-1 in FIG. 10 .
  • n is an integer greater than 1.
  • the second RCNN includes hidden layers and cls fc and reg fc.
  • Hidden layers can include Shared fc and/or Shared conv.
  • cls fc may be obtained by merging n sub-cls fcs in FIG. 10
  • reg fc may be obtained by merging n sub-reg fcs in FIG. 10 .
  • cls fc can output the confidence that each proposal belongs to each category, and reg fc can output the adjusted candidate 2D box.
  • the FPN in Figure 11 is an optional module.
  • the ROI-Align module is used as the region of interest extraction module only as an example, and other methods may also be used to extract corresponding features. For details, refer to the foregoing description, which will not be repeated here.
  • the classification and regression network adopts the first RCNN, and after the training is completed, the second RCNN is obtained based on the first RCNN, that is, in the perceptual network used for inference, the classification and regression network can adopt the second RCNN. .
  • the perceptual network in FIG. 10 can be applied to the training side, and the first RCNN in the trained perceptual network is merged to obtain the perceptual network shown in FIG. 11 , that is, the model parameters in FIG. model parameters are obtained.
  • the perceptual network in Figure 11 can be applied to the inference side to reduce time-consuming.
  • one sensing network is used to complete various sensing tasks, multiple tasks share one RPN, and one RPN predicts the area where the objects to be detected in multiple tasks are located, while ensuring the performance of the sensing network, It reduces the amount of parameters and calculation of the perception network, improves the processing efficiency, is conducive to deployment in scenarios with high real-time requirements, reduces the pressure on hardware, and saves costs.
  • the first RCNN or the second RCNN is used as the classification and regression network, and multiple tasks share the hidden layer of the RCNN, which further reduces the amount of parameters and calculation of the perception network, and improves the processing efficiency.
  • each task corresponds to independent sub-classification fc and sub-regression fc, which improves the scalability of the perception network. Flexibly increase or decrease detection tasks by increasing or decreasing sub-classification fc and sub-regression fc.
  • multiple sub-classification fc and sub-regression fc in the first RCNN are combined, and the second RCNN is used as the classification and regression network, which can further reduce the operation of matrix operations, is more friendly to hardware, and further The time-consuming operation is reduced and the processing efficiency is improved.
  • the perception network in the embodiment of the present application may be trained by using an existing training method.
  • the labeling data is partial labeling data, for example, when only labeling the labeling data of the object to be detected for one task is marked on a sample image, when the labeling data of the object to be detected for the task is used for training, adjustments will be made.
  • the parameters of the RPN so that the RPN can more accurately predict the candidate 2D frame of the object to be detected for this task, but cannot accurately predict the candidate 2D frame of the object to be detected for other tasks on the sample image.
  • the parameters of the RPN will be adjusted, so the adjusted RPN may not be able to accurately predict the candidate 2D frame of the object to be detected in other tasks. In this way, the training data of different tasks may suppress each other, causing RPN to fail to predict all the target objects in the image.
  • the embodiment of the present application provides a training method for a perceptual network, which utilizes sample images in the inference training set of other perceptual networks to provide pseudo boxes (pseudo bounding boxes, Pseudo Bboxes) for objects to be detected that are not marked in the sample images, and then
  • the RPN is jointly trained based on pseudo-frames and labeled data, which is beneficial to obtain candidate 2D frames of objects to be detected in multiple tasks.
  • FIG. 12 shows a method 1200 for training a perceptual network provided by an embodiment of the present application.
  • the method 1200 may be performed by a training device for a neural network model, and the training device may be a cloud service device or a terminal device.
  • a device with sufficient computing power to execute the neural network model training method such as a computer and a server, can also be a system composed of cloud service equipment and terminal equipment.
  • the method 1200 may be performed by the training device 120 in FIG. 3 , the neural network processor 50 in FIG. 5 , or the execution device 310 in FIG. 6 .
  • the perception network includes: RPN, where the RPN is used to predict the position information of the candidate 2D frame of the target object in the sample image, the target object includes objects to be detected for multiple tasks, and each task in the multiple tasks includes at least one category.
  • the sensory network may be the sensory network shown in FIG. 8 .
  • relevant descriptions are appropriately omitted when describing the training method. During training, just replace the input image with a sample image.
  • the method 1200 includes steps S1210 to S1220, and steps S1210 to S1220 are described below.
  • the target objects include a first task object and a second task object.
  • the training data includes the sample image, the labeling data of the first task object on the sample image, and the pseudo-frame of the second task object on the sample image, and the labeling data includes the class label of the first task object and the labeled 2D frame of the first task object,
  • the pseudo frame of the second task object is the target 2D frame of the second task object obtained by inferring the sample image through other perceptual networks.
  • Labeled data can also be understood as ground truth.
  • Annotated class labels are used to indicate the true class to which the task object belongs.
  • the labeled data of the first task object can also be understood as the labeled data of the sample image.
  • the fully annotated data of the sample image includes the class labels and annotated 2D boxes of the objects to be detected in all tasks on the sample image.
  • the fully annotated data includes the annotation information of all objects of interest.
  • Part of the annotation data includes the class label and annotated 2D frame of the object to be detected in some tasks on the sample image.
  • Part of the annotation data only includes the annotation information of some objects of interest.
  • the first task objects may include objects to be detected in one or more tasks.
  • the one or more tasks are the tasks where the first task object is located.
  • the first task objects in different sample images in the training set may be the same or different.
  • the "first” in the "first task object” in the embodiment of the present application is only used to define the object to be detected that has a true value in the sample image, and has no other limiting role.
  • the annotation data of sample image 1# is the annotation data of the car, that is, the first task object in the sample image 1# includes the objects to be detected in the detection task of the car, such as trucks, cars, buses, etc.
  • the sample image The labeled data of 2# is the labeled data of wheels and lights, that is, the first task object in sample image 2# includes the objects to be detected in the detection task of wheels and lights, such as wheels, lights, etc.
  • sample image The labeling data of 3# includes the labeling data of the car and the labeling data of the wheels and lights, that is, the first task object in the sample image 3# includes the objects in the detection task of the car and the objects to be detected in the detection task of the wheels and lights. object.
  • the labeling data of the sample images in the embodiments of the present application may be partial labeling data, so that targeted collection can be carried out, that is, the required sample images are collected for specific tasks, and it is not necessary to mark each sample image in each sample image to be marked for all tasks.
  • the detected objects reduce the cost of data collection and labeling.
  • the scheme using part of the labeled data has flexible scalability. In the case of adding tasks, it is only necessary to provide the labeled data of the new tasks, and there is no need to label new objects to be detected on the basis of the original training data. .
  • the Pseudo Bboxes on the sample image are the target 2D boxes of the second task object obtained by inferring the sample image through other perceptual networks.
  • the Pseudo Bboxes on the sample image can also be understood as the Pseudo Bboxes of the second task object.
  • Other perceptual networks refer to other perceptual networks than the one to be trained.
  • the other perceptual network may be a multi-head multi-tasking perceptual network.
  • the perceptual network as shown in FIG. 7 is used to infer the sample images in the training set, and the inference result of the sample image is obtained, and the inference result includes the target 2D frame of the target object on the sample image.
  • other perceptual networks may also include multiple single-task perceptual networks.
  • multiple single-task perceptual networks are used to infer the sample images in the training set respectively, and the inference results of the sample images are obtained respectively.
  • the inference results of each single-task perceptual network include the object to be detected in the task on the sample image.
  • the target 2D frames of the objects to be detected in multiple tasks on the sample image can be obtained.
  • the second task objects may include objects to be detected in one or more tasks.
  • the one or more tasks are the tasks where the second task object is located.
  • the same object to be detected may exist in the second task object and the first task object.
  • the second task objects in different sample images in the training set can be the same or different.
  • the "second" in the "second task object” in the embodiment of the present application is only used to define the object to be detected with a pseudo frame in the sample image, and has no other limiting role.
  • the annotation frame in the annotation data is used as the target output of the RPN.
  • Labeled data is usually human-labeled data, and the accuracy of labeling data is usually higher than that of pseudo-frames obtained by inference from other perception networks. Using labelled frames as the target output can improve the accuracy of the training model.
  • the multiple tasks that the perception network needs to complete include the 8 tasks in Table 1, the labeled data of the sample image 1# is the labeled data of the car, and the first task object in the sample image 1# includes the object to be detected in the car detection task.
  • the detected objects such as trucks, cars and buses, that is, the labeled data of sample image 1# are part of the labeled data.
  • the sample image 1# is reasoned through other perceptual networks to obtain the target 2D frame of the second task object, that is, the pseudo frame.
  • the sample image 1# is inferred by 7 single-task perception networks used to complete the 7 tasks except the car detection task in Table 1, and the target 2D frame of the second task object is obtained, in this case , the second task object may include objects in the seven tasks in Table 1 except for the vehicle detection task.
  • the multi-head and multi-task perceptual network shown in Figure 7 can be used to complete the 8 tasks in Table 1.
  • the target 2D frame of the second task object can be obtained.
  • the second task object may include the objects to be detected in the eight tasks in Table 1. In this way, the regions where the objects to be detected are located in the eight tasks in the sample image 1# can be obtained after the pseudo frame and the annotation frame are combined.
  • the RPN predicts the area where the object to be detected is located in all tasks that need to be detected.
  • other perceptual networks perform reasoning on the sample image, and can obtain the target 2D frame of the second task object on the sample image and the confidence level of the category to which the second task object belongs.
  • the target 2D frame of the second task object on the sample image obtained by other perceptual network inferences is used as the pseudo frame on the sample image. That is, when the confidence level is greater than or equal to the first threshold, the inference results of other perceptual networks are used for training.
  • a low threshold may be used for filtering.
  • the first threshold is 0.05, that is, the target 2D frame with a confidence level greater than or equal to 0.05 can be used as a pseudo frame on the sample image to participate in the training of the perceptual network together with the labeled data. It should be understood that the first threshold may be set as required, which is not limited in this embodiment of the present application.
  • step S1220 may include steps S1221 to S1223.
  • S1221 Calculate a first loss function value according to the difference between the marked 2D frame of the first task object and the target 2D frame of the second task object and the candidate 2D frame of the target object in the sample image predicted by the RPN.
  • the labeled 2D frame of the first task object and the target 2D frame of the second task object are compared with the candidate 2D frame of the target object predicted by RPN, and the loss function value of the RPN stage is obtained, that is, the first loss function value.
  • the forward propagation of the perceptual network is performed based on the sample image, and the candidate 2D frame of the target object on the sample image is predicted by the RPN.
  • the specific forward propagation process is shown in FIG. 8 and will not be repeated here.
  • S1222 Calculate a second loss function value of the perceptual network according to the labeled data of the sample image.
  • the second loss function value of the perceptual network is the second loss function value of the part of the perceptual network that needs to be trained.
  • the part to be trained in the perception network includes the part to be trained in the classification and regression network, the region of interest extraction module, the RPN and the backbone network, and the part to be trained in the classification and regression network is determined according to the first task object.
  • the part of the perceptual network that needs to be trained refers to the part of the perceptual network that needs to be trained as determined by the sample images.
  • the classification and regression network can predict the confidence that the candidate 2D box belongs to each category and the target 2D box of the target object.
  • the region of interest extraction module deducts the features of the candidate 2D frame from the feature map, and the features of the candidate 2D frame are input into the part to be trained in the classification and regression network, and the candidate 2D frame is obtained. Confidence that the 2D box belongs to the object category in the task corresponding to the first task object.
  • the part of the classification and regression network that needs to be trained is determined according to the first task object. In other words, the part to be trained in the classification and regression network is determined according to the task where the first task object is located.
  • the classification and regression network includes a plurality of third RCNNs, and the part to be trained in the classification and regression network includes the third RCNN corresponding to the task where the first task object is located.
  • the perception network may be as shown in FIG. 9 .
  • the task where the first task object in the sample image 1# (an example of the sample image) is located includes a vehicle detection task, and the first task object includes an object to be detected in the vehicle detection task.
  • the features of the candidate 2D box are input into the third RCNN corresponding to the vehicle detection task, and then the confidence level of the candidate 2D box belonging to the three categories of cars, trucks and buses, and the target 2D box are obtained.
  • the part to be trained in the classification and regression network is the third RCNN corresponding to the vehicle detection task.
  • the classification and regression network includes a first RCNN, and the part to be trained in the classification and regression network includes the hidden layer in the first RCNN and the sub-classification fc and sub-regression fc corresponding to the task where the first task object is located.
  • the perception network can be as shown in FIG. 10 .
  • the task where the first task object in the sample image 1# is located includes the detection task of the car, and the first task object includes the object to be detected in the detection task of the car.
  • the features of the candidate 2D box are input into the sub-classification fc and sub-regression fc corresponding to the vehicle detection task after passing through the hidden layer in the first RCNN, and then the confidence level of the candidate 2D box belonging to the three categories of cars, trucks and buses is obtained. , and the target 2D box.
  • the part that needs to be trained in the classification and regression network is the hidden layer in the first RCNN and the sub-classification fc and sub-regression fc corresponding to the car detection task.
  • the labeled data of the sample image is compared with the output result of the classification and regression network, and the loss function value of the task where the first task object in the classification and regression network stage is located, that is, the second loss function value is obtained. That is, the loss of other tasks not involved in the annotation data of the sample images is not calculated.
  • S1223 Perform backpropagation based on the first loss function value and the second loss function value, and adjust the parameters of the part of the perception network that needs to be trained.
  • the gradient of the parameter related to the first loss function value is calculated, and then the parameter related to the first loss function value is adjusted based on the gradient of the parameter, so as to realize the adjustment of the perceptual network, so that the RPN can Predict candidate boxes more comprehensively.
  • the parameters related to the first loss function value are the parameters in the perceptual network used in the process of obtaining the first loss function value, for example, the parameters of the backbone and the parameters of the RPN. Further, in the case where the perceptual network includes FPN, the parameter related to the value of the first loss function also includes FPN.
  • the gradient of the parameter related to the second loss function value is calculated, and then the parameter related to the second loss function value is adjusted based on the gradient of the parameter to realize the adjustment of the perceptual network, so that the classification regression
  • the network can better correct the output 2D box and improve the accuracy of category prediction.
  • the parameters related to the second loss function are the parameters in the perceptual network used in the process of calculating the value of the second loss function, for example, the parameters of the backbone, the parameters of the RPN, the parameters of the region of interest extraction module, and the classification and regression network. parameters for the part of the training required. Further, in the case where the perceptual network includes FPN, the parameter related to the value of the second loss function also includes FPN.
  • the parameters related to the second loss function are the parameters of the part of the perception network that needs to be trained.
  • the training termination condition is met, the training is terminated, and a trained perceptual network is obtained.
  • the training is terminated and the weights of the trained perceptual network are output.
  • steps S1221 to S1223 are only an implementation manner of step S1220, and step S1220 may also be implemented in other manners.
  • step S1220 includes the following steps S1 to S3.
  • S1 Calculate a first loss function value according to the difference between the labeled 2D frame of the first task object and the target 2D frame of the second task object and the candidate 2D frame of the target object on the sample image predicted by the RPN.
  • the labeled 2D frame of the first task object and the target 2D frame of the second task object are compared with the candidate 2D frame of the target object predicted by RPN, and the loss function value of the RPN stage is obtained, that is, the first loss function value.
  • the forward propagation of the perceptual network is performed based on the sample image, and the candidate 2D frame of the target object on the sample image is predicted by the RPN.
  • the specific forward propagation process is shown in FIG. 8 and will not be repeated here.
  • the second loss function value of the part to be trained in the perceptual network is calculated, and the value of the second loss function of the part to be trained in the perceptual network is calculated.
  • the part includes the part that needs to be trained in the classification and regression network, the region of interest extraction module, the RPN and the backbone network.
  • the part that needs to be trained in the classification and regression network is determined according to the first task object and the second task object.
  • the pseudo-label on the sample image is the class label of the second task object on the sample image obtained by inferring the sample image through other perceptual networks.
  • the classification and regression network can predict the confidence that the candidate 2D box belongs to each category and the target 2D box of the target object.
  • the region of interest extraction module deducts the features of the candidate 2D frame from the feature map, and the features of the candidate 2D frame are input into the part to be trained in the classification and regression network, and the candidate 2D frame is obtained.
  • the part of the classification and regression network that needs to be trained is determined according to the first task object and the second task object. In other words, the part to be trained in the classification and regression network is determined according to the task where the first task object is located and the task where the second task object is located.
  • the classification and regression network includes a plurality of third RCNNs, for example, the perceptual network may be as shown in FIG. 9 .
  • the task where the first task object in the sample image 1# is located includes the detection task of the car, and the first task object includes the object to be detected in the detection task of the car.
  • the features of the candidate 2D box are input into the third RCNN corresponding to the vehicle detection task, and then the confidence level of the candidate 2D box belonging to the three categories of cars, trucks and buses, and the target 2D box are obtained.
  • the task where the second task object in the sample image 1# is located includes the detection task of wheels and lights, and the second task object includes objects in the detection task of wheels and lights.
  • the features of the candidate 2D frame are input into the third RCNN corresponding to the detection task of wheels and lights, and then the confidence of the candidate 2D frame belonging to the two categories of wheels and lights, and the target 2D frame are obtained.
  • the parts to be trained in the classification and regression network are the third RCNN corresponding to the detection task of the car and the third RCNN corresponding to the detection task of the wheels and lights.
  • the classification and regression network includes the first RCNN, for example, the perceptual network may be as shown in FIG. 10 .
  • the task where the first task object in the sample image 1# is located includes the detection task of the car, and the first task object includes the object to be detected in the detection task of the car.
  • the features of the candidate 2D box are input into the sub-classification fc and sub-regression fc corresponding to the vehicle detection task after passing through the hidden layer in the first RCNN, and then the confidence level of the candidate 2D box belonging to the three categories of cars, trucks and buses is obtained. , and the target 2D box.
  • the task where the second task object in the sample image 1# is located includes the detection task of wheels and lights, and the second task object includes objects in the detection task of wheels and lights.
  • the features of the candidate 2D box are input into the sub-classification fc and sub-regression fc corresponding to the detection task of wheels and lights after passing through the hidden layer in the first RCNN, and then the confidence that the candidate 2D box belongs to the two categories of wheels and lights is obtained. degrees, and the target 2D box.
  • the parts that need to be trained in the classification and regression network include the hidden layer in the first RCNN, the sub-classification fc and sub-regression fc corresponding to the detection task of the car, and the sub-classification corresponding to the detection task of wheels and lights fc and subregression fc.
  • the gradient of the parameter related to the first loss function value is calculated, and then the parameter related to the first loss function value is adjusted based on the gradient of the parameter, so as to realize the adjustment of the perceptual network, so that the RPN can Predict candidate boxes more comprehensively.
  • the parameters related to the first loss function value are the parameters in the perceptual network used in the process of obtaining the first loss function value, for example, the parameters of the backbone and the parameters of the RPN. Further, in the case where the perceptual network includes FPN, the parameter related to the value of the first loss function also includes FPN.
  • the gradient of the parameter related to the second loss function value is calculated, and then the parameter related to the second loss function value is adjusted based on the gradient of the parameter to realize the adjustment of the perceptual network, so that the classification regression
  • the network can better correct the output 2D box and improve the accuracy of category prediction.
  • the parameters related to the second loss function are the parameters in the perceptual network used in the process of calculating the value of the second loss function, for example, the parameters of the backbone, the parameters of the RPN, the parameters of the region of interest extraction module, and the classification and regression network. parameters for the part of the training required. Further, in the case where the perceptual network includes FPN, the parameter related to the value of the second loss function also includes FPN.
  • the parameters related to the second loss function are the parameters of the part of the perception network that needs to be trained.
  • the training termination condition is met, the training is terminated, and a trained perceptual network is obtained.
  • the training is terminated and the weights of the trained perceptual network are output.
  • the perception network is jointly trained based on the pseudo frame and the labeled data, and in the case that the labeled data only includes the labeled data of the first task object, that is, in the case of partial labeled data, the information of the second task object is provided.
  • Pseudo frame in order to provide a more comprehensive frame of the object to be detected on the same sample image as the target output of the RPN, to adjust the parameters of the RPN so that the output of the RPN is constantly close to the target data, avoiding mutual inhibition between different tasks, which is beneficial to It enables RPN to obtain more comprehensive and accurate candidate 2D boxes, while improving the recall rate.
  • the labeling data of the sample images in the embodiments of the present application may be partial labeling data, so that targeted collection can be performed, that is, the required sample images are collected for specific tasks, and there is no need to mark the objects to be detected for all tasks in each sample image. It reduces the cost of data collection and the cost of labeling, which is conducive to balancing the training data of different tasks.
  • the scheme using part of the labeled data has flexible scalability. In the case of adding tasks, it is only necessary to provide the labeled data of the new tasks, and there is no need to label new objects to be detected on the basis of the original training data. .
  • the parts of the perception network that are shared by different tasks all participate in the training process based on the labeled data of different tasks, so that It enables the parts shared by different tasks in the perceptual network to learn the common features of each task.
  • Different parts corresponding to different tasks in the perception network for example, the parts corresponding to each task in the classification and regression network, only participate in the training process based on the labeled data of the respective tasks, which can make different parts corresponding to different tasks in the perception network. Its task-specific features can be learned, improving the accuracy of the model.
  • the part of the classification and regression network that needs to be trained is determined according to the task, and different parts of the classification and regression network corresponding to different tasks do not affect each other during the training process, ensuring the independence of each task, making The model has strong flexibility.
  • FIG. 13 shows a training method of a perceptual network provided by an embodiment of the present application.
  • the method shown in FIG. 13 may be regarded as a specific implementation of the method shown in FIG. 12 .
  • appropriate omissions are made when describing the method 1300 .
  • the solution of the embodiment of the present application is described in detail below by taking the visual perception system of ADAS/ADS as an example.
  • the visual perception system of ADAS/ADS needs to perform target detection for various tasks, such as: dynamic obstacles, static obstacles, traffic signs, traffic lights, road signs (such as left turn signs or straight signs) and zebra crossings.
  • the target objects include the first task object and the second task object.
  • the training data includes the sample image, the annotation data of the first task object on the sample image, and the pseudo frame of the second task object on the sample image.
  • the annotation data includes the class label of the first task object and the labeled 2D frame of the first task object.
  • labeled data is provided for each task. For example, provide the labeling data of the car for the training process of task0, mark the 2D frame and class label of Car/Truck/Bus on one or more sample images in the dataset; provide the labeling data of the person for the training of task1, in the data One or more sample images in the set are marked with the 2D frame and class label of Pedestrian/Cyclist/Tricycle; the annotation data of wheels and lights are provided for task2, and Wheel/Car_light is marked on one or more sample images in the dataset The 2D box and class label of TrafficLight_Red/Yellow/Green/Black are marked on one or more sample images in the dataset, and the 2D box and class label of TrafficLight_Red/Yellow/Green/Black are marked for task3, and so on. In this way, each sample image has annotated data for at least one task.
  • the sample image includes annotation information of all objects of interest. That is, all objects of interest are annotated in each sample image.
  • the object of interest is the object to be detected in the eight categories in Table 1.
  • each type of annotation data only needs to annotate a specific type of object. That is, the labeled data of each sample image may be partial labeled data.
  • class labels and 2D boxes of objects to be detected in multiple tasks can also be labeled on each sample image, that is, to provide mixed labeled data.
  • the training data can be used to train the required training part of the perceptual network corresponding to the two tasks at the same time.
  • a task label may be assigned to each sample image, and the task label may be used to indicate that the sample image is used to train the required training portion of the perceptual network.
  • the labeled data of the sample image can be obtained in the above manner.
  • the annotation data may be stored in an annotation file.
  • the annotation file is the ground truth file.
  • the sample images are inferred through other perceptual networks, and the inference results are obtained.
  • Inference results include Pseudo Bboxes on sample images. Pseudo Bboxes can be used to complement objects to be detected belonging to other tasks that are not labeled in the labeled data of the sample image.
  • the inference result may be stored in an inference result file.
  • the inference result file is the Pseudo Bboxes file.
  • Each sample image can correspond to an annotation file and an inference result file.
  • the labeled 2D boxes in the labeled data of the sample image and the Pseudo Bboxes can be combined to obtain the 2D boxes of the objects to be detected in all tasks on the sample image.
  • the inference result also includes the confidence level of the category to which the second task object on the sample image belongs.
  • a low threshold is used to filter the inference results. That is, inference results whose confidence is less than the first threshold are filtered out.
  • the confidence levels corresponding to the Pseudo Bboxes used for training are all greater than or equal to the first threshold.
  • the first threshold is 0.05.
  • a perceptual network is trained based on partially labeled data and Pseudo Bboxes. Specifically, the method 1300 includes steps S1310 to S1350.
  • the training data is input into the perception network, and the training data includes a sample image, annotated data of the first task object on the sample image, and a pseudo frame of the second task object on the sample image.
  • Step S1310 corresponds to step S1210 in the method 1200.
  • Step S1210 corresponds to step S1210 in the method 1200.
  • Step S1210 corresponds to step S1210 in the method 1200.
  • the structure of the perceptual network used in the training process is shown in Figure 14.
  • the perceptual network includes: backbone, RPN, region of interest extraction module and first RCNN.
  • the sensory network shown in FIG. 14 can be regarded as a specific implementation of the sensory network shown in FIG. 10 .
  • the perceptual network in Figure 14 can simultaneously complete the object detection of the 8 categories in Table 1.
  • the perceptual network in Figure 14 can simultaneously complete the target detection of the eight tasks in Table 1.
  • the 8 sub-classifications fc and the sub-regression fc in the first RCNN in Figure 14 simultaneously complete the 2D object detection of the 8 categories in Table 1.
  • the perceptual network of the present application can flexibly add or delete the classification fc and regression fc in the first RCNN according to the needs of the business, so as to train to obtain target detection that can achieve different numbers of tasks perception network.
  • the labeled data of the sample image includes the labeled 2D box and the class label of the first task object.
  • the Pseudo Bboxes on the sample image include the Pseudo Bboxes of the second task object.
  • Step S1320 includes: using the labeled 2D box of the first task object and the Pseudo Bboxes of the second task object to calculate the loss in the RPN stage, that is, the first loss function value.
  • Step S1320 corresponds to step S1221 in the method 1200. For details, please refer to step S1221.
  • the sample image may belong to one or more tasks according to the data type it is annotated. In other words, the sample image may belong to one or more tasks according to the task corresponding to the first task object. For example, if a sample image is only marked with traffic signs, the sample image only belongs to the task of traffic signs. If a sample image is marked with people and cars at the same time, then the sample image belongs to the two tasks of people and cars.
  • the loss of the classification and regression network stage only the loss of the part corresponding to the task to which the current sample image belongs is calculated, and the loss of other tasks is not calculated. For example, if the currently input sample image belongs to the task of people and cars, only the loss of the part corresponding to the person and the car is calculated, and the loss of the part corresponding to the other tasks (such as traffic lights and traffic signs) is not calculated.
  • the region of interest extraction module deducts features from a feature map according to the candidate 2D frame predicted by RPN, and enters the sub-classification fc and sub-regression corresponding to the task of the sample image after shared fc and shared conv In fc, the prediction result is obtained, that is, the confidence that the candidate 2D frame belongs to the object category in the task, and the target 2D frame. Then, compare the labeled data with the prediction result to obtain the loss, which is the loss in the classification and regression network stage corresponding to the task.
  • the labeled data of the current sample image only includes the labeled data of one task
  • the sample image is input to the network for training, for the multiple sub-classification fc and sub-regression fc in the first RCNN, only the task in the first RCNN is trained.
  • the corresponding subclassification fc and subregression fc do not affect the subclassification fc and subregression fc corresponding to other tasks in the first RCNN.
  • Regression fc obtains the prediction result of the traffic light in the sample image, and compares it with the true value to obtain the loss value. That is to say, the sample image of the traffic light only passes through the backbone, RPN, region of interest extraction module, and the sub-classification fc and sub-regression fc corresponding to the traffic light in the first RCNN, and the sub-classification fc and sub-regression fc corresponding to other tasks are not involved. Calculation of loss value.
  • the labeled data of the current sample image includes the labeled data of multiple tasks
  • the sample image is input to the network for training, for the multiple sub-classification fc and sub-regression fc in the first RCNN, only the multiple sub-classification fc and sub-regression fc in the first RCNN are trained
  • the subclassification fc and subregression fc corresponding to each task do not affect the subclassification fc and subregression fc corresponding to other tasks in the first RCNN.
  • the task of the traffic light is task 3
  • the task of people is task 1.
  • the prediction results of the traffic lights in the sample image are obtained through the sub-classification fc and sub-regression fc corresponding to task 3
  • the prediction results of the people in the sample image are obtained through the sub-classification fc and sub-regression fc corresponding to task 1
  • the The true values are compared to obtain the loss values corresponding to the two tasks.
  • the sample image only passes through the backbone, RPN, region of interest extraction module, sub-classification fc and sub-regression fc corresponding to task3 in the first RCNN, and sub-classification fc and sub-regression fc corresponding to task1 in the first RCNN.
  • the sub-classification fc and sub-regression fc corresponding to other tasks do not participate in the calculation of the loss value. In this way, the losses of the classification and regression stages corresponding to the two tasks will be obtained, and the overall loss value of the classification and regression stages can be the average of the multiple losses.
  • the gradient of the relevant parameters is calculated, and the gradient is returned.
  • the part that needs to be trained in the perceptual network is subjected to gradient backhaul.
  • the part of the perception network that needs to be trained is determined according to the task to which the sample image belongs, and the part not corresponding to the task to which the sample image belongs does not participate in the gradient backhaul.
  • the gradient is passed back along the sub-class fc and sub-regression fc corresponding to the task to which the sample image belongs, without affecting the sub-class fc and sub-regression fc corresponding to other tasks, the shared fc or conv of the first RCNN and the RPN and Backbone participates in gradient return.
  • the weight parameters of the part of the perceptual network that need to be trained in the perceptual network are updated using the back-passed gradients.
  • the part corresponding to the task to which the sample image belongs in the perceptual network can be adjusted in a targeted manner, so that the part corresponding to the task to which the sample image belongs can better learn the task to which the sample image belongs.
  • step S1310 If the perception network does not converge, go to step S1310 to continue the training process.
  • the labeling data of the sample images in the embodiments of the present application may be partial labeling data, so that targeted collection can be performed, that is, the required sample images are collected for specific tasks without labeling all objects of interest in each picture. processing, reducing the cost of data collection and labeling.
  • the method of preparing training data by using partial labeling data has very flexible scalability. In the case of adding detection tasks, only the part corresponding to the detection task needs to be added to the classification and regression network.
  • adding the corresponding part of the detection task The sub-classification fc and sub-regression fc of the sub-category fc and sub-regression fc are provided, and the sample image with the annotation data of the newly added object can be provided, and the newly added object to be detected is not required to be marked on the basis of the original training data.
  • the pseudo-frames are used to supplement the unlabeled objects to be detected in the sample images, so as to avoid that when the RPN is trained based on the partial labeled data, the partial labeled data of different tasks will suppress each other, which will affect the training of the RPN, and is conducive to the prediction of the RPN.
  • the area where the object to be detected is located in all tasks that need to be detected.
  • the part corresponding to each task in the perception network only detects the object to be detected in the task, and during the training process, it can avoid accidental injury to objects of other tasks that are not labeled.
  • the shared parts in the perception network such as backbone, RPN, region of interest extraction module, etc., learn the common features of each task, while the parts corresponding to each task in the classification and regression network learn task-specific features, for example, the first The sub-classification fc and sub-regression fc corresponding to each task in an RCNN learn its task-specific features.
  • This embodiment of the present application further provides an object recognition method 1500, and the method 1500 can be executed by an object recognition apparatus.
  • the object recognition device may be a cloud service device or a terminal device, for example, a vehicle, drone, robot, computer, server or mobile phone and other devices with sufficient computing power to execute the object recognition method, or a cloud service device.
  • a system consisting of equipment and terminal equipment.
  • the method 1500 may be executed by the execution device 110 in FIG. 3 , the neural network processor 50 in FIG. 5 , or the execution device 310 in FIG. 6 , or a local device.
  • the object recognition method may be specifically executed by the execution device 110 shown in FIG. 3 .
  • the object recognition method may be processed by the GPU, or may be jointly processed by the CPU and the GPU, or other processors suitable for neural network computing may be used without using the GPU, which is not limited in this application.
  • the perceptual network in the embodiment of the present application is used to process the image.
  • the repeated description is appropriately omitted when introducing the method 1500 below.
  • the method 1500 includes steps S1510 to S1540, which are described below.
  • the perception network includes backbone network, RPN, region of interest extraction module and classification regression network.
  • the input image may be an image captured by a terminal device (or other device or device such as a computer, server, etc.) through a camera, or the input image may also be an image obtained from the terminal device (or other device or device such as a computer, server, etc.)
  • the obtained image for example, an image stored in an album of the terminal device, or an image obtained by the terminal device from the cloud
  • the target object includes objects to be detected in multiple tasks, each task in the multiple tasks includes at least one category, and the first The second feature map is determined according to the first feature map.
  • S1540 Use a classification and regression network to process the first feature information to obtain a target 2D frame and first indication information of the target object, where the number of target 2D frames is less than or equal to the number of candidate 2D frames, and the first indication information is used to indicate the target object the category to which it belongs.
  • using a classification and regression network to process the first feature information to obtain the target 2D frame of the target object and the first indication information including: using a classification and regression network to process the first feature information to obtain a candidate 2D frame belonging to multiple The confidence of each category in the task; use the classification and regression network to adjust the position information of the candidate 2D frame to obtain the adjusted candidate 2D frame; determine the target 2D frame according to the adjusted candidate 2D frame; according to the target 2D frame belongs to each category The confidence of determining the first indication information.
  • the classification and regression network includes a first regional convolutional neural network RCNN
  • the first RCNN includes a hidden layer, multiple sub-category fully connected layers and multiple sub-regression fully connected layers
  • the hidden layer is connected to the multiple sub-category fully connected layers
  • the hidden layer is The layer is connected with multiple sub-regression fully-connected layers
  • the multiple sub-classification fully-connected layers are in one-to-one correspondence with multiple tasks
  • the multiple sub-regression fully-connected layers are in one-to-one correspondence with multiple tasks
  • the classification and regression network is used to process the first feature information.
  • outputting the target 2D frame of the target object and the first indication information including: using the hidden layer to process the first feature information to obtain the second feature information; using the sub-classification fully connected layer to obtain the candidate 2D frame belonging to the sub-class according to the second feature information. Classify the confidence of the object category in the task corresponding to the fully connected layer; use the sub-regression fully connected layer to adjust the position information of the candidate 2D frame according to the second feature information, and obtain the adjusted candidate 2D frame.
  • the classification and regression network includes a second RCNN
  • the second RCNN includes a hidden layer, a classification fully connected layer and a regression fully connected layer, the hidden layer is connected to the classification fully connected layer, and the hidden layer is connected to the regression fully connected layer
  • the regression network processes the first feature information, and outputs the target 2D frame of the target object and the first indication information, including: using the hidden layer to process the first feature information to obtain third feature information;
  • the feature information obtains the confidence that the candidate 2D frame belongs to each category; the position information of the candidate 2D frame is adjusted according to the third feature information by using the regression fully connected layer, and the adjusted candidate 2D frame is obtained.
  • the classification fully connected layer is obtained by merging multiple sub-classification fully connected layers in the first RCNN
  • the regression fully connected layer is obtained by merging multiple sub-regression fully connected layers in the first RCNN
  • the first RCNN includes a hidden layer, multiple sub-category fully-connected layers, and multiple sub-regression fully-connected layers.
  • the hidden layer is connected to multiple sub-category fully-connected layers, the hidden layer is connected to multiple sub-regression fully-connected layers, and the multiple sub-category fully-connected layers are connected to Multiple tasks are in one-to-one correspondence, and multiple sub-regression fully-connected layers are in one-to-one correspondence with multiple tasks;
  • the sub-classification fully-connected layer is used to obtain the object category of the candidate 2D frame belonging to the task corresponding to the sub-classification fully-connected layer according to the third feature information.
  • the confidence of the sub-regression fully connected layer is used to adjust the position information of the candidate 2D frame according to the third feature information, and obtain the adjusted candidate 2D frame.
  • FIG. 16 shows the processing flow of the object recognition method provided by the embodiment of the present application.
  • the processing flow in FIG. 16 can be regarded as a specific implementation of the method shown in FIG. 15 , and the method in FIG.
  • the perceptual network shown in the figure is executed.
  • the structure of the perception network adopted in this embodiment of the present application is shown in FIG. 17 .
  • the perceptual network includes: backbone, RPN, region of interest extraction module and second RCNN.
  • the sensory network shown in FIG. 17 can be regarded as a specific implementation of the sensory network shown in FIG. 11 .
  • the perceptual network in Figure 17 can simultaneously complete the object detection of the 8 categories in Table 1. In other words, the perceptual network in Figure 17 can simultaneously complete the target detection of the 8 tasks in Table 1.
  • the sensory network shown in FIG. 17 may be determined according to the sensory network shown in FIG. 14 . For example, as shown in FIG.
  • the classification fc in the second RCNN is obtained by merging multiple sub-classifications fc in the first RCNN, and the regression fc in the second RCNN is obtained after merging the sub-regression fc in the first RCNN.
  • the perceptual network of the present application can flexibly add or delete the sub-classification fc and sub-regression fc in the first RCNN according to the needs of the business, so as to achieve target detection of different numbers of tasks.
  • the method 1600 includes steps S1610 to S1650.
  • step S1620 may be performed by the backbone in FIG. 17 .
  • the backbone performs convolution processing on the input image to generate several feature maps of different scales, that is, the first feature map.
  • the backbone can adopt various forms of convolutional networks, such as VGG16, Resnet50, or Inception-Net, etc.
  • step S1620 may further include: performing feature fusion based on the first feature map, and outputting the fused feature map.
  • the feature maps output by the backbone network or FPN can be provided as basic features to subsequent modules.
  • step S1630 may be performed by the RPN in FIG. 17 .
  • the RPN predicts the region where the target object is located on the second feature map, and outputs a candidate 2D frame matching the region where the target object is located.
  • the target object includes objects to be detected in multiple tasks.
  • the second feature map may include the feature map output by the backbone network or FPN.
  • RPN predicts areas where target objects may exist based on the feature map provided by backbone or FPN, and outputs candidate frames of these areas, or the coordinates of candidate areas (proposal).
  • the RPN can predict that there may be candidate frames of objects to be detected in the eight categories in Table 1.
  • step S1640 may be performed by the region of interest extraction module in FIG. 17 .
  • the region of interest extraction module extracts the features of the region where the candidate 2D frame is located on the third feature map.
  • the third feature map can be a feature map provided by backbone or FPN.
  • the region of interest extraction module extracts the features of the region where each proposal is located on a feature map provided by the backbone or FPN according to the coordinates of the proposal provided by the RPN, and resize to a fixed size to obtain the value of each proposal. feature.
  • step S1650 may be performed by the second RCNN in FIG. 17 .
  • the hidden layer in the second RCNN for example, shared fc/conv, further performs feature extraction on the features of each proposal extracted by the region of interest extraction module, and sends them to cls fc and reg fc.
  • Classify the proposal to obtain the confidence that each proposal belongs to each category, adjust the coordinates of the 2D frame of the proposal by reg fc to obtain a more compact 2D frame coordinate, and then perform the frame sum operation, such as the NMS operation to merge and adjust After the 2D box, output the target 2D box and the classification result.
  • the classification result can be used as the first indication information.
  • the weights of the classification fc in the second RCNN in FIG. 17 are obtained by combining the weights of multiple sub-classifications fc of the first RCNN in FIG. 16 .
  • the weights of the regression fc in the second RCNN in FIG. 17 are obtained by combining the weights of the multiple sub-regression fcs of the first RCNN in FIG. 16 .
  • the first RCNN in Figure 16 uses the sigmoid function to normalize the label logits obtained by the sub-category fc to obtain the confidence of each category, which is equivalent to performing a second classification for each category Processing, the confidence of the current category has no relationship with other categories, so during inference, the sub-categories fc of all tasks of the model can be combined into one category fc. The sub-regressions fc can also be combined into one regression fc.
  • the candidate 2D frame is a rectangular frame
  • the position information of the candidate 2D frame is represented by 4 values
  • the length of the feature output by the hidden layer is 1024
  • the number of categories in each task is n
  • the number of categories in each task is n.
  • the weight of the sub-regression fc is a tensor of 1024*4n
  • the weight of the sub-category fc in each task is a tensor of 1024*n.
  • the weight of the regression fc formed after merging is the tensor of 1024*124
  • the weight of the classification fc is the tensor of 1024*31.
  • the second RCNN obtained after merging only includes one classification fc and one regression fc, and its input and output are consistent with the tensor shape of the combined weight, that is, the input of classification fc and regression fc is 1024, the output of classification fc is 31, and the output of regression fc is 124.
  • Table 2 shows the parameter amount and calculation amount of 8 tasks implemented by the single-head multi-task network in the embodiment of the present application and the existing multi-head multi-task network when the input image size is 720*1280 (@720p). registration. That is, Table 2 shows the parameter amount and calculation amount of the 8-task single-head-end task network and the multi-head-end multi-task network.
  • Table 3 shows the comparison of inference time consumption between the single-head multi-task network in the embodiment of the present application and the existing multi-head multi-task network.
  • the single-head-end multi-task network of the embodiment of the present application reduces the latency by 17% and 22% on images with resolutions of 720p and 1080p, respectively, which is significantly The processing efficiency is improved, which is conducive to deployment in scenarios with high real-time requirements.
  • the single-head multitasking network in the embodiment of the present application can achieve the same detection performance as the multi-heading multitasking network.
  • Table 4 shows the performance comparison of the single-head-end multi-task network and the multi-head-end multi-task network on some categories.
  • AP category Multi-Head End Multitasking Network
  • AP Single-end multitasking network
  • the average precision (average precision, AP) of the single-head multi-task network in the embodiment of the present application and the existing multi-head multi-task network is not much different, that is, the performance of the two is comparable. It can be seen from this that the single-end multitasking network in the embodiment of the present application can save the amount of computation and memory on the premise of ensuring the performance of the model.
  • FIG. 19 is a schematic block diagram of an apparatus according to an embodiment of the present application.
  • the apparatus 4000 shown in FIG. 19 includes an acquisition unit 4010 and a processing unit 4020 .
  • the apparatus 4000 may be used as a training apparatus for a perceptual network, and the acquiring unit 4010 and the processing unit 4020 may be used to perform the training method of the perceptual network of the embodiments of the present application, for example, may be used to perform the method 1200 or Method 1300.
  • the perception network includes a candidate region generation network RPN, and the RPN is used to predict the position information of the candidate two-dimensional 2D frame of the target object in the sample image.
  • the target object includes objects to be detected for multiple tasks. Each task includes at least one category, and the target objects include a first task object and a second task object.
  • the obtaining unit 4010 is used to obtain training data, the training data includes the sample image, the labeling data of the first task object on the sample image, and the pseudo frame of the second task object on the sample image, and the labeling data includes the class label of the first task object and
  • the labeled 2D frame of the first task object, and the pseudo frame of the second task object is the target 2D frame of the second task object obtained by inferring the sample image through other perceptual networks.
  • the processing unit 4020 is configured to train the perception network based on the training data.
  • the perception network further includes a backbone network, a region of interest extraction module, and a classification and regression network
  • the processing unit 4020 is specifically configured to: according to the marked 2D frame of the first task object and the target 2D of the second task object.
  • the difference between the frame and the candidate 2D frame of the target object in the sample image predicted by RPN calculates the first loss function value; calculates the second loss function value of the perceptual network according to the labeled data;
  • the function value is back-propagated, and the parameters of the part to be trained in the perception network are adjusted.
  • the part to be trained in the perception network includes the part to be trained in the classification and regression network, the region of interest extraction module, the RPN and the backbone network, and the classification
  • the part of the regression network that needs to be trained is determined according to the first task object.
  • the backbone network is used to perform convolution processing on the sample image and output the first feature map of the sample image;
  • RPN is used to output the position of the candidate 2D frame of the target object based on the second feature map.
  • the second feature map is determined according to the first feature map;
  • the region of interest extraction module is used to extract the first feature information on the third feature map based on the position information of the candidate 2D frame, and the first feature information is the candidate 2D frame
  • the feature of the area, the third feature map is determined according to the first feature map;
  • the classification and regression network is used to process the first feature information, output the target 2D frame of the target object and the first indication information, the target 2D of the target object
  • the number of boxes is less than or equal to the number of candidate 2D boxes of the target object, and the first indication information is used to indicate the category to which the target object belongs.
  • the classification and regression network includes a first regional convolutional neural network RCNN, the first RCNN includes a hidden layer, multiple sub-classification fully connected layers and multiple sub-regression fully connected layers, and the hidden layer and multiple sub-classification fully connected layers.
  • connection layer is connected, the hidden layer is connected with multiple sub-regression fully-connected layers, multiple sub-classification fully-connected layers are in one-to-one correspondence with multiple tasks, and multiple sub-regression fully-connected layers are in one-to-one correspondence with multiple tasks;
  • One feature information is processed to obtain second feature information;
  • the sub-classification fully connected layer is used to obtain the confidence level of the object category in the task corresponding to the sub-classification fully connected layer according to the second feature information;
  • the sub-regression fully connected layer It is used to adjust the position information of the candidate 2D frame according to the result of the hidden layer processing to obtain the adjusted candidate 2D frame; and the part to be trained in the classification and regression network includes the hidden layer and the task corresponding to the first task object.
  • the sub-classification fully connected layer and the sub-regression fully connected layer is used to obtain the confidence level of the object category in the task corresponding to the sub-classification fully connected layer according to the second feature information;
  • the sub-regression fully connected layer It is used to adjust
  • the device 4000 may function as an object recognition device.
  • the object recognition apparatus includes an acquisition unit 4010 and a processing unit 4020 .
  • the perception network includes: backbone network, candidate region generation network, region of interest extraction module and classification regression network.
  • the acquiring unit 4010 and the processing unit 4020 may be used to execute the object recognition method of the embodiments of the present application, for example, may be used to execute the method 1500 or the method 1600 .
  • the acquisition unit 4010 is used to acquire an input image.
  • the processing unit 4020 is configured to use the backbone network to perform convolution processing on the input image to obtain the first feature map of the input image; use the RPN to output the position information of the candidate two-dimensional 2D frame of the target object based on the second feature map, and the target object includes a plurality of The object to be detected in the task, each task in the multiple tasks includes at least one category, and the second feature map is determined according to the first feature map; the region of interest extraction module is used based on the position information of the candidate 2D frame.
  • Extract the first feature information from the feature map the first feature information is the feature of the area where the candidate 2D frame is located, and the third feature map is determined according to the first feature map; use the classification and regression network to process the first feature information to obtain the target object
  • the target 2D frame of the target object and the first indication information the number of target 2D frames of the target object is less than or equal to the number of candidate 2D frames of the target object, and the first indication information is used to indicate the category to which the target object belongs.
  • the processing unit 4020 is specifically configured to: use a classification and regression network to process the first feature information to obtain the confidence that the candidate 2D frame belongs to each category in multiple tasks;
  • the position information of the 2D frame is adjusted to obtain an adjusted candidate 2D frame;
  • the target 2D frame is determined according to the adjusted candidate 2D frame;
  • the first indication information is determined according to the confidence that the target 2D frame belongs to each category.
  • the classification and regression network includes a first regional convolutional neural network RCNN, the first RCNN includes a hidden layer, multiple sub-classification fully connected layers and multiple sub-regression fully connected layers, and the hidden layer and multiple sub-classification fully connected layers.
  • connection layer is connected, the hidden layer is connected with multiple sub-regression fully-connected layers, the multiple sub-classification fully-connected layers are in one-to-one correspondence with multiple tasks, and the multiple sub-regression fully-connected layers are in one-to-one correspondence with multiple tasks; and the processing unit is specifically used for: Use the hidden layer to process the first feature information to obtain the second feature information; use the sub-classification fully connected layer to obtain the confidence level of the object category in the task corresponding to the sub-classification fully connected layer by using the sub-classification fully connected layer; use The sub-regression fully connected layer adjusts the position information of the candidate 2D frame according to the second feature information, and obtains the adjusted candidate 2D frame.
  • the classification and regression network includes a second RCNN
  • the second RCNN includes a hidden layer, a classification fully connected layer, and a regression fully connected layer
  • the hidden layer is connected to the classification fully connected layer
  • the hidden layer is connected to the regression fully connected layer.
  • the processing unit 4020 is specifically used for: using the hidden layer to process the first feature information to obtain the third feature information; using the classification fully connected layer to obtain the confidence that the candidate 2D frame belongs to each category according to the third feature information obtained ; Use the regression fully connected layer to adjust the position information of the candidate 2D frame according to the obtained third feature information, and obtain the adjusted candidate 2D frame.
  • the classification fully connected layer is obtained by merging multiple sub-classification fully connected layers in the first RCNN, and the regression fully connected layer is obtained by combining multiple sub-regression fully connected layers in the first RCNN.
  • the first RCNN includes a hidden layer, multiple sub-classification fully connected layers and multiple sub-regression fully connected layers, the hidden layer is connected with multiple sub-classification fully connected layers, the hidden layer is connected with multiple sub-regression fully connected layers, and multiple sub-regression fully connected layers.
  • the classification fully connected layer corresponds to multiple tasks one-to-one, and multiple sub-regression fully connected layers correspond to multiple tasks one-to-one;
  • the sub-classification fully connected layer is used to obtain the candidate 2D frame according to the obtained third feature information, which belongs to the sub-classification fully connected layer.
  • the confidence of the object category in the task; the sub-regression fully connected layer is used to adjust the position information of the candidate 2D frame according to the obtained third feature information, and obtain the adjusted candidate 2D frame.
  • apparatus 4000 is embodied in the form of functional units.
  • unit here can be implemented in the form of software and/or hardware, which is not specifically limited.
  • a "unit” may be a software program, a hardware circuit, or a combination of the two that realizes the above-mentioned functions.
  • the hardware circuits may include application specific integrated circuits (ASICs), electronic circuits, processors for executing one or more software or firmware programs (eg, shared processors, proprietary processors, or group processors) etc.) and memory, merge logic and/or other suitable components to support the described functions.
  • ASICs application specific integrated circuits
  • processors for executing one or more software or firmware programs eg, shared processors, proprietary processors, or group processors
  • the units of each example described in the embodiments of the present application can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.
  • FIG. 20 is a schematic diagram of a hardware structure of an apparatus provided by an embodiment of the present application.
  • the apparatus 6000 shown in FIG. 20 (the apparatus 6000 may specifically be a computer device) includes a memory 6001 , a processor 6002 , a communication interface 6003 and a bus 6004 .
  • the memory 6001 , the processor 6002 , and the communication interface 6003 are connected to each other through the bus 6004 for communication.
  • the apparatus 6000 may serve as a training apparatus for a perceptual network.
  • the memory 6001 may be a read only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM).
  • the memory 6001 may store a program, and when the program stored in the memory 6001 is executed by the processor 6002, the processor 6002 is configured to execute each step of the method for training a perceptual network according to the embodiment of the present application. Specifically, the processor 6002 may perform step S1220 in the method shown in FIG. 12 above.
  • the processor 6002 may adopt a general-purpose central processing unit (CPU), a microprocessor, an application specific integrated circuit (ASIC), a graphics processor (graphics processing unit, GPU), or one or more
  • the integrated circuit is used to execute the relevant program to realize the training method of the perceptual network according to the method embodiment of the present application.
  • the processor 6002 may also be an integrated circuit chip with signal processing capability, for example, the chip shown in FIG. 5 .
  • each step of the training method of the perceptual network of the present application can be completed by the hardware integrated logic circuit in the processor 6002 or the instructions in the form of software.
  • the above-mentioned processor 6002 can also be a general-purpose processor, a digital signal processor (digital signal processing, DSP), an application specific integrated circuit (ASIC), an off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, Discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processing
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the steps of the method disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art.
  • the storage medium is located in the memory 6001, and the processor 6002 reads the information in the memory 6001 and, in combination with its hardware, completes the functions required to be performed by the units included in the training device in the embodiments of the present application, or executes the diagrams in the method embodiments of the present application. 12 shows the training method of the perceptual network.
  • the communication interface 6003 uses a transceiver device such as but not limited to a transceiver to implement communication between the device 6000 and other devices or a communication network. For example, training data can be obtained through the communication interface 6003 .
  • the bus 6004 may include a pathway for communicating information between the various components of the device 6000 (eg, the memory 6001, the processor 6002, the communication interface 6003).
  • the device 6000 may function as an object recognition device.
  • the memory 6001 may be ROM, static storage device and RAM.
  • the memory 6001 may store a program.
  • the processor 6002 and the communication interface 6003 are used to execute each step of the object recognition method of the embodiment of the present application.
  • the processor 6002 may perform steps S1520 to S1540 in the method shown in FIG. 15 above.
  • the processor 6002 may adopt a general-purpose CPU, a microprocessor, an ASIC, a GPU, or one or more integrated circuits, and is used to execute a related program, so as to realize the functions required to be performed by the unit in the object recognition apparatus of the embodiment of the present application, Or execute the object recognition method of the method embodiment of the present application.
  • the processor 6002 may also be an integrated circuit chip with signal processing capability, for example, the chip shown in FIG. 6 .
  • each step of the object recognition method of the embodiment of the present application may be completed by an integrated logic circuit of hardware in the processor 6002 or an instruction in the form of software.
  • the above-mentioned processor 6002 may also be a general-purpose processor, DSP, ASIC, FPGA or other programmable logic devices, discrete gate or transistor logic devices, and discrete hardware components.
  • the methods, steps, and logic block diagrams disclosed in the embodiments of this application can be implemented or executed.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the steps of the method disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art.
  • the storage medium is located in the memory 6001, and the processor 6002 reads the information in the memory 6001, and combines its hardware to complete the functions required to be performed by the units included in the object recognition device of the embodiment of the present application, or to perform the object recognition of the method embodiment of the present application. method.
  • the communication interface 6003 uses a transceiver device such as but not limited to a transceiver to implement communication between the device 6000 and other devices or a communication network.
  • a transceiver device such as but not limited to a transceiver to implement communication between the device 6000 and other devices or a communication network.
  • the data to be processed can be acquired through the communication interface 6003 .
  • the bus 6004 may include a pathway for communicating information between the various components of the device 6000 (eg, the memory 6001, the processor 6002, the communication interface 6003).
  • the apparatus 6000 may also include other devices necessary for normal operation. Meanwhile, according to specific needs, those skilled in the art should understand that the apparatus 6000 may further include hardware devices that implement other additional functions. In addition, those skilled in the art should understand that the apparatus 6000 may only include the necessary devices for implementing the embodiments of the present application, and does not necessarily include all the devices shown in FIG. 20 .
  • An embodiment of the present application provides a computer-readable medium, where the computer-readable medium stores program code executed by a device, where the program code includes relevant content for executing the object recognition method shown in FIG. 15 or FIG. 16 .
  • An embodiment of the present application provides a computer-readable medium, where the computer-readable medium stores program code executed by a device, where the program code includes relevant content for executing the training method shown in FIG. 12 or FIG. 13 .
  • An embodiment of the present application provides a computer program product, which, when the computer program product runs on a computer, enables the computer to execute the relevant content of the object recognition method shown in FIG. 15 or FIG. 16 .
  • An embodiment of the present application provides a computer program product, which, when the computer program product runs on a computer, enables the computer to execute the relevant content of the training method shown in FIG. 12 or FIG. 13 .
  • An embodiment of the present application provides a chip, where the chip includes a processor and a data interface, the processor reads an instruction on a memory through the data interface, and executes the object recognition method as shown in FIG. 15 or FIG. 16 .
  • An embodiment of the present application provides a chip, where the chip includes a processor and a data interface, the processor reads an instruction on a memory through the data interface, and executes the training method as shown in FIG. 12 or FIG. 13 .
  • the chip may further include a memory, in which instructions are stored, the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the The processor is configured to execute the object recognition method of FIG. 15 or FIG. 16 or the training method of FIG. 12 or FIG. 13 .
  • the processor in the embodiment of the present application may be a central processing unit (central processing unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (digital signal processors, DSP), application-specific integrated circuits (application specific integrated circuit, ASIC), off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the memory in the embodiments of the present application may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically programmable Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • Volatile memory may be random access memory (RAM), which acts as an external cache.
  • RAM random access memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • DRAM synchronous dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • DDR SDRAM double data rate synchronous dynamic random access memory
  • enhanced SDRAM enhanced synchronous dynamic random access memory
  • SLDRAM synchronous connection dynamic random access memory Fetch memory
  • direct memory bus random access memory direct rambus RAM, DR RAM
  • the above embodiments may be implemented in whole or in part by software, hardware, firmware or any other combination.
  • the above-described embodiments may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions or computer programs. When the computer instructions or computer programs are loaded or executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server or data center Transmission to another website site, computer, server or data center by wire (eg, infrared, wireless, microwave, etc.).
  • the computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, or the like that contains one or more sets of available media.
  • the usable media may be magnetic media (e.g., floppy disks, hard disks, magnetic tapes), optical media (e.g., DVDs), or semiconductor media.
  • the semiconductor medium may be a solid state drive.
  • At least one means one or more, and “plurality” means two or more.
  • At least one item(s) below” or similar expressions thereof refer to any combination of these items, including any combination of single item(s) or plural items(s).
  • at least one item (a) of a, b, or c can represent: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, c may be single or multiple .
  • the size of the sequence numbers of the above-mentioned processes does not mean the sequence of execution, and the execution sequence of each process should be determined by its functions and internal logic, and should not be dealt with in the embodiments of the present application. implementation constitutes any limitation.
  • the disclosed system, apparatus and method may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium.
  • the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution.
  • the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

Réseau cognitif (800), procédé de formation (1200) pour un réseau cognitif, et procédé de reconnaissance d'objet (1500) et appareil. Le réseau cognitif (800) comprend un réseau fédérateur (810), un réseau de proposition de région (RPN) candidate (820), un module d'extraction de région d'intérêt (830), et un réseau de régression de classification (840). Une pluralité de tâches cognitives partagent le RPN (820) et un RPN (820) prédit la région dans laquelle un objet à détecter dans la pluralité de tâches est situé, et le réseau de régression de classification obtient un cadre 2D cible et un résultat de classification. Le réseau cognitif (800) peut réduire le nombre de paramètres et de calculs dans un réseau cognitif multitâche, réduire la consommation d'énergie du matériel et améliorer la vitesse de fonctionnement d'un modèle.
PCT/CN2021/086643 2021-04-12 2021-04-12 Réseau cognitif, procédé de formation de réseau cognitif, et procédé et appareil de reconnaissance d'objet WO2022217434A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2021/086643 WO2022217434A1 (fr) 2021-04-12 2021-04-12 Réseau cognitif, procédé de formation de réseau cognitif, et procédé et appareil de reconnaissance d'objet
CN202180096605.3A CN117157679A (zh) 2021-04-12 2021-04-12 感知网络、感知网络的训练方法、物体识别方法及装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/086643 WO2022217434A1 (fr) 2021-04-12 2021-04-12 Réseau cognitif, procédé de formation de réseau cognitif, et procédé et appareil de reconnaissance d'objet

Publications (1)

Publication Number Publication Date
WO2022217434A1 true WO2022217434A1 (fr) 2022-10-20

Family

ID=83639331

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/086643 WO2022217434A1 (fr) 2021-04-12 2021-04-12 Réseau cognitif, procédé de formation de réseau cognitif, et procédé et appareil de reconnaissance d'objet

Country Status (2)

Country Link
CN (1) CN117157679A (fr)
WO (1) WO2022217434A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115713500A (zh) * 2022-11-07 2023-02-24 广州汽车集团股份有限公司 一种视觉感知方法及装置
CN116821699A (zh) * 2023-08-31 2023-09-29 山东海量信息技术研究院 一种感知模型训练方法、装置及电子设备和存储介质
WO2024103803A1 (fr) * 2022-11-16 2024-05-23 华为技术有限公司 Procédé et appareil de détection de cible, et support d'enregistrement

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10223610B1 (en) * 2017-10-15 2019-03-05 International Business Machines Corporation System and method for detection and classification of findings in images
CN109784194A (zh) * 2018-12-20 2019-05-21 上海图森未来人工智能科技有限公司 目标检测网络构建方法和训练方法、目标检测方法
CN110298262A (zh) * 2019-06-06 2019-10-01 华为技术有限公司 物体识别方法及装置
CN110533067A (zh) * 2019-07-22 2019-12-03 杭州电子科技大学 基于深度学习的边框回归的端到端弱监督目标检测方法
CN110705544A (zh) * 2019-09-05 2020-01-17 中国民航大学 基于Faster-RCNN的自适应快速目标检测方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10223610B1 (en) * 2017-10-15 2019-03-05 International Business Machines Corporation System and method for detection and classification of findings in images
CN109784194A (zh) * 2018-12-20 2019-05-21 上海图森未来人工智能科技有限公司 目标检测网络构建方法和训练方法、目标检测方法
CN110298262A (zh) * 2019-06-06 2019-10-01 华为技术有限公司 物体识别方法及装置
CN110533067A (zh) * 2019-07-22 2019-12-03 杭州电子科技大学 基于深度学习的边框回归的端到端弱监督目标检测方法
CN110705544A (zh) * 2019-09-05 2020-01-17 中国民航大学 基于Faster-RCNN的自适应快速目标检测方法

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115713500A (zh) * 2022-11-07 2023-02-24 广州汽车集团股份有限公司 一种视觉感知方法及装置
WO2024103803A1 (fr) * 2022-11-16 2024-05-23 华为技术有限公司 Procédé et appareil de détection de cible, et support d'enregistrement
CN116821699A (zh) * 2023-08-31 2023-09-29 山东海量信息技术研究院 一种感知模型训练方法、装置及电子设备和存储介质
CN116821699B (zh) * 2023-08-31 2024-01-19 山东海量信息技术研究院 一种感知模型训练方法、装置及电子设备和存储介质

Also Published As

Publication number Publication date
CN117157679A (zh) 2023-12-01

Similar Documents

Publication Publication Date Title
WO2020253416A1 (fr) Procédé et dispositif de détection d'objet et support de stockage informatique
WO2020244653A1 (fr) Procédé et dispositif d'identification d'objet
WO2021043112A1 (fr) Procédé et appareil de classification d'images
Mendes et al. Exploiting fully convolutional neural networks for fast road detection
WO2021147325A1 (fr) Procédé et appareil de détection d'objets, et support de stockage
WO2021155792A1 (fr) Appareil de traitement, procédé et support de stockage
WO2022217434A1 (fr) Réseau cognitif, procédé de formation de réseau cognitif, et procédé et appareil de reconnaissance d'objet
CN111368972B (zh) 一种卷积层量化方法及其装置
US20220148291A1 (en) Image classification method and apparatus, and image classification model training method and apparatus
CN110222718B (zh) 图像处理的方法及装置
Ayachi et al. Pedestrian detection based on light-weighted separable convolution for advanced driver assistance systems
CN113591872A (zh) 一种数据处理系统、物体检测方法及其装置
CN111401517A (zh) 一种感知网络结构搜索方法及其装置
Grigorev et al. Depth estimation from single monocular images using deep hybrid network
WO2022179606A1 (fr) Procédé de traitement d'image et appareil associé
WO2022156475A1 (fr) Procédé et appareil de formation de modèle de réseau neuronal, et procédé et appareil de traitement de données
CN114972182A (zh) 一种物体检测方法及其装置
Li et al. Pedestrian detection based on light perception fusion of visible and thermal images
US20230401826A1 (en) Perception network and data processing method
WO2023179593A1 (fr) Procédé et dispositif de traitement de données
CN113449550A (zh) 人体重识别数据处理的方法、人体重识别的方法和装置
CN114881096A (zh) 多标签的类均衡方法及其装置
EP4361885A1 (fr) Procédé, appareil et système de traitement de données
US20240169733A1 (en) Method and electronic device with video processing
CN115731530A (zh) 一种模型训练方法及其装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21936329

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21936329

Country of ref document: EP

Kind code of ref document: A1