WO2021218786A1

WO2021218786A1 - Data processing system, object detection method and apparatus thereof

Info

Publication number: WO2021218786A1
Application number: PCT/CN2021/089118
Authority: WO
Inventors: 应江勇; 朱雄威; 高敬; 陈雷
Original assignee: 华为技术有限公司
Priority date: 2020-04-30
Filing date: 2021-04-23
Publication date: 2021-11-04
Also published as: US20230076266A1; CN113591872A

Abstract

Disclosed in the present application are a data processing system, an object detection method and an apparatus thereof, which are applied to the field of artificial intelligence. In the present application, a second feature map generation unit introduces texture detail information of a shallow layer of an original feature map (multiple first feature maps generated by a convolutional processing unit) into a deep feature map (multiple second feature maps generated by a first feature map generation unit), so as to generate multiple third feature maps, and the third feature maps having rich texture detail information of the shallow layers are used as input data for a detection unit to perform target detection, and the detection accuracy of subsequent object detection can be improved.

Description

Data processing system, object detection method and device

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on April 30, 2020, the application number is 202010362601.2, and the invention title is "a data processing system, object detection method and device", the entire content of which is by reference Incorporated in this application.

Technical field

This application relates to the field of artificial intelligence, and in particular to a data processing system, object detection method and device.

Background technique

Computer vision is an inseparable part of various intelligent/autonomous systems in various application fields, such as manufacturing, inspection, document analysis, medical diagnosis, and military. It is about how to use cameras/video cameras and computers to obtain What we need is the knowledge of the data and information of the subject. To put it vividly, it is to install eyes (camera or video camera) and brain (algorithm) to the computer to replace the human eye to identify, track and measure the target, so that the computer can perceive the environment. Because perception can be seen as extracting information from sensory signals, computer vision can also be seen as a science that studies how to make artificial systems "perceive" from images or multi-dimensional data. In general, computer vision uses various imaging systems to replace the visual organs to obtain input information, and then the computer replaces the brain to complete the processing and interpretation of the input information. The ultimate research goal of computer vision is to enable computers to observe and understand the world through vision like humans, and have the ability to adapt to the environment autonomously.

The perception network can be a neural network model that processes and analyzes images and obtains the processing results. At present, the perception network can complete more and more functions, such as image classification, 2D detection, semantic segmentation, and key point detection. , Linear object detection (such as lane line or stop line detection in automatic driving technology), drivable area detection, etc. In addition, the visual perception system has the characteristics of low cost, non-contact, small size, and large amount of information. With the continuous improvement of the accuracy of visual perception algorithms, it has become the key technology of many artificial intelligence systems today, and has been more and more widely used, such as: advanced driving assistance system (ADAS) and automatic driving system (autonomous driving system). Recognition of dynamic obstacles (people or cars) and static objects (traffic lights, traffic signs or traffic cones) on the road in the driving system (ADS), through the recognition of human masks in the terminal visual camera beauty function Mask and key points to achieve slimming effect, etc.

Perceptual networks usually include feature pyramid networks (FPN), which introduces a top-to-bottom network structure, and introduces horizontal connection branches derived from the original feature extraction network, and the original feature network corresponds to the resolution feature map and After upsampling the deep feature maps for fusion, the top-to-bottom network structure introduced in FPN has a larger receptive field, but its detection accuracy for small objects is low.

Summary of the invention

In the first aspect, the present application provides an object detection method, the method is used in a first perception network, and the method includes:

Receiving an input first image, and performing convolution processing on the first image to generate a plurality of first feature maps;

It should be noted that “convolution processing on the input image” here should not be understood as only performing convolution processing on the input image. In some implementations, the input image can be Perform convolution processing, pooling operations, and so on.

It should be noted that “convolution processing on the first image to generate multiple first feature maps” here should not only be understood as performing multiple convolution processing on the first image each time Convolution processing can generate a first feature map, that is, it should not be understood that each first feature map is obtained based on convolution processing on the first image, but, on the whole, the first image is multiple The source of the first feature map; in one implementation, the first image can be convolved to obtain a first feature map, and then the generated first feature map can be convolved to obtain another first feature map. Feature maps, and so on, can get multiple first feature maps.

It should be noted that a series of convolution processing may be performed on the input image. Specifically, in each convolution processing, the first feature map obtained by the previous convolution processing may be subjected to convolution processing. , And then obtain a first feature map, and multiple first feature maps can be obtained by the above method.

It should be noted that the multiple first feature maps may be feature maps with multi-scale resolution, that is, the multiple first feature maps are not feature maps with the same resolution. In an optional implementation, the multiple first feature maps The feature map can form a feature pyramid.

The input image may be received, and the input image may be subjected to convolution processing to generate multiple first feature maps with multi-scale resolution; the convolution processing unit may perform a series of convolution processing on the input image , To obtain feature maps at different scales (with different resolutions). The convolution processing unit can take many forms, such as visual geometry group (VGG), residual neural network (residual neural network, resnet), GoogLeNet core structure (Inception-net), and so on.

Generate multiple second feature maps according to the multiple first feature maps; wherein the multiple first feature maps include more texture details and/or of the input image than the multiple second feature maps Or the location details in the input image.

It should be noted that the "generating multiple second feature maps based on the multiple first feature maps" here should not be understood to mean that the source of each second feature map generated in the multiple second feature maps is Multiple first feature maps; in one implementation, a part of the second feature maps in the multiple second feature maps are directly generated based on one or more first feature maps in the multiple first feature maps; one In this implementation, a part of the second feature maps in the multiple second feature maps are directly generated based on one or more first feature maps in the multiple first feature maps, and other second feature maps other than itself ; In one implementation, a part of the second feature maps in the multiple second feature maps are directly generated based on other second feature maps other than itself. At this time, due to "other second feature maps other than itself The map is generated based on one or more first characteristic maps "in a plurality of first characteristic maps", therefore, it can be understood as generating a plurality of second characteristic maps according to the plurality of first characteristic maps.

It should be noted that the multiple second feature maps may be feature maps with multi-scale resolution, that is, the multiple second feature maps are not feature maps with the same resolution. In an optional implementation, the multiple second feature maps The feature map can form a feature pyramid.

Among them, the convolution operation can be performed on the topmost feature map C4 in the multiple first feature maps generated by the convolution processing unit. Exemplarily, the hole convolution and 1×1 convolution can be used to convert the topmost feature map C4 The number of channels is reduced to 256, which is the top feature map P4 of the feature pyramid; the output results of the top and next feature maps C3 are horizontally linked and the number of channels is reduced to 256 using 1×1 convolution. The pixels are added to obtain the feature map P3; and so on, from top to bottom, a first feature pyramid is constructed, and the first feature pyramid may include a plurality of second feature maps.

Among them, the texture details can be used to express the detailed information of small targets and edge features. Compared with the second feature map, the first feature map includes more texture detail information, so that it can be used in the detection results of small target detection. The detection accuracy is higher. Wherein, the position details can be information that expresses the position of the object in the image and the relative position between the objects.

Compared with the first feature map, multiple second feature maps can include more deep features. Deep features contain rich semantic information, which has a good effect on classification tasks. At the same time, deep features have larger receptive fields. It has a good detection effect on large targets; in one implementation, by introducing a top-down path to generate multiple second feature maps, the rich semantic information contained in deep features can naturally be propagated downwards. Make the second feature maps of each scale contain rich semantic information.

A plurality of third characteristic maps are generated according to the plurality of first characteristic maps and the plurality of second characteristic maps.

It should be noted that, "generating multiple third feature maps based on the multiple first feature maps and the multiple second feature maps" here should not be understood as each of the multiple second feature maps The sources of the third feature maps are all multiple first feature maps and multiple second feature maps, but, as a whole, the multiple first feature maps and the multiple second feature maps are multiple third feature maps. The source of the feature map; in one implementation, a part of the third feature maps in the multiple third feature maps are based on one or more first feature maps and multiple second feature maps in the multiple first feature maps One or more of the second feature maps are generated; in one implementation, a part of the third feature maps in the multiple third feature maps are based on one or more first feature maps in the multiple first feature maps, One or more second feature maps in multiple second feature maps, and other third feature maps other than itself; in one implementation, a part of the third feature maps in multiple third feature maps It is generated based on a third feature map other than itself.

It should be noted that the multiple third feature maps may be feature maps with multi-scale resolution, that is, the multiple third feature maps are not feature maps with the same resolution. In an optional implementation, the multiple third feature maps The feature map can form a feature pyramid.

According to at least one third feature map of the plurality of third feature maps, a first detection result of the object included in the image is output.

In one implementation, the object can be a person, animal, plant, object, etc.

In an implementation, object detection can be performed on the image according to at least one third feature map of the plurality of third feature maps, where the object detection is to identify the type of object included in the image and where the object is located. The location and so on.

In an existing implementation, the second feature map generation unit (such as the feature pyramid network) introduces a top-down path to propagate the rich semantic information contained in deep features downwards, so that the second feature map generation unit of each scale Feature maps contain rich semantic information, and deep features have a large receptive field, which makes it possible to have a better detection effect on large targets. However, in the existing implementation, the more detailed position detail information and texture detail information contained in the shallower feature maps are ignored, which has a great impact on the detection accuracy of medium and small targets. In the embodiment of this application, the second feature map generation unit introduces the shallow texture detail information of the original feature map (multiple first feature maps generated by the convolution processing unit) into the deep feature map (generated by the first feature map generation unit). In multiple second feature maps), multiple third feature maps are generated, and the third feature map with shallow rich texture detail information is used as the input data of the detection unit for target detection, which can improve the detection of subsequent object detection Accuracy.

It should be noted that this embodiment does not mean that the detection accuracy of object detection for any image that includes small targets will be higher, but for a large number of samples, this embodiment can have higher comprehensive detection. Accuracy.

It should be noted that the above object detection method can be implemented by a data processing system, such as a trained perceptual network, where the perceptual network can include a convolution processing unit, a first feature map generating unit, a second feature map generating unit, and A detection unit, the convolution processing unit is respectively connected to the first feature map generating unit and the second feature map generating unit, the first feature map generating unit and the second feature map generating unit are connected, so The second feature map generating unit is connected to the detection unit, and the convolution processing unit is configured to receive an input image and perform convolution processing on the input image to generate a plurality of first feature maps; The first characteristic map generating unit is configured to generate a plurality of second characteristic maps according to the plurality of first characteristic maps; wherein, the plurality of first characteristic maps include more than the plurality of second characteristic maps The texture details of the input image and/or the position details in the input image; the second feature map generating unit is configured to perform according to the plurality of first feature maps and the plurality of second feature maps A plurality of third feature maps are generated; the detection unit is configured to output a detection result of an object included in the image according to at least one third feature map in the plurality of third feature maps.

In an implementation, the perception network may include: a backbone network, a first feature pyramid network FPN, a second FPN, and a head-end header, the backbone network is respectively connected to the first FPN and the second FPN, The first FPN is connected to the second FPN, and the second FPN is connected to the head (wherein, the convolution processing unit is the backbone network, and the first feature map generation unit is the first feature pyramid network FPN , The second feature map generating unit is a second feature pyramid network FPN, and the detection unit is a head).

The backbone network can be used to receive input images and perform convolution processing on the input images to generate multiple first feature maps with multi-scale resolution; the backbone network can perform a series of input images Convolution processing obtains feature maps at different scales (with different resolutions). The backbone network can take many forms, such as visual geometry group (VGG), residual neural network (residual neural network, resnet), the core structure of GoogLeNet (Inception-net), and so on.

The first FPN may be used to generate a first feature pyramid according to the multiple first feature maps, where the first feature pyramid includes multiple second feature maps with multi-scale resolution; wherein, the backbone network is generated Perform the convolution operation on the top-most feature map C4 of. Exemplarily, the number of channels of the top-most feature map C4 can be reduced to 256 using hole convolution and 1×1 convolution, as the top-most feature map P4 of the feature pyramid; After linking the output result of the feature map C3 of the top and the next layer and using 1×1 convolution to reduce the number of channels to 256, add it with the feature map P4 channel by pixel to obtain the feature map P3; and so on, from top to bottom, Construct the first characteristic pyramid.

The second FPN may be used to generate a second feature pyramid based on the multiple first feature maps and the multiple second feature maps, and the second feature pyramid includes multiple third feature maps with multi-scale resolution. Feature map

The head is used to detect the target object in the image according to at least one third feature map of the plurality of third feature maps, and output the detection result.

In the embodiment of this application, the second FPN introduces the rich edge, texture and other detailed information of the shallow layer of the original feature map (multiple first feature maps generated by the backbone network) into the deep feature map (multiple first feature maps generated by the first FPN). In the second feature map), a second feature pyramid is generated, and the third feature map with shallow, rich edge, texture, and other detailed information is used as the input data for head target detection, which can improve the detection accuracy of subsequent object detection.

In an optional implementation, the plurality of first characteristic maps include a first target characteristic map, the plurality of second characteristic maps include a second target characteristic map, and the plurality of third characteristic maps include a third target characteristic map. A target feature map and a fourth target feature map, the resolution of the third target feature map is smaller than that of the fourth target feature map; said generating according to the plurality of first feature maps and the plurality of second feature maps Multiple third feature maps, including:

Down-sampling the third target feature map to obtain a fifth target feature map, the fifth target feature map and the second target feature map have the same number of channels and resolution; for the first target The feature map is subjected to down-sampling and convolution processing to obtain a sixth target feature map, where the sixth target feature map has the same number of channels and resolution as the second target feature map; and the fifth target feature map , The second target feature map and the sixth target feature map are superimposed on channels to generate the fourth target feature map;

In the embodiment of the present application, the third target feature map can be down-sampling and convolution processing to obtain the fifth target feature map, wherein the purpose of down-sampling is to make the feature maps of each channel in the fifth target feature map distinguish The rate is the same as the second target feature map, and the purpose of the convolution processing is to make the number of channels of the fifth target feature map the same as the second target feature map. Through the above method, the fifth target feature map and the second target feature map have the same resolution and number of channels, and the fifth target feature map, the sixth target feature map, and the second target feature map can be combined. The channels are added to obtain the fourth target feature map.

Or, down-sampling the third target feature map to obtain a fifth target feature map, where the fifth target feature map has the same number of channels and resolution as the second target feature map; A target feature map is down-sampled to obtain a sixth target feature map. The sixth target feature map and the second target feature map have the same resolution; the fifth target feature map and the second target feature map are The feature map and the sixth target feature map are subjected to channel superposition and convolution processing to generate the fourth target feature map, and the fourth target feature map and the second target feature map have the same number of channels.

In the embodiments of this application, the expression "channel superimposition" can be understood as performing the corresponding elements in the feature map of the corresponding channel (that is, the channel with the same semantic information) (corresponding elements can be understood as elements located at the same position on the feature map). Superposition (such as doing addition, etc.).

In the embodiment of the present application, the second target feature map may be a feature map including multiple channels, and the feature map corresponding to each channel may be a feature map including a kind of semantic information.

In the embodiment of the present application, the third target feature map can be down-sampled to obtain a fifth target feature map, where the purpose of down-sampling is to make the resolution of the feature maps of each channel in the fifth target feature map equal to that of the second target feature map. The target feature map is the same. Through the above method, the fifth target feature map and the second target feature map have the same resolution, and the fifth target feature map and the second target feature can be added channel by channel; The first target feature map is down-sampled to obtain a sixth target feature map, where the purpose of down-sampling is to make the resolution of the feature maps of each channel in the sixth target feature map the same as the second target feature map. Through the above method, the fifth target feature map, the sixth target feature map, and the second target feature map have the same resolution, so that the fifth target feature map, the sixth target feature map can be converted to the same resolution. The feature map and the second target feature map are added channel by channel, and then processed by convolution, so that the obtained fourth target feature map and the second target feature map have the same number of channels, where the convolution process can be It is a concatenation operation.

In this embodiment, a third target feature map and a fourth target feature map with different resolutions can be generated, the resolution of the third target feature map is smaller than the fourth target feature map, and the fourth target with a larger resolution The feature map is generated based on a first feature map in a plurality of first feature maps, a second feature map in a plurality of second feature maps, and a third target feature map. At this time, the multiple third feature maps generated by the second feature map generation unit retain the advantages of the feature pyramid network, which has bottom-up (from the feature map with small resolution to generate feature maps with larger resolution in turn), and will be shallower. The rich texture detail information and/or position detail information of the layer neural network is introduced into the deep convolutional layer, and the detection network uses multiple third feature maps generated in this way to detect small target detection results with higher detection accuracy .

In an implementation manner, the method further includes:

Performing convolution on at least one third feature map of the plurality of third feature maps by the first convolution layer to obtain at least one fourth feature map;

And convolving at least one third feature map in the plurality of third feature maps to obtain at least one fifth feature map; wherein the receptive field corresponding to the fourth feature map is larger than the fifth feature map Corresponding receptive field;

Correspondingly, the outputting the first detection result of the object included in the image according to at least one third feature map of the plurality of third feature maps includes:

According to the at least one fourth feature map and the at least one fifth feature map, a first detection result of the object included in the image is output.

In this embodiment, the fourth feature map and the fifth feature map obtained by processing the third feature map have different receptive fields. Since they have different receptive field feature maps, they can adapt to targets of different sizes. For detection, for example, the third feature map can be processed by convolutional layers with different dilation rates, and the processing results obtained can include information about large targets or information about small targets, so that subsequent targets In the detection process, both large targets and small targets can be detected.

In an implementation manner, the method further includes:

Processing the fourth feature map according to the first weight value to obtain a processed fourth feature map;

It should be noted that the first weight value may perform a channel-based multiplication operation or other numerical processing on the fourth feature map, so that the elements in the fourth feature map undergo a corresponding gain.

The fifth feature map is processed according to the second weight value to obtain a processed fifth feature map; it should be noted that the second weight value can be compared to the fifth feature map based on the channel. The multiplication operation or other numerical processing makes the elements in the fifth feature map perform corresponding gains.

Wherein, in a case where the object to be detected included in the fourth feature map is greater than the object to be detected included in the fifth feature map, the first weight value is greater than the second weight value;

According to the processed fourth feature map and the processed fifth feature map, a first detection result of the object included in the image is output.

In this embodiment, if the object to be detected included in the fourth feature map is greater than the object to be detected included in the fifth feature map, correspondingly, the first weight value is greater than the second weight value, then the processed The fourth feature map has a greater gain than the processed fifth feature map. Since the receptive field corresponding to the fourth feature map itself is larger than the receptive field corresponding to the fifth feature map, the larger the receptive field, the more it has For more large target information, the detection accuracy of the large target in the target detection performed by using it is also higher. In this embodiment, when there is a larger target in the image, the gain corresponding to the fourth feature map is compared with There are also more fifth feature maps. When the detection unit performs target detection on the image based on the processed fourth feature map and the processed fifth feature map, the overall receptive field will be larger. Yes, the detection accuracy is also higher.

In this embodiment, the intermediate feature extraction layer can learn the recognition law of the weight value through training: for the feature map that includes a large target, the first weight value of the first convolutional layer determined by it is larger, and the determined The second weight value of the second convolutional layer is smaller. For a feature map that includes a small target, the first weight value of the first convolutional layer determined by the feature map is relatively small, and the second weight value of the second convolutional layer determined by the feature map is relatively large.

In an implementation manner, the method further includes:

Performing hole convolution processing on at least one third feature map of the plurality of third feature maps;

The target object in the image is detected according to the at least one third feature map after the hole convolution processing, and the first detection result is output.

In the embodiment of the present application, a 3x3 convolution may function as a sliding window in the candidate region extraction network (RPN), and the convolution kernel may be moved on at least one third feature map to pass subsequent intermediate layer and category judgments and The frame regression layer can obtain whether there is a target in each anchor frame and the difference between the predicted frame and the real frame, and a better frame extraction result can be obtained by training the candidate region extraction network. In this embodiment, the 3x3 sliding window convolution kernel is replaced with a 3x3 hole convolution kernel, and at least one third feature map of the plurality of third feature maps is subjected to hole convolution processing, according to the hole convolution processing The latter at least one third feature map detects the target object in the image and outputs the detection result. Without increasing the amount of calculation, the receptive field is increased, and the impact on large targets and partially occluded targets is reduced. Missed inspection.

In an implementation manner, the first detection result includes a first detection frame, and the method further includes:

Acquire a second detection result of the first image, the second detection result is obtained by object detection of the first image through a second perception network, and the object detection accuracy of the first perception network is higher than that of the Object detection accuracy of the second perception network, the second detection result includes a second detection frame, and there is an intersection between the area where the second detection frame is located and the area where the first detection frame is located;

If the ratio of the area of the intersection to the area of the first detection frame is less than a preset value, the second detection result is updated so that the updated second detection result includes the first detection frame.

In the embodiment of the present application, if the ratio of the area of the first intersection to the area of the first detection frame is less than the preset value, it can be considered that the first detection frame is omitted from the second detection result, and the updated The second detection result is such that the updated second detection result includes the first detection frame. The timing characteristics are introduced into the model to assist in determining whether the suspected missed test result is a true missed test result, and the category of the missed test result is judged, which improves the efficiency of verification.

In an implementation manner, the second detection result includes multiple detection frames, and there is an intersection between the area where each detection frame of the multiple detection frames is located and the area where the first detection frame is located, so The multiple detection frames include the second detection frame, wherein, in the area of the intersection of the area where each detection frame is located in the multiple detection frames and the area where the first detection frame is located, the second detection frame The area of the intersection between the area where the detection frame is located and the area where the first detection frame is located is the smallest.

In an implementation manner, the first image is an image frame in the video, the second image is an image frame in the video, and the frame distance between the first image and the second image in the video Less than the preset value, the method further includes:

Acquiring a third detection result of the second image, where the third detection result includes a fourth detection frame and an object category corresponding to the fourth detection frame;

When the shape difference and position difference between the fourth detection frame and the first detection frame are within a preset range, the first detection frame corresponds to the object category corresponding to the fourth detection frame.

In an implementation manner, the detection confidence of the fourth detection frame is greater than a preset threshold.

In a second aspect, the present application provides a data processing system, the data processing system includes: a convolution processing unit, a first feature map generation unit, a second feature map generation unit, and a detection unit, the convolution processing units are respectively Is connected to the first characteristic map generating unit and the second characteristic map generating unit, the first characteristic map generating unit is connected to the second characteristic map generating unit, and the second characteristic map generating unit is connected to the Detection unit connection;

The convolution processing unit is configured to receive an input image, and perform convolution processing on the input image to generate a plurality of first feature maps;

The first characteristic map generating unit is configured to generate a plurality of second characteristic maps according to the plurality of first characteristic maps; wherein, the plurality of first characteristic maps include more than the plurality of second characteristic maps Texture details of the input image and/or location details in the input image;

The second characteristic map generating unit is configured to generate a plurality of third characteristic maps according to the plurality of first characteristic maps and the plurality of second characteristic maps;

The detection unit is configured to output a detection result of an object included in the image according to at least one third feature map of the plurality of third feature maps.

For example, the data processing system may be a sensory network system for realizing the function of the sensory network.

Since the third aspect is a device corresponding to the first aspect, please refer to the description of the first aspect for its various implementation modes, explanations and corresponding technical effects, which will not be repeated here.

In the third aspect, this application provides a perceptual network training method, the method includes:

Obtaining a pre-labeled detection frame of the target object in the image;

Acquiring a target detection frame corresponding to the image and the first perception network, where the target detection frame is used to identify the target object;

In one design, the detection result of the image may be obtained, the detection result is obtained by object detection of the image through a first perception network, and the detection result includes the target detection frame corresponding to the first object ；

Performing iterative training on the first perception network according to a loss function to output a second perception network; wherein the intersection of the loss function and the pre-labeled detection frame and the target detection frame is more related to IoU;

In one design, the first perceptual network may be iteratively trained according to the loss function to update the parameters included in the first perceptual network to obtain the second perceptual network; wherein, the loss function and the pre- The intersection between the labeled detection frame and the target detection frame is more related to IoU.

In one design, the second perceptual network may be output.

In the embodiment of this application, the newly designed frame regression loss function uses scale invariance and is applied to the IoU loss item of the target detection measurement method, the loss item considering the aspect ratio of the predicted frame and the real frame, and the predicted frame center coordinates and The loss term of the ratio between the center coordinate distance of the real border and the distance between the lower right corner coordinate of the predicted border and the upper left corner coordinate of the real border. The IoU loss term naturally introduces a constant-scale border prediction quality evaluation index. The loss term of the aspect ratio of the two borders measures the two borders. The degree of fit between the shapes, the third distance ratio metric item is used to solve the problem that when IoU=0, the relative position relationship between the predicted frame and the real frame cannot be known, and the backward propagation is difficult.

In an implementation manner, the preset loss function is also related to the shape difference between the target detection frame and the pre-labeled detection frame, wherein the shape difference is negatively related to the area of the pre-labeled detection frame .

In an implementation manner, the preset loss function is also related to the position difference between the target detection frame and the pre-labeled detection frame in the image, wherein the position difference is related to the pre-labeled detection frame. The area of the frame is negatively correlated; or the position difference is negatively correlated with the area of the smallest circumscribed rectangle of the convex hull of the pre-labeled detection frame and the target detection frame.

In an implementation manner, the target detection frame includes a first corner point and a first center point, the pre-labeled detection frame includes a second corner point and a second center point, and the first corner point and the first center point The two corner points are the two end points of the diagonal of the rectangle, and the position difference is also positively correlated with the position difference between the first center point and the second center point in the image, and with the first corner The point is negatively related to the length of the second corner point.

In an implementation manner, the preset loss function includes a target loss item related to the position difference, and the target loss item changes as the position difference changes; wherein, when the position difference is greater than When the preset value is set, the rate of change of the target loss item is greater than the first preset rate of change; and/or, when the position difference is less than the preset value, the rate of change of the target loss item is less than the second preset rate of change Rate. When the position difference is greater than the preset value, the rate of change of the target loss item is greater than the first preset rate of change; and/or, when the position difference is less than the preset value, the rate of change of the target loss item Less than the second preset rate of change, the effect of rapid convergence can be achieved during the training process.

In a fourth aspect, the present application provides a perceptual network training device, the device includes:

An acquiring module for acquiring a pre-labeled detection frame of a target object in an image; and acquiring a target detection frame corresponding to the image and the first perception network, the target detection frame being used to identify the target object;

The iterative training module is used to iteratively train the first perception network according to the loss function to output the second perception network; wherein the loss function and the intersection between the pre-labeled detection frame and the target detection frame And more related to IoU.

In an implementation manner, the preset loss function includes a target loss item related to the position difference, and the target loss item changes as the position difference changes; wherein,

When the position difference is greater than the preset value, the rate of change of the target loss item is greater than the first preset rate of change; and/or, when the position difference is less than the preset value, the rate of change of the target loss item Less than the second preset rate of change.

Since the fourth aspect is a device corresponding to the second aspect, please refer to the description of the first aspect for its various implementation modes, explanations and corresponding technical effects, which will not be repeated here.

In a fifth aspect, an embodiment of the present application provides an object detection device, which may include a memory, a processor, and a bus system, where the memory is used to store a program, and the processor is used to execute the program in the memory to execute the above-mentioned second aspect. And any optional method of the second aspect.

In a sixth aspect, an embodiment of the present application provides an object detection device, which may include a memory, a processor, and a bus system. The memory is used to store a program, and the processor is used to execute the program in the memory to execute the above-mentioned third aspect. And any optional method of the third aspect.

In a seventh aspect, an embodiment of the present invention also provides a sensory network application system. The sensory network application system includes at least one processor, at least one memory, at least one communication interface, and at least one display device. The processor, the memory, the display device and the communication interface are connected through the communication bus and complete the mutual communication.

Communication interface, used to communicate with other devices or communication networks;

The memory is used to store application program codes for executing the above solutions, and the processor controls the execution. The processor is configured to execute the application program code stored in the memory;

The code stored in the memory can execute one of the object detection methods provided above or the method of training the perceptual network provided in the above embodiments;

The display device is used to display the image to be recognized, 2D, 3D, Mask, key points and other information of the object of interest in the image to be recognized.

In an eighth aspect, an embodiment of the present application provides a computer-readable storage medium having a computer program stored in the computer-readable storage medium, and when it runs on a computer, the computer executes the second aspect and any of the above-mentioned second aspects. Optional method or the first aspect and any optional method thereof.

In a ninth aspect, an embodiment of the present application provides a computer-readable storage medium in which a computer program is stored. When the computer program is run on a computer, the computer can execute the third aspect and any one thereof. Optional method.

In a tenth aspect, an embodiment of the present application provides a computer program, which when running on a computer, causes the computer to execute the first aspect and any optional method thereof.

In an eleventh aspect, an embodiment of the present application provides a computer program that, when run on a computer, causes the computer to execute the third aspect and any optional method thereof.

In a twelfth aspect, this application provides a chip system that includes a processor for supporting execution devices or training devices to implement the functions involved in the above aspects, for example, sending or processing data involved in the above methods ; Or, information. In a possible design, the chip system further includes a memory for storing program instructions and data necessary for the execution device or the training device. The chip system can be composed of chips, and can also include chips and other discrete devices.

This application provides a data processing system. The second feature map generation unit introduces the shallow texture detail information of the original feature map (multiple first feature maps generated by the convolution processing unit) into the deep feature map (first The multiple second feature maps generated by the feature map generating unit) are used to generate multiple third feature maps, and the third feature map with shallow rich texture detail information is used as the input data for the detection unit to perform target detection. Improve the detection accuracy of subsequent object detection.

Description of the drawings

Figure 1 is a schematic diagram of a structure of the main frame of artificial intelligence;

Figure 2 is an application scenario of an embodiment of the application;

Figure 3 is an application scenario of an embodiment of the application;

Figure 4 is an application scenario of an embodiment of the application;

FIG. 5 is a schematic diagram of a system architecture provided by an embodiment of the application;

FIG. 6 is a schematic diagram of the structure of a convolutional neural network used in an embodiment of the application;

FIG. 7 is a schematic diagram of the structure of a convolutional neural network used in an embodiment of the application;

FIG. 8 is a hardware structure of a chip provided by an embodiment of the application;

FIG. 9 is a schematic structural diagram of a sensing network provided by an embodiment of this application;

Figure 10 is a schematic diagram of the structure of a backbone network;

FIG. 11 is a schematic diagram of the structure of a first FPN;

Figure 12a is a schematic diagram of the structure of a second FPN;

Figure 12b is a schematic diagram of the structure of a second FPN;

Figure 12c is a schematic diagram of the structure of a second FPN;

Figure 12d is a schematic diagram of the structure of a second FPN;

Figure 12e is a schematic diagram of the structure of a second FPN;

Figure 13a is a schematic diagram of the structure of a head;

Figure 13b is a schematic diagram of the structure of a head;

FIG. 14a is a schematic diagram of the structure of a sensing network provided by an embodiment of this application;

FIG. 14b is a schematic diagram of a hole convolution kernel provided by an embodiment of this application;

Figure 14c is a schematic diagram of a processing flow of intermediate feature extraction;

FIG. 15 is a schematic flowchart of an object detection method provided by an embodiment of this application;

FIG. 16 is a schematic flow chart of a perceptual network training method provided by an embodiment of this application;

FIG. 17 is a schematic flowchart of an object detection method provided by an embodiment of this application;

FIG. 18 is a schematic diagram of a perceptual network training device provided by an embodiment of this application;

FIG. 19 is a schematic diagram of an object detection device provided by an embodiment of the application;

FIG. 20 is a schematic structural diagram of an execution device provided by an embodiment of this application;

FIG. 21 is a schematic structural diagram of a training device provided by an embodiment of the present application;

FIG. 22 is a schematic diagram of a structure of a chip provided by an embodiment of the application.

Detailed ways

The embodiments of the present invention will be described below in conjunction with the drawings in the embodiments of the present invention. The terms used in the embodiment of the present invention are only used to explain specific embodiments of the present invention, and are not intended to limit the present invention.

The embodiments of the present application will be described below in conjunction with the drawings. A person of ordinary skill in the art knows that with the development of technology and the emergence of new scenarios, the technical solutions provided in the embodiments of the present application are equally applicable to similar technical problems.

The terms "first", "second", etc. in the description and claims of the application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It should be understood that the terms used in this way can be interchanged under appropriate circumstances, and this is merely a way of distinguishing objects with the same attributes in the description of the embodiments of the present application. In addition, the terms "including" and "having" and any variations of them are intended to cover non-exclusive inclusion, so that a process, method, system, product, or device that includes a series of units is not necessarily limited to those units, but may include Listed or inherent to these processes, methods, products, or equipment.

First, the overall workflow of the artificial intelligence system is described. Please refer to Figure 1. Figure 1 shows a schematic diagram of the main framework of artificial intelligence. (Vertical axis) Two dimensions explain the above-mentioned artificial intelligence theme framework. Among them, the "intelligent information chain" reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has gone through the condensing process of "data-information-knowledge-wisdom". The "IT value chain" from the underlying infrastructure of human intelligence, information (providing and processing technology realization) to the industrial ecological process of the system, reflects the value that artificial intelligence brings to the information technology industry.

(1) Infrastructure

The infrastructure provides computing power support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the basic platform. Communicate with the outside through sensors; computing capabilities are provided by smart chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA); basic platforms include distributed computing frameworks and network related platform guarantees and support, which can include cloud storage and Computing, interconnection network, etc. For example, sensors communicate with the outside to obtain data, and these data are provided to the smart chip in the distributed computing system provided by the basic platform for calculation.

(2) Data

The data in the upper layer of the infrastructure is used to represent the data source in the field of artificial intelligence. The data involves graphics, images, voice, and text, as well as the Internet of Things data of traditional devices, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.

(3) Data processing

Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other methods.

Among them, machine learning and deep learning can symbolize and formalize data for intelligent information modeling, extraction, preprocessing, training, etc.

Reasoning refers to the process of simulating human intelligent reasoning in a computer or intelligent system, using formal information to conduct machine thinking and solving problems based on reasoning control strategies. The typical function is search and matching.

Decision-making refers to the process of making decisions after intelligent information is reasoned, and usually provides functions such as classification, ranking, and prediction.

(4) General ability

After the above-mentioned data processing is performed on the data, some general capabilities can be formed based on the results of the data processing, such as an algorithm or a general system, for example, translation, text analysis, computer vision processing, speech recognition, image Recognition and so on.

(5) Smart products and industry applications

Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. It is an encapsulation of the overall solution of artificial intelligence, productizing intelligent information decision-making and realizing landing applications. Its application fields mainly include: intelligent terminals, intelligent transportation, Smart medical care, autonomous driving, safe city, etc.

The embodiments of the present application are mainly applied in fields such as driving assistance, automatic driving, and mobile phone terminals that need to complete various perception tasks. The application system framework of the present invention is shown in Figures 2 and 3. Figure 2 shows the application scenario of the embodiment of the present application. The embodiment of the present application is in the module of automatic data labeling in the data processing platform. The dotted line outlines the location of the present invention. s position. The system is an intelligent data platform for human-machine collaboration, to build artificial intelligence capabilities with higher efficiency, faster training, and stronger models. The automatic data labeling module is an intelligent labeling system framework that solves the problem of high manual labeling costs and few manual labeling sets.

The product implementation form of the embodiment of this application is the program code included in the intelligent data storage system and deployed on the server hardware. The network element whose function is enhanced or modified due to this solution belongs to soft modification and belongs to a relatively independent module. Taking the application scenario shown in FIG. 3 as an example, the program code of the embodiment of the application exists in the runtime training module of the intelligent data system component. When the program is running, the program code of the embodiment of the application is stored and accelerated by the host of the server. Hardware (GPU/FPGA/dedicated chip) memory, the possible impact in the future is that before data is read into the module, data may be read from a certain ftp, file, database, or memory. After adopting the embodiment of this application, you can only The data source is updated to the interface of the functional module involved in the scheme.

Figure 3 shows the implementation form of the present invention in the server and platform software, where the label generating device and the automatic calibration device are newly added modules based on the existing platform software of the present invention.

The following is a brief introduction to the two application scenarios of ADAS/ADS visual perception system and mobile phone beauty.

Application scenario 1: ADAS/ADS visual perception system

As shown in Figure 4, in ADAS and ADS, real-time detection of multiple types of 2D targets is required, including: dynamic obstacles (Pedestrian, Cyclist, Tricycle, Car, Truck) (Truck), bus (Bus), static obstacles (TrafficCone, TrafficStick, FireHydrant, Motocycle, Bicycle), traffic signs ( (TrafficSign), guide signs (GuideSign), billboards (Billboard), red traffic lights (TrafficLight_Red) / yellow traffic lights (TrafficLight_Yellow) / green traffic lights (TrafficLight_Green) / black traffic lights (TrafficLight_Black), road signs (RoadSign)). In addition, in order to accurately obtain the area occupied by the dynamic obstacle in the 3-dimensional space, it is also necessary to perform a 3D estimation of the dynamic obstacle and output a 3D frame. In order to integrate with the lidar data, it is necessary to obtain the mask of the dynamic obstacle, so as to filter out the laser point cloud hitting the dynamic obstacle; in order to carry out accurate parking space, it is necessary to detect the 4 key points of the parking space at the same time ; In order to locate the composition, it is necessary to detect the key points of the static target. Using the technical solutions provided in the embodiments of the present application, all or part of the above-mentioned functions can be completed in the perception network.

Application scenario 2: mobile phone beauty function

In the mobile phone, the mask and key points of the human body are detected through the perception network provided by the embodiments of the present application, and the corresponding parts of the human body can be zoomed in and out, such as waist reduction and hip beautification operations, so as to output beautiful images.

Application scenario 3: Image classification scenario:

After obtaining the image to be classified, the object recognition device adopts the object recognition method of the present application to obtain the category of the object in the image to be classified, and then can classify the image to be classified according to the category of the object in the image to be classified. For photographers, many photos are taken every day, including animals, people, and plants. Using the method of the present application, photos can be quickly classified according to the content of the photos, which can be divided into photos containing animals, photos containing people, and photos containing plants.

In the case of a large number of images, the manual classification method is relatively inefficient, and people are prone to fatigue when dealing with the same thing for a long time. At this time, the classification result will have a large error; the method of this application is adopted. Images can be classified quickly without errors.

Application scenario 4: Product classification:

After the object recognition device obtains the image of the product, it then uses the object recognition method of the present application to obtain the category of the product in the image of the product, and then classifies the product according to the category of the product. For a wide variety of commodities in large shopping malls or supermarkets, the object recognition method of the present application can quickly complete the classification of commodities, reducing time and labor costs.

The following describes the method provided in this application from the model training side and the model application side:

The method for training a perceptual network provided by the embodiments of this application involves computer vision processing, and can be specifically applied to data processing methods such as data training, machine learning, and deep learning. And the category of objects) to perform symbolic and formalized intelligent information modeling, extraction, preprocessing, training, etc., and finally obtain a trained perceptual network; and the embodiment of this application will input data (such as the object in this application). The image of) is input into the trained perceptual network to obtain output data (for example, the 2D, 3D, Mask, key points and other information of the object of interest in the image are obtained in this application).

Since the embodiments of the present application involve a large number of applications of neural networks, in order to facilitate understanding, the following first introduces related terms, neural networks and other related concepts involved in the embodiments of the present application.

(1) Object detection. Using image processing, machine learning, computer graphics and other related methods, object detection can determine the category of image objects and determine the detection frame used to locate the object.

(2) Convolutional Neural Network (Convosutionas Neuras Network, CNN) is a deep neural network with a convolutional structure. The convolutional neural network contains a feature extractor composed of a convolutional layer and a sub-sampling layer. The feature extractor can be regarded as a filter. The perceptual network in this embodiment may include a convolutional neural network, which is used to perform convolution processing on an image or perform convolution processing on a feature map to generate a feature map.

(3) Backpropagation algorithm

Convolutional neural networks can use backpropagation (BP) algorithms to modify the size of the parameters in the initial super-resolution model during the training process, so that the reconstruction error loss of the super-resolution model becomes smaller and smaller. Specifically, forwarding the input signal to the output will cause error loss, and the parameters in the initial super-resolution model are updated by backpropagating the error loss information, so that the error loss is converged. The backpropagation algorithm is a backpropagation motion dominated by error loss, and aims to obtain the optimal super-resolution model parameters, such as a weight matrix. In this embodiment, when training the perception network, the perception network may be updated based on the back propagation algorithm.

5 is a schematic diagram of a system architecture provided by an embodiment of the present application. In FIG. 5, the execution device 110 is configured with an input/output (input/output, I/O) interface 112 for data interaction with external devices, and the user Data may be input to the I/O interface 112 through the client device 140, and the input data may include the image to be recognized or the image block or image in the embodiment of the present application.

When the execution device 120 preprocesses the input data, or when the calculation module 111 of the execution device 120 executes calculations and other related processing (such as performing the function realization of the neural network in this application), the execution device 120 may call the data storage system 150 The data, codes, etc. are used for corresponding processing, and the data, instructions, etc. obtained by corresponding processing can also be stored in the data storage system 150.

Finally, the I/O interface 112 returns the processing result, such as the image or image block obtained above, or at least one of the 2D, 3D, Mask, and key points of the object of interest in the image, to the client device 140, so as to provide it to the client device 140. user.

Optionally, the client device 140 may be a control unit in an automatic driving system or a functional algorithm module in a mobile phone terminal. For example, the functional algorithm module may be used to implement tasks related to perception.

It is worth noting that the training device 120 can generate corresponding target models/rules based on different training data for different goals or different tasks, and the corresponding target models/rules can be used to achieve the above goals or complete the above tasks. , So as to provide users with the desired results. The target model/rule may be the perceptual network described in the subsequent embodiment, and the result provided to the user may be the object detection result in the subsequent embodiment.

In the case shown in FIG. 5, the user can manually set input data, and the manual setting can be operated through the interface provided by the I/O interface 112. In another case, the client device 140 can automatically send input data to the I/O interface 112. If the client device 140 is required to automatically send the input data and the user's authorization is required, the user can set the corresponding authority in the client device 140. The user can view the result output by the execution device 110 on the client device 140, and the specific display form may be a specific manner such as display, sound, and action. The client device 140 can also be used as a data collection terminal to collect the input data of the input I/O interface 112 and the output result of the output I/O interface 112 as new sample data and store it in the database 130 as shown in the figure. Of course, it is also possible not to collect through the client device 140, but the I/O interface 112 directly uses the input data input to the I/O interface 112 and the output result of the output I/O interface 112 as a new sample as shown in the figure. The data is stored in the database 130.

It is worth noting that FIG. 5 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation. For example, in FIG. 5, the data The storage system 150 is an external memory relative to the execution device 110. In other cases, the data storage system 150 may also be placed in the execution device 110.

As shown in FIG. 5, the perceptual network can be obtained by training according to the training device 120.

Among them, the perceptual network may include a deep neural network with a convolutional structure. The structure of the convolutional neural network used in the embodiment of the present application may be as shown in FIG. 6. In FIG. 6, a convolutional neural network (CNN) 200 may include an input layer 210, a convolutional layer/pooling layer 220 (where the pooling layer is optional), and a neural network layer 230. Among them, the input layer 210 can obtain the image to be processed, and pass the obtained image to be processed to the convolutional layer/pooling layer 220 and the subsequent neural network layer 230 for processing, and the processing result of the image can be obtained. The following describes the internal layer structure of CNN 200 in Figure 6 in detail.

Convolutional layer/pooling layer 220:

Convolutional layer:

As shown in FIG. 6, the convolutional layer/pooling layer 220 may include layers 221-226, for example: in an implementation, layer 221 is a convolutional layer, layer 222 is a pooling layer, and layer 223 is a convolutional layer. Layers, 224 is the pooling layer, 225 is the convolutional layer, and 226 is the pooling layer; in another implementation, 221 and 222 are the convolutional layers, 223 is the pooling layer, and 224 and 225 are the convolutional layers. Layer, 226 is the pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or as the input of another convolutional layer to continue the convolution operation.

The following will take the convolutional layer 221 as an example to introduce the internal working principle of a convolutional layer.

The convolution layer 221 can include many convolution operators. The convolution operator is also called a kernel. Its function in image processing is equivalent to a filter that extracts specific information from the input image matrix. The convolution operator is essentially It can be a weight matrix. This weight matrix is usually pre-defined. In the process of convolution on the image, the weight matrix is usually one pixel after one pixel (or two pixels after two pixels) along the horizontal direction on the input image. ...It depends on the value of stride) to complete the work of extracting specific features from the image.

The weight values in these weight matrices need to be obtained through a lot of training in practical applications. Each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, so that the convolutional neural network 200 can make correct predictions. .

When the convolutional neural network 200 has multiple convolutional layers, the initial convolutional layer (such as 221) often extracts more general features, which can also be called low-level features; with the convolutional neural network With the deepening of the network 200, the features extracted by the subsequent convolutional layers (for example, 226) become more and more complex, such as features such as high-level semantics, and features with higher semantics are more suitable for the problem to be solved.

Pooling layer:

Since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce a pooling layer after the convolutional layer. In the 221-226 layers as illustrated by 220 in Figure 6, it can be a convolutional layer followed by a layer. The pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.

Neural network layer 230:

After processing by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not enough to output the required output information. Because as mentioned above, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input image.

It should be noted that the convolutional neural network 210 shown in FIG. 6 is only used as an example of a convolutional neural network. In specific applications, the convolutional neural network may also exist in the form of other network models.

The perception network of the embodiment of the present application may include a deep neural network with a convolutional structure, where the structure of the convolutional neural network may be as shown in FIG. 7. In FIG. 7, a convolutional neural network (CNN) 200 may include an input layer 110, a convolutional layer/pooling layer 120 (where the pooling layer is optional), and a neural network layer 130. Compared with FIG. 6, multiple convolutional layers/pooling layers in the convolutional layer/pooling layer 120 in FIG. 7 are parallel, and the respectively extracted features are input to the full neural network layer 130 for processing.

Refer to FIG. 8, which is a schematic structural diagram of a data processing system provided by an embodiment of the application. As shown in Figure 8, the data processing system may include:

A convolution processing unit 801, a first feature map generating unit 802, a second feature map generating unit 803, and a detection unit 804, the convolution processing unit 801 is respectively connected to the first feature map generating unit 802 and the second feature map generating unit 802 and the second feature map generating unit 804. The map generating unit 803 is connected, the first characteristic map generating unit 802 is connected with the second characteristic map generating unit 803, and the second characteristic map generating unit 803 is connected with the detecting unit 804.

In one implementation, the data processing system can implement the function of a perceptual network, wherein the convolution processing unit 801 is the backbone network, and the first feature map generating unit 802 and the second feature map generating unit 803 are feature maps. In the pyramid network, the detection unit 804 is a head.

Refer to FIG. 9, which is a schematic structural diagram of a sensing network provided by an embodiment of this application. As shown in Figure 9, the sensing network includes:

The backbone network 901, the first feature pyramid network FPN 902, the second FPN 903, and the head-end header 904, the backbone network 901 is connected to the first FPN 902 and the second FPN 903, and the first FPN 901 Connected to the second FPN 902, and the second FPN 903 connected to the head 904;

In the embodiment of the present application, the architecture of the sensing network may be the architecture shown in FIG. 9, which mainly consists of a backbone network 901, a first FPN 902, a second FPN 903, and a head-end header 904.

In one implementation, the convolution processing unit 801 is the backbone network, and the convolution processing unit 801 is configured to receive an input image and perform convolution processing on the input image to generate multiple first images. Feature map.

It should be noted that “convolution processing on the input image” here should not be understood as only performing convolution processing on the input image. In some implementations, the input image can be Perform convolution processing and other processing.

Wherein, the convolution processing unit may be used to receive an input image, and perform convolution processing on the input image to generate multiple first feature maps with multi-scale resolution; the convolution processing unit may perform convolution processing on the input A series of convolution processing is performed on the image to obtain feature maps at different scales (with different resolutions). The convolution processing unit can take many forms, such as visual geometry group (VGG), residual neural network (residual neural network, resnet), GoogLeNet core structure (Inception-net), and so on.

In the embodiment of the present application, the convolution processing unit may be a backbone network, the backbone network 901, which is used to receive an input image and perform convolution processing on the input image to generate multiple first features with multi-scale resolution. picture.

10, FIG. 10 is a schematic diagram of the structure of a backbone network provided by an embodiment of the application. As shown in FIG. 10, the backbone network is used to receive input images, perform convolution processing on the input images, and output Feature maps with different resolutions corresponding to the image (feature map C1, feature map C2, feature map C3, feature map C4); that is to say, feature maps of different sizes corresponding to the image are output, and the backbone network completes the basic features The extraction provides corresponding features for subsequent detection.

Specifically, the backbone network can perform a series of convolution processing on the input image to obtain feature maps at different scales (with different resolutions). These feature maps will provide basic features for subsequent detection modules. The backbone network can take many forms, such as visual geometry group (VGG), residual neural network (residual neural network, resnet), the core structure of GoogLeNet (Inception-net), and so on.

The backbone network can perform convolution processing on the input image to generate several convolution feature maps of different scales. Each feature map is a matrix of H*W*C, where H is the height of the feature map and W is the width of the feature map , C is the number of channels in the feature map.

The backbone can use a variety of existing convolutional network frameworks, such as VGG16, Resnet50, Inception-Net, etc. The following is an example of Resnet18 as Backbone.

Assume that the resolution of the input image is H*W*3 (height H, width W, the number of channels is 3, that is, three channels of RBG). The input image can be convolved through a convolutional layer Res18-Conv1 of Resnet18 to generate Featuremap (feature map) C1. This feature map is down-sampled 2 times with respect to the input image, and the number of channels is expanded to 64, so the C1 The resolution is H/4*W/4*64. C1 can be convolved through Resnet18's Res18-Conv2 to get Featuremap C2, the resolution of this feature map is the same as C1; C2 continues through Res18-Conv3 for convolution operation to generate Featuremap C3, this feature map is further downsampled relative to C2 , The number of channels is doubled, and its resolution is H/8*W/8*128; finally C3 undergoes Res18-Conv4 convolution operation to generate Featuremap C4, and its resolution is H/16*W/16*256.

It should be noted that the backbone network in the embodiments of the present application may also be referred to as a backbone network, which is not limited here.

It should be noted that the backbone network shown in FIG. 10 is only an implementation manner, and does not constitute a limitation to this application.

The first characteristic map generating unit 802 is configured to generate a plurality of second characteristic maps according to the plurality of first characteristic maps; wherein, the plurality of first characteristic maps include more features than the plurality of second characteristic maps. More texture detail information and/or location detail information.

It should be noted that the texture detail information here can be shallow detail information used to express small targets and edge features. Compared with the second feature map, the first feature map includes more texture detail information, so that The detection accuracy of the detection results for small target detection is higher. Wherein, the position details can be information that expresses the position of the object in the image and the relative position between the objects.

It should be noted that, compared to the first feature map, multiple second feature maps can include more deep features. Deep features contain rich semantic information, which has a good effect on classification tasks. At the same time, deep features have more The large receptive field can have a good detection effect on large targets; in one implementation, by introducing a top-down path to generate multiple second feature maps, the rich semantic information contained in deep features can be naturally combined Propagate downward so that the second feature maps of various scales contain rich semantic information.

In this embodiment of the present application, the first feature map generating unit 802 may be the first FPN 902.

In the embodiment of the present application, the first FPN 902 is configured to generate a first feature pyramid according to the multiple first feature maps, and the first feature pyramid includes multiple second feature maps with multi-scale resolution. In the embodiment of the present application, the first FPN is connected to the backbone network, and the first FPN may perform convolution processing and merging processing on multiple feature maps of different resolutions generated by the backbone network to construct the first feature pyramid.

Referring to FIG. 11, FIG. 11 is a schematic diagram of the structure of a first FPN. The first FPN 902 can generate a first feature pyramid according to the multiple first feature maps, and the first feature pyramid includes a multi-scale resolution. A second feature map (feature map P2, feature map P3, feature map P4, feature map P5). Among them, the convolution operation is performed on the topmost feature map C4 generated by the backbone network 901. Illustratively, the hole convolution and 1×1 convolution can be used to reduce the number of channels of the topmost feature map C4 to 256, which is used as the feature pyramid. The top-most feature map P4; horizontally link the output results of the top-level and next-level feature maps C3 and use 1×1 convolution to reduce the number of channels to 256, then add it with the feature map P4 channel by pixel to obtain the feature map P3; By analogy, from top to bottom, the first feature pyramid ΦP={feature map P2, feature map P3, feature map P4, feature map P5} is constructed.

In the embodiment of the present application, in order to obtain a larger receptive field, the first feature pyramid may further include a feature map P5, which can be generated by directly performing a convolution operation on the feature map P4. The feature map in the middle of the first feature pyramid can take the rich semantic information contained in the deep features through the top-down structure and introduce each feature layer layer by layer, so that feature maps of different scales contain richer semantic information. Provide better semantic information for small targets and improve the classification performance of small targets.

It should be noted that the first FPN shown in FIG. 11 is only an implementation manner, and does not constitute a limitation of the present application.

The second feature map generating unit 803 is configured to generate multiple third feature maps according to the multiple first feature maps and the multiple second feature maps.

In this embodiment of the present application, the second feature map generating unit 803 may be a second FPN 903.

In the embodiment of the present application, the second FPN 903 is used to generate a second feature pyramid based on the multiple first feature maps and the multiple second feature maps, and the second feature pyramid includes multi-scale resolution Multiple third characteristic maps of the rate.

Referring to Fig. 12a, Fig. 12a is a schematic diagram of the structure of a second FPN. The second FPN 903 can generate the first feature map according to the multiple first feature maps generated by the backbone network 901 and the multiple second feature maps generated by the first FPN 902. Two feature pyramids, where the second feature pyramid may include multiple third feature maps (for example, the feature map Q1, the feature map Q2, the feature map Q3, and the feature map Q4 shown in FIG. 12a).

In the embodiment of the present application, the second feature pyramid includes a plurality of third feature maps with multi-scale resolution, and the lowest-level feature map (that is, the feature map with the lowest resolution) in the plurality of third feature maps can be based on the backbone network A first feature map generated and a second feature map generated by the first FPN are generated.

Specifically, in an embodiment, the plurality of first feature maps include a first target feature map, the plurality of second feature maps include a second target feature map, and the plurality of third feature maps include a first target feature map. Three target feature maps, the third target feature map is the feature map with the smallest resolution in the plurality of third feature maps, and the second FPN is used to generate the third target feature map through the following steps:

Perform down-sampling and convolution processing on the first target feature map to obtain a fourth target feature map, where the fourth target feature map has the same number of channels and resolution as the second target feature map; The fourth target feature map and the second target feature map are added channel by channel to generate the third target feature map.

12b, the plurality of first feature maps include a first target feature map, the first target feature map may be the feature map C2 in FIG. 12b, the plurality of second feature maps include a second target feature map, and the second target The feature map may be the feature map P3 in FIG. 12b, the plurality of third feature maps include a third target feature map, and the third target feature map may be the feature map Q1 in FIG. 12b.

In the embodiment of the present application, the first target feature map can be down-sampled and convolved to obtain a fourth target feature map. As shown in FIG. 12b, the feature map C2 can be down-sampled and convolved. Processing, where the purpose of downsampling is to make the resolution of each channel feature map of the fourth target feature map the same as the feature map P3, and the purpose of convolution processing is to make the number of channels of the fourth target feature map the same as the feature map P3. Through the above method, the fourth target feature map and the second target feature map have the same number of channels and resolution, and the fourth target feature map and the second target feature can be added channel by channel, such as As shown in FIG. 12b, the fourth target feature map and the second target feature are added channel by channel to obtain the feature map Q1.

Specifically, in another embodiment, the plurality of first feature maps include a first target feature map, the plurality of second feature maps include a second target feature map, and the plurality of third feature maps include A third target feature map, the third target feature map is a feature map with the smallest resolution in the plurality of third feature maps, and the second FPN is used to generate the third target feature map through the following steps:

Down-sampling the first target feature map to obtain a fourth target feature map. The fourth target feature map has the same resolution as the second target feature map; The second target feature map is added channel by channel and convolution processing is performed to generate the third target feature map, and the third target feature map has the same number of channels as the second target feature map.

12c, the plurality of first feature maps include a first target feature map, the first target feature map may be the feature map C2 in FIG. 12c, the plurality of second feature maps include a second target feature map, and the second target The feature map may be the feature map P3 in FIG. 12c, the plurality of third feature maps include a third target feature map, and the third target feature map may be the feature map Q1 in FIG. 12c.

In the embodiment of the present application, the first target feature map can be down-sampled to obtain the fourth target feature map. As shown in FIG. 12c, the feature map C2 can be down-sampled, where the purpose of down-sampling is The resolution of the feature map of each channel in the fourth target feature map is made the same as the feature map P3. Through the above method, the fourth target feature map and the second target feature map have the same resolution, and then the fourth target feature map and the second target feature can be added channel by channel, and then passed through the volume. Convolution processing, so that the obtained third target feature map and the feature map P3 have the same number of channels, wherein the above-mentioned convolution processing may be a concatenation operation.

In the embodiment of the present application, the second feature pyramid includes multiple third feature maps with multi-scale resolution, wherein the non-bottom-most feature maps (that is, the feature maps with not the lowest resolution) in the multiple third feature maps can be based on A first feature map generated by the backbone network, a second feature map generated by the first FPN, and a third feature map of the adjacent bottom layer are generated.

Specifically, in an embodiment, the plurality of first feature maps include a first target feature map, the plurality of second feature maps include a second target feature map, and the plurality of third feature maps include a first target feature map. A three-target feature map and a fourth target feature map, the resolution of the third target feature map is smaller than that of the fourth target feature map, and the second feature map generating unit is configured to generate the fourth target through the following steps Feature map:

Down-sampling the third target feature map to obtain a fifth target feature map, the fifth target feature map and the second target feature map have the same number of channels and resolution; for the first target The feature map is subjected to down-sampling and convolution processing to obtain a sixth target feature map, where the sixth target feature map has the same number of channels and resolution as the second target feature map; and the fifth target feature map , The second target feature map and the sixth target feature map are superimposed based on respective channels to generate the fourth target feature map.

12d, the plurality of first feature maps include a first target feature map, the first target feature map may be the feature map C3 in FIG. 12d, the plurality of second feature maps include a second target feature map, and the second target The feature map may be the feature map P4 in FIG. 12d, the plurality of third feature maps include a third target feature map and a fourth target feature map, and the third target feature map may be the feature map Q2 in FIG. 12d.

In the embodiment of the present application, the third target feature map can be down-sampled to obtain the fifth target feature map. As shown in FIG. 12d, the feature map Q1 can be down-sampled, where the purpose of down-sampling is The resolution of the feature map of each channel in the fifth target feature map is made the same as the feature map P4.

Specifically, in another embodiment, the third target feature map may be down-sampled to obtain a fifth target feature map, and the fifth target feature map has the same characteristics as the second target feature map. Number of channels and resolution; down-sampling the first target feature map to obtain a sixth target feature map, the sixth target feature map having the same resolution as the second target feature map; Five target feature maps, the second target feature maps, and the sixth target feature maps are superimposed on their respective channels and subjected to convolution processing to generate the fourth target feature map, and the fourth target feature map is The second target feature maps have the same number of channels.

12e, the plurality of first feature maps include a first target feature map, the first target feature map may be the feature map C3 in FIG. 12e, the plurality of second feature maps include a second target feature map, and the second target The feature map may be the feature map P4 in FIG. 12e, the plurality of third feature maps include a third target feature map and a fourth target feature map, and the third target feature map may be the feature map Q2 in FIG. 12e.

In this embodiment, the second feature map generating unit may generate a third target feature map and a fourth target feature map with different resolutions. The resolution of the third target feature map is smaller than that of the fourth target feature map, wherein the resolution The larger fourth target feature map is generated based on a first feature map in a plurality of first feature maps, a second feature map in a plurality of second feature maps, and a third target feature map. At this time, the multiple third feature maps generated by the second feature map generation unit retain the advantages of the feature pyramid network, which has bottom-up (from the feature map with small resolution to generate feature maps with larger resolution in turn), and will be shallower. The rich texture detail information and/or position detail information of the layer neural network is introduced into the deep convolutional layer, and the detection network uses multiple third feature maps generated in this way to detect small target detection results with higher detection accuracy .

The object detection task is different from the image classification. In the image classification task, the model only needs to answer the question of what is in the image. Therefore, the image classification task has translation invariance such as translation and scale. In the object detection task, the model needs to know where the target in the image is and which category the target belongs to. These two parallel tasks, but in the actual obtained visible light image, the size and position of the target are always changing , There is the problem of multi-scale target detection. The original one-stage and two-stage models are based on the image classification task and directly migrate a part of the image classification network for detection, and in order to obtain sufficient expressive power and The feature map of semantic information generally selects the deeper features of the network for subsequent processing. In order to obtain better expression and higher precision results, the current deep neural network models are all moving towards deeper and multi-branch topological structures. Expansion, as the network layer deepens, the subsequent network degradation problem also appears. The ResNet network structure of identity mapping and jump connection can solve the network degradation problem well. At present, in the high-precision detection model, the number of network layers is in the order of 10 to 100 depth, which allows the network to obtain better expressive ability, but the problem with this is that the deeper the network layer, the acquired features have The larger the receptive field, this will cause small targets to be missed in the detection.

Moreover, in the traditional detection algorithm, only the features acquired in a certain layer are used, which will produce a fixed receptive field, although anchor points (Anchor) will be introduced in the subsequent processing, by setting anchors of different sizes and aspect ratios. The points are mapped back to the original image through the feature map for processing to solve the multi-scale problem, but it is inevitable that there will still be a certain target missed detection problem. The feature pyramid network (FPN) can solve the problem of multi-scale target detection. Through the observation of the deep convolutional neural network, the deep network model itself has its own multi-level and multi-scale feature maps, and the shallow feature maps have more Small receptive fields, deep feature maps have larger receptive fields, so the direct use of such feature maps with pyramidal hierarchical structure can introduce multi-scale information, but there will be a problem. Although shallow feature maps have smaller receptive fields , It is beneficial to detect small targets, but the semantic information contained in the shallow feature map is relatively small, which can be understood as the shallow features are not abstract enough, and it is difficult to classify the detected targets. Feature Pyramid Network (FPN) uses an ingenious structural design to solve the problem of insufficient semantic information of shallow features. For a basic deep convolutional neural network, there is a bottom-to-top forward calculation to obtain hierarchical feature maps, and each scale feature has a scaling factor, and the resolution of the feature map gradually decreases. The Feature Pyramid Network (FPN) introduces a top-to-bottom network structure, gradually enlarges the resolution of the feature map, and introduces the horizontal connection branch derived from the original feature extraction network, and the original feature network corresponds to the resolution feature map with the up-sampling Then the deep feature maps are fused.

In the embodiment of this application, considering that the shallow information generated by the backbone network is once again integrated into each feature layer, a bottom-up jump connection multi-scale feature layer design is designed, and the shallow feature map contains very rich edges. , Texture and detail information, through the introduction of jump connections between the original feature map and the bottom-to-top multi-scale network layer, and the down-sampled original feature map is connected to the second top-to-bottom multi-degree feature layer horizontally Feature maps with corresponding resolutions are fused. This network layer improvement has better detection results for small targets and partially occluded targets, and can introduce rich semantic information and detailed information features in the multi-scale feature pyramid layer.

In an existing implementation, the second feature map generation unit (such as the feature pyramid network) introduces a top-down path to propagate the rich semantic information contained in deep features downwards, so that the second feature map generation unit of each scale Feature maps contain rich semantic information, and deep features have a large receptive field, which makes it possible to have a better detection effect on large targets. However, in the existing implementation, the more detailed position detail information and texture detail information contained in the shallower feature maps are ignored, which has a great impact on the detection accuracy of medium and small targets. In the embodiment of the present application, the second feature map generation unit introduces the shallow texture detail information of the original feature map (multiple first feature maps generated by the convolution processing unit) into the deep feature map (the first feature map generation unit). In the generated multiple second feature maps), multiple third feature maps are generated, and the third feature map with shallow rich texture detail information is used as the input data of the detection unit for target detection, which can improve subsequent object detection The detection accuracy.

The detection unit 804 is configured to perform target detection on the image according to at least one third feature map of the plurality of third feature maps, and output the detection result.

In this embodiment of the present application, the detection unit 804 may be a head.

In the embodiment of the present application, the head is used to detect the target object in the image according to at least one third feature map of the plurality of third feature maps, and output the detection result.

In the embodiment of the present application, the sensing network may include one or more heads. As shown in FIG. 13a, each parallel head-end head is used to perform a task in one task according to the third feature map output by the second FPN. The task object is detected, and the 2D frame of the area where the task object is located and the confidence level corresponding to each 2D frame are output; wherein, each parallel head can complete the detection of different task objects; wherein, the task object is The object that needs to be detected in the task; the higher the confidence, the greater the probability that the object corresponding to the task exists in the 2D box corresponding to the confidence.

In the embodiment of this application, different heads can complete different 2D detection tasks. For example, one head of multiple heads can complete car detection and output the 2D frame and confidence of Car/Truck/Bus; head1 of multiple heads It can complete human detection and output the 2D frame and confidence of Pedestrian/Cyclist/Tricyle; multiple heads can complete the detection of traffic lights, and output the 2D frame and confidence of Red_Trafficligh/Green_Trafficlight/Yellow_TrafficLight/Black_TrafficLight.

In the embodiment of the present application, the sensing network may include multiple serial heads; the serial head is connected to a parallel head; it should be emphasized that, in fact, the serial head is not necessary, and it is only necessary to detect the 2D frame. In the scenario, there is no need to include the serial head.

The serial head can be used to extract the features of the region where the 2D frame is located from one or more feature maps on the second FPN using the 2D frame of the task object of the task provided by the parallel head connected to it. Predicting the 3D information, Mask information or Keypiont information of the task object of the task to which the 2D frame is located according to the characteristics of the area where the 2D frame is located. The serial head is optionally connected in series behind the parallel head, and on the basis of detecting the 2D frame of the task, the 3D/Mask/Keypoint detection of the objects inside the 2D frame is completed. For example, serial 3D_head0 completes the estimation of the vehicle's orientation, center of mass and length, width and height, thereby outputting the 3D frame of the vehicle; serial Mask_head0 predicts the fine mask of the vehicle, thereby dividing the vehicle; serial Keypont_head0 completes the key points of the vehicle Estimate. Serial head is not necessary, some tasks do not require 3D/Mask/Keypoint detection, serial head does not need to be connected in series, such as traffic light detection, only need to detect 2D frame, without serial head. In addition, some tasks can choose to connect one or more serial heads in series according to the specific needs of the task, such as the detection of parking lots (Parking Slot). In addition to the 2D frame, it also needs the key points of the parking space. Therefore, in this task It only needs to connect a serial Keypoint_head in series, without the head of 3D and Mask.

In the embodiment of this application, the header is connected with the FPN. The header can complete the detection of the 2D frame of a task according to the feature map provided by the FPN, and output the 2D frame of the object of the task and the corresponding confidence level, etc., the following describes one For a schematic diagram of the header structure, refer to FIG. 13b, which is a schematic diagram of a header. As shown in FIG. 13b, the head includes three modules: a Region Proposal Network (RPN), ROI-ALIGN, and RCNN.

Among them, the RPN module can be used to predict the area where the task object is located on one or more third feature maps provided by the second FPN, and output candidate 2D boxes that match the area; or it can be understood that the RPN is in the FPN The output one or more horizontal images predict the areas where the task object may exist, and give the boxes of these areas, these areas are called candidate areas (Proposal). For example, when the head is responsible for detecting a car, its RPN layer predicts a candidate frame that may have a car; when the head is responsible for detecting a person, its RPN layer predicts a candidate frame that may have a person. Of course, these Proposals are not accurate. On the one hand, they do not necessarily contain the object of the task, and on the other hand, these frames are not compact.

The 2D candidate region prediction process can be implemented by the RPN module of the head, which predicts the regions where the task object may exist based on the feature map provided by the FPN, and gives candidate frames (also called candidate regions, Proposal) of these regions. In this embodiment, if the head is responsible for detecting a car, its RPN layer predicts that there may be a candidate frame for the car.

The RPN layer may generate a feature map RPN Hidden through, for example, a 3*3 convolution on the third feature map provided by the second FPN. The RPN layer of the back head will predict Proposal from RPN Hidden. Specifically, the RPN layer of the head respectively predicts the coordinates and confidence of the proposal at each position of the RPN Hidden through a 1*1 convolution. The higher the confidence, the greater the probability that the proposal exists for the object of the task. For example, the greater the score of a certain proposal in the head, the greater the probability of its existence. The Proposal predicted by each RPN layer needs to go through the Proposal merging module, and the excess Proposal is removed according to the degree of overlap between Proposals (this process can be used but not limited to the NMS algorithm), and the largest score is selected from the remaining K Proposals N (N<K) proposals are used as candidate areas where objects may exist. These Proposals are inaccurate. On the one hand, they do not necessarily contain the object of the task, and on the other hand, these frames are not compact. Therefore, the RPN module is only a rough detection process, and the subsequent RCNN module is required for subdivision. When the RPN module returns to the Proposal coordinates, it does not directly return the absolute value of the coordinates, but returns the coordinates relative to the Anchor. The higher the match between these Anchors and the actual object, the greater the probability that the RPN can detect the object.

The ROI-ALIGN module is used to extract the features of the region where the candidate 2D frame is located from a feature map provided by the FPN according to the region predicted by the RPN module; that is, the ROI-ALIGN module is mainly based on the RPN module For the provided Proposal, take out the features of the area where each Proposal is located on a certain feature map, and resize to a fixed size to obtain the features of each Proposal. It is understandable that the ROI-ALIGN module can be used but not limited to ROI-POOLING (region of interest pooling)/ROI-ALIGN (region of interest extraction)/PS-ROIPOOLING (position-sensitive region of interest pooling)/ Feature extraction methods such as PS-ROIALIGN (position-sensitive region of interest extraction).

The RCNN module is used to perform convolution processing on the features of the region where the candidate 2D box is located through a neural network to obtain the confidence that the candidate 2D box belongs to each object category; adjust the coordinates of the candidate area 2D box through the neural network , Making the adjusted 2D candidate frame more match the shape of the actual object than the candidate 2D frame, and selecting the adjusted 2D candidate frame with a confidence greater than a preset threshold as the 2D frame of the region. In other words, the RCNN module mainly refines the features of each Proposal proposed by the ROI-ALIGN module, and obtains the confidence that each Proposal belongs to each category (for example, for the task of car, Backgroud/Car/Truck will be given /Bus 4 points), and adjust the coordinates of the Proposal 2D frame to output a more compact 2D frame. After these 2D boxes are combined by non-maximum suppression (NMS), they are output as the final 2D box.

The sub-classification of 2D candidate regions is mainly implemented by the RCNN module of head in Figure 13b. According to the features of each Proposal extracted by the ROI-ALIGN module, it further returns to more compact 2D frame coordinates, and at the same time classifies this Proposal. Output the confidence that it belongs to each category. RCNN can be implemented in many forms. The feature size output by the ROI-ALIGN module can be N*14*14*256 (Feature of proposals), which is first processed by the Resnet18 convolution module 4 (Res18-Conv5) in the RCNN module. The output feature size is N*7*7*512, and then processed through a Global Avg Pool (average pooling layer), and the 7*7 features in each channel in the input features are averaged to obtain N*512 Features, where each 1*512-dimensional feature vector represents the feature of each Proposal. Next, return the precise coordinates of the frame through the two fully connected layers FC (output N*4 vector, these 4 numerical sub-tables represent the x/y coordinates of the center point of the frame, the width and height of the frame), and the confidence of the frame category Degree (in head0, the score that this box is Backgroud/Car/Truck/Bus needs to be given). Finally, through the box merging operation, several boxes with the largest scores are selected, and the repeated boxes are removed through the NMS operation, so as to obtain a compact box output.

In some practical application scenarios, the sensing network may also include other heads, and 3D/Mask/Keypoint detection can be further performed on the basis of detecting the 2D frame. Exemplarily, taking 3D as an example, the ROI-ALIGN module extracts the features of the region where each 2D box is located on the feature map output by the FPN according to the accurate 2D box provided by the head. Assuming that the number of 2D boxes is M, then The feature size output by the ROI-ALIGN module is M*14*14*256, which is first processed by Resnet18's Res18-Conv5, and the output feature size is N*7*7*512, and then through a Global Avg Pool (average pooling) Layer) for processing, and average the 7*7 features of each channel in the input features to obtain M*512 features, where each 1*512-dimensional feature vector represents the feature of each 2D box. Next, the orientation angle of the object in the frame (orientation, M*1 vector), centroid point coordinates (centroid, M*2 vector, these 2 values represent the x/y coordinates of the centroid) and Length, width and height (dimention).

It should be noted that the header shown in FIG. 13a and FIG. 13b is only an implementation manner, and does not constitute a limitation to the present application.

In the embodiment of the present application, the perception network may further include: a hole convolution layer, configured to perform hole convolution processing on at least one third feature map of the plurality of third feature maps; correspondingly, the head , Which is specifically configured to detect the target object in the image according to the at least one third feature map after the hole convolution processing, and output the detection result.

14a and 14b, FIG. 14a is a schematic diagram of a perceptual network structure provided by an embodiment of this application, and FIG. 14b is a schematic diagram of a hollow convolution kernel provided by an embodiment of this application. In the embodiment of this application, there will be a 3x3 convolution in the candidate region extraction network (RPN) to act as a sliding window, and move the convolution kernel on at least one third feature map to pass subsequent intermediate layer and category judgments and borders The regression layer can obtain whether there is a target in each anchor frame and the difference between the predicted frame and the real frame, and a better frame extraction result can be obtained by training the candidate region extraction network. In this embodiment, the 3x3 sliding window convolution kernel is replaced with a 3x3 hole convolution kernel, and at least one third feature map of the plurality of third feature maps is subjected to hole convolution processing, according to the hole convolution processing The latter at least one third feature map detects the target object in the image and outputs the detection result. Without increasing the amount of calculation, the receptive field is increased, and the impact on large targets and partially occluded targets is reduced. Missed inspection. Among them, assuming that the original ordinary convolution kernel space size is k×k, introduce a new parameter d, and stuff (d-1) spaces into the original convolution kernel, then the new convolution kernel size is:

n=k+(k-1)×(d-1)

For example, for a 3x3 convolution kernel, d=2 is set, and the receptive field of the new convolution kernel is 5x5.

The existing candidate region extraction network (RPN) sets a 3x3 ordinary convolution kernel as a sliding window to slide on the feature map for subsequent processing. In this embodiment, the 3x3 ordinary convolution kernel is replaced with a 3x3 hole convolution kernel for improvement The network corrects the missed detection of large targets and occluded targets. In this embodiment, through the improvement of the feature extraction network and the improvement of the candidate region extraction network (RPN), the finally obtained annotation model can obtain good detection results, and After replacing the ordinary convolution with the 3x3 hole convolution, a larger receptive field can be obtained without increasing the amount of calculation. At the same time, because of the larger receptive field, better context information can be obtained and the background can be reduced. The misjudgment is the occurrence of prospects.

The present application provides a data processing system, which is characterized in that the data processing system includes: a convolution processing unit, a first feature map generating unit, a second feature map generating unit, and a detection unit, and the convolution processing units are respectively Is connected to the first characteristic map generating unit and the second characteristic map generating unit, the first characteristic map generating unit is connected to the second characteristic map generating unit, and the second characteristic map generating unit is connected to the The detection unit is connected; the convolution processing unit is configured to receive an input image and perform convolution processing on the input image to generate a plurality of first feature maps; the first feature map generation unit is configured to Generate multiple second feature maps according to the multiple first feature maps; wherein the multiple first feature maps include more texture details and/or of the input image than the multiple second feature maps Or position details in the input image; the second feature map generating unit is configured to generate multiple third feature maps according to the multiple first feature maps and the multiple second feature maps; the The detection unit is configured to output a detection result of the object included in the image according to at least one third feature map of the plurality of third feature maps. In the embodiment of this application, the second feature map generation unit introduces the shallow texture detail information of the original feature map (multiple first feature maps generated by the convolution processing unit) into the deep feature map (generated by the first feature map generation unit). In multiple second feature maps), multiple third feature maps are generated, and the third feature map with shallow rich texture detail information is used as the input data of the detection unit for target detection, which can improve the detection of subsequent object detection Accuracy.

Optionally, the data processing system in the embodiment of the present application may further include:

The intermediate feature extraction layer is used to convolve at least one third feature map of the plurality of third feature maps to obtain at least one fourth feature map; and for at least one of the plurality of third feature maps A third feature map is convolved to obtain at least one fifth feature map; wherein the receptive field corresponding to the fourth feature map is larger than the receptive field corresponding to the fifth feature map; the detection unit is specifically configured to According to the at least one fourth feature map and the at least one fifth feature map, a detection result of the object included in the image is output.

Referring to Figure 14c, Figure 14c is a schematic diagram of the processing flow of an intermediate feature extraction. As shown in Figure 14c, the third feature map can be convolved by using convolutional layers with different dilation rates. , Get the corresponding feature map (the number of channels in each feature map is c), and perform the splicing operation to obtain feature map 4 (the number of channels is 3c), and then obtain the global Info descriptor through global average pooling and arithmetic tie processing, and Give nonlinear features through the first fully connected layer, and process through the second fully connected layer and sigmoid function to limit each weight value to the range of 0-1, and then perform the weight value and the corresponding feature map channel by channel Multiply to get the processed feature map.

Referring to FIG. 15, FIG. 15 is a schematic flowchart of an object detection method provided by an embodiment of the application. As shown in FIG. 15, the object detection method includes:

1501. Receive an input first image.

In the embodiment of the present application, when object detection needs to be performed on the first image, the input first image may be received.

1502. Perform object detection on the first image through the first perception network to obtain a first detection result, where the first detection result includes a first detection frame.

In the embodiment of the present application, after receiving the input first image, object detection can be performed on the first image through the first perception network to obtain the first detection result, and the first detection result includes the first detection frame, wherein: The first detection frame may indicate the pixel position of the detected object in the first image.

1503. Perform object detection on the first image through a second perception network to obtain a second detection result, where the second detection result includes a second detection frame, between the second detection frame and the first detection frame There is a first intersection.

In the embodiment of the present application, the first image may be detected by an object through a second perception network to obtain a second detection result. The second detection result includes a second detection frame, and the first detection frame may indicate the detection The pixel position of an object in the first image.

In the embodiment of the present application, there is a first intersection between the second detection frame and the first detection frame, that is, on the first image, the pixel position where the second detection frame is located and the pixel position where the first detection frame is located There is overlap.

1504. If the ratio of the area of the intersection to the area of the first detection frame is less than a preset value, update the second detection result so that the updated second detection result includes the first detection frame.

In the embodiment of the present application, if the ratio of the area of the first intersection to the area of the first detection frame is less than the preset value, it can be considered that the first detection frame is omitted from the second detection result, and the updated The second detection result is such that the updated second detection result includes the first detection frame.

In the embodiment of the present application, the second detection result may further include a third detection frame, and there is a second intersection between the third detection frame and the first detection frame, and the area of the second intersection is smaller than the The area of the first intersection.

In the embodiment of the present application, there may be multiple detection frames that have an intersection with the first detection frame in the second detection result, and the size of the intersection between the second detection frame and the first detection frame is the largest , That is, the size of the intersection with the first detection frame is the largest, the ratio of the area of the intersection between the second detection frame and the first detection frame to the area of the first detection frame is less than the preset value , Then the ratio of the area of the intersection between the remaining detection frames that have an intersection with the first detection frame and the area of the first detection frame to the area of the first detection frame is also less than the preset value.

In the embodiment of the present application, for the same image, the object detection accuracy of the first perception network is higher than the object detection accuracy of the second perception network, and the object detection accuracy is related to at least one of the following features: The shape, position, or category of the object corresponding to the detection frame.

That is, in the embodiment of the present application, the object detection accuracy of the first perception network is higher than the object detection accuracy of the second perception network, that is, the detection result of the first perception network can be used to update the detection result of the second perception network.

Optionally, the input first image may be received, and convolution processing is performed on the first image to generate multiple first feature maps; multiple second feature maps are generated according to the multiple first feature maps; Wherein, the multiple first feature maps include more texture detail information and/or location detail information than the multiple second feature maps; according to the multiple first feature maps and the multiple second feature maps The image generates a plurality of third feature maps; according to at least one third feature map of the plurality of third feature maps, target detection is performed on the first image, and a first detection result is output.

Optionally, the plurality of second feature maps include more semantic information than the plurality of first feature maps.

Optionally, the plurality of first feature maps include a first target feature map, the plurality of second feature maps include a second target feature map, and the plurality of third feature maps include a third target feature map and a first target feature map. Four target feature maps, the resolution of the third target feature map is smaller than that of the fourth target feature map; down-sampling the third target feature map to obtain a fifth target feature map, the fifth target feature The image has the same number of channels and resolution as the second target feature map; down-sampling and convolution processing are performed on the first target feature map to obtain a sixth target feature map, which is the same as The second target feature map has the same number of channels and resolution; the fifth target feature map, the second target feature map, and the sixth target feature map are superimposed based on the respective channels to generate the A fourth target feature map; or, down-sampling the third target feature map to obtain a fifth target feature map, the fifth target feature map having the same number of channels and resolution as the second target feature map Rate; down-sampling the first target feature map to obtain a sixth target feature map, the sixth target feature map and the second target feature map have the same resolution; the fifth target feature map , The second target feature map and the sixth target feature map are superimposed on the respective channels and subjected to convolution processing to generate the fourth target feature map, the fourth target feature map and the second target The feature maps have the same number of channels.

Optionally, at least one third feature map of the plurality of third feature maps can be convolved by the first convolution layer to obtain at least one fourth feature map; At least one third feature map in the plurality of third feature maps is convolved to obtain at least one fifth feature map; wherein the receptive field corresponding to the fourth feature map is larger than the receptive field corresponding to the fifth feature map ; According to the fourth feature map and the fifth feature map, perform target detection on the first image, and output a first detection result.

Optionally, determine a first weight value corresponding to the fourth feature map and a second weight value corresponding to the fifth feature map; process the fourth feature map according to the first weight value to obtain The processed fourth feature map; the fifth feature map is processed according to the second weight value to obtain a processed fifth feature map; wherein, the object to be detected included in the fourth feature map is greater than In the case where the fifth feature map includes the object to be detected, the first weight value is greater than the second weight value; according to the processed fourth feature map and the processed fifth feature map, Perform target detection on the first image, and output a first detection result.

Optionally, the first detection result includes a first detection frame, and a second detection result of the first image can be obtained, and the second detection result is object detection of the first image through a second perception network Obtained, the object detection accuracy of the first perception network is higher than the object detection accuracy of the second perception network, and the second detection result includes a second detection frame, and the area where the second detection frame is located corresponds to the There is an intersection between the areas where the first detection frame is located;

Optionally, the second detection result includes multiple detection frames, and there is an intersection between the area where each detection frame of the multiple detection frames is located and the area where the first detection frame is located. The detection frame includes the second detection frame, wherein, in the area of the intersection of the area where each detection frame is located in the multiple detection frames and the area where the first detection frame is located, the second detection frame is located The area of the intersection between the area and the area where the first detection frame is located is the smallest.

Optionally, the first image is an image frame in the video, the second image is an image frame in the video, and the frame distance between the first image and the second image in the video is smaller than a preset Value, the third detection result of the second image is acquired, the third detection result includes a fourth detection frame and the object category corresponding to the fourth detection frame; in the fourth detection frame and the first detection When the shape difference and the position difference between the frames are within a preset range, the first detection frame corresponds to the object category corresponding to the fourth detection frame.

Optionally, the detection confidence of the fourth detection frame is greater than a preset threshold.

In the embodiment of the present application, the first image is an image frame in the video, the second image is an image frame in the video, and the frame distance between the first image and the second image in the video is less than A preset value, a third detection result of the second image can also be obtained, and the third detection result is obtained by performing object detection on the second image through the first perception network or the second perception network , The third detection result includes a fourth detection frame and an object category corresponding to the fourth detection frame; if the shape difference and position difference between the fourth detection frame and the first detection frame are within a preset range , It is determined that the first detection frame corresponds to the object category corresponding to the fourth detection frame. Optionally, the detection confidence of the fourth detection frame is greater than a preset threshold.

In the embodiment of the present application, if the first image is obtained through video frame extraction, a timing detection algorithm can be considered. For the missed target detected by the missed detection algorithm, select several frames of images before and after the current image, and compare whether the area near the center of the missed target in the previous and next frames is similar to the area, aspect ratio, and center coordinates of the missed target Select a specific number of similar target frames, and then compare the similar target frames with other target frames in the suspected missed image whether there are similar target frames, and remove target frames that are similar to other target frames from the detected similar target frames. Obtain the most similar target frame in each of the front and rear frames based on the content similarity and feature similarity algorithm, and get the most similar target frame that is suspected to be missed. If the confidence of the most similar target frame is higher than a certain threshold, you can Determine the category of the missed target. If it is lower than a certain threshold, compare the suspected missed target with the most similar target frame category. If the category is the same, it is judged as a missed test. If the category is different, manual verification can be performed.

An embodiment of the present application provides an object detection method. The method includes: receiving an input first image; performing object detection on the first image through a first perception network to obtain a first detection result, and the first detection The result includes a first detection frame; object detection is performed on the first image through a second perception network, and a second detection result is obtained. The second detection result includes a second detection frame. There is a first intersection between a detection frame; if the ratio of the area of the first intersection to the area of the first detection frame is less than the preset value, the second detection result is updated, and the updated second detection result includes The first detection frame. Through the above method, the timing characteristics are introduced into the model to assist in determining whether the suspected missed test result is a true missed test result, and the category of the missed test result is judged, which improves the verification efficiency.

Referring to FIG. 16, FIG. 16 is a schematic flowchart of a perceptual network training method provided by an embodiment of this application. The perceptual network training method can be used to train the perceptual network in the foregoing embodiment. It should be noted that in this embodiment The first perception network may be the initial network of the perception network in the foregoing embodiment, and the second perception network obtained by training the first perception network may be the perception network in the foregoing embodiment.

As shown in Figure 16, the perceptual network training method includes:

1601. Obtain a pre-labeled detection frame of a target object in an image.

1602. Acquire a target detection frame corresponding to the image and the first perception network, where the target detection frame is used to identify the target object.

In the embodiment of the present application, the detection result of the image may be obtained. The detection result is obtained by object detection on the image through a first perception network, and the detection result includes the target detection frame corresponding to the first object. .

1603. Perform iterative training on the first perception network according to a loss function to output a second perception network; wherein the intersection of the loss function and the pre-labeled detection frame and the target detection frame is more related to IoU .

In the embodiment of the present application, the preset loss function is also related to the shape difference between the target detection frame and the pre-labeled detection frame, wherein the shape difference is negatively related to the area of the pre-labeled detection frame.

In the embodiment of the present application, the rectangular detection frame includes a first side and a second side that are connected, and the circumscribed rectangular frame includes a third side corresponding to the first side, and a first side corresponding to the second side. Four sides, the area difference is also positively correlated with the length difference between the first side and the third side, and positively correlated with the length difference between the second side and the fourth side.

In the embodiment of the present application, the preset loss function is also related to the position difference between the target detection frame and the pre-labeled detection frame in the image, wherein the position difference is related to the pre-labeled detection frame. The area of ?? is negatively correlated; or the position difference is negatively correlated with the area of the smallest circumscribed rectangle of the convex hull of the pre-labeled detection frame and the target detection frame.

In the embodiment of the present application, the target detection frame includes a first corner point and a first center point, the pre-labeled detection frame includes a second corner point and a second center point, the first corner point and the second center point are The corner points are the two end points of the diagonal of the rectangle, and the position difference is also positively correlated with the position difference between the first center point and the second center point in the image, and with the first corner point. It is negatively related to the length of the second corner point.

Exemplarily, the loss function may be as follows:

In the embodiment of this application, the newly designed frame regression loss function uses scale invariance and is applied to the IoU loss item of the target detection measurement method, the loss item considering the aspect ratio of the predicted frame and the real frame, and the predicted frame center coordinates and The loss item of the ratio of the distance between the center coordinates of the real frame and the coordinates of the lower right corner of the predicted frame and the coordinates of the upper left corner of the real frame. The IoU loss item naturally introduces a constant-scale frame prediction quality evaluation index. The loss item of the aspect ratio of the two frames measures the two frames. The degree of fit between the shapes, the third distance ratio metric is used to solve the problem that when IoU=0, the relative position relationship between the predicted frame and the real frame cannot be known, and the back propagation is difficult to minimize the loss function. , after the introduction ratio of the distance, the distance is smaller than the naturally narrow the distance between the o _p o _g, remote

with

the distance between. The three loss functions are assigned different weights to balance the impact of each item. Among them, the aspect ratio and distance ratio will be introduced and

Proportional weight coefficient to reduce the influence of frame scale, large-scale frame has smaller weight, and small-scale frame has larger weight. The frame regression loss function proposed in this patent is suitable for various two-stage and one-stage algorithms, and has good versatility, as well as the fit between the target scale, the frame, and the center point and the corner of the frame. Played a very good role in promoting.

In the embodiment of the application, the IoU, which measures the degree of fit between the detected frame and the real frame in the target detection, is used as the loss function of frame regression. Due to the inherent scale invariance of IoU, it solves that other frame regression functions in the past are more sensitive to scale changes. In addition, the loss of the aspect ratio between the predicted frame and the real frame is introduced, which promotes the predicted frame to fit the real frame more closely during the training process, and through the introduction of

Proportional weight coefficient to reduce the impact of scale changes; introduce the ratio of the distance between the center coordinates of the predicted border and the center coordinates of the real border and the distance between the lower right corner coordinates of the predicted border and the upper left corner coordinates of the real border, which is used to solve the problem of not knowing when IoU=0 the relative position between the two border, then the distance, the problem can not be back-propagation, the introduction ratio of the distance, the distance is smaller than the naturally narrow the distance between the o _p o _g, remote

with

The distance between promotes the border to change in the right direction.

The preset loss function includes a target loss item related to the position difference, and the target loss item changes with the change of the position difference; wherein, when the position difference is greater than a preset value, the The rate of change of the target loss item is greater than the first preset rate of change; and/or, when the position difference is less than the preset value, the rate of change of the target loss item is less than the second preset rate of change.

Optionally, the bounding box regression loss function can also be the following loss function:

The above-mentioned frame regression loss function utilizes scale invariance and is widely used in the detection measurement method of the IoU loss item, the loss item that takes into account the aspect ratio of the predicted frame and the real frame, and the pull loss item that narrows the distance between the predicted frame and the real frame. , The IoU loss term naturally introduces a constant-scale border prediction quality evaluation index. The loss term of the aspect ratio of the two borders measures the fit of the shape between the two borders. The third distance ratio metric is used to solve the problem when IoU=0. It is impossible to know the relative position relationship between the predicted frame and the real frame, and it is difficult to backpropagate to minimize the loss function. Therefore, the third pull loss item is introduced to shorten the distance between the predicted frame and the real frame. The ratio of the area area1 of the diagonal rectangle between the center of the predicted frame and the center of the real frame and the area2 of the smallest convex rectangle between the predicted frame and the real frame is used as the distance_IoU term, and f(x)=-ln(1-x) is used as the loss Function calculation item, it can be seen that the range of distance_IoU is [0,1). From the curve of f(x)=-ln(1-x), it can be seen that it has a rapid convergence trend between [0,1), so Setting x to x=distance_IoU to obtain the third loss item can achieve the purpose of quickly narrowing the predicted bounding box and the real bounding box.

Referring to FIG. 17, FIG. 17 is a schematic flowchart of an object detection method provided by an embodiment of the application. As shown in FIG. 17, the object detection method includes:

1701. Receive an input first image;

1702. Perform convolution processing on the first image to generate multiple first feature maps;

1703. Generate multiple second feature maps according to the multiple first feature maps; wherein the multiple first feature maps include more texture details of the input image than the multiple second feature maps And/or location details in the input image;

Compared with the first feature map, multiple second feature maps can include more deep features. Deep features contain rich semantic information, which has a good effect on classification tasks. At the same time, deep features have larger receptive fields. It has a good detection effect for large targets; in one implementation, by introducing a top-down path to generate multiple second feature maps, the rich semantic information contained in deep features can naturally be propagated downward. Make the second feature maps of each scale contain rich semantic information.

1704. Generate multiple third feature maps according to the multiple first feature maps and the multiple second feature maps.

1705. According to at least one third feature map of the plurality of third feature maps, output a first detection result of an object included in the image.

In an existing implementation, the second feature map generation unit (such as the feature pyramid network) introduces a top-down path to propagate the rich semantic information contained in deep features downwards, so that the second feature map generation unit of each scale Feature maps contain rich semantic information, and deep features have a large receptive field, which makes it possible to have a better detection effect on large targets. However, in the existing implementation, the more detailed position detail information and texture detail information contained in the shallower feature maps are ignored, which has a great impact on the detection accuracy of medium and small targets. In the embodiment of this application, the second feature map generation unit introduces the shallow texture detail information of the original feature map (multiple first feature maps generated by the convolution processing unit) into the deep feature map (generated by the first feature map generation unit). In multiple second feature maps), multiple third feature maps are generated, and the third feature map with shallow layer of rich texture detail information is used as the input data of the detection unit for target detection, which can improve the detection of subsequent object detection Accuracy.

Optionally, the plurality of first feature maps, the plurality of second feature maps, and the plurality of third feature maps are feature maps with multi-scale resolution.

Optionally, the plurality of first feature maps include a first target feature map, the plurality of second feature maps include a second target feature map, and the plurality of third feature maps include a third target feature map and a first target feature map. Four target feature maps, the resolution of the third target feature map is smaller than that of the fourth target feature map; said generating a plurality of third features according to the plurality of first feature maps and the plurality of second feature maps Figures, including:

Down-sampling the third target feature map to obtain a fifth target feature map, the fifth target feature map and the second target feature map have the same number of channels and resolution; for the first target The feature map is subjected to down-sampling and convolution processing to obtain a sixth target feature map. The sixth target feature map has the same number of channels and resolution as the second target feature map; and the fifth target feature map , The second target feature map and the sixth target feature map are superimposed on channels to generate the fourth target feature map; or,

Down-sampling the third target feature map to obtain a fifth target feature map, the fifth target feature map and the second target feature map have the same number of channels and resolution; for the first target The feature map is down-sampled to obtain a sixth target feature map. The sixth target feature map and the second target feature map have the same resolution; the fifth target feature map and the second target feature map are Perform channel superposition and convolution processing with the sixth target feature map to generate the fourth target feature map, and the fourth target feature map has the same number of channels as the second target feature map.

Optionally, the method further includes:

The fifth feature map is processed according to the second weight value to obtain a processed fifth feature map; wherein, the object to be detected included in the fourth feature map is larger than the object to be detected included in the fifth feature map In the case of an object, the first weight value is greater than the second weight value;

Optionally, the first detection result includes a first detection frame, and the method further includes:

Optionally, the second detection result includes multiple detection frames, and there is an intersection between the area where each detection frame of the multiple detection frames is located and the area where the first detection frame is located. The detection frame includes the second detection frame, wherein, in the area of the intersection of the area where each detection frame is located in the plurality of detection frames and the area where the first detection frame is located, the second detection frame is located The area of the intersection between the area and the area where the first detection frame is located is the smallest.

Optionally, the first image is an image frame in the video, the second image is an image frame in the video, and the frame distance between the first image and the second image in the video is less than a preset Value, the method further includes:

Referring to FIG. 18, FIG. 18 is a schematic diagram of a perceptual network training device provided by an embodiment of the application. As shown in Fig. 18, the perceptual network training device 1800 includes:

An obtaining module 1801, configured to obtain a pre-labeled detection frame of a target object in an image; and obtain a target detection frame corresponding to the image and the first perception network, the target detection frame being used to identify the target object;

The iterative training module 1802 is configured to iteratively train the first perceptual network according to the loss function to output the second perceptual network; wherein the loss function and the difference between the pre-labeled detection frame and the target detection frame Consolidation is more related to IoU.

Optionally, the preset loss function is also related to the shape difference between the target detection frame and the pre-labeled detection frame, wherein the shape difference is negatively related to the area of the pre-labeled detection frame.

Optionally, the preset loss function is also related to the position difference between the target detection frame and the pre-labeled detection frame in the image, wherein the position difference is related to the area of the pre-labeled detection frame. Negative correlation; or the position difference is negatively correlated with the area of the smallest circumscribed rectangle of the convex hull of the pre-labeled detection frame and the target detection frame.

Optionally, the target detection frame includes a first corner point and a first center point, the pre-labeled detection frame includes a second corner point and a second center point, the first corner point and the second corner point Are the two end points of the diagonal of the rectangle, the position difference is also positively correlated with the position difference between the first center point and the second center point in the image, and is also positively correlated with the first corner point and the The length of the second corner point is negatively related.

Optionally, the preset loss function includes a target loss item related to the position difference, and the target loss item changes as the position difference changes; wherein,

Referring to FIG. 19, FIG. 19 is a schematic diagram of an object detection device provided by an embodiment of the application. As shown in FIG. 19, the object detection device 1900 includes:

The receiving module 1901 is configured to receive an input first image, and perform convolution processing on the first image to generate multiple first feature maps;

The convolution processing module 1902 is configured to perform convolution processing on the first image to generate multiple first feature maps;

The first feature map generating module 1903 is configured to generate multiple second feature maps according to the multiple first feature maps; wherein, the multiple first feature maps include more than the multiple second feature maps. Texture details of the input image and/or location details in the input image;

The second feature map generating module 1904 is configured to generate multiple third feature maps according to the multiple first feature maps and the multiple second feature maps;

The detection module 1905 is configured to output a first detection result of an object included in the image according to at least one third feature map of the plurality of third feature maps.

Optionally, the plurality of first feature maps include a first target feature map, the plurality of second feature maps include a second target feature map, and the plurality of third feature maps include a third target feature map and a first target feature map. Four target feature maps, the resolution of the third target feature map is smaller than that of the fourth target feature map, and the second feature map generating module is specifically configured to:

Down-sampling the third target feature map to obtain a fifth target feature map, the fifth target feature map and the second target feature map have the same number of channels and resolution; for the first target The feature map is subjected to down-sampling and convolution processing to obtain a sixth target feature map, where the sixth target feature map has the same number of channels and resolution as the second target feature map; and the fifth target feature map , The second target feature map and the sixth target feature map are superimposed on channels to generate the fourth target feature map; or,

Optionally, the device further includes:

An intermediate feature extraction module, configured to perform convolution on at least one third feature map of the plurality of third feature maps through the first convolution layer to obtain at least one fourth feature map;

Correspondingly, the detection module is specifically used for:

Optionally, the device further includes:

Correspondingly, the detection module is specifically used for:

Optionally, the first detection result includes a first detection frame, and the acquisition module is further configured to:

An update module, configured to update the second detection result if the ratio of the area of the intersection to the area of the first detection frame is less than a preset value, so that the updated second detection result includes the first Check box.

Optionally, the first image is an image frame in the video, the second image is an image frame in the video, and the frame distance between the first image and the second image in the video is smaller than a preset Value, the acquisition module is also used to:

In the case that the shape difference and position difference between the fourth detection frame and the first detection frame are within a preset range, the first detection frame corresponds to the object category corresponding to the fourth detection frame.

Next, an execution device provided by an embodiment of this application will be introduced. Please refer to FIG. 20. FIG. 20 is a schematic structural diagram of an execution device provided by an embodiment of this application. Tablets, laptops, smart wearable devices, monitoring data processing devices, etc., are not limited here. Wherein, the object detection device described in the embodiment corresponding to FIG. 19 may be deployed on the execution device 2000 to implement the function of object detection in the embodiment corresponding to FIG. 19. Specifically, the execution device 2000 includes: a receiver 2001, a transmitter 2002, a processor 2003, and a memory 2004 (the number of processors 2003 in the execution device 2000 may be one or more, and one processor is taken as an example in FIG. 20) , Where the processor 2003 may include an application processor 20031 and a communication processor 20032. In some embodiments of the present application, the receiver 2001, the transmitter 2002, the processor 2003, and the memory 2004 may be connected by a bus or other methods.

The memory 2004 may include a read-only memory and a random access memory, and provides instructions and data to the processor 2003. A part of the memory 2004 may also include a non-volatile random access memory (NVRAM). The memory 2004 stores a processor and operating instructions, executable modules or data structures, or a subset of them, or an extended set of them. The operating instructions may include various operating instructions for implementing various operations.

The processor 2003 controls the operation of the execution device. In a specific application, the various components of the execution device are coupled together through a bus system, where the bus system may include a power bus, a control bus, and a status signal bus in addition to a data bus. However, for the sake of clear description, various buses are referred to as bus systems in the figure.

The method disclosed in the above embodiments of the present application may be applied to the processor 2003 or implemented by the processor 2003. The processor 2003 may be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method can be completed by an integrated logic circuit of hardware in the processor 2003 or instructions in the form of software. The aforementioned processor 2003 may be a general-purpose processor, a digital signal processing (DSP), a microprocessor or a microcontroller, and may further include an application specific integrated circuit (ASIC), field programmable Field-programmable gate array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components. The processor 2003 can implement or execute the methods, steps, and logical block diagrams disclosed in the embodiments of the present application. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor. The software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers. The storage medium is located in the memory 2004, and the processor 2003 reads the information in the memory 2004, and completes the steps of the above method in combination with its hardware.

The receiver 2001 can be used to receive input digital or character information, and generate signal input related to the relevant settings and function control of the execution device. The transmitter 2002 can be used to output digital or character information through the first interface; the transmitter 2002 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group; the transmitter 2002 can also include display devices such as a display screen .

In the embodiment of the present application, in one case, the processor 2003 is configured to execute the image processing method executed by the execution device in the embodiment corresponding to FIG. 9 to FIG. 11. Specifically, the application processor 20031 is configured to execute the object detection method in the foregoing embodiment.

An embodiment of this application also provides a training device. Please refer to FIG. 21. FIG. 21 is a schematic structural diagram of a training device provided in an embodiment of this application. The training device 2100 may be deployed with the perception described in the embodiment corresponding to FIG. 18 The network training device is used to realize the function of the perceptual network training device in the embodiment corresponding to FIG. 18. Specifically, the training device 2100 is implemented by one or more servers, and the training device 2100 may have relatively large differences due to different configurations or performances. It may include one or more central processing units (CPU) 2121 (for example, one or more processors) and a memory 2132, and one or more storage media 2130 (for example, one or A storage device in Shanghai). Among them, the memory 2132 and the storage medium 2130 may be short-term storage or persistent storage. The program stored in the storage medium 2130 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the training device. Furthermore, the central processing unit 2121 may be configured to communicate with the storage medium 2130, and execute a series of instruction operations in the storage medium 2130 on the training device 2100.

The training device 2100 may also include one or more power supplies 2126, one or more wired or wireless network interfaces 2150, and one or more input and output interfaces 2158; or, one or more operating systems 2141, such as Windows ServerTM, Mac OS XTM , UnixTM, LinuxTM, FreeBSDTM and so on.

In the embodiment of the present application, the central processing unit 2121 is configured to execute the relevant steps of the perceptual network training method in the foregoing embodiment.

The embodiments of the present application also provide a product including a computer program, which when running on a computer, causes the computer to execute the steps executed by the aforementioned execution device, or causes the computer to execute the steps executed by the aforementioned training device.

The embodiments of the present application also provide a computer-readable storage medium that stores a program for signal processing, and when it runs on a computer, the computer executes the steps performed by the aforementioned execution device. Or, make the computer execute the steps performed by the aforementioned training device.

The execution device, training device, or terminal device provided by the embodiments of the present application may specifically be a chip. The chip includes a processing unit and a communication unit. The processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, Pins or circuits, etc. The processing unit can execute the computer-executable instructions stored in the storage unit to make the chip in the execution device execute the data processing method described in the foregoing embodiment, or to make the chip in the training device execute the data processing method described in the foregoing embodiment. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, etc., and the storage unit may also be a storage unit located outside the chip in the wireless access device, such as Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), etc.

Specifically, please refer to Fig. 22. Fig. 22 is a schematic diagram of a structure of a chip provided by an embodiment of the application. On the CPU), the Host CPU assigns tasks. The core part of the NPU is the arithmetic circuit 2203, and the controller 2204 controls the arithmetic circuit 2203 to extract matrix data from the memory and perform multiplication operations.

In some implementations, the arithmetic circuit 2203 includes multiple processing units (Process Engine, PE). In some implementations, the arithmetic circuit 2203 is a two-dimensional systolic array. The arithmetic circuit 2203 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 2203 is a general-purpose matrix processor.

For example, suppose there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the corresponding data of matrix B from the weight memory 2202 and caches it on each PE in the arithmetic circuit. The arithmetic circuit fetches matrix A data and matrix B from the input memory 2201 to perform matrix operations, and the partial result or final result of the obtained matrix is stored in an accumulator 2208.

The unified memory 2206 is used to store input data and output data. The weight data directly passes through the storage unit access controller (Direct Memory Access Controller, DMAC) 2205, and the DMAC is transferred to the weight memory 2202. The input data is also transferred to the unified memory 2206 through the DMAC.

The BIU is the Bus Interface Unit, that is, the bus interface unit 2210, which is used for the interaction of the AXI bus with the DMAC and the instruction fetch buffer (IFB) 2209.

The bus interface unit 2210 (Bus Interface Unit, BIU for short) is used for the instruction fetch memory 2209 to obtain instructions from the external memory, and is also used for the storage unit access controller 2205 to obtain the original data of the input matrix A or the weight matrix B from the external memory.

The DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 2206 or to transfer the weight data to the weight memory 2202 or to transfer the input data to the input memory 2201.

The vector calculation unit 2207 includes a plurality of arithmetic processing units. If necessary, further processing is performed on the output of the arithmetic circuit 2203, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and so on. It is mainly used in the calculation of non-convolutional/fully connected layer networks in neural networks, such as Batch Normalization, pixel-level summation, and upsampling of feature planes.

In some implementations, the vector calculation unit 2207 can store the processed output vector to the unified memory 2206. For example, the vector calculation unit 2207 may apply a linear function; or, apply a nonlinear function to the output of the arithmetic circuit 2203, for example, perform linear interpolation on the feature plane extracted by the convolutional layer, and for example, a vector of accumulated values to generate the activation value. In some implementations, the vector calculation unit 2207 generates normalized values, pixel-level summed values, or both. In some implementations, the processed output vector can be used as an activation input to the arithmetic circuit 2203, for example for use in a subsequent layer in a neural network.

The instruction fetch buffer 2209 connected to the controller 2204 is used to store instructions used by the controller 2204;

The unified memory 2206, the input memory 2201, the weight memory 2202, and the fetch memory 2209 are all On-Chip memories. The external memory is private to the NPU hardware architecture.

Among them, the processor mentioned in any one of the above may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above-mentioned programs.

In addition, it should be noted that the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physically separate. The physical unit can be located in one place or distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments. In addition, in the drawings of the device embodiments provided in the present application, the connection relationship between the modules indicates that they have a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.

Through the description of the above embodiments, those skilled in the art can clearly understand that this application can be implemented by means of software plus necessary general hardware. Of course, it can also be implemented by dedicated hardware including dedicated integrated circuits, dedicated CPUs, dedicated memory, Dedicated components and so on to achieve. Under normal circumstances, all functions completed by computer programs can be easily implemented with corresponding hardware. Moreover, the specific hardware structures used to achieve the same function can also be diverse, such as analog circuits, digital circuits or special-purpose circuits. Circuit etc. However, for this application, software program implementation is a better implementation in more cases. Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a readable storage medium, such as a computer floppy disk. , U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., including several instructions to make a computer device (which can be a personal computer, training device, or network device, etc.) execute the various embodiments described in this application method.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented by software, it can be implemented in the form of a computer program product in whole or in part.

The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions described in the embodiments of the present application are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, training device, or data. The center transmits to another website, computer, training equipment, or data center through wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a training device or a data center integrated with one or more available media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk (SSD)).

Claims

A data processing system, characterized in that the data processing system comprises: a convolution processing unit, a first feature map generating unit, a second feature map generating unit, and a detection unit, the convolution processing unit and the first feature map generating unit, respectively A feature map generating unit is connected to the second feature map generating unit, the first feature map generating unit is connected to the second feature map generating unit, and the second feature map generating unit is connected to the detection unit;

The convolution processing unit is configured to receive an input image, and perform convolution processing on the input image to generate a plurality of first feature maps;

The first characteristic map generating unit is configured to generate a plurality of second characteristic maps according to the plurality of first characteristic maps; wherein, the plurality of first characteristic maps include more than the plurality of second characteristic maps Texture details of the input image and/or location details in the input image;

The second characteristic map generating unit is configured to generate a plurality of third characteristic maps according to the plurality of first characteristic maps and the plurality of second characteristic maps;

The detection unit is configured to output a detection result of an object included in the image according to at least one third feature map of the plurality of third feature maps.
The data processing system according to claim 1, wherein the plurality of second feature maps include more semantic information than the plurality of first feature maps.
The data processing system according to claim 1 or 2, wherein the convolution processing unit is a backbone network, and the first feature map generating unit and the second feature map generating unit are feature pyramid network FPN, The detection unit is head.
The data processing system according to any one of claims 1 to 3, wherein the plurality of first feature maps include a first target feature map, and the plurality of second feature maps include a second target feature map, so The multiple third feature maps include a third target feature map and a fourth target feature map, the resolution of the third target feature map is smaller than the fourth target feature map, and the second feature map generating unit is used to pass The following steps are used to generate the fourth target feature map:

Down-sampling the third target feature map to obtain a fifth target feature map, the fifth target feature map and the second target feature map have the same number of channels and resolution; for the first target The feature map is subjected to down-sampling and convolution processing to obtain a sixth target feature map. The sixth target feature map has the same number of channels and resolution as the second target feature map; and the fifth target feature map , The second target feature map and the sixth target feature map are superimposed on channels to generate the fourth target feature map; or,

Down-sampling the third target feature map to obtain a fifth target feature map, the fifth target feature map and the second target feature map have the same number of channels and resolution; for the first target The feature map is down-sampled to obtain a sixth target feature map. The sixth target feature map and the second target feature map have the same resolution; the fifth target feature map and the second target feature map are Perform channel superposition and convolution processing with the sixth target feature map to generate the fourth target feature map, and the fourth target feature map has the same number of channels as the second target feature map.
The data processing system according to any one of claims 1 to 4, wherein the data processing system further comprises:

An intermediate feature extraction layer, configured to perform convolution on at least one third feature map of the plurality of third feature maps to obtain at least one fourth feature map;

And convolving at least one third feature map in the plurality of third feature maps to obtain at least one fifth feature map; wherein the receptive field corresponding to the fourth feature map is larger than the fifth feature map Corresponding receptive field;

The detection unit is specifically configured to output a detection result of the object included in the image according to the at least one fourth feature map and the at least one fifth feature map.
The data processing system according to claim 5, wherein the intermediate feature extraction layer is further used for:

Processing the at least one fourth feature map according to the first weight value to obtain a processed fourth feature map;

The at least one fifth feature map is processed according to the second weight value to obtain a processed fifth feature map; wherein, the object to be detected included in the fourth feature map is larger than that included in the fifth feature map In the case of an object to be detected, the first weight value is greater than the second weight value;

Correspondingly, the detection unit is specifically configured to output the detection result of the object included in the image according to the processed fourth feature map and the processed fifth feature map.
A perceptual network training method, characterized in that the method includes:

Obtaining a pre-labeled detection frame of the target object in the image;

Acquiring a target detection frame corresponding to the image and the first perception network, where the target detection frame is used to identify the target object;

The first perceptual network is iteratively trained according to the loss function to output the second perceptual network; wherein the intersection of the loss function and the pre-labeled detection frame and the target detection frame is more related to IoU.
The method according to claim 7, wherein the preset loss function is also related to the shape difference between the target detection frame and the pre-labeled detection frame, wherein the shape difference is related to the pre-labeled detection frame. The area of the detection frame is negatively correlated.
The method according to claim 7 or 8, wherein the preset loss function is also related to the position difference between the target detection frame and the pre-labeled detection frame in the image, wherein the The position difference is negatively related to the area of the pre-labeled detection frame; or the position difference is negatively related to the area of the smallest circumscribed rectangle of the convex hull of the pre-labeled detection frame and the target detection frame.
The method according to claim 9, wherein the target detection frame includes a first corner point and a first center point, the pre-labeled detection frame includes a second corner point and a second center point, and the first The corner point and the second corner point are the two end points of the diagonal of the rectangle, and the position difference is also positively correlated with the position difference between the first center point and the second center point in the image, and It is negatively related to the length of the first corner point and the second corner point.
The method according to claim 7 or 8, wherein the preset loss function includes a target loss item related to the position difference, and the target loss item changes as the position difference changes; in,

When the position difference is greater than the preset value, the rate of change of the target loss item is greater than the first preset rate of change; and/or, when the position difference is less than the preset value, the rate of change of the target loss item Less than the second preset rate of change.
An object detection method, characterized in that the method is used in a first perception network, and the method includes:

Receiving an input first image, and performing convolution processing on the first image to generate a plurality of first feature maps;

Generate multiple second feature maps according to the multiple first feature maps; wherein the multiple first feature maps include more texture details and/or of the input image than the multiple second feature maps Or the location details in the input image;

Generating a plurality of third characteristic maps according to the plurality of first characteristic maps and the plurality of second characteristic maps;

According to at least one third feature map of the plurality of third feature maps, a first detection result of the object included in the image is output.
The method according to claim 12, wherein the plurality of second feature maps include more semantic information than the plurality of first feature maps.
The method according to claim 12 or 13, wherein the plurality of first feature maps, the plurality of second feature maps, and the plurality of third feature maps are feature maps with multi-scale resolution .
The method according to any one of claims 12 to 14, wherein the multiple first feature maps include a first target feature map, the multiple second feature maps include a second target feature map, and the multiple The third feature map includes a third target feature map and a fourth target feature map. The resolution of the third target feature map is smaller than that of the fourth target feature map; The multiple second feature maps generate multiple third feature maps, including:

Down-sampling the third target feature map to obtain a fifth target feature map, the fifth target feature map and the second target feature map have the same number of channels and resolution; for the first target The feature map is subjected to down-sampling and convolution processing to obtain a sixth target feature map. The sixth target feature map has the same number of channels and resolution as the second target feature map; and the fifth target feature map , The second target feature map and the sixth target feature map are superimposed on channels to generate the fourth target feature map; or,

Down-sampling the third target feature map to obtain a fifth target feature map, the fifth target feature map and the second target feature map have the same number of channels and resolution; for the first target The feature map is down-sampled to obtain a sixth target feature map. The sixth target feature map and the second target feature map have the same resolution; the fifth target feature map and the second target feature map are Perform channel superposition and convolution processing with the sixth target feature map to generate the fourth target feature map, and the fourth target feature map has the same number of channels as the second target feature map.
The method according to any one of claims 12 to 15, wherein the method further comprises:

Performing convolution on at least one third feature map of the plurality of third feature maps by the first convolution layer to obtain at least one fourth feature map;

And convolving at least one third feature map in the plurality of third feature maps to obtain at least one fifth feature map; wherein the receptive field corresponding to the fourth feature map is larger than the fifth feature map Corresponding receptive field;

Correspondingly, the outputting the first detection result of the object included in the image according to at least one third feature map of the plurality of third feature maps includes:

According to the at least one fourth feature map and the at least one fifth feature map, a first detection result of the object included in the image is output.
The method according to claim 16, wherein the method further comprises:

Processing the fourth feature map according to the first weight value to obtain a processed fourth feature map;

The fifth feature map is processed according to the second weight value to obtain a processed fifth feature map; wherein, the object to be detected included in the fourth feature map is larger than the object to be detected included in the fifth feature map In the case of an object, the first weight value is greater than the second weight value;

Correspondingly, the outputting the first detection result of the object included in the image according to at least one third feature map of the plurality of third feature maps includes:

According to the processed fourth feature map and the processed fifth feature map, a first detection result of the object included in the image is output.
The method according to any one of claims 12 to 17, wherein the first detection result comprises a first detection frame, and the method further comprises:

Acquire a second detection result of the first image, the second detection result is obtained by object detection of the first image through a second perception network, and the object detection accuracy of the first perception network is higher than that of the Object detection accuracy of the second perception network, the second detection result includes a second detection frame, and there is an intersection between the area where the second detection frame is located and the area where the first detection frame is located;

If the ratio of the area of the intersection to the area of the first detection frame is less than a preset value, the second detection result is updated so that the updated second detection result includes the first detection frame.
The method according to claim 18, wherein the second detection result includes a plurality of detection frames, and the area where each detection frame of the plurality of detection frames is located is the same as the area where the first detection frame is located. There is an intersection, and the plurality of detection frames includes the second detection frame, wherein the area of the intersection of the area where each detection frame is located in the multiple detection frames and the area where the first detection frame is located Wherein, the area of the intersection between the area where the second detection frame is located and the area where the first detection frame is located is the smallest.
The method according to claim 18 or 19, wherein the first image is an image frame in the video, the second image is an image frame in the video, and the first image and the second image are The frame spacing in the video is less than a preset value, and the method further includes:

Acquiring a third detection result of the second image, where the third detection result includes a fourth detection frame and an object category corresponding to the fourth detection frame;

When the shape difference and position difference between the fourth detection frame and the first detection frame are within a preset range, the first detection frame corresponds to the object category corresponding to the fourth detection frame.
A perceptual network training device, characterized in that the device includes:

An acquiring module for acquiring a pre-labeled detection frame of a target object in an image; and acquiring a target detection frame corresponding to the image and the first perception network, the target detection frame being used to identify the target object;

The iterative training module is used to iteratively train the first perception network according to the loss function to output the second perception network; wherein the loss function and the intersection between the pre-labeled detection frame and the target detection frame And more related to IoU.
The device according to claim 21, wherein the preset loss function is also related to the shape difference between the target detection frame and the pre-labeled detection frame, wherein the shape difference is related to the pre-labeled detection frame. The area of the detection frame is negatively correlated.
The device according to claim 21 or 22, wherein the preset loss function is also related to the position difference between the target detection frame and the pre-labeled detection frame in the image, wherein the The position difference is negatively related to the area of the pre-labeled detection frame; or the position difference is negatively related to the area of the smallest circumscribed rectangle of the convex hull of the pre-labeled detection frame and the target detection frame.
The apparatus according to claim 23, wherein the preset loss function comprises a target loss item related to the position difference, and the target loss item changes with the change of the position difference; wherein,

When the position difference is greater than the preset value, the rate of change of the target loss item is greater than the first preset rate of change; and/or, when the position difference is less than the preset value, the rate of change of the target loss item Less than the second preset rate of change.
An object detection device, characterized by comprising a storage medium, a processing circuit, and a bus system; wherein the storage medium is used to store instructions, and the processing circuit is used to execute instructions in the memory to execute the claim 7. To the steps of the method of any one of 20.
A computer-readable storage medium having a computer program stored thereon, wherein the program is executed by a processor to implement the steps of the method according to any one of claims 7 to 20.