WO2021218786A1 - Data processing system, object detection method and apparatus thereof - Google Patents

Data processing system, object detection method and apparatus thereof Download PDF

Info

Publication number
WO2021218786A1
WO2021218786A1 PCT/CN2021/089118 CN2021089118W WO2021218786A1 WO 2021218786 A1 WO2021218786 A1 WO 2021218786A1 CN 2021089118 W CN2021089118 W CN 2021089118W WO 2021218786 A1 WO2021218786 A1 WO 2021218786A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature map
target
detection
feature
maps
Prior art date
Application number
PCT/CN2021/089118
Other languages
French (fr)
Chinese (zh)
Inventor
应江勇
朱雄威
高敬
陈雷
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021218786A1 publication Critical patent/WO2021218786A1/en
Priority to US18/050,051 priority Critical patent/US20230076266A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle

Definitions

  • This application relates to the field of artificial intelligence, and in particular to a data processing system, object detection method and device.
  • Computer vision is an inseparable part of various intelligent/autonomous systems in various application fields, such as manufacturing, inspection, document analysis, medical diagnosis, and military. It is about how to use cameras/video cameras and computers to obtain What we need is the knowledge of the data and information of the subject. To put it vividly, it is to install eyes (camera or video camera) and brain (algorithm) to the computer to replace the human eye to identify, track and measure the target, so that the computer can perceive the environment. Because perception can be seen as extracting information from sensory signals, computer vision can also be seen as a science that studies how to make artificial systems "perceive" from images or multi-dimensional data.
  • computer vision uses various imaging systems to replace the visual organs to obtain input information, and then the computer replaces the brain to complete the processing and interpretation of the input information.
  • the ultimate research goal of computer vision is to enable computers to observe and understand the world through vision like humans, and have the ability to adapt to the environment autonomously.
  • the perception network can be a neural network model that processes and analyzes images and obtains the processing results.
  • the perception network can complete more and more functions, such as image classification, 2D detection, semantic segmentation, and key point detection. , Linear object detection (such as lane line or stop line detection in automatic driving technology), drivable area detection, etc.
  • the visual perception system has the characteristics of low cost, non-contact, small size, and large amount of information. With the continuous improvement of the accuracy of visual perception algorithms, it has become the key technology of many artificial intelligence systems today, and has been more and more widely used, such as: advanced driving assistance system (ADAS) and automatic driving system (autonomous driving system). Recognition of dynamic obstacles (people or cars) and static objects (traffic lights, traffic signs or traffic cones) on the road in the driving system (ADS), through the recognition of human masks in the terminal visual camera beauty function Mask and key points to achieve slimming effect, etc.
  • ADAS advanced driving assistance system
  • ADS driving system
  • Perceptual networks usually include feature pyramid networks (FPN), which introduces a top-to-bottom network structure, and introduces horizontal connection branches derived from the original feature extraction network, and the original feature network corresponds to the resolution feature map and After upsampling the deep feature maps for fusion, the top-to-bottom network structure introduced in FPN has a larger receptive field, but its detection accuracy for small objects is low.
  • FPN feature pyramid networks
  • the present application provides an object detection method, the method is used in a first perception network, and the method includes:
  • CNN processing on the input image here should not be understood as only performing convolution processing on the input image.
  • the input image can be Perform convolution processing, pooling operations, and so on.
  • Convolution processing on the first image to generate multiple first feature maps should not only be understood as performing multiple convolution processing on the first image each time Convolution processing can generate a first feature map, that is, it should not be understood that each first feature map is obtained based on convolution processing on the first image, but, on the whole, the first image is multiple The source of the first feature map; in one implementation, the first image can be convolved to obtain a first feature map, and then the generated first feature map can be convolved to obtain another first feature map. Feature maps, and so on, can get multiple first feature maps.
  • a series of convolution processing may be performed on the input image. Specifically, in each convolution processing, the first feature map obtained by the previous convolution processing may be subjected to convolution processing. , And then obtain a first feature map, and multiple first feature maps can be obtained by the above method.
  • the multiple first feature maps may be feature maps with multi-scale resolution, that is, the multiple first feature maps are not feature maps with the same resolution.
  • the multiple first feature maps The feature map can form a feature pyramid.
  • the input image may be received, and the input image may be subjected to convolution processing to generate multiple first feature maps with multi-scale resolution; the convolution processing unit may perform a series of convolution processing on the input image , To obtain feature maps at different scales (with different resolutions).
  • the convolution processing unit can take many forms, such as visual geometry group (VGG), residual neural network (residual neural network, resnet), GoogLeNet core structure (Inception-net), and so on.
  • the "generating multiple second feature maps based on the multiple first feature maps” here should not be understood to mean that the source of each second feature map generated in the multiple second feature maps is Multiple first feature maps; in one implementation, a part of the second feature maps in the multiple second feature maps are directly generated based on one or more first feature maps in the multiple first feature maps; one In this implementation, a part of the second feature maps in the multiple second feature maps are directly generated based on one or more first feature maps in the multiple first feature maps, and other second feature maps other than itself ; In one implementation, a part of the second feature maps in the multiple second feature maps are directly generated based on other second feature maps other than itself.
  • the multiple second feature maps may be feature maps with multi-scale resolution, that is, the multiple second feature maps are not feature maps with the same resolution.
  • the multiple second feature maps The feature map can form a feature pyramid.
  • the convolution operation can be performed on the topmost feature map C4 in the multiple first feature maps generated by the convolution processing unit.
  • the hole convolution and 1 ⁇ 1 convolution can be used to convert the topmost feature map C4
  • the number of channels is reduced to 256, which is the top feature map P4 of the feature pyramid; the output results of the top and next feature maps C3 are horizontally linked and the number of channels is reduced to 256 using 1 ⁇ 1 convolution.
  • the pixels are added to obtain the feature map P3; and so on, from top to bottom, a first feature pyramid is constructed, and the first feature pyramid may include a plurality of second feature maps.
  • the texture details can be used to express the detailed information of small targets and edge features.
  • the first feature map includes more texture detail information, so that it can be used in the detection results of small target detection.
  • the detection accuracy is higher.
  • the position details can be information that expresses the position of the object in the image and the relative position between the objects.
  • multiple second feature maps can include more deep features.
  • Deep features contain rich semantic information, which has a good effect on classification tasks.
  • deep features have larger receptive fields. It has a good detection effect on large targets; in one implementation, by introducing a top-down path to generate multiple second feature maps, the rich semantic information contained in deep features can naturally be propagated downwards. Make the second feature maps of each scale contain rich semantic information.
  • a plurality of third characteristic maps are generated according to the plurality of first characteristic maps and the plurality of second characteristic maps.
  • a part of the third feature maps in the multiple third feature maps are based on one or more first feature maps and multiple second feature maps in the multiple first feature maps One or more of the second feature maps are generated; in one implementation, a part of the third feature maps in the multiple third feature maps are based on one or more first feature maps in the multiple first feature maps, One or more second feature maps in multiple second feature maps, and other third feature maps other than itself; in one implementation, a part of the third feature maps in multiple third feature maps It is generated based on a third feature map other than itself.
  • the multiple third feature maps may be feature maps with multi-scale resolution, that is, the multiple third feature maps are not feature maps with the same resolution.
  • the multiple third feature maps The feature map can form a feature pyramid.
  • a first detection result of the object included in the image is output.
  • the object can be a person, animal, plant, object, etc.
  • object detection can be performed on the image according to at least one third feature map of the plurality of third feature maps, where the object detection is to identify the type of object included in the image and where the object is located. The location and so on.
  • the second feature map generation unit (such as the feature pyramid network) introduces a top-down path to propagate the rich semantic information contained in deep features downwards, so that the second feature map generation unit of each scale Feature maps contain rich semantic information, and deep features have a large receptive field, which makes it possible to have a better detection effect on large targets.
  • the more detailed position detail information and texture detail information contained in the shallower feature maps are ignored, which has a great impact on the detection accuracy of medium and small targets.
  • the second feature map generation unit introduces the shallow texture detail information of the original feature map (multiple first feature maps generated by the convolution processing unit) into the deep feature map (generated by the first feature map generation unit).
  • multiple third feature maps are generated, and the third feature map with shallow rich texture detail information is used as the input data of the detection unit for target detection, which can improve the detection of subsequent object detection Accuracy.
  • this embodiment does not mean that the detection accuracy of object detection for any image that includes small targets will be higher, but for a large number of samples, this embodiment can have higher comprehensive detection. Accuracy.
  • the above object detection method can be implemented by a data processing system, such as a trained perceptual network, where the perceptual network can include a convolution processing unit, a first feature map generating unit, a second feature map generating unit, and A detection unit, the convolution processing unit is respectively connected to the first feature map generating unit and the second feature map generating unit, the first feature map generating unit and the second feature map generating unit are connected, so The second feature map generating unit is connected to the detection unit, and the convolution processing unit is configured to receive an input image and perform convolution processing on the input image to generate a plurality of first feature maps;
  • the first characteristic map generating unit is configured to generate a plurality of second characteristic maps according to the plurality of first characteristic maps; wherein, the plurality of first characteristic maps include more than the plurality of second characteristic maps
  • the second feature map generating unit is configured to perform according to the plurality of first feature maps and the plurality of second feature maps A
  • the perception network may include: a backbone network, a first feature pyramid network FPN, a second FPN, and a head-end header, the backbone network is respectively connected to the first FPN and the second FPN, The first FPN is connected to the second FPN, and the second FPN is connected to the head (wherein, the convolution processing unit is the backbone network, and the first feature map generation unit is the first feature pyramid network FPN , The second feature map generating unit is a second feature pyramid network FPN, and the detection unit is a head).
  • the backbone network can be used to receive input images and perform convolution processing on the input images to generate multiple first feature maps with multi-scale resolution; the backbone network can perform a series of input images Convolution processing obtains feature maps at different scales (with different resolutions).
  • the backbone network can take many forms, such as visual geometry group (VGG), residual neural network (residual neural network, resnet), the core structure of GoogLeNet (Inception-net), and so on.
  • the first FPN may be used to generate a first feature pyramid according to the multiple first feature maps, where the first feature pyramid includes multiple second feature maps with multi-scale resolution; wherein, the backbone network is generated Perform the convolution operation on the top-most feature map C4 of.
  • the number of channels of the top-most feature map C4 can be reduced to 256 using hole convolution and 1 ⁇ 1 convolution, as the top-most feature map P4 of the feature pyramid; After linking the output result of the feature map C3 of the top and the next layer and using 1 ⁇ 1 convolution to reduce the number of channels to 256, add it with the feature map P4 channel by pixel to obtain the feature map P3; and so on, from top to bottom, Construct the first characteristic pyramid.
  • the second FPN may be used to generate a second feature pyramid based on the multiple first feature maps and the multiple second feature maps, and the second feature pyramid includes multiple third feature maps with multi-scale resolution.
  • the head is used to detect the target object in the image according to at least one third feature map of the plurality of third feature maps, and output the detection result.
  • the second FPN introduces the rich edge, texture and other detailed information of the shallow layer of the original feature map (multiple first feature maps generated by the backbone network) into the deep feature map (multiple first feature maps generated by the first FPN).
  • the second feature map a second feature pyramid is generated, and the third feature map with shallow, rich edge, texture, and other detailed information is used as the input data for head target detection, which can improve the detection accuracy of subsequent object detection.
  • the plurality of first characteristic maps include a first target characteristic map
  • the plurality of second characteristic maps include a second target characteristic map
  • the plurality of third characteristic maps include a third target characteristic map.
  • a target feature map and a fourth target feature map, the resolution of the third target feature map is smaller than that of the fourth target feature map; said generating according to the plurality of first feature maps and the plurality of second feature maps Multiple third feature maps, including:
  • the third target feature map Down-sampling the third target feature map to obtain a fifth target feature map, the fifth target feature map and the second target feature map have the same number of channels and resolution; for the first target
  • the feature map is subjected to down-sampling and convolution processing to obtain a sixth target feature map, where the sixth target feature map has the same number of channels and resolution as the second target feature map; and the fifth target feature map , The second target feature map and the sixth target feature map are superimposed on channels to generate the fourth target feature map;
  • the third target feature map can be down-sampling and convolution processing to obtain the fifth target feature map, wherein the purpose of down-sampling is to make the feature maps of each channel in the fifth target feature map distinguish
  • the rate is the same as the second target feature map
  • the purpose of the convolution processing is to make the number of channels of the fifth target feature map the same as the second target feature map.
  • a target feature map is down-sampled to obtain a sixth target feature map.
  • the sixth target feature map and the second target feature map have the same resolution; the fifth target feature map and the second target feature map are The feature map and the sixth target feature map are subjected to channel superposition and convolution processing to generate the fourth target feature map, and the fourth target feature map and the second target feature map have the same number of channels.
  • channel superimposition can be understood as performing the corresponding elements in the feature map of the corresponding channel (that is, the channel with the same semantic information) (corresponding elements can be understood as elements located at the same position on the feature map). Superposition (such as doing addition, etc.).
  • the second target feature map may be a feature map including multiple channels, and the feature map corresponding to each channel may be a feature map including a kind of semantic information.
  • the third target feature map can be down-sampled to obtain a fifth target feature map, where the purpose of down-sampling is to make the resolution of the feature maps of each channel in the fifth target feature map equal to that of the second target feature map.
  • the target feature map is the same.
  • the fifth target feature map and the second target feature map have the same resolution, and the fifth target feature map and the second target feature can be added channel by channel;
  • the first target feature map is down-sampled to obtain a sixth target feature map, where the purpose of down-sampling is to make the resolution of the feature maps of each channel in the sixth target feature map the same as the second target feature map.
  • the fifth target feature map, the sixth target feature map, and the second target feature map have the same resolution, so that the fifth target feature map, the sixth target feature map can be converted to the same resolution.
  • the feature map and the second target feature map are added channel by channel, and then processed by convolution, so that the obtained fourth target feature map and the second target feature map have the same number of channels, where the convolution process can be It is a concatenation operation.
  • a third target feature map and a fourth target feature map with different resolutions can be generated, the resolution of the third target feature map is smaller than the fourth target feature map, and the fourth target with a larger resolution
  • the feature map is generated based on a first feature map in a plurality of first feature maps, a second feature map in a plurality of second feature maps, and a third target feature map.
  • the multiple third feature maps generated by the second feature map generation unit retain the advantages of the feature pyramid network, which has bottom-up (from the feature map with small resolution to generate feature maps with larger resolution in turn), and will be shallower.
  • the rich texture detail information and/or position detail information of the layer neural network is introduced into the deep convolutional layer, and the detection network uses multiple third feature maps generated in this way to detect small target detection results with higher detection accuracy .
  • the method further includes:
  • the outputting the first detection result of the object included in the image according to at least one third feature map of the plurality of third feature maps includes:
  • a first detection result of the object included in the image is output.
  • the fourth feature map and the fifth feature map obtained by processing the third feature map have different receptive fields. Since they have different receptive field feature maps, they can adapt to targets of different sizes.
  • the third feature map can be processed by convolutional layers with different dilation rates, and the processing results obtained can include information about large targets or information about small targets, so that subsequent targets In the detection process, both large targets and small targets can be detected.
  • the method further includes:
  • the first weight value may perform a channel-based multiplication operation or other numerical processing on the fourth feature map, so that the elements in the fourth feature map undergo a corresponding gain.
  • the fifth feature map is processed according to the second weight value to obtain a processed fifth feature map; it should be noted that the second weight value can be compared to the fifth feature map based on the channel.
  • the multiplication operation or other numerical processing makes the elements in the fifth feature map perform corresponding gains.
  • the first weight value is greater than the second weight value
  • the outputting the first detection result of the object included in the image according to at least one third feature map of the plurality of third feature maps includes:
  • a first detection result of the object included in the image is output.
  • the processed The fourth feature map has a greater gain than the processed fifth feature map. Since the receptive field corresponding to the fourth feature map itself is larger than the receptive field corresponding to the fifth feature map, the larger the receptive field, the more it has For more large target information, the detection accuracy of the large target in the target detection performed by using it is also higher. In this embodiment, when there is a larger target in the image, the gain corresponding to the fourth feature map is compared with There are also more fifth feature maps. When the detection unit performs target detection on the image based on the processed fourth feature map and the processed fifth feature map, the overall receptive field will be larger. Yes, the detection accuracy is also higher.
  • the intermediate feature extraction layer can learn the recognition law of the weight value through training: for the feature map that includes a large target, the first weight value of the first convolutional layer determined by it is larger, and the determined The second weight value of the second convolutional layer is smaller. For a feature map that includes a small target, the first weight value of the first convolutional layer determined by the feature map is relatively small, and the second weight value of the second convolutional layer determined by the feature map is relatively large.
  • the method further includes:
  • the target object in the image is detected according to the at least one third feature map after the hole convolution processing, and the first detection result is output.
  • a 3x3 convolution may function as a sliding window in the candidate region extraction network (RPN), and the convolution kernel may be moved on at least one third feature map to pass subsequent intermediate layer and category judgments and The frame regression layer can obtain whether there is a target in each anchor frame and the difference between the predicted frame and the real frame, and a better frame extraction result can be obtained by training the candidate region extraction network.
  • the 3x3 sliding window convolution kernel is replaced with a 3x3 hole convolution kernel, and at least one third feature map of the plurality of third feature maps is subjected to hole convolution processing, according to the hole convolution processing The latter at least one third feature map detects the target object in the image and outputs the detection result. Without increasing the amount of calculation, the receptive field is increased, and the impact on large targets and partially occluded targets is reduced. Missed inspection.
  • the first detection result includes a first detection frame
  • the method further includes:
  • the second detection result is obtained by object detection of the first image through a second perception network, and the object detection accuracy of the first perception network is higher than that of the Object detection accuracy of the second perception network, the second detection result includes a second detection frame, and there is an intersection between the area where the second detection frame is located and the area where the first detection frame is located;
  • the second detection result is updated so that the updated second detection result includes the first detection frame.
  • the ratio of the area of the first intersection to the area of the first detection frame is less than the preset value, it can be considered that the first detection frame is omitted from the second detection result, and the updated The second detection result is such that the updated second detection result includes the first detection frame.
  • the timing characteristics are introduced into the model to assist in determining whether the suspected missed test result is a true missed test result, and the category of the missed test result is judged, which improves the efficiency of verification.
  • the second detection result includes multiple detection frames, and there is an intersection between the area where each detection frame of the multiple detection frames is located and the area where the first detection frame is located, so
  • the multiple detection frames include the second detection frame, wherein, in the area of the intersection of the area where each detection frame is located in the multiple detection frames and the area where the first detection frame is located, the second detection frame The area of the intersection between the area where the detection frame is located and the area where the first detection frame is located is the smallest.
  • the method further includes:
  • the third detection result includes a fourth detection frame and an object category corresponding to the fourth detection frame
  • the first detection frame corresponds to the object category corresponding to the fourth detection frame.
  • the detection confidence of the fourth detection frame is greater than a preset threshold.
  • the present application provides a data processing system, the data processing system includes: a convolution processing unit, a first feature map generation unit, a second feature map generation unit, and a detection unit, the convolution processing units are respectively Is connected to the first characteristic map generating unit and the second characteristic map generating unit, the first characteristic map generating unit is connected to the second characteristic map generating unit, and the second characteristic map generating unit is connected to the Detection unit connection;
  • the convolution processing unit is configured to receive an input image, and perform convolution processing on the input image to generate a plurality of first feature maps
  • the first characteristic map generating unit is configured to generate a plurality of second characteristic maps according to the plurality of first characteristic maps; wherein, the plurality of first characteristic maps include more than the plurality of second characteristic maps Texture details of the input image and/or location details in the input image;
  • the second characteristic map generating unit is configured to generate a plurality of third characteristic maps according to the plurality of first characteristic maps and the plurality of second characteristic maps;
  • the detection unit is configured to output a detection result of an object included in the image according to at least one third feature map of the plurality of third feature maps.
  • the data processing system may be a sensory network system for realizing the function of the sensory network.
  • the second feature map generation unit (such as the feature pyramid network) introduces a top-down path to propagate the rich semantic information contained in deep features downwards, so that the second feature map generation unit of each scale Feature maps contain rich semantic information, and deep features have a large receptive field, which makes it possible to have a better detection effect on large targets.
  • the more detailed position detail information and texture detail information contained in the shallower feature maps are ignored, which has a great impact on the detection accuracy of medium and small targets.
  • the second feature map generation unit introduces the shallow texture detail information of the original feature map (multiple first feature maps generated by the convolution processing unit) into the deep feature map (generated by the first feature map generation unit).
  • multiple third feature maps are generated, and the third feature map with shallow rich texture detail information is used as the input data of the detection unit for target detection, which can improve the detection of subsequent object detection Accuracy.
  • this embodiment does not mean that the detection accuracy of object detection for any image that includes small targets will be higher, but for a large number of samples, this embodiment can have higher comprehensive detection. Accuracy.
  • the third aspect is a device corresponding to the first aspect, please refer to the description of the first aspect for its various implementation modes, explanations and corresponding technical effects, which will not be repeated here.
  • this application provides a perceptual network training method, the method includes:
  • the detection result of the image may be obtained, the detection result is obtained by object detection of the image through a first perception network, and the detection result includes the target detection frame corresponding to the first object ;
  • the first perceptual network may be iteratively trained according to the loss function to update the parameters included in the first perceptual network to obtain the second perceptual network; wherein, the loss function and the pre- The intersection between the labeled detection frame and the target detection frame is more related to IoU.
  • the second perceptual network may be output.
  • the newly designed frame regression loss function uses scale invariance and is applied to the IoU loss item of the target detection measurement method, the loss item considering the aspect ratio of the predicted frame and the real frame, and the predicted frame center coordinates and The loss term of the ratio between the center coordinate distance of the real border and the distance between the lower right corner coordinate of the predicted border and the upper left corner coordinate of the real border.
  • the IoU loss term naturally introduces a constant-scale border prediction quality evaluation index.
  • the loss term of the aspect ratio of the two borders measures the two borders.
  • the preset loss function is also related to the shape difference between the target detection frame and the pre-labeled detection frame, wherein the shape difference is negatively related to the area of the pre-labeled detection frame .
  • the preset loss function is also related to the position difference between the target detection frame and the pre-labeled detection frame in the image, wherein the position difference is related to the pre-labeled detection frame.
  • the area of the frame is negatively correlated; or the position difference is negatively correlated with the area of the smallest circumscribed rectangle of the convex hull of the pre-labeled detection frame and the target detection frame.
  • the target detection frame includes a first corner point and a first center point
  • the pre-labeled detection frame includes a second corner point and a second center point
  • the first corner point and the first center point are the two end points of the diagonal of the rectangle
  • the position difference is also positively correlated with the position difference between the first center point and the second center point in the image, and with the first corner The point is negatively related to the length of the second corner point.
  • the preset loss function includes a target loss item related to the position difference, and the target loss item changes as the position difference changes; wherein, when the position difference is greater than When the preset value is set, the rate of change of the target loss item is greater than the first preset rate of change; and/or, when the position difference is less than the preset value, the rate of change of the target loss item is less than the second preset rate of change Rate.
  • the rate of change of the target loss item is greater than the first preset rate of change; and/or, when the position difference is less than the preset value, the rate of change of the target loss item Less than the second preset rate of change, the effect of rapid convergence can be achieved during the training process.
  • the present application provides a perceptual network training device, the device includes:
  • An acquiring module for acquiring a pre-labeled detection frame of a target object in an image; and acquiring a target detection frame corresponding to the image and the first perception network, the target detection frame being used to identify the target object;
  • the iterative training module is used to iteratively train the first perception network according to the loss function to output the second perception network; wherein the loss function and the intersection between the pre-labeled detection frame and the target detection frame And more related to IoU.
  • the preset loss function is also related to the shape difference between the target detection frame and the pre-labeled detection frame, wherein the shape difference is negatively related to the area of the pre-labeled detection frame .
  • the preset loss function is also related to the position difference between the target detection frame and the pre-labeled detection frame in the image, wherein the position difference is related to the pre-labeled detection frame.
  • the area of the frame is negatively correlated; or the position difference is negatively correlated with the area of the smallest circumscribed rectangle of the convex hull of the pre-labeled detection frame and the target detection frame.
  • the target detection frame includes a first corner point and a first center point
  • the pre-labeled detection frame includes a second corner point and a second center point
  • the first corner point and the first center point are the two end points of the diagonal of the rectangle
  • the position difference is also positively correlated with the position difference between the first center point and the second center point in the image, and with the first corner The point is negatively related to the length of the second corner point.
  • the preset loss function includes a target loss item related to the position difference, and the target loss item changes as the position difference changes;
  • the rate of change of the target loss item is greater than the first preset rate of change; and/or, when the position difference is less than the preset value, the rate of change of the target loss item Less than the second preset rate of change.
  • the fourth aspect is a device corresponding to the second aspect, please refer to the description of the first aspect for its various implementation modes, explanations and corresponding technical effects, which will not be repeated here.
  • an embodiment of the present application provides an object detection device, which may include a memory, a processor, and a bus system, where the memory is used to store a program, and the processor is used to execute the program in the memory to execute the above-mentioned second aspect. And any optional method of the second aspect.
  • an embodiment of the present application provides an object detection device, which may include a memory, a processor, and a bus system.
  • the memory is used to store a program
  • the processor is used to execute the program in the memory to execute the above-mentioned third aspect. And any optional method of the third aspect.
  • an embodiment of the present invention also provides a sensory network application system.
  • the sensory network application system includes at least one processor, at least one memory, at least one communication interface, and at least one display device.
  • the processor, the memory, the display device and the communication interface are connected through the communication bus and complete the mutual communication.
  • Communication interface used to communicate with other devices or communication networks
  • the memory is used to store application program codes for executing the above solutions, and the processor controls the execution.
  • the processor is configured to execute the application program code stored in the memory;
  • the code stored in the memory can execute one of the object detection methods provided above or the method of training the perceptual network provided in the above embodiments;
  • the display device is used to display the image to be recognized, 2D, 3D, Mask, key points and other information of the object of interest in the image to be recognized.
  • an embodiment of the present application provides a computer-readable storage medium having a computer program stored in the computer-readable storage medium, and when it runs on a computer, the computer executes the second aspect and any of the above-mentioned second aspects.
  • an embodiment of the present application provides a computer-readable storage medium in which a computer program is stored.
  • the computer program When the computer program is run on a computer, the computer can execute the third aspect and any one thereof.
  • Optional method When the computer program is run on a computer, the computer can execute the third aspect and any one thereof. Optional method.
  • an embodiment of the present application provides a computer program, which when running on a computer, causes the computer to execute the first aspect and any optional method thereof.
  • an embodiment of the present application provides a computer program that, when run on a computer, causes the computer to execute the third aspect and any optional method thereof.
  • this application provides a chip system that includes a processor for supporting execution devices or training devices to implement the functions involved in the above aspects, for example, sending or processing data involved in the above methods ; Or, information.
  • the chip system further includes a memory for storing program instructions and data necessary for the execution device or the training device.
  • the chip system can be composed of chips, and can also include chips and other discrete devices.
  • the second feature map generation unit introduces the shallow texture detail information of the original feature map (multiple first feature maps generated by the convolution processing unit) into the deep feature map (first The multiple second feature maps generated by the feature map generating unit) are used to generate multiple third feature maps, and the third feature map with shallow rich texture detail information is used as the input data for the detection unit to perform target detection. Improve the detection accuracy of subsequent object detection.
  • Figure 1 is a schematic diagram of a structure of the main frame of artificial intelligence
  • Figure 2 is an application scenario of an embodiment of the application
  • Figure 3 is an application scenario of an embodiment of the application
  • Figure 4 is an application scenario of an embodiment of the application
  • FIG. 5 is a schematic diagram of a system architecture provided by an embodiment of the application.
  • FIG. 6 is a schematic diagram of the structure of a convolutional neural network used in an embodiment of the application.
  • FIG. 7 is a schematic diagram of the structure of a convolutional neural network used in an embodiment of the application.
  • FIG. 8 is a hardware structure of a chip provided by an embodiment of the application.
  • FIG. 9 is a schematic structural diagram of a sensing network provided by an embodiment of this application.
  • Figure 10 is a schematic diagram of the structure of a backbone network
  • FIG. 11 is a schematic diagram of the structure of a first FPN
  • Figure 12a is a schematic diagram of the structure of a second FPN
  • Figure 12b is a schematic diagram of the structure of a second FPN
  • Figure 12c is a schematic diagram of the structure of a second FPN
  • Figure 12d is a schematic diagram of the structure of a second FPN
  • Figure 12e is a schematic diagram of the structure of a second FPN
  • Figure 13a is a schematic diagram of the structure of a head
  • Figure 13b is a schematic diagram of the structure of a head
  • FIG. 14a is a schematic diagram of the structure of a sensing network provided by an embodiment of this application.
  • FIG. 14b is a schematic diagram of a hole convolution kernel provided by an embodiment of this application.
  • Figure 14c is a schematic diagram of a processing flow of intermediate feature extraction
  • FIG. 15 is a schematic flowchart of an object detection method provided by an embodiment of this application.
  • FIG. 16 is a schematic flow chart of a perceptual network training method provided by an embodiment of this application.
  • FIG. 17 is a schematic flowchart of an object detection method provided by an embodiment of this application.
  • FIG. 18 is a schematic diagram of a perceptual network training device provided by an embodiment of this application.
  • FIG. 19 is a schematic diagram of an object detection device provided by an embodiment of the application.
  • FIG. 20 is a schematic structural diagram of an execution device provided by an embodiment of this application.
  • FIG. 21 is a schematic structural diagram of a training device provided by an embodiment of the present application.
  • FIG. 22 is a schematic diagram of a structure of a chip provided by an embodiment of the application.
  • Figure 1 shows a schematic diagram of the main framework of artificial intelligence.
  • the "intelligent information chain” reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has gone through the condensing process of "data-information-knowledge-wisdom".
  • the "IT value chain” from the underlying infrastructure of human intelligence, information (providing and processing technology realization) to the industrial ecological process of the system, reflects the value that artificial intelligence brings to the information technology industry.
  • the infrastructure provides computing power support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the basic platform.
  • smart chips hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA
  • basic platforms include distributed computing frameworks and network related platform guarantees and support, which can include cloud storage and Computing, interconnection network, etc.
  • sensors communicate with the outside to obtain data, and these data are provided to the smart chip in the distributed computing system provided by the basic platform for calculation.
  • the data in the upper layer of the infrastructure is used to represent the data source in the field of artificial intelligence.
  • the data involves graphics, images, voice, and text, as well as the Internet of Things data of traditional devices, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other methods.
  • machine learning and deep learning can symbolize and formalize data for intelligent information modeling, extraction, preprocessing, training, etc.
  • Reasoning refers to the process of simulating human intelligent reasoning in a computer or intelligent system, using formal information to conduct machine thinking and solving problems based on reasoning control strategies.
  • the typical function is search and matching.
  • Decision-making refers to the process of making decisions after intelligent information is reasoned, and usually provides functions such as classification, ranking, and prediction.
  • some general capabilities can be formed based on the results of the data processing, such as an algorithm or a general system, for example, translation, text analysis, computer vision processing, speech recognition, image Recognition and so on.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. It is an encapsulation of the overall solution of artificial intelligence, productizing intelligent information decision-making and realizing landing applications. Its application fields mainly include: intelligent terminals, intelligent transportation, Smart medical care, autonomous driving, safe city, etc.
  • the embodiments of the present application are mainly applied in fields such as driving assistance, automatic driving, and mobile phone terminals that need to complete various perception tasks.
  • the application system framework of the present invention is shown in Figures 2 and 3.
  • Figure 2 shows the application scenario of the embodiment of the present application.
  • the embodiment of the present application is in the module of automatic data labeling in the data processing platform.
  • the dotted line outlines the location of the present invention. s position.
  • the system is an intelligent data platform for human-machine collaboration, to build artificial intelligence capabilities with higher efficiency, faster training, and stronger models.
  • the automatic data labeling module is an intelligent labeling system framework that solves the problem of high manual labeling costs and few manual labeling sets.
  • the product implementation form of the embodiment of this application is the program code included in the intelligent data storage system and deployed on the server hardware.
  • the network element whose function is enhanced or modified due to this solution belongs to soft modification and belongs to a relatively independent module.
  • the program code of the embodiment of the application exists in the runtime training module of the intelligent data system component.
  • the program code of the embodiment of the application is stored and accelerated by the host of the server.
  • Hardware GPU/FPGA/dedicated chip
  • the possible impact in the future is that before data is read into the module, data may be read from a certain ftp, file, database, or memory.
  • you can only The data source is updated to the interface of the functional module involved in the scheme.
  • Figure 3 shows the implementation form of the present invention in the server and platform software, where the label generating device and the automatic calibration device are newly added modules based on the existing platform software of the present invention.
  • Application scenario 1 ADAS/ADS visual perception system
  • Application scenario 2 mobile phone beauty function
  • the mask and key points of the human body are detected through the perception network provided by the embodiments of the present application, and the corresponding parts of the human body can be zoomed in and out, such as waist reduction and hip beautification operations, so as to output beautiful images.
  • Application scenario 3 Image classification scenario:
  • the object recognition device After obtaining the image to be classified, the object recognition device adopts the object recognition method of the present application to obtain the category of the object in the image to be classified, and then can classify the image to be classified according to the category of the object in the image to be classified.
  • the object recognition method of the present application For photographers, many photos are taken every day, including animals, people, and plants. Using the method of the present application, photos can be quickly classified according to the content of the photos, which can be divided into photos containing animals, photos containing people, and photos containing plants.
  • the object recognition device After the object recognition device obtains the image of the product, it then uses the object recognition method of the present application to obtain the category of the product in the image of the product, and then classifies the product according to the category of the product. For a wide variety of commodities in large shopping malls or supermarkets, the object recognition method of the present application can quickly complete the classification of commodities, reducing time and labor costs.
  • the method for training a perceptual network involves computer vision processing, and can be specifically applied to data processing methods such as data training, machine learning, and deep learning. And the category of objects) to perform symbolic and formalized intelligent information modeling, extraction, preprocessing, training, etc., and finally obtain a trained perceptual network; and the embodiment of this application will input data (such as the object in this application).
  • the image of is input into the trained perceptual network to obtain output data (for example, the 2D, 3D, Mask, key points and other information of the object of interest in the image are obtained in this application).
  • object detection Using image processing, machine learning, computer graphics and other related methods, object detection can determine the category of image objects and determine the detection frame used to locate the object.
  • Convolutional Neural Network (Convosutionas Neuras Network, CNN) is a deep neural network with a convolutional structure.
  • the convolutional neural network contains a feature extractor composed of a convolutional layer and a sub-sampling layer.
  • the feature extractor can be regarded as a filter.
  • the perceptual network in this embodiment may include a convolutional neural network, which is used to perform convolution processing on an image or perform convolution processing on a feature map to generate a feature map.
  • Convolutional neural networks can use backpropagation (BP) algorithms to modify the size of the parameters in the initial super-resolution model during the training process, so that the reconstruction error loss of the super-resolution model becomes smaller and smaller. Specifically, forwarding the input signal to the output will cause error loss, and the parameters in the initial super-resolution model are updated by backpropagating the error loss information, so that the error loss is converged.
  • the backpropagation algorithm is a backpropagation motion dominated by error loss, and aims to obtain the optimal super-resolution model parameters, such as a weight matrix.
  • the perception network may be updated based on the back propagation algorithm.
  • the execution device 110 is configured with an input/output (input/output, I/O) interface 112 for data interaction with external devices, and the user Data may be input to the I/O interface 112 through the client device 140, and the input data may include the image to be recognized or the image block or image in the embodiment of the present application.
  • I/O input/output
  • the execution device 120 may call the data storage system 150
  • the data, codes, etc. are used for corresponding processing, and the data, instructions, etc. obtained by corresponding processing can also be stored in the data storage system 150.
  • the I/O interface 112 returns the processing result, such as the image or image block obtained above, or at least one of the 2D, 3D, Mask, and key points of the object of interest in the image, to the client device 140, so as to provide it to the client device 140. user.
  • the client device 140 may be a control unit in an automatic driving system or a functional algorithm module in a mobile phone terminal.
  • the functional algorithm module may be used to implement tasks related to perception.
  • the training device 120 can generate corresponding target models/rules based on different training data for different goals or different tasks, and the corresponding target models/rules can be used to achieve the above goals or complete the above tasks. , So as to provide users with the desired results.
  • the target model/rule may be the perceptual network described in the subsequent embodiment, and the result provided to the user may be the object detection result in the subsequent embodiment.
  • the user can manually set input data, and the manual setting can be operated through the interface provided by the I/O interface 112.
  • the client device 140 can automatically send input data to the I/O interface 112. If the client device 140 is required to automatically send the input data and the user's authorization is required, the user can set the corresponding authority in the client device 140.
  • the user can view the result output by the execution device 110 on the client device 140, and the specific display form may be a specific manner such as display, sound, and action.
  • the client device 140 can also be used as a data collection terminal to collect the input data of the input I/O interface 112 and the output result of the output I/O interface 112 as new sample data and store it in the database 130 as shown in the figure.
  • the I/O interface 112 directly uses the input data input to the I/O interface 112 and the output result of the output I/O interface 112 as a new sample as shown in the figure.
  • the data is stored in the database 130.
  • FIG. 5 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data The storage system 150 is an external memory relative to the execution device 110. In other cases, the data storage system 150 may also be placed in the execution device 110.
  • the perceptual network can be obtained by training according to the training device 120.
  • the perceptual network may include a deep neural network with a convolutional structure.
  • the structure of the convolutional neural network used in the embodiment of the present application may be as shown in FIG. 6.
  • a convolutional neural network (CNN) 200 may include an input layer 210, a convolutional layer/pooling layer 220 (where the pooling layer is optional), and a neural network layer 230.
  • the input layer 210 can obtain the image to be processed, and pass the obtained image to be processed to the convolutional layer/pooling layer 220 and the subsequent neural network layer 230 for processing, and the processing result of the image can be obtained.
  • the convolutional layer/pooling layer 220 may include layers 221-226, for example: in an implementation, layer 221 is a convolutional layer, layer 222 is a pooling layer, and layer 223 is a convolutional layer. Layers, 224 is the pooling layer, 225 is the convolutional layer, and 226 is the pooling layer; in another implementation, 221 and 222 are the convolutional layers, 223 is the pooling layer, and 224 and 225 are the convolutional layers. Layer, 226 is the pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or as the input of another convolutional layer to continue the convolution operation.
  • the convolution layer 221 can include many convolution operators.
  • the convolution operator is also called a kernel. Its function in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator is essentially It can be a weight matrix. This weight matrix is usually pre-defined. In the process of convolution on the image, the weight matrix is usually one pixel after one pixel (or two pixels after two pixels) along the horizontal direction on the input image. ...It depends on the value of stride) to complete the work of extracting specific features from the image.
  • weight values in these weight matrices need to be obtained through a lot of training in practical applications.
  • Each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, so that the convolutional neural network 200 can make correct predictions. .
  • the initial convolutional layer (such as 221) often extracts more general features, which can also be called low-level features; with the convolutional neural network
  • the features extracted by the subsequent convolutional layers (for example, 226) become more and more complex, such as features such as high-level semantics, and features with higher semantics are more suitable for the problem to be solved.
  • pooling layer After it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce a pooling layer after the convolutional layer.
  • the 221-226 layers as illustrated by 220 in Figure 6 it can be a convolutional layer followed by a layer.
  • the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
  • the convolutional neural network 200 After processing by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not enough to output the required output information. Because as mentioned above, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input image.
  • the convolutional neural network 210 shown in FIG. 6 is only used as an example of a convolutional neural network. In specific applications, the convolutional neural network may also exist in the form of other network models.
  • the perception network of the embodiment of the present application may include a deep neural network with a convolutional structure, where the structure of the convolutional neural network may be as shown in FIG. 7.
  • a convolutional neural network (CNN) 200 may include an input layer 110, a convolutional layer/pooling layer 120 (where the pooling layer is optional), and a neural network layer 130.
  • CNN convolutional neural network
  • FIG. 6 multiple convolutional layers/pooling layers in the convolutional layer/pooling layer 120 in FIG. 7 are parallel, and the respectively extracted features are input to the full neural network layer 130 for processing.
  • FIG. 8 is a schematic structural diagram of a data processing system provided by an embodiment of the application. As shown in Figure 8, the data processing system may include:
  • the map generating unit 803 is connected, the first characteristic map generating unit 802 is connected with the second characteristic map generating unit 803, and the second characteristic map generating unit 803 is connected with the detecting unit 804.
  • the data processing system can implement the function of a perceptual network, wherein the convolution processing unit 801 is the backbone network, and the first feature map generating unit 802 and the second feature map generating unit 803 are feature maps.
  • the detection unit 804 is a head.
  • FIG. 9 is a schematic structural diagram of a sensing network provided by an embodiment of this application. As shown in Figure 9, the sensing network includes:
  • the backbone network 901, the first feature pyramid network FPN 902, the second FPN 903, and the head-end header 904, the backbone network 901 is connected to the first FPN 902 and the second FPN 903, and the first FPN 901 Connected to the second FPN 902, and the second FPN 903 connected to the head 904;
  • the architecture of the sensing network may be the architecture shown in FIG. 9, which mainly consists of a backbone network 901, a first FPN 902, a second FPN 903, and a head-end header 904.
  • the convolution processing unit 801 is the backbone network, and the convolution processing unit 801 is configured to receive an input image and perform convolution processing on the input image to generate multiple first images. Feature map.
  • CNN processing on the input image should not be understood as only performing convolution processing on the input image.
  • the input image can be Perform convolution processing and other processing.
  • Convolution processing on the first image to generate multiple first feature maps should not only be understood as performing multiple convolution processing on the first image each time Convolution processing can generate a first feature map, that is, it should not be understood that each first feature map is obtained based on convolution processing on the first image, but, on the whole, the first image is multiple The source of the first feature map; in one implementation, the first image can be convolved to obtain a first feature map, and then the generated first feature map can be convolved to obtain another first feature map. Feature maps, and so on, can get multiple first feature maps.
  • a series of convolution processing may be performed on the input image. Specifically, in each convolution processing, the first feature map obtained by the previous convolution processing may be subjected to convolution processing. , And then obtain a first feature map, and multiple first feature maps can be obtained by the above method.
  • the multiple first feature maps may be feature maps with multi-scale resolution, that is, the multiple first feature maps are not feature maps with the same resolution.
  • the multiple first feature maps The feature map can form a feature pyramid.
  • the convolution processing unit may be used to receive an input image, and perform convolution processing on the input image to generate multiple first feature maps with multi-scale resolution; the convolution processing unit may perform convolution processing on the input A series of convolution processing is performed on the image to obtain feature maps at different scales (with different resolutions).
  • the convolution processing unit can take many forms, such as visual geometry group (VGG), residual neural network (residual neural network, resnet), GoogLeNet core structure (Inception-net), and so on.
  • the convolution processing unit may be a backbone network, the backbone network 901, which is used to receive an input image and perform convolution processing on the input image to generate multiple first features with multi-scale resolution. picture.
  • FIG. 10 is a schematic diagram of the structure of a backbone network provided by an embodiment of the application.
  • the backbone network is used to receive input images, perform convolution processing on the input images, and output Feature maps with different resolutions corresponding to the image (feature map C1, feature map C2, feature map C3, feature map C4); that is to say, feature maps of different sizes corresponding to the image are output, and the backbone network completes the basic features
  • feature map C1, feature map C2, feature map C3, feature map C4 feature maps with different resolutions corresponding to the image
  • the backbone network can perform a series of convolution processing on the input image to obtain feature maps at different scales (with different resolutions). These feature maps will provide basic features for subsequent detection modules.
  • the backbone network can take many forms, such as visual geometry group (VGG), residual neural network (residual neural network, resnet), the core structure of GoogLeNet (Inception-net), and so on.
  • the backbone network can perform convolution processing on the input image to generate several convolution feature maps of different scales.
  • Each feature map is a matrix of H*W*C, where H is the height of the feature map and W is the width of the feature map , C is the number of channels in the feature map.
  • the backbone can use a variety of existing convolutional network frameworks, such as VGG16, Resnet50, Inception-Net, etc.
  • VGG16 convolutional network frameworks
  • Resnet50 Resnet50
  • Inception-Net etc.
  • Resnet18 is an example of Resnet18 as Backbone.
  • the resolution of the input image is H*W*3 (height H, width W, the number of channels is 3, that is, three channels of RBG).
  • the input image can be convolved through a convolutional layer Res18-Conv1 of Resnet18 to generate Featuremap (feature map) C1.
  • This feature map is down-sampled 2 times with respect to the input image, and the number of channels is expanded to 64, so the C1
  • the resolution is H/4*W/4*64.
  • C1 can be convolved through Resnet18's Res18-Conv2 to get Featuremap C2, the resolution of this feature map is the same as C1; C2 continues through Res18-Conv3 for convolution operation to generate Featuremap C3, this feature map is further downsampled relative to C2 , The number of channels is doubled, and its resolution is H/8*W/8*128; finally C3 undergoes Res18-Conv4 convolution operation to generate Featuremap C4, and its resolution is H/16*W/16*256.
  • backbone network in the embodiments of the present application may also be referred to as a backbone network, which is not limited here.
  • the first characteristic map generating unit 802 is configured to generate a plurality of second characteristic maps according to the plurality of first characteristic maps; wherein, the plurality of first characteristic maps include more features than the plurality of second characteristic maps. More texture detail information and/or location detail information.
  • the "generating multiple second feature maps based on the multiple first feature maps” here should not be understood to mean that the source of each second feature map generated in the multiple second feature maps is Multiple first feature maps; in one implementation, a part of the second feature maps in the multiple second feature maps are directly generated based on one or more first feature maps in the multiple first feature maps; one In this implementation, a part of the second feature maps in the multiple second feature maps are directly generated based on one or more first feature maps in the multiple first feature maps, and other second feature maps other than itself ; In one implementation, a part of the second feature maps in the multiple second feature maps are directly generated based on other second feature maps other than itself.
  • the multiple second feature maps may be feature maps with multi-scale resolution, that is, the multiple second feature maps are not feature maps with the same resolution.
  • the multiple second feature maps The feature map can form a feature pyramid.
  • the convolution operation can be performed on the topmost feature map C4 in the multiple first feature maps generated by the convolution processing unit.
  • the hole convolution and 1 ⁇ 1 convolution can be used to convert the topmost feature map C4
  • the number of channels is reduced to 256, which is the top feature map P4 of the feature pyramid; the output results of the top and next feature maps C3 are horizontally linked and the number of channels is reduced to 256 using 1 ⁇ 1 convolution.
  • the pixels are added to obtain the feature map P3; and so on, from top to bottom, a first feature pyramid is constructed, and the first feature pyramid may include a plurality of second feature maps.
  • the texture detail information here can be shallow detail information used to express small targets and edge features.
  • the first feature map includes more texture detail information, so that The detection accuracy of the detection results for small target detection is higher.
  • the position details can be information that expresses the position of the object in the image and the relative position between the objects.
  • multiple second feature maps can include more deep features. Deep features contain rich semantic information, which has a good effect on classification tasks. At the same time, deep features have more The large receptive field can have a good detection effect on large targets; in one implementation, by introducing a top-down path to generate multiple second feature maps, the rich semantic information contained in deep features can be naturally combined Propagate downward so that the second feature maps of various scales contain rich semantic information.
  • the first feature map generating unit 802 may be the first FPN 902.
  • the first FPN 902 is configured to generate a first feature pyramid according to the multiple first feature maps, and the first feature pyramid includes multiple second feature maps with multi-scale resolution.
  • the first FPN is connected to the backbone network, and the first FPN may perform convolution processing and merging processing on multiple feature maps of different resolutions generated by the backbone network to construct the first feature pyramid.
  • FIG. 11 is a schematic diagram of the structure of a first FPN.
  • the first FPN 902 can generate a first feature pyramid according to the multiple first feature maps, and the first feature pyramid includes a multi-scale resolution.
  • a second feature map feature map P2, feature map P3, feature map P4, feature map P5.
  • the convolution operation is performed on the topmost feature map C4 generated by the backbone network 901.
  • the hole convolution and 1 ⁇ 1 convolution can be used to reduce the number of channels of the topmost feature map C4 to 256, which is used as the feature pyramid.
  • the top-most feature map P4 horizontally link the output results of the top-level and next-level feature maps C3 and use 1 ⁇ 1 convolution to reduce the number of channels to 256, then add it with the feature map P4 channel by pixel to obtain the feature map P3;
  • the first feature pyramid may further include a feature map P5, which can be generated by directly performing a convolution operation on the feature map P4.
  • the feature map in the middle of the first feature pyramid can take the rich semantic information contained in the deep features through the top-down structure and introduce each feature layer layer by layer, so that feature maps of different scales contain richer semantic information. Provide better semantic information for small targets and improve the classification performance of small targets.
  • the first FPN shown in FIG. 11 is only an implementation manner, and does not constitute a limitation of the present application.
  • the second feature map generating unit 803 is configured to generate multiple third feature maps according to the multiple first feature maps and the multiple second feature maps.
  • a part of the third feature maps in the multiple third feature maps are based on one or more first feature maps and multiple second feature maps in the multiple first feature maps One or more of the second feature maps are generated; in one implementation, a part of the third feature maps in the multiple third feature maps are based on one or more first feature maps in the multiple first feature maps, One or more second feature maps in multiple second feature maps, and other third feature maps other than itself; in one implementation, a part of the third feature maps in multiple third feature maps It is generated based on a third feature map other than itself.
  • the multiple third feature maps may be feature maps with multi-scale resolution, that is, the multiple third feature maps are not feature maps with the same resolution.
  • the multiple third feature maps The feature map can form a feature pyramid.
  • the second feature map generating unit 803 may be a second FPN 903.
  • the second FPN 903 is used to generate a second feature pyramid based on the multiple first feature maps and the multiple second feature maps, and the second feature pyramid includes multi-scale resolution Multiple third characteristic maps of the rate.
  • Fig. 12a is a schematic diagram of the structure of a second FPN.
  • the second FPN 903 can generate the first feature map according to the multiple first feature maps generated by the backbone network 901 and the multiple second feature maps generated by the first FPN 902.
  • Two feature pyramids, where the second feature pyramid may include multiple third feature maps (for example, the feature map Q1, the feature map Q2, the feature map Q3, and the feature map Q4 shown in FIG. 12a).
  • the second feature pyramid includes a plurality of third feature maps with multi-scale resolution, and the lowest-level feature map (that is, the feature map with the lowest resolution) in the plurality of third feature maps can be based on the backbone network A first feature map generated and a second feature map generated by the first FPN are generated.
  • the plurality of first feature maps include a first target feature map
  • the plurality of second feature maps include a second target feature map
  • the plurality of third feature maps include a first target feature map.
  • Three target feature maps, the third target feature map is the feature map with the smallest resolution in the plurality of third feature maps, and the second FPN is used to generate the third target feature map through the following steps:
  • the plurality of first feature maps include a first target feature map
  • the first target feature map may be the feature map C2 in FIG. 12b
  • the plurality of second feature maps include a second target feature map
  • the second target The feature map may be the feature map P3 in FIG. 12b
  • the plurality of third feature maps include a third target feature map
  • the third target feature map may be the feature map Q1 in FIG. 12b.
  • the first target feature map can be down-sampled and convolved to obtain a fourth target feature map.
  • the feature map C2 can be down-sampled and convolved. Processing, where the purpose of downsampling is to make the resolution of each channel feature map of the fourth target feature map the same as the feature map P3, and the purpose of convolution processing is to make the number of channels of the fourth target feature map the same as the feature map P3.
  • the fourth target feature map and the second target feature map have the same number of channels and resolution, and the fourth target feature map and the second target feature can be added channel by channel, such as As shown in FIG. 12b, the fourth target feature map and the second target feature are added channel by channel to obtain the feature map Q1.
  • the plurality of first feature maps include a first target feature map
  • the plurality of second feature maps include a second target feature map
  • the plurality of third feature maps include A third target feature map
  • the third target feature map is a feature map with the smallest resolution in the plurality of third feature maps
  • the second FPN is used to generate the third target feature map through the following steps:
  • the fourth target feature map has the same resolution as the second target feature map;
  • the second target feature map is added channel by channel and convolution processing is performed to generate the third target feature map, and the third target feature map has the same number of channels as the second target feature map.
  • the plurality of first feature maps include a first target feature map
  • the first target feature map may be the feature map C2 in FIG. 12c
  • the plurality of second feature maps include a second target feature map
  • the second target The feature map may be the feature map P3 in FIG. 12c
  • the plurality of third feature maps include a third target feature map
  • the third target feature map may be the feature map Q1 in FIG. 12c.
  • the first target feature map can be down-sampled to obtain the fourth target feature map.
  • the feature map C2 can be down-sampled, where the purpose of down-sampling is The resolution of the feature map of each channel in the fourth target feature map is made the same as the feature map P3.
  • the fourth target feature map and the second target feature map have the same resolution, and then the fourth target feature map and the second target feature can be added channel by channel, and then passed through the volume.
  • Convolution processing so that the obtained third target feature map and the feature map P3 have the same number of channels, wherein the above-mentioned convolution processing may be a concatenation operation.
  • the second feature pyramid includes multiple third feature maps with multi-scale resolution, wherein the non-bottom-most feature maps (that is, the feature maps with not the lowest resolution) in the multiple third feature maps can be based on A first feature map generated by the backbone network, a second feature map generated by the first FPN, and a third feature map of the adjacent bottom layer are generated.
  • the plurality of first feature maps include a first target feature map
  • the plurality of second feature maps include a second target feature map
  • the plurality of third feature maps include a first target feature map.
  • the second feature map generating unit is configured to generate the fourth target through the following steps Feature map:
  • the feature map is subjected to down-sampling and convolution processing to obtain a sixth target feature map, where the sixth target feature map has the same number of channels and resolution as the second target feature map; and the fifth target feature map ,
  • the second target feature map and the sixth target feature map are superimposed based on respective channels to generate the fourth target feature map.
  • the plurality of first feature maps include a first target feature map
  • the first target feature map may be the feature map C3 in FIG. 12d
  • the plurality of second feature maps include a second target feature map
  • the second target The feature map may be the feature map P4 in FIG. 12d
  • the plurality of third feature maps include a third target feature map and a fourth target feature map
  • the third target feature map may be the feature map Q2 in FIG. 12d.
  • the third target feature map can be down-sampled to obtain the fifth target feature map.
  • the feature map Q1 can be down-sampled, where the purpose of down-sampling is The resolution of the feature map of each channel in the fifth target feature map is made the same as the feature map P4.
  • the third target feature map can be down-sampling and convolution processing to obtain the fifth target feature map, wherein the purpose of down-sampling is to make the feature maps of each channel in the fifth target feature map distinguish
  • the rate is the same as the second target feature map
  • the purpose of the convolution processing is to make the number of channels of the fifth target feature map the same as the second target feature map.
  • the third target feature map may be down-sampled to obtain a fifth target feature map, and the fifth target feature map has the same characteristics as the second target feature map. Number of channels and resolution; down-sampling the first target feature map to obtain a sixth target feature map, the sixth target feature map having the same resolution as the second target feature map; Five target feature maps, the second target feature maps, and the sixth target feature maps are superimposed on their respective channels and subjected to convolution processing to generate the fourth target feature map, and the fourth target feature map is The second target feature maps have the same number of channels.
  • the plurality of first feature maps include a first target feature map
  • the first target feature map may be the feature map C3 in FIG. 12e
  • the plurality of second feature maps include a second target feature map
  • the second target The feature map may be the feature map P4 in FIG. 12e
  • the plurality of third feature maps include a third target feature map and a fourth target feature map
  • the third target feature map may be the feature map Q2 in FIG. 12e.
  • the third target feature map can be down-sampled to obtain a fifth target feature map, where the purpose of down-sampling is to make the resolution of the feature maps of each channel in the fifth target feature map equal to that of the second target feature map.
  • the target feature map is the same.
  • the fifth target feature map and the second target feature map have the same resolution, and the fifth target feature map and the second target feature can be added channel by channel;
  • the first target feature map is down-sampled to obtain a sixth target feature map, where the purpose of down-sampling is to make the resolution of the feature maps of each channel in the sixth target feature map the same as the second target feature map.
  • the fifth target feature map, the sixth target feature map, and the second target feature map have the same resolution, so that the fifth target feature map, the sixth target feature map can be converted to the same resolution.
  • the feature map and the second target feature map are added channel by channel, and then processed by convolution, so that the obtained fourth target feature map and the second target feature map have the same number of channels, where the convolution process can be It is a concatenation operation.
  • the second feature map generating unit may generate a third target feature map and a fourth target feature map with different resolutions.
  • the resolution of the third target feature map is smaller than that of the fourth target feature map, wherein the resolution
  • the larger fourth target feature map is generated based on a first feature map in a plurality of first feature maps, a second feature map in a plurality of second feature maps, and a third target feature map.
  • the multiple third feature maps generated by the second feature map generation unit retain the advantages of the feature pyramid network, which has bottom-up (from the feature map with small resolution to generate feature maps with larger resolution in turn), and will be shallower.
  • the rich texture detail information and/or position detail information of the layer neural network is introduced into the deep convolutional layer, and the detection network uses multiple third feature maps generated in this way to detect small target detection results with higher detection accuracy .
  • the object detection task is different from the image classification.
  • the model In the image classification task, the model only needs to answer the question of what is in the image. Therefore, the image classification task has translation invariance such as translation and scale.
  • the model In the object detection task, the model needs to know where the target in the image is and which category the target belongs to.
  • the current deep neural network models are all moving towards deeper and multi-branch topological structures. Expansion, as the network layer deepens, the subsequent network degradation problem also appears.
  • the ResNet network structure of identity mapping and jump connection can solve the network degradation problem well.
  • the number of network layers is in the order of 10 to 100 depth, which allows the network to obtain better expressive ability, but the problem with this is that the deeper the network layer, the acquired features have The larger the receptive field, this will cause small targets to be missed in the detection.
  • anchor points anchor points
  • the points are mapped back to the original image through the feature map for processing to solve the multi-scale problem, but it is inevitable that there will still be a certain target missed detection problem.
  • the feature pyramid network FPN can solve the problem of multi-scale target detection.
  • the deep network model itself has its own multi-level and multi-scale feature maps, and the shallow feature maps have more Small receptive fields, deep feature maps have larger receptive fields, so the direct use of such feature maps with pyramidal hierarchical structure can introduce multi-scale information, but there will be a problem.
  • shallow feature maps have smaller receptive fields . It is beneficial to detect small targets, but the semantic information contained in the shallow feature map is relatively small, which can be understood as the shallow features are not abstract enough, and it is difficult to classify the detected targets.
  • Feature Pyramid Network uses an ingenious structural design to solve the problem of insufficient semantic information of shallow features.
  • the Feature Pyramid Network introduces a top-to-bottom network structure, gradually enlarges the resolution of the feature map, and introduces the horizontal connection branch derived from the original feature extraction network, and the original feature network corresponds to the resolution feature map with the up-sampling Then the deep feature maps are fused.
  • a bottom-up jump connection multi-scale feature layer design is designed, and the shallow feature map contains very rich edges.
  • Texture and detail information through the introduction of jump connections between the original feature map and the bottom-to-top multi-scale network layer, and the down-sampled original feature map is connected to the second top-to-bottom multi-degree feature layer horizontally Feature maps with corresponding resolutions are fused.
  • This network layer improvement has better detection results for small targets and partially occluded targets, and can introduce rich semantic information and detailed information features in the multi-scale feature pyramid layer.
  • the second feature map generation unit (such as the feature pyramid network) introduces a top-down path to propagate the rich semantic information contained in deep features downwards, so that the second feature map generation unit of each scale Feature maps contain rich semantic information, and deep features have a large receptive field, which makes it possible to have a better detection effect on large targets.
  • the more detailed position detail information and texture detail information contained in the shallower feature maps are ignored, which has a great impact on the detection accuracy of medium and small targets.
  • the second feature map generation unit introduces the shallow texture detail information of the original feature map (multiple first feature maps generated by the convolution processing unit) into the deep feature map (the first feature map generation unit).
  • multiple third feature maps are generated, and the third feature map with shallow rich texture detail information is used as the input data of the detection unit for target detection, which can improve subsequent object detection The detection accuracy.
  • this embodiment does not mean that the detection accuracy of object detection for any image that includes small targets will be higher, but for a large number of samples, this embodiment can have higher comprehensive detection. Accuracy.
  • the detection unit 804 is configured to perform target detection on the image according to at least one third feature map of the plurality of third feature maps, and output the detection result.
  • the detection unit 804 may be a head.
  • the head is used to detect the target object in the image according to at least one third feature map of the plurality of third feature maps, and output the detection result.
  • the sensing network may include one or more heads.
  • each parallel head-end head is used to perform a task in one task according to the third feature map output by the second FPN.
  • the task object is detected, and the 2D frame of the area where the task object is located and the confidence level corresponding to each 2D frame are output; wherein, each parallel head can complete the detection of different task objects; wherein, the task object is The object that needs to be detected in the task; the higher the confidence, the greater the probability that the object corresponding to the task exists in the 2D box corresponding to the confidence.
  • different heads can complete different 2D detection tasks.
  • one head of multiple heads can complete car detection and output the 2D frame and confidence of Car/Truck/Bus
  • head1 of multiple heads It can complete human detection and output the 2D frame and confidence of Pedestrian/Cyclist/Tricyle
  • multiple heads can complete the detection of traffic lights, and output the 2D frame and confidence of Red_Trafficligh/Green_Trafficlight/Yellow_TrafficLight/Black_TrafficLight.
  • the sensing network may include multiple serial heads; the serial head is connected to a parallel head; it should be emphasized that, in fact, the serial head is not necessary, and it is only necessary to detect the 2D frame. In the scenario, there is no need to include the serial head.
  • the serial head can be used to extract the features of the region where the 2D frame is located from one or more feature maps on the second FPN using the 2D frame of the task object of the task provided by the parallel head connected to it. Predicting the 3D information, Mask information or Keypiont information of the task object of the task to which the 2D frame is located according to the characteristics of the area where the 2D frame is located.
  • the serial head is optionally connected in series behind the parallel head, and on the basis of detecting the 2D frame of the task, the 3D/Mask/Keypoint detection of the objects inside the 2D frame is completed.
  • serial 3D_head0 completes the estimation of the vehicle's orientation, center of mass and length, width and height, thereby outputting the 3D frame of the vehicle;
  • serial Mask_head0 predicts the fine mask of the vehicle, thereby dividing the vehicle;
  • serial Keypont_head0 completes the key points of the vehicle Estimate.
  • Serial head is not necessary, some tasks do not require 3D/Mask/Keypoint detection, serial head does not need to be connected in series, such as traffic light detection, only need to detect 2D frame, without serial head.
  • some tasks can choose to connect one or more serial heads in series according to the specific needs of the task, such as the detection of parking lots (Parking Slot). In addition to the 2D frame, it also needs the key points of the parking space. Therefore, in this task It only needs to connect a serial Keypoint_head in series, without the head of 3D and Mask.
  • the header is connected with the FPN.
  • the header can complete the detection of the 2D frame of a task according to the feature map provided by the FPN, and output the 2D frame of the object of the task and the corresponding confidence level, etc., the following describes one
  • FIG. 13b is a schematic diagram of a header.
  • the head includes three modules: a Region Proposal Network (RPN), ROI-ALIGN, and RCNN.
  • RPN Region Proposal Network
  • ROI-ALIGN ROI-ALIGN
  • RCNN Resource Control Network
  • the RPN module can be used to predict the area where the task object is located on one or more third feature maps provided by the second FPN, and output candidate 2D boxes that match the area; or it can be understood that the RPN is in the FPN
  • the output one or more horizontal images predict the areas where the task object may exist, and give the boxes of these areas, these areas are called candidate areas (Proposal).
  • candidate areas For example, when the head is responsible for detecting a car, its RPN layer predicts a candidate frame that may have a car; when the head is responsible for detecting a person, its RPN layer predicts a candidate frame that may have a person.
  • these Proposals are not accurate. On the one hand, they do not necessarily contain the object of the task, and on the other hand, these frames are not compact.
  • the 2D candidate region prediction process can be implemented by the RPN module of the head, which predicts the regions where the task object may exist based on the feature map provided by the FPN, and gives candidate frames (also called candidate regions, Proposal) of these regions.
  • the head is responsible for detecting a car, its RPN layer predicts that there may be a candidate frame for the car.
  • the RPN layer may generate a feature map RPN Hidden through, for example, a 3*3 convolution on the third feature map provided by the second FPN.
  • the RPN layer of the back head will predict Proposal from RPN Hidden. Specifically, the RPN layer of the head respectively predicts the coordinates and confidence of the proposal at each position of the RPN Hidden through a 1*1 convolution. The higher the confidence, the greater the probability that the proposal exists for the object of the task. For example, the greater the score of a certain proposal in the head, the greater the probability of its existence.
  • the Proposal predicted by each RPN layer needs to go through the Proposal merging module, and the excess Proposal is removed according to the degree of overlap between Proposals (this process can be used but not limited to the NMS algorithm), and the largest score is selected from the remaining K Proposals N (N ⁇ K) proposals are used as candidate areas where objects may exist. These Proposals are inaccurate. On the one hand, they do not necessarily contain the object of the task, and on the other hand, these frames are not compact. Therefore, the RPN module is only a rough detection process, and the subsequent RCNN module is required for subdivision. When the RPN module returns to the Proposal coordinates, it does not directly return the absolute value of the coordinates, but returns the coordinates relative to the Anchor. The higher the match between these Anchors and the actual object, the greater the probability that the RPN can detect the object.
  • the ROI-ALIGN module is used to extract the features of the region where the candidate 2D frame is located from a feature map provided by the FPN according to the region predicted by the RPN module; that is, the ROI-ALIGN module is mainly based on the RPN module
  • the ROI-ALIGN module can be used but not limited to ROI-POOLING (region of interest pooling)/ROI-ALIGN (region of interest extraction)/PS-ROIPOOLING (position-sensitive region of interest pooling)/ Feature extraction methods such as PS-ROIALIGN (position-sensitive region of interest extraction).
  • the RCNN module is used to perform convolution processing on the features of the region where the candidate 2D box is located through a neural network to obtain the confidence that the candidate 2D box belongs to each object category; adjust the coordinates of the candidate area 2D box through the neural network , Making the adjusted 2D candidate frame more match the shape of the actual object than the candidate 2D frame, and selecting the adjusted 2D candidate frame with a confidence greater than a preset threshold as the 2D frame of the region.
  • the RCNN module mainly refines the features of each Proposal proposed by the ROI-ALIGN module, and obtains the confidence that each Proposal belongs to each category (for example, for the task of car, Backgroud/Car/Truck will be given /Bus 4 points), and adjust the coordinates of the Proposal 2D frame to output a more compact 2D frame. After these 2D boxes are combined by non-maximum suppression (NMS), they are output as the final 2D box.
  • NMS non-maximum suppression
  • the sub-classification of 2D candidate regions is mainly implemented by the RCNN module of head in Figure 13b. According to the features of each Proposal extracted by the ROI-ALIGN module, it further returns to more compact 2D frame coordinates, and at the same time classifies this Proposal. Output the confidence that it belongs to each category.
  • RCNN can be implemented in many forms.
  • the feature size output by the ROI-ALIGN module can be N*14*14*256 (Feature of proposals), which is first processed by the Resnet18 convolution module 4 (Res18-Conv5) in the RCNN module.
  • the output feature size is N*7*7*512, and then processed through a Global Avg Pool (average pooling layer), and the 7*7 features in each channel in the input features are averaged to obtain N*512 Features, where each 1*512-dimensional feature vector represents the feature of each Proposal.
  • FC output N*4 vector, these 4 numerical sub-tables represent the x/y coordinates of the center point of the frame, the width and height of the frame), and the confidence of the frame category Degree (in head0, the score that this box is Backgroud/Car/Truck/Bus needs to be given).
  • box merging operation several boxes with the largest scores are selected, and the repeated boxes are removed through the NMS operation, so as to obtain a compact box output.
  • the sensing network may also include other heads, and 3D/Mask/Keypoint detection can be further performed on the basis of detecting the 2D frame.
  • the ROI-ALIGN module extracts the features of the region where each 2D box is located on the feature map output by the FPN according to the accurate 2D box provided by the head.
  • the feature size output by the ROI-ALIGN module is M*14*14*256, which is first processed by Resnet18's Res18-Conv5, and the output feature size is N*7*7*512, and then through a Global Avg Pool (average pooling) Layer) for processing, and average the 7*7 features of each channel in the input features to obtain M*512 features, where each 1*512-dimensional feature vector represents the feature of each 2D box.
  • the orientation angle of the object in the frame (orientation, M*1 vector), centroid point coordinates (centroid, M*2 vector, these 2 values represent the x/y coordinates of the centroid) and Length, width and height (dimention).
  • FIG. 13a and FIG. 13b are only an implementation manner, and does not constitute a limitation to the present application.
  • the perception network may further include: a hole convolution layer, configured to perform hole convolution processing on at least one third feature map of the plurality of third feature maps; correspondingly, the head , Which is specifically configured to detect the target object in the image according to the at least one third feature map after the hole convolution processing, and output the detection result.
  • a hole convolution layer configured to perform hole convolution processing on at least one third feature map of the plurality of third feature maps; correspondingly, the head , which is specifically configured to detect the target object in the image according to the at least one third feature map after the hole convolution processing, and output the detection result.
  • FIG. 14a is a schematic diagram of a perceptual network structure provided by an embodiment of this application
  • FIG. 14b is a schematic diagram of a hollow convolution kernel provided by an embodiment of this application.
  • RPN candidate region extraction network
  • the regression layer can obtain whether there is a target in each anchor frame and the difference between the predicted frame and the real frame, and a better frame extraction result can be obtained by training the candidate region extraction network.
  • the 3x3 sliding window convolution kernel is replaced with a 3x3 hole convolution kernel, and at least one third feature map of the plurality of third feature maps is subjected to hole convolution processing, according to the hole convolution processing
  • the latter at least one third feature map detects the target object in the image and outputs the detection result.
  • n k+(k-1) ⁇ (d-1)
  • the existing candidate region extraction network sets a 3x3 ordinary convolution kernel as a sliding window to slide on the feature map for subsequent processing.
  • the 3x3 ordinary convolution kernel is replaced with a 3x3 hole convolution kernel for improvement
  • the network corrects the missed detection of large targets and occluded targets.
  • the finally obtained annotation model can obtain good detection results, and After replacing the ordinary convolution with the 3x3 hole convolution, a larger receptive field can be obtained without increasing the amount of calculation. At the same time, because of the larger receptive field, better context information can be obtained and the background can be reduced. The misjudgment is the occurrence of prospects.
  • the present application provides a data processing system, which is characterized in that the data processing system includes: a convolution processing unit, a first feature map generating unit, a second feature map generating unit, and a detection unit, and the convolution processing units are respectively Is connected to the first characteristic map generating unit and the second characteristic map generating unit, the first characteristic map generating unit is connected to the second characteristic map generating unit, and the second characteristic map generating unit is connected to the The detection unit is connected; the convolution processing unit is configured to receive an input image and perform convolution processing on the input image to generate a plurality of first feature maps; the first feature map generation unit is configured to Generate multiple second feature maps according to the multiple first feature maps; wherein the multiple first feature maps include more texture details and/or of the input image than the multiple second feature maps Or position details in the input image; the second feature map generating unit is configured to generate multiple third feature maps according to the multiple first feature maps and the multiple second feature maps; the The detection unit is configured to output a detection result of the object included in the image according to
  • the second feature map generation unit introduces the shallow texture detail information of the original feature map (multiple first feature maps generated by the convolution processing unit) into the deep feature map (generated by the first feature map generation unit).
  • multiple second feature maps multiple third feature maps are generated, and the third feature map with shallow rich texture detail information is used as the input data of the detection unit for target detection, which can improve the detection of subsequent object detection Accuracy.
  • the data processing system in the embodiment of the present application may further include:
  • the intermediate feature extraction layer is used to convolve at least one third feature map of the plurality of third feature maps to obtain at least one fourth feature map; and for at least one of the plurality of third feature maps A third feature map is convolved to obtain at least one fifth feature map; wherein the receptive field corresponding to the fourth feature map is larger than the receptive field corresponding to the fifth feature map; the detection unit is specifically configured to According to the at least one fourth feature map and the at least one fifth feature map, a detection result of the object included in the image is output.
  • Figure 14c is a schematic diagram of the processing flow of an intermediate feature extraction.
  • the third feature map can be convolved by using convolutional layers with different dilation rates. , Get the corresponding feature map (the number of channels in each feature map is c), and perform the splicing operation to obtain feature map 4 (the number of channels is 3c), and then obtain the global Info descriptor through global average pooling and arithmetic tie processing, and Give nonlinear features through the first fully connected layer, and process through the second fully connected layer and sigmoid function to limit each weight value to the range of 0-1, and then perform the weight value and the corresponding feature map channel by channel Multiply to get the processed feature map.
  • the processed The fourth feature map has a greater gain than the processed fifth feature map. Since the receptive field corresponding to the fourth feature map itself is larger than the receptive field corresponding to the fifth feature map, the larger the receptive field, the more it has For more large target information, the detection accuracy of the large target in the target detection performed by using it is also higher. In this embodiment, when there is a larger target in the image, the gain corresponding to the fourth feature map is compared with There are also more fifth feature maps. When the detection unit performs target detection on the image based on the processed fourth feature map and the processed fifth feature map, the overall receptive field will be larger. Yes, the detection accuracy is also higher.
  • the intermediate feature extraction layer can learn the recognition law of the weight value through training: for the feature map that includes a large target, the first weight value of the first convolutional layer determined by it is larger, and the determined The second weight value of the second convolutional layer is smaller. For a feature map that includes a small target, the first weight value of the first convolutional layer determined by the feature map is relatively small, and the second weight value of the second convolutional layer determined by the feature map is relatively large.
  • FIG. 15 is a schematic flowchart of an object detection method provided by an embodiment of the application. As shown in FIG. 15, the object detection method includes:
  • the input first image when object detection needs to be performed on the first image, the input first image may be received.
  • object detection after receiving the input first image, object detection can be performed on the first image through the first perception network to obtain the first detection result, and the first detection result includes the first detection frame, wherein:
  • the first detection frame may indicate the pixel position of the detected object in the first image.
  • the first image may be detected by an object through a second perception network to obtain a second detection result.
  • the second detection result includes a second detection frame, and the first detection frame may indicate the detection The pixel position of an object in the first image.
  • the ratio of the area of the first intersection to the area of the first detection frame is less than the preset value, it can be considered that the first detection frame is omitted from the second detection result, and the updated The second detection result is such that the updated second detection result includes the first detection frame.
  • the second detection result may further include a third detection frame, and there is a second intersection between the third detection frame and the first detection frame, and the area of the second intersection is smaller than the The area of the first intersection.
  • the object detection accuracy of the first perception network is higher than the object detection accuracy of the second perception network, and the object detection accuracy is related to at least one of the following features: The shape, position, or category of the object corresponding to the detection frame.
  • the object detection accuracy of the first perception network is higher than the object detection accuracy of the second perception network, that is, the detection result of the first perception network can be used to update the detection result of the second perception network.
  • the input first image may be received, and convolution processing is performed on the first image to generate multiple first feature maps; multiple second feature maps are generated according to the multiple first feature maps; wherein, the multiple first feature maps include more texture detail information and/or location detail information than the multiple second feature maps; according to the multiple first feature maps and the multiple second feature maps
  • the image generates a plurality of third feature maps; according to at least one third feature map of the plurality of third feature maps, target detection is performed on the first image, and a first detection result is output.
  • the plurality of second feature maps include more semantic information than the plurality of first feature maps.
  • the plurality of first feature maps include a first target feature map
  • the plurality of second feature maps include a second target feature map
  • the plurality of third feature maps include a third target feature map and a first target feature map.
  • Four target feature maps the resolution of the third target feature map is smaller than that of the fourth target feature map; down-sampling the third target feature map to obtain a fifth target feature map, the fifth target feature
  • the image has the same number of channels and resolution as the second target feature map; down-sampling and convolution processing are performed on the first target feature map to obtain a sixth target feature map, which is the same as The second target feature map has the same number of channels and resolution; the fifth target feature map, the second target feature map, and the sixth target feature map are superimposed based on the respective channels to generate the A fourth target feature map; or, down-sampling the third target feature map to obtain a fifth target feature map, the fifth target feature map having the same number of channels and resolution as the second target feature map Rate; down-sampling the first target feature map
  • At least one third feature map of the plurality of third feature maps can be convolved by the first convolution layer to obtain at least one fourth feature map; At least one third feature map in the plurality of third feature maps is convolved to obtain at least one fifth feature map; wherein the receptive field corresponding to the fourth feature map is larger than the receptive field corresponding to the fifth feature map ; According to the fourth feature map and the fifth feature map, perform target detection on the first image, and output a first detection result.
  • the first detection result includes a first detection frame, and a second detection result of the first image can be obtained, and the second detection result is object detection of the first image through a second perception network Obtained, the object detection accuracy of the first perception network is higher than the object detection accuracy of the second perception network, and the second detection result includes a second detection frame, and the area where the second detection frame is located corresponds to the There is an intersection between the areas where the first detection frame is located;
  • the second detection result is updated so that the updated second detection result includes the first detection frame.
  • the second detection result includes multiple detection frames, and there is an intersection between the area where each detection frame of the multiple detection frames is located and the area where the first detection frame is located.
  • the detection frame includes the second detection frame, wherein, in the area of the intersection of the area where each detection frame is located in the multiple detection frames and the area where the first detection frame is located, the second detection frame is located The area of the intersection between the area and the area where the first detection frame is located is the smallest.
  • the first image is an image frame in the video
  • the second image is an image frame in the video
  • the frame distance between the first image and the second image in the video is smaller than a preset Value
  • the third detection result of the second image is acquired, the third detection result includes a fourth detection frame and the object category corresponding to the fourth detection frame; in the fourth detection frame and the first detection
  • the first detection frame corresponds to the object category corresponding to the fourth detection frame.
  • the detection confidence of the fourth detection frame is greater than a preset threshold.
  • the first image is an image frame in the video
  • the second image is an image frame in the video
  • the frame distance between the first image and the second image in the video is less than A preset value
  • a third detection result of the second image can also be obtained, and the third detection result is obtained by performing object detection on the second image through the first perception network or the second perception network
  • the third detection result includes a fourth detection frame and an object category corresponding to the fourth detection frame; if the shape difference and position difference between the fourth detection frame and the first detection frame are within a preset range , It is determined that the first detection frame corresponds to the object category corresponding to the fourth detection frame.
  • the detection confidence of the fourth detection frame is greater than a preset threshold.
  • a timing detection algorithm can be considered. For the missed target detected by the missed detection algorithm, select several frames of images before and after the current image, and compare whether the area near the center of the missed target in the previous and next frames is similar to the area, aspect ratio, and center coordinates of the missed target Select a specific number of similar target frames, and then compare the similar target frames with other target frames in the suspected missed image whether there are similar target frames, and remove target frames that are similar to other target frames from the detected similar target frames. Obtain the most similar target frame in each of the front and rear frames based on the content similarity and feature similarity algorithm, and get the most similar target frame that is suspected to be missed.
  • the confidence of the most similar target frame is higher than a certain threshold, you can Determine the category of the missed target. If it is lower than a certain threshold, compare the suspected missed target with the most similar target frame category. If the category is the same, it is judged as a missed test. If the category is different, manual verification can be performed.
  • An embodiment of the present application provides an object detection method.
  • the method includes: receiving an input first image; performing object detection on the first image through a first perception network to obtain a first detection result, and the first detection
  • the result includes a first detection frame; object detection is performed on the first image through a second perception network, and a second detection result is obtained.
  • the second detection result includes a second detection frame. There is a first intersection between a detection frame; if the ratio of the area of the first intersection to the area of the first detection frame is less than the preset value, the second detection result is updated, and the updated second detection result includes The first detection frame.
  • FIG. 16 is a schematic flowchart of a perceptual network training method provided by an embodiment of this application.
  • the perceptual network training method can be used to train the perceptual network in the foregoing embodiment.
  • the first perception network may be the initial network of the perception network in the foregoing embodiment
  • the second perception network obtained by training the first perception network may be the perception network in the foregoing embodiment.
  • the perceptual network training method includes:
  • the detection result of the image may be obtained.
  • the detection result is obtained by object detection on the image through a first perception network, and the detection result includes the target detection frame corresponding to the first object. .
  • the preset loss function is also related to the shape difference between the target detection frame and the pre-labeled detection frame, wherein the shape difference is negatively related to the area of the pre-labeled detection frame.
  • the rectangular detection frame includes a first side and a second side that are connected, and the circumscribed rectangular frame includes a third side corresponding to the first side, and a first side corresponding to the second side.
  • the area difference is also positively correlated with the length difference between the first side and the third side, and positively correlated with the length difference between the second side and the fourth side.
  • the preset loss function is also related to the position difference between the target detection frame and the pre-labeled detection frame in the image, wherein the position difference is related to the pre-labeled detection frame.
  • the area of ?? is negatively correlated; or the position difference is negatively correlated with the area of the smallest circumscribed rectangle of the convex hull of the pre-labeled detection frame and the target detection frame.
  • the target detection frame includes a first corner point and a first center point
  • the pre-labeled detection frame includes a second corner point and a second center point
  • the first corner point and the second center point are
  • the corner points are the two end points of the diagonal of the rectangle
  • the position difference is also positively correlated with the position difference between the first center point and the second center point in the image, and with the first corner point. It is negatively related to the length of the second corner point.
  • the loss function may be as follows:
  • the newly designed frame regression loss function uses scale invariance and is applied to the IoU loss item of the target detection measurement method, the loss item considering the aspect ratio of the predicted frame and the real frame, and the predicted frame center coordinates and The loss item of the ratio of the distance between the center coordinates of the real frame and the coordinates of the lower right corner of the predicted frame and the coordinates of the upper left corner of the real frame.
  • the IoU loss item naturally introduces a constant-scale frame prediction quality evaluation index.
  • the loss item of the aspect ratio of the two frames measures the two frames.
  • the distance is smaller than the naturally narrow the distance between the o p o g, remote with the distance between.
  • the three loss functions are assigned different weights to balance the impact of each item. Among them, the aspect ratio and distance ratio will be introduced and Proportional weight coefficient to reduce the influence of frame scale, large-scale frame has smaller weight, and small-scale frame has larger weight.
  • the frame regression loss function proposed in this patent is suitable for various two-stage and one-stage algorithms, and has good versatility, as well as the fit between the target scale, the frame, and the center point and the corner of the frame. Played a very good role in promoting.
  • the IoU which measures the degree of fit between the detected frame and the real frame in the target detection, is used as the loss function of frame regression. Due to the inherent scale invariance of IoU, it solves that other frame regression functions in the past are more sensitive to scale changes.
  • the preset loss function includes a target loss item related to the position difference, and the target loss item changes with the change of the position difference; wherein, when the position difference is greater than a preset value, the The rate of change of the target loss item is greater than the first preset rate of change; and/or, when the position difference is less than the preset value, the rate of change of the target loss item is less than the second preset rate of change.
  • the bounding box regression loss function can also be the following loss function:
  • the above-mentioned frame regression loss function utilizes scale invariance and is widely used in the detection measurement method of the IoU loss item, the loss item that takes into account the aspect ratio of the predicted frame and the real frame, and the pull loss item that narrows the distance between the predicted frame and the real frame.
  • the IoU loss term naturally introduces a constant-scale border prediction quality evaluation index.
  • the loss term of the aspect ratio of the two borders measures the fit of the shape between the two borders.
  • FIG. 17 is a schematic flowchart of an object detection method provided by an embodiment of the application. As shown in FIG. 17, the object detection method includes:
  • CNN processing on the input image here should not be understood as only performing convolution processing on the input image.
  • the input image can be Perform convolution processing, pooling operations, and so on.
  • Convolution processing on the first image to generate multiple first feature maps should not only be understood as performing multiple convolution processing on the first image each time Convolution processing can generate a first feature map, that is, it should not be understood that each first feature map is obtained based on convolution processing on the first image, but, on the whole, the first image is multiple The source of the first feature map; in one implementation, the first image can be convolved to obtain a first feature map, and then the generated first feature map can be convolved to obtain another first feature map. Feature maps, and so on, can get multiple first feature maps.
  • a series of convolution processing may be performed on the input image. Specifically, in each convolution processing, the first feature map obtained by the previous convolution processing may be subjected to convolution processing. , And then obtain a first feature map, and multiple first feature maps can be obtained by the above method.
  • the multiple first feature maps may be feature maps with multi-scale resolution, that is, the multiple first feature maps are not feature maps with the same resolution.
  • the multiple first feature maps The feature map can form a feature pyramid.
  • the input image may be received, and the input image may be subjected to convolution processing to generate multiple first feature maps with multi-scale resolution; the convolution processing unit may perform a series of convolution processing on the input image , To obtain feature maps at different scales (with different resolutions).
  • the convolution processing unit can take many forms, such as visual geometry group (VGG), residual neural network (residual neural network, resnet), GoogLeNet core structure (Inception-net), and so on.
  • the "generating multiple second feature maps based on the multiple first feature maps” here should not be understood to mean that the source of each second feature map generated in the multiple second feature maps is Multiple first feature maps; in one implementation, a part of the second feature maps in the multiple second feature maps are directly generated based on one or more first feature maps in the multiple first feature maps; one In this implementation, a part of the second feature maps in the multiple second feature maps are directly generated based on one or more first feature maps in the multiple first feature maps, and other second feature maps other than itself ; In one implementation, a part of the second feature maps in the multiple second feature maps are directly generated based on other second feature maps other than itself.
  • the multiple second feature maps may be feature maps with multi-scale resolution, that is, the multiple second feature maps are not feature maps with the same resolution.
  • the multiple second feature maps The feature map can form a feature pyramid.
  • the convolution operation can be performed on the topmost feature map C4 in the multiple first feature maps generated by the convolution processing unit.
  • the hole convolution and 1 ⁇ 1 convolution can be used to convert the topmost feature map C4
  • the number of channels is reduced to 256, which is the top feature map P4 of the feature pyramid; the output results of the top and next feature maps C3 are horizontally linked and the number of channels is reduced to 256 using 1 ⁇ 1 convolution.
  • the pixels are added to obtain the feature map P3; and so on, from top to bottom, a first feature pyramid is constructed, and the first feature pyramid may include a plurality of second feature maps.
  • the texture details can be used to express the detailed information of small targets and edge features.
  • the first feature map includes more texture detail information, so that it can be used in the detection results of small target detection.
  • the detection accuracy is higher.
  • the position details can be information that expresses the position of the object in the image and the relative position between the objects.
  • multiple second feature maps can include more deep features.
  • Deep features contain rich semantic information, which has a good effect on classification tasks.
  • deep features have larger receptive fields. It has a good detection effect for large targets; in one implementation, by introducing a top-down path to generate multiple second feature maps, the rich semantic information contained in deep features can naturally be propagated downward. Make the second feature maps of each scale contain rich semantic information.
  • At least one third feature map of the plurality of third feature maps output a first detection result of an object included in the image.
  • the second feature map generation unit (such as the feature pyramid network) introduces a top-down path to propagate the rich semantic information contained in deep features downwards, so that the second feature map generation unit of each scale Feature maps contain rich semantic information, and deep features have a large receptive field, which makes it possible to have a better detection effect on large targets.
  • the more detailed position detail information and texture detail information contained in the shallower feature maps are ignored, which has a great impact on the detection accuracy of medium and small targets.
  • the second feature map generation unit introduces the shallow texture detail information of the original feature map (multiple first feature maps generated by the convolution processing unit) into the deep feature map (generated by the first feature map generation unit).
  • multiple third feature maps are generated, and the third feature map with shallow layer of rich texture detail information is used as the input data of the detection unit for target detection, which can improve the detection of subsequent object detection Accuracy.
  • this embodiment does not mean that the detection accuracy of object detection for any image that includes small targets will be higher, but for a large number of samples, this embodiment can have higher comprehensive detection. Accuracy.
  • the plurality of second feature maps include more semantic information than the plurality of first feature maps.
  • the plurality of first feature maps, the plurality of second feature maps, and the plurality of third feature maps are feature maps with multi-scale resolution.
  • the plurality of first feature maps include a first target feature map
  • the plurality of second feature maps include a second target feature map
  • the plurality of third feature maps include a third target feature map and a first target feature map.
  • Four target feature maps the resolution of the third target feature map is smaller than that of the fourth target feature map; said generating a plurality of third features according to the plurality of first feature maps and the plurality of second feature maps Figures, including:
  • the third target feature map Down-sampling the third target feature map to obtain a fifth target feature map, the fifth target feature map and the second target feature map have the same number of channels and resolution; for the first target
  • the feature map is subjected to down-sampling and convolution processing to obtain a sixth target feature map.
  • the sixth target feature map has the same number of channels and resolution as the second target feature map; and the fifth target feature map , The second target feature map and the sixth target feature map are superimposed on channels to generate the fourth target feature map; or,
  • the third target feature map Down-sampling the third target feature map to obtain a fifth target feature map, the fifth target feature map and the second target feature map have the same number of channels and resolution; for the first target
  • the feature map is down-sampled to obtain a sixth target feature map.
  • the sixth target feature map and the second target feature map have the same resolution; the fifth target feature map and the second target feature map are Perform channel superposition and convolution processing with the sixth target feature map to generate the fourth target feature map, and the fourth target feature map has the same number of channels as the second target feature map.
  • the method further includes:
  • the outputting the first detection result of the object included in the image according to at least one third feature map of the plurality of third feature maps includes:
  • a first detection result of the object included in the image is output.
  • the method further includes:
  • the fifth feature map is processed according to the second weight value to obtain a processed fifth feature map; wherein, the object to be detected included in the fourth feature map is larger than the object to be detected included in the fifth feature map In the case of an object, the first weight value is greater than the second weight value;
  • the outputting the first detection result of the object included in the image according to at least one third feature map of the plurality of third feature maps includes:
  • a first detection result of the object included in the image is output.
  • the first detection result includes a first detection frame
  • the method further includes:
  • the second detection result is obtained by object detection of the first image through a second perception network, and the object detection accuracy of the first perception network is higher than that of the Object detection accuracy of the second perception network, the second detection result includes a second detection frame, and there is an intersection between the area where the second detection frame is located and the area where the first detection frame is located;
  • the second detection result is updated so that the updated second detection result includes the first detection frame.
  • the second detection result includes multiple detection frames, and there is an intersection between the area where each detection frame of the multiple detection frames is located and the area where the first detection frame is located.
  • the detection frame includes the second detection frame, wherein, in the area of the intersection of the area where each detection frame is located in the plurality of detection frames and the area where the first detection frame is located, the second detection frame is located The area of the intersection between the area and the area where the first detection frame is located is the smallest.
  • the first image is an image frame in the video
  • the second image is an image frame in the video
  • the frame distance between the first image and the second image in the video is less than a preset Value
  • the method further includes:
  • the third detection result includes a fourth detection frame and an object category corresponding to the fourth detection frame
  • the first detection frame corresponds to the object category corresponding to the fourth detection frame.
  • the detection confidence of the fourth detection frame is greater than a preset threshold.
  • FIG. 18 is a schematic diagram of a perceptual network training device provided by an embodiment of the application. As shown in Fig. 18, the perceptual network training device 1800 includes:
  • An obtaining module 1801 configured to obtain a pre-labeled detection frame of a target object in an image; and obtain a target detection frame corresponding to the image and the first perception network, the target detection frame being used to identify the target object;
  • the iterative training module 1802 is configured to iteratively train the first perceptual network according to the loss function to output the second perceptual network; wherein the loss function and the difference between the pre-labeled detection frame and the target detection frame Consolidation is more related to IoU.
  • the preset loss function is also related to the shape difference between the target detection frame and the pre-labeled detection frame, wherein the shape difference is negatively related to the area of the pre-labeled detection frame.
  • the preset loss function is also related to the position difference between the target detection frame and the pre-labeled detection frame in the image, wherein the position difference is related to the area of the pre-labeled detection frame. Negative correlation; or the position difference is negatively correlated with the area of the smallest circumscribed rectangle of the convex hull of the pre-labeled detection frame and the target detection frame.
  • the target detection frame includes a first corner point and a first center point
  • the pre-labeled detection frame includes a second corner point and a second center point
  • the first corner point and the second corner point Are the two end points of the diagonal of the rectangle
  • the position difference is also positively correlated with the position difference between the first center point and the second center point in the image, and is also positively correlated with the first corner point and the The length of the second corner point is negatively related.
  • the preset loss function includes a target loss item related to the position difference, and the target loss item changes as the position difference changes;
  • the rate of change of the target loss item is greater than the first preset rate of change; and/or, when the position difference is less than the preset value, the rate of change of the target loss item Less than the second preset rate of change.
  • FIG. 19 is a schematic diagram of an object detection device provided by an embodiment of the application. As shown in FIG. 19, the object detection device 1900 includes:
  • the receiving module 1901 is configured to receive an input first image, and perform convolution processing on the first image to generate multiple first feature maps;
  • the convolution processing module 1902 is configured to perform convolution processing on the first image to generate multiple first feature maps
  • the first feature map generating module 1903 is configured to generate multiple second feature maps according to the multiple first feature maps; wherein, the multiple first feature maps include more than the multiple second feature maps. Texture details of the input image and/or location details in the input image;
  • the second feature map generating module 1904 is configured to generate multiple third feature maps according to the multiple first feature maps and the multiple second feature maps;
  • the detection module 1905 is configured to output a first detection result of an object included in the image according to at least one third feature map of the plurality of third feature maps.
  • the plurality of second feature maps include more semantic information than the plurality of first feature maps.
  • the plurality of first feature maps, the plurality of second feature maps, and the plurality of third feature maps are feature maps with multi-scale resolution.
  • the plurality of first feature maps include a first target feature map
  • the plurality of second feature maps include a second target feature map
  • the plurality of third feature maps include a third target feature map and a first target feature map.
  • the second feature map generating module is specifically configured to:
  • the third target feature map Down-sampling the third target feature map to obtain a fifth target feature map, the fifth target feature map and the second target feature map have the same number of channels and resolution; for the first target
  • the feature map is subjected to down-sampling and convolution processing to obtain a sixth target feature map, where the sixth target feature map has the same number of channels and resolution as the second target feature map; and the fifth target feature map ,
  • the second target feature map and the sixth target feature map are superimposed on channels to generate the fourth target feature map; or,
  • the third target feature map Down-sampling the third target feature map to obtain a fifth target feature map, the fifth target feature map and the second target feature map have the same number of channels and resolution; for the first target
  • the feature map is down-sampled to obtain a sixth target feature map.
  • the sixth target feature map and the second target feature map have the same resolution; the fifth target feature map and the second target feature map are Perform channel superposition and convolution processing with the sixth target feature map to generate the fourth target feature map, and the fourth target feature map has the same number of channels as the second target feature map.
  • the device further includes:
  • An intermediate feature extraction module configured to perform convolution on at least one third feature map of the plurality of third feature maps through the first convolution layer to obtain at least one fourth feature map;
  • the detection module is specifically used for:
  • a first detection result of the object included in the image is output.
  • the device further includes:
  • the fifth feature map is processed according to the second weight value to obtain a processed fifth feature map; wherein, the object to be detected included in the fourth feature map is larger than the object to be detected included in the fifth feature map In the case of an object, the first weight value is greater than the second weight value;
  • the detection module is specifically used for:
  • a first detection result of the object included in the image is output.
  • the first detection result includes a first detection frame
  • the acquisition module is further configured to:
  • the second detection result is obtained by object detection of the first image through a second perception network, and the object detection accuracy of the first perception network is higher than that of the Object detection accuracy of the second perception network, the second detection result includes a second detection frame, and there is an intersection between the area where the second detection frame is located and the area where the first detection frame is located;
  • An update module configured to update the second detection result if the ratio of the area of the intersection to the area of the first detection frame is less than a preset value, so that the updated second detection result includes the first Check box.
  • the acquisition module is also used to:
  • the third detection result includes a fourth detection frame and an object category corresponding to the fourth detection frame
  • the first detection frame corresponds to the object category corresponding to the fourth detection frame.
  • the detection confidence of the fourth detection frame is greater than a preset threshold.
  • FIG. 20 is a schematic structural diagram of an execution device provided by an embodiment of this application. Tablets, laptops, smart wearable devices, monitoring data processing devices, etc., are not limited here.
  • the object detection device described in the embodiment corresponding to FIG. 19 may be deployed on the execution device 2000 to implement the function of object detection in the embodiment corresponding to FIG. 19.
  • the execution device 2000 includes: a receiver 2001, a transmitter 2002, a processor 2003, and a memory 2004 (the number of processors 2003 in the execution device 2000 may be one or more, and one processor is taken as an example in FIG. 20) ,
  • the processor 2003 may include an application processor 20031 and a communication processor 20032.
  • the receiver 2001, the transmitter 2002, the processor 2003, and the memory 2004 may be connected by a bus or other methods.
  • the memory 2004 may include a read-only memory and a random access memory, and provides instructions and data to the processor 2003. A part of the memory 2004 may also include a non-volatile random access memory (NVRAM).
  • NVRAM non-volatile random access memory
  • the memory 2004 stores a processor and operating instructions, executable modules or data structures, or a subset of them, or an extended set of them.
  • the operating instructions may include various operating instructions for implementing various operations.
  • the processor 2003 controls the operation of the execution device.
  • the various components of the execution device are coupled together through a bus system, where the bus system may include a power bus, a control bus, and a status signal bus in addition to a data bus.
  • bus system may include a power bus, a control bus, and a status signal bus in addition to a data bus.
  • various buses are referred to as bus systems in the figure.
  • the method disclosed in the above embodiments of the present application may be applied to the processor 2003 or implemented by the processor 2003.
  • the processor 2003 may be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method can be completed by an integrated logic circuit of hardware in the processor 2003 or instructions in the form of software.
  • the aforementioned processor 2003 may be a general-purpose processor, a digital signal processing (DSP), a microprocessor or a microcontroller, and may further include an application specific integrated circuit (ASIC), field programmable Field-programmable gate array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.
  • the processor 2003 can implement or execute the methods, steps, and logical block diagrams disclosed in the embodiments of the present application.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
  • the storage medium is located in the memory 2004, and the processor 2003 reads the information in the memory 2004, and completes the steps of the above method in combination with its hardware.
  • the receiver 2001 can be used to receive input digital or character information, and generate signal input related to the relevant settings and function control of the execution device.
  • the transmitter 2002 can be used to output digital or character information through the first interface; the transmitter 2002 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group; the transmitter 2002 can also include display devices such as a display screen .
  • the processor 2003 is configured to execute the image processing method executed by the execution device in the embodiment corresponding to FIG. 9 to FIG. 11.
  • the application processor 20031 is configured to execute the object detection method in the foregoing embodiment.
  • FIG. 21 is a schematic structural diagram of a training device provided in an embodiment of this application.
  • the training device 2100 may be deployed with the perception described in the embodiment corresponding to FIG. 18
  • the network training device is used to realize the function of the perceptual network training device in the embodiment corresponding to FIG. 18.
  • the training device 2100 is implemented by one or more servers, and the training device 2100 may have relatively large differences due to different configurations or performances. It may include one or more central processing units (CPU) 2121 (for example, one or more processors) and a memory 2132, and one or more storage media 2130 (for example, one or A storage device in Shanghai).
  • CPU central processing units
  • storage media 2130 for example, one or A storage device in Shanghai.
  • the memory 2132 and the storage medium 2130 may be short-term storage or persistent storage.
  • the program stored in the storage medium 2130 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the training device.
  • the central processing unit 2121 may be configured to communicate with the storage medium 2130, and execute a series of instruction operations in the storage medium 2130 on the training device 2100.
  • the training device 2100 may also include one or more power supplies 2126, one or more wired or wireless network interfaces 2150, and one or more input and output interfaces 2158; or, one or more operating systems 2141, such as Windows ServerTM, Mac OS XTM , UnixTM, LinuxTM, FreeBSDTM and so on.
  • operating systems 2141 such as Windows ServerTM, Mac OS XTM , UnixTM, LinuxTM, FreeBSDTM and so on.
  • the central processing unit 2121 is configured to execute the relevant steps of the perceptual network training method in the foregoing embodiment.
  • the embodiments of the present application also provide a product including a computer program, which when running on a computer, causes the computer to execute the steps executed by the aforementioned execution device, or causes the computer to execute the steps executed by the aforementioned training device.
  • the embodiments of the present application also provide a computer-readable storage medium that stores a program for signal processing, and when it runs on a computer, the computer executes the steps performed by the aforementioned execution device. Or, make the computer execute the steps performed by the aforementioned training device.
  • the execution device, training device, or terminal device provided by the embodiments of the present application may specifically be a chip.
  • the chip includes a processing unit and a communication unit.
  • the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, Pins or circuits, etc.
  • the processing unit can execute the computer-executable instructions stored in the storage unit to make the chip in the execution device execute the data processing method described in the foregoing embodiment, or to make the chip in the training device execute the data processing method described in the foregoing embodiment.
  • the storage unit is a storage unit in the chip, such as a register, a cache, etc.
  • the storage unit may also be a storage unit located outside the chip in the wireless access device, such as Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), etc.
  • ROM Read-only memory
  • RAM random access memory
  • Fig. 22 is a schematic diagram of a structure of a chip provided by an embodiment of the application.
  • the Host CPU assigns tasks.
  • the core part of the NPU is the arithmetic circuit 2203, and the controller 2204 controls the arithmetic circuit 2203 to extract matrix data from the memory and perform multiplication operations.
  • the arithmetic circuit 2203 includes multiple processing units (Process Engine, PE). In some implementations, the arithmetic circuit 2203 is a two-dimensional systolic array. The arithmetic circuit 2203 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 2203 is a general-purpose matrix processor.
  • the arithmetic circuit fetches the corresponding data of matrix B from the weight memory 2202 and caches it on each PE in the arithmetic circuit.
  • the arithmetic circuit fetches matrix A data and matrix B from the input memory 2201 to perform matrix operations, and the partial result or final result of the obtained matrix is stored in an accumulator 2208.
  • the unified memory 2206 is used to store input data and output data.
  • the weight data directly passes through the storage unit access controller (Direct Memory Access Controller, DMAC) 2205, and the DMAC is transferred to the weight memory 2202.
  • the input data is also transferred to the unified memory 2206 through the DMAC.
  • DMAC Direct Memory Access Controller
  • the BIU is the Bus Interface Unit, that is, the bus interface unit 2210, which is used for the interaction of the AXI bus with the DMAC and the instruction fetch buffer (IFB) 2209.
  • IFB instruction fetch buffer
  • the bus interface unit 2210 (Bus Interface Unit, BIU for short) is used for the instruction fetch memory 2209 to obtain instructions from the external memory, and is also used for the storage unit access controller 2205 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • BIU Bus Interface Unit
  • the DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 2206 or to transfer the weight data to the weight memory 2202 or to transfer the input data to the input memory 2201.
  • the vector calculation unit 2207 includes a plurality of arithmetic processing units. If necessary, further processing is performed on the output of the arithmetic circuit 2203, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and so on. It is mainly used in the calculation of non-convolutional/fully connected layer networks in neural networks, such as Batch Normalization, pixel-level summation, and upsampling of feature planes.
  • the vector calculation unit 2207 can store the processed output vector to the unified memory 2206.
  • the vector calculation unit 2207 may apply a linear function; or, apply a nonlinear function to the output of the arithmetic circuit 2203, for example, perform linear interpolation on the feature plane extracted by the convolutional layer, and for example, a vector of accumulated values to generate the activation value.
  • the vector calculation unit 2207 generates normalized values, pixel-level summed values, or both.
  • the processed output vector can be used as an activation input to the arithmetic circuit 2203, for example for use in a subsequent layer in a neural network.
  • the instruction fetch buffer 2209 connected to the controller 2204 is used to store instructions used by the controller 2204;
  • the unified memory 2206, the input memory 2201, the weight memory 2202, and the fetch memory 2209 are all On-Chip memories.
  • the external memory is private to the NPU hardware architecture.
  • the processor mentioned in any one of the above may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above-mentioned programs.
  • the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physically separate.
  • the physical unit can be located in one place or distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the connection relationship between the modules indicates that they have a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.
  • this application can be implemented by means of software plus necessary general hardware.
  • it can also be implemented by dedicated hardware including dedicated integrated circuits, dedicated CPUs, dedicated memory, Dedicated components and so on to achieve.
  • all functions completed by computer programs can be easily implemented with corresponding hardware.
  • the specific hardware structures used to achieve the same function can also be diverse, such as analog circuits, digital circuits or special-purpose circuits. Circuit etc.
  • software program implementation is a better implementation in more cases.
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a readable storage medium, such as a computer floppy disk. , U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., including several instructions to make a computer device (which can be a personal computer, training device, or network device, etc.) execute the various embodiments described in this application method.
  • a computer device which can be a personal computer, training device, or network device, etc.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer instructions may be transmitted from a website, computer, training device, or data.
  • the center transmits to another website, computer, training equipment, or data center through wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.).
  • wired such as coaxial cable, optical fiber, digital subscriber line (DSL)
  • wireless such as infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a training device or a data center integrated with one or more available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk (SSD)).

Abstract

Disclosed in the present application are a data processing system, an object detection method and an apparatus thereof, which are applied to the field of artificial intelligence. In the present application, a second feature map generation unit introduces texture detail information of a shallow layer of an original feature map (multiple first feature maps generated by a convolutional processing unit) into a deep feature map (multiple second feature maps generated by a first feature map generation unit), so as to generate multiple third feature maps, and the third feature maps having rich texture detail information of the shallow layers are used as input data for a detection unit to perform target detection, and the detection accuracy of subsequent object detection can be improved.

Description

一种数据处理系统、物体检测方法及其装置Data processing system, object detection method and device
本申请要求于2020年04月30日提交中国专利局、申请号为202010362601.2、发明名称为“一种数据处理系统、物体检测方法及其装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on April 30, 2020, the application number is 202010362601.2, and the invention title is "a data processing system, object detection method and device", the entire content of which is by reference Incorporated in this application.
技术领域Technical field
本申请涉及人工智能领域,尤其涉及一种数据处理系统、物体检测方法及其装置。This application relates to the field of artificial intelligence, and in particular to a data processing system, object detection method and device.
背景技术Background technique
计算机视觉是各个应用领域,如制造业、检验、文档分析、医疗诊断,和军事等领域中各种智能/自主系统中不可分割的一部分,它是一门关于如何运用照相机/摄像机和计算机来获取我们所需的,被拍摄对象的数据与信息的学问。形象地说,就是给计算机安装上眼睛(照相机或摄像机)和大脑(算法)用来代替人眼对目标进行识别、跟踪和测量等,从而使计算机能够感知环境。因为感知可以看作是从感官信号中提取信息,所以计算机视觉也可以看作是研究如何使人工系统从图像或多维数据中“感知”的科学。总的来说,计算机视觉就是用各种成象系统代替视觉器官获取输入信息,再由计算机来代替大脑对这些输入信息完成处理和解释。计算机视觉的最终研究目标就是使计算机能像人那样通过视觉观察和理解世界,具有自主适应环境的能力。Computer vision is an inseparable part of various intelligent/autonomous systems in various application fields, such as manufacturing, inspection, document analysis, medical diagnosis, and military. It is about how to use cameras/video cameras and computers to obtain What we need is the knowledge of the data and information of the subject. To put it vividly, it is to install eyes (camera or video camera) and brain (algorithm) to the computer to replace the human eye to identify, track and measure the target, so that the computer can perceive the environment. Because perception can be seen as extracting information from sensory signals, computer vision can also be seen as a science that studies how to make artificial systems "perceive" from images or multi-dimensional data. In general, computer vision uses various imaging systems to replace the visual organs to obtain input information, and then the computer replaces the brain to complete the processing and interpretation of the input information. The ultimate research goal of computer vision is to enable computers to observe and understand the world through vision like humans, and have the ability to adapt to the environment autonomously.
感知网络可以为对图像进行处理分析并得到处理结果的神经网络模型,目前感知网络能完成的功能越来越多,例如,可以是图像分类、2D检测、语义分割(semantic segmentation)、关键点检测、线性物体检测(比如自动驾驶技术中的车道线或停止线检测)、可行驶区域检测等。另外,视觉感知系统具有成本低、非接触性、体积小、信息量大的特点。随着视觉感知算法的精度的不断提高,其成为当今众多人工智能系统的关键技术,得到越来越广泛的应用,如:高级驾驶辅助系统(advanced driving assistant system,ADAS)和自动驾驶系统(autonomous driving system,ADS)中对路面上的动态障碍物(人或车)、静态物体(交通灯、交通标志或交通锥状物)的识别,在终端视觉的拍照美颜功能中通过识别人体掩膜Mask和关键点实现瘦身效果等。The perception network can be a neural network model that processes and analyzes images and obtains the processing results. At present, the perception network can complete more and more functions, such as image classification, 2D detection, semantic segmentation, and key point detection. , Linear object detection (such as lane line or stop line detection in automatic driving technology), drivable area detection, etc. In addition, the visual perception system has the characteristics of low cost, non-contact, small size, and large amount of information. With the continuous improvement of the accuracy of visual perception algorithms, it has become the key technology of many artificial intelligence systems today, and has been more and more widely used, such as: advanced driving assistance system (ADAS) and automatic driving system (autonomous driving system). Recognition of dynamic obstacles (people or cars) and static objects (traffic lights, traffic signs or traffic cones) on the road in the driving system (ADS), through the recognition of human masks in the terminal visual camera beauty function Mask and key points to achieve slimming effect, etc.
感知网络通常包括特征金字塔网络(feature pyramid networks,FPN),其引入了一个自顶到底的网络结构,并且引入从原始特征提取网络引出的横向连接支路,将原始特征网络对应分辨率特征图与经过上采样之后的深层特征图进行融合,FPN中引入的自顶到底的网络结构感受野较大,但是其针对于小物体的检测的检测精度较低。Perceptual networks usually include feature pyramid networks (FPN), which introduces a top-to-bottom network structure, and introduces horizontal connection branches derived from the original feature extraction network, and the original feature network corresponds to the resolution feature map and After upsampling the deep feature maps for fusion, the top-to-bottom network structure introduced in FPN has a larger receptive field, but its detection accuracy for small objects is low.
发明内容Summary of the invention
第一方面,本申请提供了一种物体检测方法,所述方法用于第一感知网络,所述方法包括:In the first aspect, the present application provides an object detection method, the method is used in a first perception network, and the method includes:
接收输入的第一图像,并对所述第一图像进行卷积处理,以生成多个第一特征图;Receiving an input first image, and performing convolution processing on the first image to generate a plurality of first feature maps;
需要说明的是,这里的“对所述输入的图像进行卷积处理”,不应理解为,仅仅对对所述输入的图像进行卷积处理,在一些实现中,可以对所述输入的图像进行卷积处理、池化操作等等。It should be noted that “convolution processing on the input image” here should not be understood as only performing convolution processing on the input image. In some implementations, the input image can be Perform convolution processing, pooling operations, and so on.
需要说明的是,这里的“对所述第一图像进行卷积处理,以生成多个第一特征图”,不应仅理解为,对所述第一图像进行多次卷积处理,每次卷积处理可以生成一个第一特征图,即不应该理解为每张第一特征图都是基于对第一图像进行卷积处理得到的,而是,从整体上来看,第一图像是多个第一特征图的来源;在一种实现中,可以对所述第一图像进行卷积处理得到一个第一特征图,之后可以对生成的第一特征图进行卷积处理,得到另一个第一特征图,以此类推,就可以得到多个第一特征图。It should be noted that “convolution processing on the first image to generate multiple first feature maps” here should not only be understood as performing multiple convolution processing on the first image each time Convolution processing can generate a first feature map, that is, it should not be understood that each first feature map is obtained based on convolution processing on the first image, but, on the whole, the first image is multiple The source of the first feature map; in one implementation, the first image can be convolved to obtain a first feature map, and then the generated first feature map can be convolved to obtain another first feature map. Feature maps, and so on, can get multiple first feature maps.
需要说明的是,可以是对所述输入的图像进行一系列的卷积处理,具体的,在每次卷积处理时,可以是对前一次卷积处理得到的第一特征图进行卷积处理,进而得到一个第一特征图,通过上述方式,可以得到多个第一特征图。It should be noted that a series of convolution processing may be performed on the input image. Specifically, in each convolution processing, the first feature map obtained by the previous convolution processing may be subjected to convolution processing. , And then obtain a first feature map, and multiple first feature maps can be obtained by the above method.
需要说明的是,多个第一特征图可以是具有多尺度分辨率的特征图,即多个第一特征图并不是分辨率相同的特征图,在一种可选实现中,多个第一特征图可以构成一个特征金字塔。It should be noted that the multiple first feature maps may be feature maps with multi-scale resolution, that is, the multiple first feature maps are not feature maps with the same resolution. In an optional implementation, the multiple first feature maps The feature map can form a feature pyramid.
其中,可以接收输入的图像,并对所述输入的图像进行卷积处理,生成具有多尺度分辨率的多个第一特征图;卷积处理单元可以对输入的图像进行一系列的卷积处理,得到在不同的尺度(具有不同分辨率)下的特征图(feature map)。卷积处理单元可以采用多种形式,比如视觉几何组(visual geometry group,VGG)、残差神经网络(residual neural network,resnet)、GoogLeNet的核心结构(Inception-net)等。The input image may be received, and the input image may be subjected to convolution processing to generate multiple first feature maps with multi-scale resolution; the convolution processing unit may perform a series of convolution processing on the input image , To obtain feature maps at different scales (with different resolutions). The convolution processing unit can take many forms, such as visual geometry group (VGG), residual neural network (residual neural network, resnet), GoogLeNet core structure (Inception-net), and so on.
根据所述多个第一特征图生成多个第二特征图;其中,所述多个第一特征图比所述多个第二特征图包括更多的所述输入的图像的纹理细节和/或所述输入的图像中的位置细节。Generate multiple second feature maps according to the multiple first feature maps; wherein the multiple first feature maps include more texture details and/or of the input image than the multiple second feature maps Or the location details in the input image.
需要说明的是,这里的“根据所述多个第一特征图生成多个第二特征图”,并不应该理解为生成多个第二特征图中的每个第二特征图的来源都是多个第一特征图;在一种实现中,多个第二特征图中的一部分第二特征图是直接基于多个第一特征图中的一个或多个第一特征图生成的;在一种实现中,多个第二特征图中的一部分第二特征图是直接基于多个第一特征图中的一个或多个第一特征图,以及除自身之外的其他第二特征图生成的;在一种实现中,多个第二特征图中的一部分第二特征图是直接基于除自身之外的其他第二特征图生成的,此时,由于“除自身之外的其他第二特征图是基于多个第一特征图中的一个或多个第一特征图”生成的,因此,可以理解为根据所述多个第一特征图生成多个第二特征图。It should be noted that the "generating multiple second feature maps based on the multiple first feature maps" here should not be understood to mean that the source of each second feature map generated in the multiple second feature maps is Multiple first feature maps; in one implementation, a part of the second feature maps in the multiple second feature maps are directly generated based on one or more first feature maps in the multiple first feature maps; one In this implementation, a part of the second feature maps in the multiple second feature maps are directly generated based on one or more first feature maps in the multiple first feature maps, and other second feature maps other than itself ; In one implementation, a part of the second feature maps in the multiple second feature maps are directly generated based on other second feature maps other than itself. At this time, due to "other second feature maps other than itself The map is generated based on one or more first characteristic maps "in a plurality of first characteristic maps", therefore, it can be understood as generating a plurality of second characteristic maps according to the plurality of first characteristic maps.
需要说明的是,多个第二特征图可以是具有多尺度分辨率的特征图,即多个第二特征图并不是分辨率相同的特征图,在一种可选实现中,多个第二特征图可以构成一个特征金字塔。It should be noted that the multiple second feature maps may be feature maps with multi-scale resolution, that is, the multiple second feature maps are not feature maps with the same resolution. In an optional implementation, the multiple second feature maps The feature map can form a feature pyramid.
其中,可以对卷积处理单元生成的多个第一特征图中的最顶层特征图C4进行卷积操作,示例性的,可以使用空洞卷积和1×1卷积将最顶层特征图C4的通道数下降为256,作为特征金字塔的最顶层特征图P4;横向链接最顶层下一层特征图C3的输出结果并使用1×1卷积降低通道数至256后,与特征图P4逐通道逐像素相加得到特征图P3;以此类推,从上到下,构建出第一特征金字塔,该第一特征金字塔可以包括多个第二特征图。Among them, the convolution operation can be performed on the topmost feature map C4 in the multiple first feature maps generated by the convolution processing unit. Exemplarily, the hole convolution and 1×1 convolution can be used to convert the topmost feature map C4 The number of channels is reduced to 256, which is the top feature map P4 of the feature pyramid; the output results of the top and next feature maps C3 are horizontally linked and the number of channels is reduced to 256 using 1×1 convolution. The pixels are added to obtain the feature map P3; and so on, from top to bottom, a first feature pyramid is constructed, and the first feature pyramid may include a plurality of second feature maps.
其中,纹理细节可以用于表达细小目标和边缘特征的细节信息,相比于第二特征图, 第一特征图中包括更多的纹理细节信息,使得其在针对于小目标检测的检测结果的检测精度更高。其中,位置细节可以为表达图像中物体所处的位置以及物体之间的相对位置的信息。Among them, the texture details can be used to express the detailed information of small targets and edge features. Compared with the second feature map, the first feature map includes more texture detail information, so that it can be used in the detection results of small target detection. The detection accuracy is higher. Wherein, the position details can be information that expresses the position of the object in the image and the relative position between the objects.
相比于第一特征图,多个第二特征图可以包括更多的深层特征,深层特征包含丰富的语义信息,其对分类任务有很好的效果,同时深层特征具有较大感受野,可以对大型目标有较好的检测效果;在一种实现中,通过引入自上向下的通路来生成多个第二特征图,可以自然地将深层特征所包含的丰富语义信息向下进行传播,使各个尺度的第二特征图都包含丰富的语义信息。Compared with the first feature map, multiple second feature maps can include more deep features. Deep features contain rich semantic information, which has a good effect on classification tasks. At the same time, deep features have larger receptive fields. It has a good detection effect on large targets; in one implementation, by introducing a top-down path to generate multiple second feature maps, the rich semantic information contained in deep features can naturally be propagated downwards. Make the second feature maps of each scale contain rich semantic information.
根据所述多个第一特征图和所述多个第二特征图生成多个第三特征图。A plurality of third characteristic maps are generated according to the plurality of first characteristic maps and the plurality of second characteristic maps.
需要说明的是,这里的“根据所述多个第一特征图和所述多个第二特征图生成多个第三特征图”,并不应该理解为多个第二特征图中的每个第三特征图的来源都是多个第一特征图和多个第二特征图,而是,从整体上来看,多个第一特征图和所述多个第二特征图是多个第三特征图的来源;在一种实现中,多个第三特征图中的一部分第三特征图是基于多个第一特征图中的一个或多个第一特征图以及多个第二特征图中的一个或多个第二特征图生成的;在一种实现中,多个第三特征图中的一部分第三特征图是基于多个第一特征图中的一个或多个第一特征图、多个第二特征图中的一个或多个第二特征图,以及除自身之外的其他第三特征图生成的;在一种实现中,多个第三特征图中的一部分第三特征图是基于除自身之外的其他第三特征图生成的。It should be noted that, "generating multiple third feature maps based on the multiple first feature maps and the multiple second feature maps" here should not be understood as each of the multiple second feature maps The sources of the third feature maps are all multiple first feature maps and multiple second feature maps, but, as a whole, the multiple first feature maps and the multiple second feature maps are multiple third feature maps. The source of the feature map; in one implementation, a part of the third feature maps in the multiple third feature maps are based on one or more first feature maps and multiple second feature maps in the multiple first feature maps One or more of the second feature maps are generated; in one implementation, a part of the third feature maps in the multiple third feature maps are based on one or more first feature maps in the multiple first feature maps, One or more second feature maps in multiple second feature maps, and other third feature maps other than itself; in one implementation, a part of the third feature maps in multiple third feature maps It is generated based on a third feature map other than itself.
需要说明的是,多个第三特征图可以是具有多尺度分辨率的特征图,即多个第三特征图并不是分辨率相同的特征图,在一种可选实现中,多个第三特征图可以构成一个特征金字塔。It should be noted that the multiple third feature maps may be feature maps with multi-scale resolution, that is, the multiple third feature maps are not feature maps with the same resolution. In an optional implementation, the multiple third feature maps The feature map can form a feature pyramid.
根据所述多个第三特征图中的至少一个第三特征图,输出对所述图像中包括的物体的第一检测结果。According to at least one third feature map of the plurality of third feature maps, a first detection result of the object included in the image is output.
在一种实现中,物体可以是人、动物、植物、物品等。In one implementation, the object can be a person, animal, plant, object, etc.
在一种实现中,可以根据所述多个第三特征图中的至少一个第三特征图,对图像进行物体检测,其中,物体检测是为了标识出图像中包括的物体的种类,以及物体所在的位置等。In an implementation, object detection can be performed on the image according to at least one third feature map of the plurality of third feature maps, where the object detection is to identify the type of object included in the image and where the object is located. The location and so on.
在现有的一种实现中,第二特征图生成单元(例如特征金字塔网络)通过引入自上向下的通路,将深层特征所包含的丰富语义信息向下进行传播,使各个尺度的第二特征图都包含丰富的语义信息,同时深层特征具有较大感受野,使得可以对大型目标有较好的检测效果。但是现有的实现中,忽略了更浅层特征图所包含的更加精细的位置细节信息以及纹理细节信息,这对中、小目标的检测精度的影响很大。本申请实施例中,第二特征图生成单元将原始特征图(卷积处理单元生成的多个第一特征图)浅层的纹理细节信息引入到深层特征图(第一特征图生成单元生成的多个第二特征图)中,来生成多个第三特征图,将该具有浅层的丰富纹理细节信息的第三特征图作为检测单元进行目标检测的输入数据,可 以提升后续物体检测的检测精度。In an existing implementation, the second feature map generation unit (such as the feature pyramid network) introduces a top-down path to propagate the rich semantic information contained in deep features downwards, so that the second feature map generation unit of each scale Feature maps contain rich semantic information, and deep features have a large receptive field, which makes it possible to have a better detection effect on large targets. However, in the existing implementation, the more detailed position detail information and texture detail information contained in the shallower feature maps are ignored, which has a great impact on the detection accuracy of medium and small targets. In the embodiment of this application, the second feature map generation unit introduces the shallow texture detail information of the original feature map (multiple first feature maps generated by the convolution processing unit) into the deep feature map (generated by the first feature map generation unit). In multiple second feature maps), multiple third feature maps are generated, and the third feature map with shallow rich texture detail information is used as the input data of the detection unit for target detection, which can improve the detection of subsequent object detection Accuracy.
需要说明的是,本实施例中并不是指对于任意包括小目标的图像的物体检测的检测精度都会更高,而是以一个大的数量样本来说,本实施例可以具有更高的综合检测精度。It should be noted that this embodiment does not mean that the detection accuracy of object detection for any image that includes small targets will be higher, but for a large number of samples, this embodiment can have higher comprehensive detection. Accuracy.
需要说明的是,上述物体检测方法可以由数据处理系统,例如一个训练好的感知网络来实现,其中,感知网络可以包括卷积处理单元、第一特征图生成单元、第二特征图生成单元和检测单元,所述卷积处理单元分别与所述第一特征图生成单元和所述第二特征图生成单元连接,所述第一特征图生成单元和所述第二特征图生成单元连接,所述第二特征图生成单元与所述检测单元连接,所述卷积处理单元,用于接收输入的图像,并对所述输入的图像进行卷积处理,以生成多个第一特征图;所述第一特征图生成单元,用于根据所述多个第一特征图生成多个第二特征图;其中,所述多个第一特征图比所述多个第二特征图包括更多的所述输入的图像的纹理细节和/或所述输入的图像中的位置细节;所述第二特征图生成单元,用于根据所述多个第一特征图和所述多个第二特征图生成多个第三特征图;所述检测单元,用于根据所述多个第三特征图中的至少一个第三特征图,输出对所述图像中包括的物体的检测结果。It should be noted that the above object detection method can be implemented by a data processing system, such as a trained perceptual network, where the perceptual network can include a convolution processing unit, a first feature map generating unit, a second feature map generating unit, and A detection unit, the convolution processing unit is respectively connected to the first feature map generating unit and the second feature map generating unit, the first feature map generating unit and the second feature map generating unit are connected, so The second feature map generating unit is connected to the detection unit, and the convolution processing unit is configured to receive an input image and perform convolution processing on the input image to generate a plurality of first feature maps; The first characteristic map generating unit is configured to generate a plurality of second characteristic maps according to the plurality of first characteristic maps; wherein, the plurality of first characteristic maps include more than the plurality of second characteristic maps The texture details of the input image and/or the position details in the input image; the second feature map generating unit is configured to perform according to the plurality of first feature maps and the plurality of second feature maps A plurality of third feature maps are generated; the detection unit is configured to output a detection result of an object included in the image according to at least one third feature map in the plurality of third feature maps.
在一种实现中,所述感知网络可以包括:主干网络、第一特征金字塔网络FPN、第二FPN和头端header,所述主干网络分别与所述第一FPN和所述第二FPN连接,所述第一FPN和所述第二FPN连接,所述第二FPN与所述head连接(其中,卷积处理单元为主干网络,所述第一特征图生成单元为为第一特征金字塔网络FPN,所述第二特征图生成单元为第二特征金字塔网络FPN,所述检测单元为头端head)。In an implementation, the perception network may include: a backbone network, a first feature pyramid network FPN, a second FPN, and a head-end header, the backbone network is respectively connected to the first FPN and the second FPN, The first FPN is connected to the second FPN, and the second FPN is connected to the head (wherein, the convolution processing unit is the backbone network, and the first feature map generation unit is the first feature pyramid network FPN , The second feature map generating unit is a second feature pyramid network FPN, and the detection unit is a head).
所述主干网络,可以用于接收输入的图像,并对所述输入的图像进行卷积处理,生成具有多尺度分辨率的多个第一特征图;主干网络可以对输入的图像进行一系列的卷积处理,得到在不同的尺度(具有不同分辨率)下的特征图(feature map)。主干网络可以采用多种形式,比如视觉几何组(visual geometry group,VGG)、残差神经网络(residual neural network,resnet)、GoogLeNet的核心结构(Inception-net)等。The backbone network can be used to receive input images and perform convolution processing on the input images to generate multiple first feature maps with multi-scale resolution; the backbone network can perform a series of input images Convolution processing obtains feature maps at different scales (with different resolutions). The backbone network can take many forms, such as visual geometry group (VGG), residual neural network (residual neural network, resnet), the core structure of GoogLeNet (Inception-net), and so on.
所述第一FPN,可以用于根据所述多个第一特征图生成第一特征金字塔,所述第一特征金字塔包括具有多尺度分辨率的多个第二特征图;其中,对主干网络生成的最顶层特征图C4进行卷积操作,示例性的,可以使用空洞卷积和1×1卷积将最顶层特征图C4的通道数下降为256,作为特征金字塔的最顶层特征图P4;横向链接最顶层下一层特征图C3的输出结果并使用1×1卷积降低通道数至256后,与特征图P4逐通道逐像素相加得到特征图P3;以此类推,从上到下,构建出第一特征金字塔。The first FPN may be used to generate a first feature pyramid according to the multiple first feature maps, where the first feature pyramid includes multiple second feature maps with multi-scale resolution; wherein, the backbone network is generated Perform the convolution operation on the top-most feature map C4 of. Exemplarily, the number of channels of the top-most feature map C4 can be reduced to 256 using hole convolution and 1×1 convolution, as the top-most feature map P4 of the feature pyramid; After linking the output result of the feature map C3 of the top and the next layer and using 1×1 convolution to reduce the number of channels to 256, add it with the feature map P4 channel by pixel to obtain the feature map P3; and so on, from top to bottom, Construct the first characteristic pyramid.
所述第二FPN,可以用于根据所述多个第一特征图和所述多个第二特征图生成第二特征金字塔,所述第二特征金字塔包括具有多尺度分辨率的多个第三特征图;The second FPN may be used to generate a second feature pyramid based on the multiple first feature maps and the multiple second feature maps, and the second feature pyramid includes multiple third feature maps with multi-scale resolution. Feature map
所述head,用于根据所述多个第三特征图中的至少一个第三特征图,对所述图像中的目标物体进行检测,并输出检测结果。The head is used to detect the target object in the image according to at least one third feature map of the plurality of third feature maps, and output the detection result.
本申请实施例中,第二FPN将原始特征图(主干网络生成的多个第一特征图)浅层的丰富的边缘、纹理等细节信息引入到深层特征图(第一FPN生成的多个第二特征图)中,来生成第二特征金字塔,将该具有浅层的丰富的边缘、纹理等细节信息的第三特征图作为 head进行目标检测的输入数据,可以提升后续物体检测的检测精度。In the embodiment of this application, the second FPN introduces the rich edge, texture and other detailed information of the shallow layer of the original feature map (multiple first feature maps generated by the backbone network) into the deep feature map (multiple first feature maps generated by the first FPN). In the second feature map), a second feature pyramid is generated, and the third feature map with shallow, rich edge, texture, and other detailed information is used as the input data for head target detection, which can improve the detection accuracy of subsequent object detection.
在一种可选的实现中,所述多个第一特征图包括第一目标特征图,所述多个第二特征图包括第二目标特征图,所述多个第三特征图包括第三目标特征图和第四目标特征图,所述第三目标特征图的分辨率小于所述第四目标特征图;所述根据所述多个第一特征图和所述多个第二特征图生成多个第三特征图,包括:In an optional implementation, the plurality of first characteristic maps include a first target characteristic map, the plurality of second characteristic maps include a second target characteristic map, and the plurality of third characteristic maps include a third target characteristic map. A target feature map and a fourth target feature map, the resolution of the third target feature map is smaller than that of the fourth target feature map; said generating according to the plurality of first feature maps and the plurality of second feature maps Multiple third feature maps, including:
对所述第三目标特征图进行下采样,以得到第五目标特征图,所述第五目标特征图与所述第二目标特征图具有相同的通道数和分辨率;对所述第一目标特征图进行下采样和卷积处理,以得到第六目标特征图,所述第六目标特征图与所述第二目标特征图具有相同的通道数和分辨率;将所述第五目标特征图、所述第二目标特征图和所述第六目标特征图做通道叠加,以生成所述第四目标特征图;Down-sampling the third target feature map to obtain a fifth target feature map, the fifth target feature map and the second target feature map have the same number of channels and resolution; for the first target The feature map is subjected to down-sampling and convolution processing to obtain a sixth target feature map, where the sixth target feature map has the same number of channels and resolution as the second target feature map; and the fifth target feature map , The second target feature map and the sixth target feature map are superimposed on channels to generate the fourth target feature map;
本申请实施例中,可以对所述第三目标特征图进行下采样和卷积处理,得到第五目标特征图,其中下采样的目的是使得第五目标特征图中各个通道的特征图的分辨率与第二目标特征图相同,卷积处理的目的是使得第五目标特征图的通道数与第二目标特征图相同。通过上述方式,使得所述第五目标特征图与第二目标特征图具有相同的分辨率和通道数,进而可以将第五目标特征图、第六目标特征图所述和第二目标特征图逐通道相加,得到第四目标特征图。In the embodiment of the present application, the third target feature map can be down-sampling and convolution processing to obtain the fifth target feature map, wherein the purpose of down-sampling is to make the feature maps of each channel in the fifth target feature map distinguish The rate is the same as the second target feature map, and the purpose of the convolution processing is to make the number of channels of the fifth target feature map the same as the second target feature map. Through the above method, the fifth target feature map and the second target feature map have the same resolution and number of channels, and the fifth target feature map, the sixth target feature map, and the second target feature map can be combined. The channels are added to obtain the fourth target feature map.
或,对所述第三目标特征图进行下采样,以得到第五目标特征图,所述第五目标特征图与所述第二目标特征图具有相同的通道数和分辨率;对所述第一目标特征图进行下采样,得到第六目标特征图,所述第六目标特征图与所述第二目标特征图具有相同的分辨率;将所述第五目标特征图、所述第二目标特征图和所述第六目标特征图做通道叠加以及进行卷积处理,以生成所述第四目标特征图,所述第四目标特征图与所述第二目标特征图具有相同的通道数。Or, down-sampling the third target feature map to obtain a fifth target feature map, where the fifth target feature map has the same number of channels and resolution as the second target feature map; A target feature map is down-sampled to obtain a sixth target feature map. The sixth target feature map and the second target feature map have the same resolution; the fifth target feature map and the second target feature map are The feature map and the sixth target feature map are subjected to channel superposition and convolution processing to generate the fourth target feature map, and the fourth target feature map and the second target feature map have the same number of channels.
本申请实施例中,表述“通道叠加”可以理解为对对应的通道(即相同语义信息的通道)的特征图中对应的元素(对应的元素可以理解为特征图上位于相同位置的元素)进行叠加(例如做加法运算等)。In the embodiments of this application, the expression "channel superimposition" can be understood as performing the corresponding elements in the feature map of the corresponding channel (that is, the channel with the same semantic information) (corresponding elements can be understood as elements located at the same position on the feature map). Superposition (such as doing addition, etc.).
本申请实施例中,第二目标特征图可以为包括多个通道的特征图,每个通道对应的特征图可以为包括一种语义信息的特征图。In the embodiment of the present application, the second target feature map may be a feature map including multiple channels, and the feature map corresponding to each channel may be a feature map including a kind of semantic information.
本申请实施例中,可以对所述第三目标特征图进行下采样,得到第五目标特征图,其中下采样的目的是使得第五目标特征图中各个通道的特征图的分辨率与第二目标特征图相同。通过上述方式,使得所述第五目标特征图与所述第二目标特征图具有相同的分辨率,进而可以将第五目标特征图与所述第二目标特征逐通道相加;可以对所述第一目标特征图进行下采样,得到第六目标特征图,其中下采样的目的是使得第六目标特征图中各个通道的特征图的分辨率与第二目标特征图相同。通过上述方式,使得所述第五目标特征图、所述第六目标特征图以及所述第二目标特征图具有相同的分辨率,进而可以将所述第五目标特征图、所述第六目标特征图以及所述第二目标特征图逐通道相加,之后再通过卷积处理,以使得到的第四目标特征图与第二目标特征图具有相同的通道数,其中,上述卷积处理可 以是concatenation操作。In the embodiment of the present application, the third target feature map can be down-sampled to obtain a fifth target feature map, where the purpose of down-sampling is to make the resolution of the feature maps of each channel in the fifth target feature map equal to that of the second target feature map. The target feature map is the same. Through the above method, the fifth target feature map and the second target feature map have the same resolution, and the fifth target feature map and the second target feature can be added channel by channel; The first target feature map is down-sampled to obtain a sixth target feature map, where the purpose of down-sampling is to make the resolution of the feature maps of each channel in the sixth target feature map the same as the second target feature map. Through the above method, the fifth target feature map, the sixth target feature map, and the second target feature map have the same resolution, so that the fifth target feature map, the sixth target feature map can be converted to the same resolution. The feature map and the second target feature map are added channel by channel, and then processed by convolution, so that the obtained fourth target feature map and the second target feature map have the same number of channels, where the convolution process can be It is a concatenation operation.
本实施例中,可以生成分辨率不同的第三目标特征图和第四目标特征图,第三目标特征图的分辨率小于所述第四目标特征图,其中,分辨率较大的第四目标特征图是基于多个第一特征图中的一个第一特征图、多个第二特征图中的一个第二特征图以及第三目标特征图生成的。此时,第二特征图生成单元生成的多个第三特征图保留了特征金字塔网络的优点,其具有自下向上(从小分辨率的特征图依次生成更大分辨率的特征图),将浅层神经网络所具有的丰富的纹理细节信息和/或位置细节信息引入到深层卷积层中,检测网络利用通过这种方式生成的多个第三特征图的小目标检测结果的检测精度更高。In this embodiment, a third target feature map and a fourth target feature map with different resolutions can be generated, the resolution of the third target feature map is smaller than the fourth target feature map, and the fourth target with a larger resolution The feature map is generated based on a first feature map in a plurality of first feature maps, a second feature map in a plurality of second feature maps, and a third target feature map. At this time, the multiple third feature maps generated by the second feature map generation unit retain the advantages of the feature pyramid network, which has bottom-up (from the feature map with small resolution to generate feature maps with larger resolution in turn), and will be shallower. The rich texture detail information and/or position detail information of the layer neural network is introduced into the deep convolutional layer, and the detection network uses multiple third feature maps generated in this way to detect small target detection results with higher detection accuracy .
在一种实现方式中,所述方法还包括:In an implementation manner, the method further includes:
通过第一卷积层对所述多个第三特征图中的至少一个第三特征图做卷积,以得到至少一个第四特征图;Performing convolution on at least one third feature map of the plurality of third feature maps by the first convolution layer to obtain at least one fourth feature map;
以及对所述多个第三特征图中的至少一个第三特征图做卷积,以得到至少一个第五特征图;其中,所述第四特征图对应的感受野大于所述第五特征图对应的感受野;And convolving at least one third feature map in the plurality of third feature maps to obtain at least one fifth feature map; wherein the receptive field corresponding to the fourth feature map is larger than the fifth feature map Corresponding receptive field;
相应的,所述根据所述多个第三特征图中的至少一个第三特征图,输出对所述图像中包括的物体的第一检测结果,包括:Correspondingly, the outputting the first detection result of the object included in the image according to at least one third feature map of the plurality of third feature maps includes:
根据所述至少一个第四特征图和所述至少一个第五特征图,输出对所述图像中包括的物体的第一检测结果。According to the at least one fourth feature map and the at least one fifth feature map, a first detection result of the object included in the image is output.
本实施例中,通过对第三特征图进行处理后得到的第四特征图和第五特征图具有不同的感受野,由于具有不同的感受野的特征图,可以适应性的对不同大小的目标进行检测,例如可以通过具有不同扩张率(dilation rate)的卷积层对第三特征图进行处理,其得到的处理结果可以即包括大目标的信息也可以包括小目标的信息,使得后续的目标检测过程中既可以实现大目标的检测也可以实现小目标的检测。In this embodiment, the fourth feature map and the fifth feature map obtained by processing the third feature map have different receptive fields. Since they have different receptive field feature maps, they can adapt to targets of different sizes. For detection, for example, the third feature map can be processed by convolutional layers with different dilation rates, and the processing results obtained can include information about large targets or information about small targets, so that subsequent targets In the detection process, both large targets and small targets can be detected.
在一种实现方式中,所述方法还包括:In an implementation manner, the method further includes:
根据第一权重值对所述第四特征图进行处理,以得到处理后的第四特征图;Processing the fourth feature map according to the first weight value to obtain a processed fourth feature map;
需要说明的是,可以将所述第一权重值对所述第四特征图进行基于通道的相乘操作或者其他数值处理,使得第四特征图中的元素进行相应大小的增益。It should be noted that the first weight value may perform a channel-based multiplication operation or other numerical processing on the fourth feature map, so that the elements in the fourth feature map undergo a corresponding gain.
根据第二权重值对所述第五特征图进行处理,以得到处理后的第五特征图;需要说明的是,可以将所述第二权重值对所述第五特征图进行基于通道的相乘操作或者其他数值处理,使得第五特征图中的元素进行相应大小的增益。The fifth feature map is processed according to the second weight value to obtain a processed fifth feature map; it should be noted that the second weight value can be compared to the fifth feature map based on the channel. The multiplication operation or other numerical processing makes the elements in the fifth feature map perform corresponding gains.
其中,在所述第四特征图包括的待检测物体大于所述第五特征图包括的待检测物体的情况下,所述第一权重值大于所述第二权重值;Wherein, in a case where the object to be detected included in the fourth feature map is greater than the object to be detected included in the fifth feature map, the first weight value is greater than the second weight value;
相应的,所述根据所述多个第三特征图中的至少一个第三特征图,输出对所述图像中包括的物体的第一检测结果,包括:Correspondingly, the outputting the first detection result of the object included in the image according to at least one third feature map of the plurality of third feature maps includes:
根据所述处理后的第四特征图和所述处理后的第五特征图,输出对所述图像中包括的物体的第一检测结果。According to the processed fourth feature map and the processed fifth feature map, a first detection result of the object included in the image is output.
本实施例中,若所述第四特征图包括的待检测物体大于所述第五特征图包括的待检测物体,则相应的,第一权重值大于所述第二权重值,则处理后的第四特征图相对处理后的第五特征图获得的增益会更大,由于第四特征图本身对应的感受野大于第五特征图对应的感受野,感受野越大的特征图,其具有更多的大目标的信息,相应利用其进行的目标检测中大目标的检测精度也更高,本实施例中,当图像中具有的较大的目标时,第四特征图相应进行的增益相比于第五特征图也更多,在检测单元基于处理后的第四特征图和所述处理后的第五特征图,对所述图像进行目标检测时,整体会具有更大的感受野,相应的,检测精度也更高。In this embodiment, if the object to be detected included in the fourth feature map is greater than the object to be detected included in the fifth feature map, correspondingly, the first weight value is greater than the second weight value, then the processed The fourth feature map has a greater gain than the processed fifth feature map. Since the receptive field corresponding to the fourth feature map itself is larger than the receptive field corresponding to the fifth feature map, the larger the receptive field, the more it has For more large target information, the detection accuracy of the large target in the target detection performed by using it is also higher. In this embodiment, when there is a larger target in the image, the gain corresponding to the fourth feature map is compared with There are also more fifth feature maps. When the detection unit performs target detection on the image based on the processed fourth feature map and the processed fifth feature map, the overall receptive field will be larger. Yes, the detection accuracy is also higher.
本实施例中,可以通过训练,使得中间特征提取层学习到权重值的识别规律:针对于包括大目标的特征图,其确定的第一卷积层的第一权重值较大,其确定的第二卷积层的第二权重值较小。针对于包括小目标的特征图,其确定的第一卷积层的第一权重值较小,其确定的第二卷积层的第二权重值较大。In this embodiment, the intermediate feature extraction layer can learn the recognition law of the weight value through training: for the feature map that includes a large target, the first weight value of the first convolutional layer determined by it is larger, and the determined The second weight value of the second convolutional layer is smaller. For a feature map that includes a small target, the first weight value of the first convolutional layer determined by the feature map is relatively small, and the second weight value of the second convolutional layer determined by the feature map is relatively large.
在一种实现方式中,所述方法还包括:In an implementation manner, the method further includes:
对所述多个第三特征图中的至少一个第三特征图进行空洞卷积处理;Performing hole convolution processing on at least one third feature map of the plurality of third feature maps;
根据空洞卷积处理后的所述至少一个第三特征图,对所述图像中的目标物体进行检测,并输出第一检测结果。The target object in the image is detected according to the at least one third feature map after the hole convolution processing, and the first detection result is output.
本申请实施例中,可以在候选区域提取网络(RPN)中会有一个3x3卷积起到滑动窗口的作用,在至少一个第三特征图上移动该卷积核通过后续中间层和类别判断以及边框回归层可以获取得到各锚点(Anchor)边框中是否存在目标以及预测边框与真实边框之间的差别,通过训练候选区域提取网络可以获取到较好的边框提取结果。本实施例中,将3x3的滑动窗口卷积核替换为3x3的空洞卷积核,对所述多个第三特征图中的至少一个第三特征图进行空洞卷积处理,根据空洞卷积处理后的所述至少一个第三特征图,对所述图像中的目标物体进行检测,并输出检测结果,在不增加计算量的情况下,增大感受野,减少对大目标和部分遮挡目标的漏检。In the embodiment of the present application, a 3x3 convolution may function as a sliding window in the candidate region extraction network (RPN), and the convolution kernel may be moved on at least one third feature map to pass subsequent intermediate layer and category judgments and The frame regression layer can obtain whether there is a target in each anchor frame and the difference between the predicted frame and the real frame, and a better frame extraction result can be obtained by training the candidate region extraction network. In this embodiment, the 3x3 sliding window convolution kernel is replaced with a 3x3 hole convolution kernel, and at least one third feature map of the plurality of third feature maps is subjected to hole convolution processing, according to the hole convolution processing The latter at least one third feature map detects the target object in the image and outputs the detection result. Without increasing the amount of calculation, the receptive field is increased, and the impact on large targets and partially occluded targets is reduced. Missed inspection.
在一种实现方式中,所述第一检测结果包括第一检测框,所述方法还包括:In an implementation manner, the first detection result includes a first detection frame, and the method further includes:
获取所述第一图像的第二检测结果,所述第二检测结果为通过第二感知网络对所述第一图像进行物体检测得到的,所述第一感知网络的物体检测精度高于所述第二感知网络的物体检测精度,所述第二检测结果包括第二检测框,所述第二检测框所在的区域与所述第一检测框所在的区域之间存在交集;Acquire a second detection result of the first image, the second detection result is obtained by object detection of the first image through a second perception network, and the object detection accuracy of the first perception network is higher than that of the Object detection accuracy of the second perception network, the second detection result includes a second detection frame, and there is an intersection between the area where the second detection frame is located and the area where the first detection frame is located;
若所述交集的面积与所述第一检测框的面积的比值小于预设值,则更新所述第二检测结果,以使得更新后的第二检测结果包括所述第一检测框。If the ratio of the area of the intersection to the area of the first detection frame is less than a preset value, the second detection result is updated so that the updated second detection result includes the first detection frame.
本申请实施例中,若所述第一交集的面积与所述第一检测框的面积的比值小于预设值,则可以认为第二检测结果中遗漏了第一检测框,进而可以更新后的第二检测结果,以使得更新后的第二检测结果包括所述第一检测框。将时序特性引入模型中进行辅助判断疑似漏检结果是否为真正漏检结果,并对漏检结果类别进行判断,提升了校验效率。In the embodiment of the present application, if the ratio of the area of the first intersection to the area of the first detection frame is less than the preset value, it can be considered that the first detection frame is omitted from the second detection result, and the updated The second detection result is such that the updated second detection result includes the first detection frame. The timing characteristics are introduced into the model to assist in determining whether the suspected missed test result is a true missed test result, and the category of the missed test result is judged, which improves the efficiency of verification.
在一种实现方式中,所述第二检测结果包括多个检测框,所述多个检测框中的每个检测框所在的区域与所述第一检测框所在的区域之间存在交集,所述多个检测框包括所述第二检测框,其中,在所述多个检测框中每个检测框所在的区域与所述第一检测框所在的区域的交集的面积中,所述第二检测框所在的区域与所述第一检测框所在的区域之间的交集的面积最小。In an implementation manner, the second detection result includes multiple detection frames, and there is an intersection between the area where each detection frame of the multiple detection frames is located and the area where the first detection frame is located, so The multiple detection frames include the second detection frame, wherein, in the area of the intersection of the area where each detection frame is located in the multiple detection frames and the area where the first detection frame is located, the second detection frame The area of the intersection between the area where the detection frame is located and the area where the first detection frame is located is the smallest.
在一种实现方式中,所述第一图像为视频中的一个图像帧,第二图像为视频中的一个图像帧,所述第一图像与所述第二图像在所述视频中的帧间距小于预设值,所述方法还包括:In an implementation manner, the first image is an image frame in the video, the second image is an image frame in the video, and the frame distance between the first image and the second image in the video Less than the preset value, the method further includes:
获取所述第二图像的第三检测结果,所述第三检测结果包括第四检测框以及所述第四检测框对应的物体类别;Acquiring a third detection result of the second image, where the third detection result includes a fourth detection frame and an object category corresponding to the fourth detection frame;
在所述第四检测框与所述第一检测框之间的形状差异以及位置差异在预设范围内的情况下,所述第一检测框对应于所述第四检测框对应的物体类别。When the shape difference and position difference between the fourth detection frame and the first detection frame are within a preset range, the first detection frame corresponds to the object category corresponding to the fourth detection frame.
在一种实现方式中,所述第四检测框的检测置信度大于预设阈值。In an implementation manner, the detection confidence of the fourth detection frame is greater than a preset threshold.
第二方面,本申请提供了一种数据处理系统,所述数据处理系统包括:卷积处理单元、第一特征图生成单元、第二特征图生成单元和检测单元,所述卷积处理单元分别与所述第一特征图生成单元和所述第二特征图生成单元连接,所述第一特征图生成单元和所述第二特征图生成单元连接,所述第二特征图生成单元与所述检测单元连接;In a second aspect, the present application provides a data processing system, the data processing system includes: a convolution processing unit, a first feature map generation unit, a second feature map generation unit, and a detection unit, the convolution processing units are respectively Is connected to the first characteristic map generating unit and the second characteristic map generating unit, the first characteristic map generating unit is connected to the second characteristic map generating unit, and the second characteristic map generating unit is connected to the Detection unit connection;
所述卷积处理单元,用于接收输入的图像,并对所述输入的图像进行卷积处理,以生成多个第一特征图;The convolution processing unit is configured to receive an input image, and perform convolution processing on the input image to generate a plurality of first feature maps;
所述第一特征图生成单元,用于根据所述多个第一特征图生成多个第二特征图;其中,所述多个第一特征图比所述多个第二特征图包括更多的所述输入的图像的纹理细节和/或所述输入的图像中的位置细节;The first characteristic map generating unit is configured to generate a plurality of second characteristic maps according to the plurality of first characteristic maps; wherein, the plurality of first characteristic maps include more than the plurality of second characteristic maps Texture details of the input image and/or location details in the input image;
所述第二特征图生成单元,用于根据所述多个第一特征图和所述多个第二特征图生成多个第三特征图;The second characteristic map generating unit is configured to generate a plurality of third characteristic maps according to the plurality of first characteristic maps and the plurality of second characteristic maps;
所述检测单元,用于根据所述多个第三特征图中的至少一个第三特征图,输出对所述图像中包括的物体的检测结果。The detection unit is configured to output a detection result of an object included in the image according to at least one third feature map of the plurality of third feature maps.
例如,数据处理系统可以是感知网络系统,用于实现感知网络的功能。For example, the data processing system may be a sensory network system for realizing the function of the sensory network.
在现有的一种实现中,第二特征图生成单元(例如特征金字塔网络)通过引入自上向下的通路,将深层特征所包含的丰富语义信息向下进行传播,使各个尺度的第二特征图都包含丰富的语义信息,同时深层特征具有较大感受野,使得可以对大型目标有较好的检测效果。但是现有的实现中,忽略了更浅层特征图所包含的更加精细的位置细节信息以及纹理细节信息,这对中、小目标的检测精度的影响很大。本申请实施例中,第二特征图生成单元将原始特征图(卷积处理单元生成的多个第一特征图)浅层的纹理细节信息引入到深层特征图(第一特征图生成单元生成的多个第二特征图)中,来生成多个第三特征图,将 该具有浅层的丰富纹理细节信息的第三特征图作为检测单元进行目标检测的输入数据,可以提升后续物体检测的检测精度。In an existing implementation, the second feature map generation unit (such as the feature pyramid network) introduces a top-down path to propagate the rich semantic information contained in deep features downwards, so that the second feature map generation unit of each scale Feature maps contain rich semantic information, and deep features have a large receptive field, which makes it possible to have a better detection effect on large targets. However, in the existing implementation, the more detailed position detail information and texture detail information contained in the shallower feature maps are ignored, which has a great impact on the detection accuracy of medium and small targets. In the embodiment of this application, the second feature map generation unit introduces the shallow texture detail information of the original feature map (multiple first feature maps generated by the convolution processing unit) into the deep feature map (generated by the first feature map generation unit). In multiple second feature maps), multiple third feature maps are generated, and the third feature map with shallow rich texture detail information is used as the input data of the detection unit for target detection, which can improve the detection of subsequent object detection Accuracy.
需要说明的是,本实施例中并不是指对于任意包括小目标的图像的物体检测的检测精度都会更高,而是以一个大的数量样本来说,本实施例可以具有更高的综合检测精度。It should be noted that this embodiment does not mean that the detection accuracy of object detection for any image that includes small targets will be higher, but for a large number of samples, this embodiment can have higher comprehensive detection. Accuracy.
由于第三方面是第一方面对应的装置,其各种实现方式、解释说明以及相应的技术效果请参见第一方面的描述,这里不再赘述。Since the third aspect is a device corresponding to the first aspect, please refer to the description of the first aspect for its various implementation modes, explanations and corresponding technical effects, which will not be repeated here.
第三方面,本申请提供了一种感知网络训练方法,所述方法包括:In the third aspect, this application provides a perceptual network training method, the method includes:
获取图像中目标物体的预标注检测框;Obtaining a pre-labeled detection frame of the target object in the image;
获取对应于所述图像以及第一感知网络的目标检测框,所述目标检测框用于标识所述目标物体;Acquiring a target detection frame corresponding to the image and the first perception network, where the target detection frame is used to identify the target object;
在一种设计中,可以获取所述图像的检测结果,所述检测结果为通过第一感知网络对所述图像进行物体检测得到的,所述检测结果包括所述第一物体对应的目标检测框;In one design, the detection result of the image may be obtained, the detection result is obtained by object detection of the image through a first perception network, and the detection result includes the target detection frame corresponding to the first object ;
根据损失函数对所述第一感知网络进行迭代训练,以输出第二感知网络;其中,所述损失函数与所述预标注检测框和所述目标检测框之间的交并比IoU有关;Performing iterative training on the first perception network according to a loss function to output a second perception network; wherein the intersection of the loss function and the pre-labeled detection frame and the target detection frame is more related to IoU;
在一种设计中,可以根据损失函数对所述第一感知网络进行迭代训练,以更新所述第一感知网络包括的参数,以得到第二感知网络;其中,所述损失函数与所述预标注检测框和所述目标检测框之间的交并比IoU有关。In one design, the first perceptual network may be iteratively trained according to the loss function to update the parameters included in the first perceptual network to obtain the second perceptual network; wherein, the loss function and the pre- The intersection between the labeled detection frame and the target detection frame is more related to IoU.
在一种设计中,可以输出所述第二感知网络。In one design, the second perceptual network may be output.
本申请实施例中,新设计的边框回归损失函数利用了具有尺度不变性,应用于目标检测度量方法的IoU损失项、考虑了预测边框与真实边框长宽比的损失项以及预测边框中心坐标与真实边框中心坐标距离与预测边框右下角坐标与真实边框左上角坐标距离比值的损失项,IoU损失项自然引入尺度不变的边框预测质量评估指标,两边框长宽比的损失项度量了两边框之间形状的贴合程度,第三个距离比值度量项用于解决在IoU=0时,无法得知预测边框与真实边框之间相对位置关系以及难以进行反向传播。In the embodiment of this application, the newly designed frame regression loss function uses scale invariance and is applied to the IoU loss item of the target detection measurement method, the loss item considering the aspect ratio of the predicted frame and the real frame, and the predicted frame center coordinates and The loss term of the ratio between the center coordinate distance of the real border and the distance between the lower right corner coordinate of the predicted border and the upper left corner coordinate of the real border. The IoU loss term naturally introduces a constant-scale border prediction quality evaluation index. The loss term of the aspect ratio of the two borders measures the two borders. The degree of fit between the shapes, the third distance ratio metric item is used to solve the problem that when IoU=0, the relative position relationship between the predicted frame and the real frame cannot be known, and the backward propagation is difficult.
在一种实现方式中,所述预设的损失函数还与所述目标检测框与所述预标注检测框的形状差异有关,其中,所述形状差异与所述预标注检测框的面积负相关。In an implementation manner, the preset loss function is also related to the shape difference between the target detection frame and the pre-labeled detection frame, wherein the shape difference is negatively related to the area of the pre-labeled detection frame .
在一种实现方式中,所述预设的损失函数还与所述目标检测框与所述预标注检测框在所述图像中的位置差异有关,其中,所述位置差异与所述预标注检测框的面积负相关;或所述位置差异与所述预标注检测框和所述目标检测框的凸包的最小外接矩形的面积负相关。In an implementation manner, the preset loss function is also related to the position difference between the target detection frame and the pre-labeled detection frame in the image, wherein the position difference is related to the pre-labeled detection frame. The area of the frame is negatively correlated; or the position difference is negatively correlated with the area of the smallest circumscribed rectangle of the convex hull of the pre-labeled detection frame and the target detection frame.
在一种实现方式中,所述目标检测框包括第一角点和第一中心点,所述预标注检测框包括第二角点和第二中心点,所述第一角点和所述第二角点为矩形对角线的两个端点,所述位置差异还与所述第一中心点和所述第二中心点在所述图像中的位置差异正相关、以及与所述第一角点和所述第二角点的长度负相关。In an implementation manner, the target detection frame includes a first corner point and a first center point, the pre-labeled detection frame includes a second corner point and a second center point, and the first corner point and the first center point The two corner points are the two end points of the diagonal of the rectangle, and the position difference is also positively correlated with the position difference between the first center point and the second center point in the image, and with the first corner The point is negatively related to the length of the second corner point.
在一种实现方式中,所述预设的损失函数包括与所述位置差异有关的目标损失项,所述目标损失项随着所述位置差异的变化而变化;其中,当所述位置差异大于预设值时,所述目标损失项的变化率大于第一预设变化率;和/或,当所述位置差异小于预设值时,所述目标损失项的变化率小于第二预设变化率。当所述位置差异大于预设值时,所述目标损失项的变化率大于第一预设变化率;和/或,当所述位置差异小于预设值时,所述目标损失项的变化率小于第二预设变化率,可以在训练过程中实现快速收敛的效果。In an implementation manner, the preset loss function includes a target loss item related to the position difference, and the target loss item changes as the position difference changes; wherein, when the position difference is greater than When the preset value is set, the rate of change of the target loss item is greater than the first preset rate of change; and/or, when the position difference is less than the preset value, the rate of change of the target loss item is less than the second preset rate of change Rate. When the position difference is greater than the preset value, the rate of change of the target loss item is greater than the first preset rate of change; and/or, when the position difference is less than the preset value, the rate of change of the target loss item Less than the second preset rate of change, the effect of rapid convergence can be achieved during the training process.
第四方面,本申请提供了一种感知网络训练装置,所述装置包括:In a fourth aspect, the present application provides a perceptual network training device, the device includes:
获取模块,用于获取图像中目标物体的预标注检测框;以及获取对应于所述图像以及第一感知网络的目标检测框,所述目标检测框用于标识所述目标物体;An acquiring module for acquiring a pre-labeled detection frame of a target object in an image; and acquiring a target detection frame corresponding to the image and the first perception network, the target detection frame being used to identify the target object;
迭代训练模块,用于根据损失函数对所述第一感知网络进行迭代训练,以输出第二感知网络;其中,所述损失函数与所述预标注检测框和所述目标检测框之间的交并比IoU有关。The iterative training module is used to iteratively train the first perception network according to the loss function to output the second perception network; wherein the loss function and the intersection between the pre-labeled detection frame and the target detection frame And more related to IoU.
在一种实现方式中,所述预设的损失函数还与所述目标检测框与所述预标注检测框的形状差异有关,其中,所述形状差异与所述预标注检测框的面积负相关。In an implementation manner, the preset loss function is also related to the shape difference between the target detection frame and the pre-labeled detection frame, wherein the shape difference is negatively related to the area of the pre-labeled detection frame .
在一种实现方式中,所述预设的损失函数还与所述目标检测框与所述预标注检测框在所述图像中的位置差异有关,其中,所述位置差异与所述预标注检测框的面积负相关;或所述位置差异与所述预标注检测框和所述目标检测框的凸包的最小外接矩形的面积负相关。In an implementation manner, the preset loss function is also related to the position difference between the target detection frame and the pre-labeled detection frame in the image, wherein the position difference is related to the pre-labeled detection frame. The area of the frame is negatively correlated; or the position difference is negatively correlated with the area of the smallest circumscribed rectangle of the convex hull of the pre-labeled detection frame and the target detection frame.
在一种实现方式中,所述目标检测框包括第一角点和第一中心点,所述预标注检测框包括第二角点和第二中心点,所述第一角点和所述第二角点为矩形对角线的两个端点,所述位置差异还与所述第一中心点和所述第二中心点在所述图像中的位置差异正相关、以及与所述第一角点和所述第二角点的长度负相关。In an implementation manner, the target detection frame includes a first corner point and a first center point, the pre-labeled detection frame includes a second corner point and a second center point, and the first corner point and the first center point The two corner points are the two end points of the diagonal of the rectangle, and the position difference is also positively correlated with the position difference between the first center point and the second center point in the image, and with the first corner The point is negatively related to the length of the second corner point.
在一种实现方式中,所述预设的损失函数包括与所述位置差异有关的目标损失项,所述目标损失项随着所述位置差异的变化而变化;其中,In an implementation manner, the preset loss function includes a target loss item related to the position difference, and the target loss item changes as the position difference changes; wherein,
当所述位置差异大于预设值时,所述目标损失项的变化率大于第一预设变化率;和/或,当所述位置差异小于预设值时,所述目标损失项的变化率小于第二预设变化率。When the position difference is greater than the preset value, the rate of change of the target loss item is greater than the first preset rate of change; and/or, when the position difference is less than the preset value, the rate of change of the target loss item Less than the second preset rate of change.
由于第四方面是第二方面对应的装置,其各种实现方式、解释说明以及相应的技术效果请参见第一方面的描述,这里不再赘述。Since the fourth aspect is a device corresponding to the second aspect, please refer to the description of the first aspect for its various implementation modes, explanations and corresponding technical effects, which will not be repeated here.
第五方面,本申请实施例提供了一种物体检测装置,可以包括存储器、处理器以及总线系统,其中,存储器用于存储程序,处理器用于执行存储器中的程序,以执行如上述第二方面及第二方面任一可选的方法。In a fifth aspect, an embodiment of the present application provides an object detection device, which may include a memory, a processor, and a bus system, where the memory is used to store a program, and the processor is used to execute the program in the memory to execute the above-mentioned second aspect. And any optional method of the second aspect.
第六方面,本申请实施例提供了一种物体检测装置,可以包括存储器、处理器以及总线系统,其中,存储器用于存储程序,处理器用于执行存储器中的程序,以执行如上述第三方面及第三方面任一可选的方法。In a sixth aspect, an embodiment of the present application provides an object detection device, which may include a memory, a processor, and a bus system. The memory is used to store a program, and the processor is used to execute the program in the memory to execute the above-mentioned third aspect. And any optional method of the third aspect.
第七方面,本发明实施例还提供一种感知网络应用系统,该感知网络应用系统包括至 少一个处理器,至少一个存储器、至少一个通信接口以及至少一个显示设备。处理器、存储器、显示设备和通信接口通过通信总线连接并完成相互间的通信。In a seventh aspect, an embodiment of the present invention also provides a sensory network application system. The sensory network application system includes at least one processor, at least one memory, at least one communication interface, and at least one display device. The processor, the memory, the display device and the communication interface are connected through the communication bus and complete the mutual communication.
通信接口,用于与其他设备或通信网络通信;Communication interface, used to communicate with other devices or communication networks;
存储器用于存储执行以上方案的应用程序代码,并由处理器来控制执行。所述处理器用于执行所述存储器中存储的应用程序代码;The memory is used to store application program codes for executing the above solutions, and the processor controls the execution. The processor is configured to execute the application program code stored in the memory;
存储器存储的代码可执行以上提供的一种物体检测方法也可以是上述实施例提供的训练该感知网络的方法;The code stored in the memory can execute one of the object detection methods provided above or the method of training the perceptual network provided in the above embodiments;
显示设备用于显示待识别图像、该待识别图像中感兴趣物体的2D、3D、Mask、关键点等信息。The display device is used to display the image to be recognized, 2D, 3D, Mask, key points and other information of the object of interest in the image to be recognized.
第八方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,当其在计算机上运行时,使得计算机执行上述第二方面及其任一可选的方法或第一方面及其任一可选的方法。In an eighth aspect, an embodiment of the present application provides a computer-readable storage medium having a computer program stored in the computer-readable storage medium, and when it runs on a computer, the computer executes the second aspect and any of the above-mentioned second aspects. Optional method or the first aspect and any optional method thereof.
第九方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,当其在计算机上运行时,使得计算机执行上述第三方面及其任一可选的方法。In a ninth aspect, an embodiment of the present application provides a computer-readable storage medium in which a computer program is stored. When the computer program is run on a computer, the computer can execute the third aspect and any one thereof. Optional method.
第十方面,本申请实施例提供了一种计算机程序,当其在计算机上运行时,使得计算机执行上述第一方面及其任一可选的方法。In a tenth aspect, an embodiment of the present application provides a computer program, which when running on a computer, causes the computer to execute the first aspect and any optional method thereof.
第十一方面,本申请实施例提供了一种计算机程序,当其在计算机上运行时,使得计算机执行上述第三方面及其任一可选的方法。In an eleventh aspect, an embodiment of the present application provides a computer program that, when run on a computer, causes the computer to execute the third aspect and any optional method thereof.
第十二方面,本申请提供了一种芯片系统,该芯片系统包括处理器,用于支持执行设备或训练设备实现上述方面中所涉及的功能,例如,发送或处理上述方法中所涉及的数据;或,信息。在一种可能的设计中,所述芯片系统还包括存储器,所述存储器,用于保存执行设备或训练设备必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包括芯片和其他分立器件。In a twelfth aspect, this application provides a chip system that includes a processor for supporting execution devices or training devices to implement the functions involved in the above aspects, for example, sending or processing data involved in the above methods ; Or, information. In a possible design, the chip system further includes a memory for storing program instructions and data necessary for the execution device or the training device. The chip system can be composed of chips, and can also include chips and other discrete devices.
本申请提供了一种数据处理系统,所述第二特征图生成单元将原始特征图(卷积处理单元生成的多个第一特征图)浅层的纹理细节信息引入到深层特征图(第一特征图生成单元生成的多个第二特征图)中,来生成多个第三特征图,将该具有浅层的丰富纹理细节信息的第三特征图作为检测单元进行目标检测的输入数据,可以提升后续物体检测的检测精度。This application provides a data processing system. The second feature map generation unit introduces the shallow texture detail information of the original feature map (multiple first feature maps generated by the convolution processing unit) into the deep feature map (first The multiple second feature maps generated by the feature map generating unit) are used to generate multiple third feature maps, and the third feature map with shallow rich texture detail information is used as the input data for the detection unit to perform target detection. Improve the detection accuracy of subsequent object detection.
附图说明Description of the drawings
图1为人工智能主体框架的一种结构示意图;Figure 1 is a schematic diagram of a structure of the main frame of artificial intelligence;
图2为本申请实施例的应用场景;Figure 2 is an application scenario of an embodiment of the application;
图3为本申请实施例的应用场景;Figure 3 is an application scenario of an embodiment of the application;
图4为本申请实施例的应用场景;Figure 4 is an application scenario of an embodiment of the application;
图5为本申请实施例提供的一种系统架构的示意图;FIG. 5 is a schematic diagram of a system architecture provided by an embodiment of the application;
图6为本申请实施例采用的卷积神经网络的结构示意;FIG. 6 is a schematic diagram of the structure of a convolutional neural network used in an embodiment of the application;
图7为本申请实施例采用的卷积神经网络的结构示意;FIG. 7 is a schematic diagram of the structure of a convolutional neural network used in an embodiment of the application;
图8为本申请实施例提供的一种芯片的硬件结构;FIG. 8 is a hardware structure of a chip provided by an embodiment of the application;
图9为本申请实施例提供的一种感知网络的结构示意图;FIG. 9 is a schematic structural diagram of a sensing network provided by an embodiment of this application;
图10为一种主干网络的结构示意;Figure 10 is a schematic diagram of the structure of a backbone network;
图11为一种第一FPN的结构示意;FIG. 11 is a schematic diagram of the structure of a first FPN;
图12a为一种第二FPN的结构示意;Figure 12a is a schematic diagram of the structure of a second FPN;
图12b为一种第二FPN的结构示意;Figure 12b is a schematic diagram of the structure of a second FPN;
图12c为一种第二FPN的结构示意;Figure 12c is a schematic diagram of the structure of a second FPN;
图12d为一种第二FPN的结构示意;Figure 12d is a schematic diagram of the structure of a second FPN;
图12e为一种第二FPN的结构示意;Figure 12e is a schematic diagram of the structure of a second FPN;
图13a为一种head的结构示意;Figure 13a is a schematic diagram of the structure of a head;
图13b为一种head的结构示意;Figure 13b is a schematic diagram of the structure of a head;
图14a为本申请实施例提供的一种感知网络的结构示意;FIG. 14a is a schematic diagram of the structure of a sensing network provided by an embodiment of this application;
图14b为本申请实施例提供的一种空洞卷积核的示意;FIG. 14b is a schematic diagram of a hole convolution kernel provided by an embodiment of this application;
图14c为一种中间特征提取的处理流程示意;Figure 14c is a schematic diagram of a processing flow of intermediate feature extraction;
图15为本申请实施例提供的一种物体检测方法的流程示意;FIG. 15 is a schematic flowchart of an object detection method provided by an embodiment of this application;
图16为本申请实施例提供的一种感知网络训练方法的流程示意;FIG. 16 is a schematic flow chart of a perceptual network training method provided by an embodiment of this application;
图17为本申请实施例提供的一种物体检测方法的流程示意;FIG. 17 is a schematic flowchart of an object detection method provided by an embodiment of this application;
图18为本申请实施例提供的一种感知网络训练装置的示意;FIG. 18 is a schematic diagram of a perceptual network training device provided by an embodiment of this application;
图19为本申请实施例提供的一种物体检测装置的示意;FIG. 19 is a schematic diagram of an object detection device provided by an embodiment of the application;
图20为本申请实施例提供的执行设备的一种结构示意图;FIG. 20 is a schematic structural diagram of an execution device provided by an embodiment of this application;
图21是本申请实施例提供的训练设备一种结构示意图;FIG. 21 is a schematic structural diagram of a training device provided by an embodiment of the present application;
图22为本申请实施例提供的芯片的一种结构示意图。FIG. 22 is a schematic diagram of a structure of a chip provided by an embodiment of the application.
具体实施方式Detailed ways
下面结合本发明实施例中的附图对本发明实施例进行描述。本发明的实施方式部分使用的术语仅用于对本发明的具体实施例进行解释,而非旨在限定本发明。The embodiments of the present invention will be described below in conjunction with the drawings in the embodiments of the present invention. The terms used in the embodiment of the present invention are only used to explain specific embodiments of the present invention, and are not intended to limit the present invention.
下面结合附图,对本申请的实施例进行描述。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。The embodiments of the present application will be described below in conjunction with the drawings. A person of ordinary skill in the art knows that with the development of technology and the emergence of new scenarios, the technical solutions provided in the embodiments of the present application are equally applicable to similar technical problems.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。The terms "first", "second", etc. in the description and claims of the application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It should be understood that the terms used in this way can be interchanged under appropriate circumstances, and this is merely a way of distinguishing objects with the same attributes in the description of the embodiments of the present application. In addition, the terms "including" and "having" and any variations of them are intended to cover non-exclusive inclusion, so that a process, method, system, product, or device that includes a series of units is not necessarily limited to those units, but may include Listed or inherent to these processes, methods, products, or equipment.
首先对人工智能系统总体工作流程进行描述,请参见图1,图1示出的为人工智能主体框架的一种结构示意图,下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个 维度对上述人工智能主题框架进行阐述。其中,“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。First, the overall workflow of the artificial intelligence system is described. Please refer to Figure 1. Figure 1 shows a schematic diagram of the main framework of artificial intelligence. (Vertical axis) Two dimensions explain the above-mentioned artificial intelligence theme framework. Among them, the "intelligent information chain" reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has gone through the condensing process of "data-information-knowledge-wisdom". The "IT value chain" from the underlying infrastructure of human intelligence, information (providing and processing technology realization) to the industrial ecological process of the system, reflects the value that artificial intelligence brings to the information technology industry.
(1)基础设施(1) Infrastructure
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片(CPU、NPU、GPU、ASIC、FPGA等硬件加速芯片)提供;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。The infrastructure provides computing power support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the basic platform. Communicate with the outside through sensors; computing capabilities are provided by smart chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA); basic platforms include distributed computing frameworks and network related platform guarantees and support, which can include cloud storage and Computing, interconnection network, etc. For example, sensors communicate with the outside to obtain data, and these data are provided to the smart chip in the distributed computing system provided by the basic platform for calculation.
(2)数据(2) Data
基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。The data in the upper layer of the infrastructure is used to represent the data source in the field of artificial intelligence. The data involves graphics, images, voice, and text, as well as the Internet of Things data of traditional devices, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
(3)数据处理(3) Data processing
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other methods.
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。Among them, machine learning and deep learning can symbolize and formalize data for intelligent information modeling, extraction, preprocessing, training, etc.
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。Reasoning refers to the process of simulating human intelligent reasoning in a computer or intelligent system, using formal information to conduct machine thinking and solving problems based on reasoning control strategies. The typical function is search and matching.
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。Decision-making refers to the process of making decisions after intelligent information is reasoned, and usually provides functions such as classification, ranking, and prediction.
(4)通用能力(4) General ability
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理,语音识别,图像的识别等等。After the above-mentioned data processing is performed on the data, some general capabilities can be formed based on the results of the data processing, such as an algorithm or a general system, for example, translation, text analysis, computer vision processing, speech recognition, image Recognition and so on.
(5)智能产品及行业应用(5) Smart products and industry applications
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能终端、智能交通、智能医疗、自动驾驶、平安城市等。Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. It is an encapsulation of the overall solution of artificial intelligence, productizing intelligent information decision-making and realizing landing applications. Its application fields mainly include: intelligent terminals, intelligent transportation, Smart medical care, autonomous driving, safe city, etc.
本申请实施例主要应用在驾驶辅助、自动驾驶、手机终端等需要完成多种感知任务的领域。本发明的应用系统框架如图2和图3所示,图2展示了本申请实施例的应用场景,本申请实施例处于数据处理平台中数据自动标注的模块中,虚线框出了本发明所在的位置。该系统是人机协同的智能数据平台,为构筑效率更高、训练更快、模型更强的人工智能能力。数据自动标注模块是为解决人工标注成本高,人工标注集少的一种智能标注系统框架。The embodiments of the present application are mainly applied in fields such as driving assistance, automatic driving, and mobile phone terminals that need to complete various perception tasks. The application system framework of the present invention is shown in Figures 2 and 3. Figure 2 shows the application scenario of the embodiment of the present application. The embodiment of the present application is in the module of automatic data labeling in the data processing platform. The dotted line outlines the location of the present invention. s position. The system is an intelligent data platform for human-machine collaboration, to build artificial intelligence capabilities with higher efficiency, faster training, and stronger models. The automatic data labeling module is an intelligent labeling system framework that solves the problem of high manual labeling costs and few manual labeling sets.
本申请实施例的产品实现形态,是包含在智能数据存储系统中,并部署在服务器硬件上的程序代码,因本方案而功能增强或改造的网元属于软改造,属于一个相对独立的模块, 以图3所示的应用场景为例,本申请实施例的程序代码存在于智能数据系统件的运行时训练模块中,程序运行时,本申请实施例的程序代码运行于服务器的主机存储和加速硬件(GPU/FPGA/专用芯片)内存,未来可能产生的影响是,数据读入模块之前可能从某个ftp、文件、数据库、内存中读取数据,采用本申请实施例后,可以只需要将数据源更新为本方案中涉及功能模块的接口。The product implementation form of the embodiment of this application is the program code included in the intelligent data storage system and deployed on the server hardware. The network element whose function is enhanced or modified due to this solution belongs to soft modification and belongs to a relatively independent module. Taking the application scenario shown in FIG. 3 as an example, the program code of the embodiment of the application exists in the runtime training module of the intelligent data system component. When the program is running, the program code of the embodiment of the application is stored and accelerated by the host of the server. Hardware (GPU/FPGA/dedicated chip) memory, the possible impact in the future is that before data is read into the module, data may be read from a certain ftp, file, database, or memory. After adopting the embodiment of this application, you can only The data source is updated to the interface of the functional module involved in the scheme.
图3给出了本发明在服务器及平台软件中的实现形态,其中标签生成装置和自动校准装置为本发明在现有平台软件基础上新增加的模块。Figure 3 shows the implementation form of the present invention in the server and platform software, where the label generating device and the automatic calibration device are newly added modules based on the existing platform software of the present invention.
下面分别对ADAS/ADS视觉感知系统和手机美颜两种应用场景做简单的介绍。The following is a brief introduction to the two application scenarios of ADAS/ADS visual perception system and mobile phone beauty.
应用场景1:ADAS/ADS视觉感知系统Application scenario 1: ADAS/ADS visual perception system
如图4所示,在ADAS和ADS中,需要实时进行多类型的2D目标检测,包括:动态障碍物(行人(Pedestrian)、骑行者(Cyclist)、三轮车(Tricycle)、轿车(Car)、卡车(Truck)、公交车(Bus)),静态障碍物(交通锥标(TrafficCone)、交通棍标(TrafficStick)、消防栓(FireHydrant)、摩托车(Motocycle)、自行车(Bicycle)),交通标志((TrafficSign)、导向标志(GuideSign)、广告牌(Billboard)、红色交通灯(TrafficLight_Red)/黄色交通灯(TrafficLight_Yellow)/绿色交通灯(TrafficLight_Green)/黑色交通灯(TrafficLight_Black)、路标(RoadSign))。另外,为了准确获取动态障碍物的在3维空间所占的区域,还需要对动态障碍物进行3D估计,输出3D框。为了与激光雷达的数据进行融合,需要获取动态障碍物的Mask,从而把打到动态障碍物上的激光点云筛选出来;为了进行精确的泊车位,需要同时检测出泊车位的4个关键点;为了进行构图定位,需要检测出静态目标的关键点。使用本申请实施例提供的技术方案,可以在感知网络中完成上述的全部或一部分功能。As shown in Figure 4, in ADAS and ADS, real-time detection of multiple types of 2D targets is required, including: dynamic obstacles (Pedestrian, Cyclist, Tricycle, Car, Truck) (Truck), bus (Bus), static obstacles (TrafficCone, TrafficStick, FireHydrant, Motocycle, Bicycle), traffic signs ( (TrafficSign), guide signs (GuideSign), billboards (Billboard), red traffic lights (TrafficLight_Red) / yellow traffic lights (TrafficLight_Yellow) / green traffic lights (TrafficLight_Green) / black traffic lights (TrafficLight_Black), road signs (RoadSign)). In addition, in order to accurately obtain the area occupied by the dynamic obstacle in the 3-dimensional space, it is also necessary to perform a 3D estimation of the dynamic obstacle and output a 3D frame. In order to integrate with the lidar data, it is necessary to obtain the mask of the dynamic obstacle, so as to filter out the laser point cloud hitting the dynamic obstacle; in order to carry out accurate parking space, it is necessary to detect the 4 key points of the parking space at the same time ; In order to locate the composition, it is necessary to detect the key points of the static target. Using the technical solutions provided in the embodiments of the present application, all or part of the above-mentioned functions can be completed in the perception network.
应用场景2:手机美颜功能Application scenario 2: mobile phone beauty function
在手机中,通过本申请实施例提供的感知网络检测出人体的Mask和关键点,可以对人体相应的部位进行放大缩小,比如进行收腰和美臀操作,从而输出美颜的图像。In the mobile phone, the mask and key points of the human body are detected through the perception network provided by the embodiments of the present application, and the corresponding parts of the human body can be zoomed in and out, such as waist reduction and hip beautification operations, so as to output beautiful images.
应用场景3:图像分类场景:Application scenario 3: Image classification scenario:
物体识别装置在获取待分类图像后,采用本申请的物体识别方法获取待分类图像中的物体的类别,然后可根据待分类图像中物体的类别对待分类图像进行分类。对于摄影师来说,每天会拍很多照片,有动物的,有人物,有植物的。采用本申请的方法可以快速地将照片按照照片中的内容进行分类,可分成包含动物的照片、包含人物的照片和包含植物的照片。After obtaining the image to be classified, the object recognition device adopts the object recognition method of the present application to obtain the category of the object in the image to be classified, and then can classify the image to be classified according to the category of the object in the image to be classified. For photographers, many photos are taken every day, including animals, people, and plants. Using the method of the present application, photos can be quickly classified according to the content of the photos, which can be divided into photos containing animals, photos containing people, and photos containing plants.
对于图像数量比较庞大的情况,人工分类的方式效率比较低下,并且人在长时间处理同一件事情时很容易产生疲劳感,此时分类的结果会有很大的误差;而采用本申请的方法可以快速地将图像进行分类,并且不会有误差。In the case of a large number of images, the manual classification method is relatively inefficient, and people are prone to fatigue when dealing with the same thing for a long time. At this time, the classification result will have a large error; the method of this application is adopted. Images can be classified quickly without errors.
应用场景4:商品分类:Application scenario 4: Product classification:
物体识别装置获取商品的图像后,然后采用本申请的物体识别方法获取商品的图像中商品的类别,然后根据商品的类别对商品进行分类。对于大型商场或超市中种类繁多的商品,采用本申请的物体识别方法可以快速完成商品的分类,降低了时间开销和人工成本。After the object recognition device obtains the image of the product, it then uses the object recognition method of the present application to obtain the category of the product in the image of the product, and then classifies the product according to the category of the product. For a wide variety of commodities in large shopping malls or supermarkets, the object recognition method of the present application can quickly complete the classification of commodities, reducing time and labor costs.
下面从模型训练侧和模型应用侧对本申请提供的方法进行描述:The following describes the method provided in this application from the model training side and the model application side:
本申请实施例提供的训练感知网络的方法,涉及计算机视觉的处理,具体可以应用于数据训练、机器学习、深度学习等数据处理方法,对训练数据(如本申请中的物体的图像或图像块和物体的类别)进行符号化和形式化的智能信息建模、抽取、预处理、训练等,最终得到训练好的感知网络的;并且,本申请实施例将输入数据(如本申请中的物体的图像)输入到所述训练好的感知网络中,得到输出数据(如本申请中得到该图像中感兴趣物体的2D、3D、Mask、关键点等信息)。The method for training a perceptual network provided by the embodiments of this application involves computer vision processing, and can be specifically applied to data processing methods such as data training, machine learning, and deep learning. And the category of objects) to perform symbolic and formalized intelligent information modeling, extraction, preprocessing, training, etc., and finally obtain a trained perceptual network; and the embodiment of this application will input data (such as the object in this application). The image of) is input into the trained perceptual network to obtain output data (for example, the 2D, 3D, Mask, key points and other information of the object of interest in the image are obtained in this application).
由于本申请实施例涉及大量神经网络的应用,为了便于理解,下面先对本申请实施例涉及的相关术语及神经网络等相关概念进行介绍。Since the embodiments of the present application involve a large number of applications of neural networks, in order to facilitate understanding, the following first introduces related terms, neural networks and other related concepts involved in the embodiments of the present application.
(1)物体检测,利用图像处理和机器学习、计算机图形学等相关方法,物体检测可以确定图像物体的类别,并确定用于定位物体的检测框。(1) Object detection. Using image processing, machine learning, computer graphics and other related methods, object detection can determine the category of image objects and determine the detection frame used to locate the object.
(2)卷积神经网络(Convosutionas Neuras Network,CNN)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器。该特征抽取器可以看作是滤波器。本实施例中的感知网络可以包括卷积神经网络,用于对图像进行卷积处理或者对特征图进行卷积处理来生成特征图。(2) Convolutional Neural Network (Convosutionas Neuras Network, CNN) is a deep neural network with a convolutional structure. The convolutional neural network contains a feature extractor composed of a convolutional layer and a sub-sampling layer. The feature extractor can be regarded as a filter. The perceptual network in this embodiment may include a convolutional neural network, which is used to perform convolution processing on an image or perform convolution processing on a feature map to generate a feature map.
(3)反向传播算法(3) Backpropagation algorithm
卷积神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正初始的超分辨率模型中参数的大小,使得超分辨率模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的超分辨率模型中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的超分辨率模型的参数,例如权重矩阵。本实施例中,在进行感知网络的训练时,可以基于反向传播算法来更新感知网络。Convolutional neural networks can use backpropagation (BP) algorithms to modify the size of the parameters in the initial super-resolution model during the training process, so that the reconstruction error loss of the super-resolution model becomes smaller and smaller. Specifically, forwarding the input signal to the output will cause error loss, and the parameters in the initial super-resolution model are updated by backpropagating the error loss information, so that the error loss is converged. The backpropagation algorithm is a backpropagation motion dominated by error loss, and aims to obtain the optimal super-resolution model parameters, such as a weight matrix. In this embodiment, when training the perception network, the perception network may be updated based on the back propagation algorithm.
图5是本申请实施例提供的一种系统架构的示意图,在图5中,执行设备110配置输入/输出(input/output,I/O)接口112,用于与外部设备进行数据交互,用户可以通过客户设备140向I/O接口112输入数据,所述输入数据在本申请实施例中可以包括:待识别图像或者图像块或者图像。5 is a schematic diagram of a system architecture provided by an embodiment of the present application. In FIG. 5, the execution device 110 is configured with an input/output (input/output, I/O) interface 112 for data interaction with external devices, and the user Data may be input to the I/O interface 112 through the client device 140, and the input data may include the image to be recognized or the image block or image in the embodiment of the present application.
在执行设备120对输入数据进行预处理,或者在执行设备120的计算模块111执行计算等相关的处理(比如进行本申请中神经网络的功能实现)过程中,执行设备120可以调用数据存储系统150中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统150中。When the execution device 120 preprocesses the input data, or when the calculation module 111 of the execution device 120 executes calculations and other related processing (such as performing the function realization of the neural network in this application), the execution device 120 may call the data storage system 150 The data, codes, etc. are used for corresponding processing, and the data, instructions, etc. obtained by corresponding processing can also be stored in the data storage system 150.
最后,I/O接口112将处理结果,如上述得到的图像或图像块或者图像中感兴趣物体的2D、3D、Mask和关键点等信息中的至少一种返回给客户设备140,从而提供给用户。Finally, the I/O interface 112 returns the processing result, such as the image or image block obtained above, or at least one of the 2D, 3D, Mask, and key points of the object of interest in the image, to the client device 140, so as to provide it to the client device 140. user.
可选地,客户设备140,可以是自动驾驶系统中的控制单元、手机终端中的功能算法模块,例如该功能算法模块可以用于实现与感知相关的任务。Optionally, the client device 140 may be a control unit in an automatic driving system or a functional algorithm module in a mobile phone terminal. For example, the functional algorithm module may be used to implement tasks related to perception.
值得说明的是,训练设备120可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标模型/规则,该相应的目标模型/规则即可以用于实现上述目标或完成上述 任务,从而为用户提供所需的结果。其中,目标模型/规则可以是后续实施例中描述的感知网络,为用户提供的结果可以是后续实施例中的物体检测结果。It is worth noting that the training device 120 can generate corresponding target models/rules based on different training data for different goals or different tasks, and the corresponding target models/rules can be used to achieve the above goals or complete the above tasks. , So as to provide users with the desired results. The target model/rule may be the perceptual network described in the subsequent embodiment, and the result provided to the user may be the object detection result in the subsequent embodiment.
在图5中所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口112提供的界面进行操作。另一种情况下,客户设备140可以自动地向I/O接口112发送输入数据,如果要求客户设备140自动发送输入数据需要获得用户的授权,则用户可以在客户设备140中设置相应权限。用户可以在客户设备140查看执行设备110输出的结果,具体的现形式可以是显示、声音、动作等具体方式。客户设备140也可以作为数据采集端,采集如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果作为新的样本数据,并存入数据库130。当然,也可以不经过客户设备140进行采集,而是由I/O接口112直接将如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果,作为新的样本数据存入数据库130。In the case shown in FIG. 5, the user can manually set input data, and the manual setting can be operated through the interface provided by the I/O interface 112. In another case, the client device 140 can automatically send input data to the I/O interface 112. If the client device 140 is required to automatically send the input data and the user's authorization is required, the user can set the corresponding authority in the client device 140. The user can view the result output by the execution device 110 on the client device 140, and the specific display form may be a specific manner such as display, sound, and action. The client device 140 can also be used as a data collection terminal to collect the input data of the input I/O interface 112 and the output result of the output I/O interface 112 as new sample data and store it in the database 130 as shown in the figure. Of course, it is also possible not to collect through the client device 140, but the I/O interface 112 directly uses the input data input to the I/O interface 112 and the output result of the output I/O interface 112 as a new sample as shown in the figure. The data is stored in the database 130.
值得注意的是,图5仅是本申请实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在图5中,数据存储系统150相对执行设备110是外部存储器,在其它情况下,也可以将数据存储系统150置于执行设备110中。It is worth noting that FIG. 5 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation. For example, in FIG. 5, the data The storage system 150 is an external memory relative to the execution device 110. In other cases, the data storage system 150 may also be placed in the execution device 110.
如图5所示,可以根据训练设备120训练得到感知网络。As shown in FIG. 5, the perceptual network can be obtained by training according to the training device 120.
其中,感知网络可以包括带有卷积结构的深度神经网络。本申请实施例采用的卷积神经网络的结构可以如图6所示。在图6中,卷积神经网络(CNN)200可以包括输入层210,卷积层/池化层220(其中池化层为可选的),以及神经网络层230。其中,输入层210可以获取待处理图像,并将获取到的待处理图像交由卷积层/池化层220以及后面的神经网络层230进行处理,可以得到图像的处理结果。下面对图6中的CNN 200中内部的层结构进行详细的介绍。Among them, the perceptual network may include a deep neural network with a convolutional structure. The structure of the convolutional neural network used in the embodiment of the present application may be as shown in FIG. 6. In FIG. 6, a convolutional neural network (CNN) 200 may include an input layer 210, a convolutional layer/pooling layer 220 (where the pooling layer is optional), and a neural network layer 230. Among them, the input layer 210 can obtain the image to be processed, and pass the obtained image to be processed to the convolutional layer/pooling layer 220 and the subsequent neural network layer 230 for processing, and the processing result of the image can be obtained. The following describes the internal layer structure of CNN 200 in Figure 6 in detail.
卷积层/池化层220:Convolutional layer/pooling layer 220:
卷积层:Convolutional layer:
如图6所示卷积层/池化层220可以包括如示例221-226层,举例来说:在一种实现中,221层为卷积层,222层为池化层,223层为卷积层,224层为池化层,225为卷积层,226为池化层;在另一种实现方式中,221、222为卷积层,223为池化层,224、225为卷积层,226为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。As shown in FIG. 6, the convolutional layer/pooling layer 220 may include layers 221-226, for example: in an implementation, layer 221 is a convolutional layer, layer 222 is a pooling layer, and layer 223 is a convolutional layer. Layers, 224 is the pooling layer, 225 is the convolutional layer, and 226 is the pooling layer; in another implementation, 221 and 222 are the convolutional layers, 223 is the pooling layer, and 224 and 225 are the convolutional layers. Layer, 226 is the pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or as the input of another convolutional layer to continue the convolution operation.
下面将以卷积层221为例,介绍一层卷积层的内部工作原理。The following will take the convolutional layer 221 as an example to introduce the internal working principle of a convolutional layer.
卷积层221可以包括很多个卷积算子,卷积算子也称为核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。The convolution layer 221 can include many convolution operators. The convolution operator is also called a kernel. Its function in image processing is equivalent to a filter that extracts specific information from the input image matrix. The convolution operator is essentially It can be a weight matrix. This weight matrix is usually pre-defined. In the process of convolution on the image, the weight matrix is usually one pixel after one pixel (or two pixels after two pixels) along the horizontal direction on the input image. ...It depends on the value of stride) to complete the work of extracting specific features from the image.
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以用来从输入图像中提取信息,从而使得卷积神经网络200进行正确的预测。The weight values in these weight matrices need to be obtained through a lot of training in practical applications. Each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, so that the convolutional neural network 200 can make correct predictions. .
当卷积神经网络200有多个卷积层的时候,初始的卷积层(例如221)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络200深度的加深,越往后 的卷积层(例如226)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。When the convolutional neural network 200 has multiple convolutional layers, the initial convolutional layer (such as 221) often extracts more general features, which can also be called low-level features; with the convolutional neural network With the deepening of the network 200, the features extracted by the subsequent convolutional layers (for example, 226) become more and more complex, such as features such as high-level semantics, and features with higher semantics are more suitable for the problem to be solved.
池化层:Pooling layer:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,在如图6中220所示例的221-226各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。Since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce a pooling layer after the convolutional layer. In the 221-226 layers as illustrated by 220 in Figure 6, it can be a convolutional layer followed by a layer. The pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
神经网络层230:Neural network layer 230:
在经过卷积层/池化层220的处理后,卷积神经网络200还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层220只会提取特征,并减少输入图像带来的参数。After processing by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not enough to output the required output information. Because as mentioned above, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input image.
需要说明的是,如图6所示的卷积神经网络210仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在。It should be noted that the convolutional neural network 210 shown in FIG. 6 is only used as an example of a convolutional neural network. In specific applications, the convolutional neural network may also exist in the form of other network models.
本申请实施例的感知网络可以包括带有卷积结构的深度神经网络,其中,卷积神经网络的结构可以如图7所示。在图7中,卷积神经网络(CNN)200可以包括输入层110,卷积层/池化层120(其中池化层为可选的),以及神经网络层130。与图6相比,图7中的卷积层/池化层120中的多个卷积层/池化层并行,将分别提取的特征均输入给全神经网络层130进行处理。The perception network of the embodiment of the present application may include a deep neural network with a convolutional structure, where the structure of the convolutional neural network may be as shown in FIG. 7. In FIG. 7, a convolutional neural network (CNN) 200 may include an input layer 110, a convolutional layer/pooling layer 120 (where the pooling layer is optional), and a neural network layer 130. Compared with FIG. 6, multiple convolutional layers/pooling layers in the convolutional layer/pooling layer 120 in FIG. 7 are parallel, and the respectively extracted features are input to the full neural network layer 130 for processing.
参见图8,图8为本申请实施例提供的一种数据处理系统的结构示意图。如图8所示,该数据处理系统可以包括:Refer to FIG. 8, which is a schematic structural diagram of a data processing system provided by an embodiment of the application. As shown in Figure 8, the data processing system may include:
卷积处理单元801、第一特征图生成单元802、第二特征图生成单元803和检测单元804,所述卷积处理单元801分别与所述第一特征图生成单元802和所述第二特征图生成单元803连接,所述第一特征图生成单元802和所述第二特征图生成单元803连接,所述第二特征图生成单元803与所述检测单元804连接。A convolution processing unit 801, a first feature map generating unit 802, a second feature map generating unit 803, and a detection unit 804, the convolution processing unit 801 is respectively connected to the first feature map generating unit 802 and the second feature map generating unit 802 and the second feature map generating unit 804. The map generating unit 803 is connected, the first characteristic map generating unit 802 is connected with the second characteristic map generating unit 803, and the second characteristic map generating unit 803 is connected with the detecting unit 804.
在一种实现中,数据处理系统可以实现感知网络的功能,其中,所述卷积处理单元801为主干网络,所述第一特征图生成单元802和所述第二特征图生成单元803为特征金字塔网络,所述检测单元804为头端head。In one implementation, the data processing system can implement the function of a perceptual network, wherein the convolution processing unit 801 is the backbone network, and the first feature map generating unit 802 and the second feature map generating unit 803 are feature maps. In the pyramid network, the detection unit 804 is a head.
参见图9,图9为本申请实施例提供的一种感知网络的结构示意图。如图9所示,该感知网络包括:Refer to FIG. 9, which is a schematic structural diagram of a sensing network provided by an embodiment of this application. As shown in Figure 9, the sensing network includes:
主干网络901、第一特征金字塔网络FPN 902、第二FPN 903和头端header 904,所述主干网络901分别与所述第一FPN 902和所述第二FPN 903连接,所述第一FPN 901和所述第二FPN 902连接,所述第二FPN 903与所述head 904连接;The backbone network 901, the first feature pyramid network FPN 902, the second FPN 903, and the head-end header 904, the backbone network 901 is connected to the first FPN 902 and the second FPN 903, and the first FPN 901 Connected to the second FPN 902, and the second FPN 903 connected to the head 904;
本申请实施例中,感知网络的架构可以为如图9中示出的架构,其主要由主干网络901、第一FPN 902、第二FPN 903以及头端header 904组成。In the embodiment of the present application, the architecture of the sensing network may be the architecture shown in FIG. 9, which mainly consists of a backbone network 901, a first FPN 902, a second FPN 903, and a head-end header 904.
在一种实现中,所述卷积处理单元801为主干网络,所述卷积处理单元801,用于接收输入的图像,并对所述输入的图像进行卷积处理,以生成多个第一特征图。In one implementation, the convolution processing unit 801 is the backbone network, and the convolution processing unit 801 is configured to receive an input image and perform convolution processing on the input image to generate multiple first images. Feature map.
需要说明的是,这里的“对所述输入的图像进行卷积处理”,不应理解为,仅仅对对所述输入的图像进行卷积处理,在一些实现中,可以对所述输入的图像进行卷积处理以及其他处理。It should be noted that “convolution processing on the input image” here should not be understood as only performing convolution processing on the input image. In some implementations, the input image can be Perform convolution processing and other processing.
需要说明的是,这里的“对所述第一图像进行卷积处理,以生成多个第一特征图”,不应仅理解为,对所述第一图像进行多次卷积处理,每次卷积处理可以生成一个第一特征图,即不应该理解为每张第一特征图都是基于对第一图像进行卷积处理得到的,而是,从整体上来看,第一图像是多个第一特征图的来源;在一种实现中,可以对所述第一图像进行卷积处理得到一个第一特征图,之后可以对生成的第一特征图进行卷积处理,得到另一个第一特征图,以此类推,就可以得到多个第一特征图。It should be noted that “convolution processing on the first image to generate multiple first feature maps” here should not only be understood as performing multiple convolution processing on the first image each time Convolution processing can generate a first feature map, that is, it should not be understood that each first feature map is obtained based on convolution processing on the first image, but, on the whole, the first image is multiple The source of the first feature map; in one implementation, the first image can be convolved to obtain a first feature map, and then the generated first feature map can be convolved to obtain another first feature map. Feature maps, and so on, can get multiple first feature maps.
需要说明的是,可以是对所述输入的图像进行一系列的卷积处理,具体的,在每次卷积处理时,可以是对前一次卷积处理得到的第一特征图进行卷积处理,进而得到一个第一特征图,通过上述方式,可以得到多个第一特征图。It should be noted that a series of convolution processing may be performed on the input image. Specifically, in each convolution processing, the first feature map obtained by the previous convolution processing may be subjected to convolution processing. , And then obtain a first feature map, and multiple first feature maps can be obtained by the above method.
需要说明的是,多个第一特征图可以是具有多尺度分辨率的特征图,即多个第一特征图并不是分辨率相同的特征图,在一种可选实现中,多个第一特征图可以构成一个特征金字塔。It should be noted that the multiple first feature maps may be feature maps with multi-scale resolution, that is, the multiple first feature maps are not feature maps with the same resolution. In an optional implementation, the multiple first feature maps The feature map can form a feature pyramid.
其中,所述卷积处理单元,可以用于接收输入的图像,并对所述输入的图像进行卷积处理,生成具有多尺度分辨率的多个第一特征图;卷积处理单元可以对输入的图像进行一系列的卷积处理,得到在不同的尺度(具有不同分辨率)下的特征图(feature map)。卷积处理单元可以采用多种形式,比如视觉几何组(visual geometry group,VGG)、残差神经网络(residual neural network,resnet)、GoogLeNet的核心结构(Inception-net)等。Wherein, the convolution processing unit may be used to receive an input image, and perform convolution processing on the input image to generate multiple first feature maps with multi-scale resolution; the convolution processing unit may perform convolution processing on the input A series of convolution processing is performed on the image to obtain feature maps at different scales (with different resolutions). The convolution processing unit can take many forms, such as visual geometry group (VGG), residual neural network (residual neural network, resnet), GoogLeNet core structure (Inception-net), and so on.
本申请实施例中,卷积处理单元可以是主干网络,主干网络901,用于接收输入的图像,并对所述输入的图像进行卷积处理,生成具有多尺度分辨率的多个第一特征图。In the embodiment of the present application, the convolution processing unit may be a backbone network, the backbone network 901, which is used to receive an input image and perform convolution processing on the input image to generate multiple first features with multi-scale resolution. picture.
参照图10,图10为本申请实施例提供的一种主干网络的结构示意,如图10中示出的那样,主干网络用于接收输入的图像,并对输入的图像进行卷积处理,输出对应所述图像的具有不同分辨率的特征图(特征图C1、特征图C2、特征图C3、特征图C4);也就是说输出对应所述图像的不同大小的特征图,主干网络完成基础特征的提取,为后续的检测提供相应的特征。10, FIG. 10 is a schematic diagram of the structure of a backbone network provided by an embodiment of the application. As shown in FIG. 10, the backbone network is used to receive input images, perform convolution processing on the input images, and output Feature maps with different resolutions corresponding to the image (feature map C1, feature map C2, feature map C3, feature map C4); that is to say, feature maps of different sizes corresponding to the image are output, and the backbone network completes the basic features The extraction provides corresponding features for subsequent detection.
具体的,主干网络可以对输入的图像进行一系列的卷积处理,得到在不同的尺度(具有不同分辨率)下的特征图(feature map)。这些特征图将为后续的检测模块提供基础特征。主干网络可以采用多种形式,比如视觉几何组(visual geometry group,VGG)、残差神经网络(residual neural network,resnet)、GoogLeNet的核心结构(Inception-net)等。Specifically, the backbone network can perform a series of convolution processing on the input image to obtain feature maps at different scales (with different resolutions). These feature maps will provide basic features for subsequent detection modules. The backbone network can take many forms, such as visual geometry group (VGG), residual neural network (residual neural network, resnet), the core structure of GoogLeNet (Inception-net), and so on.
主干网络可以对输入的图像进行卷积处理,生成若干不同尺度的卷积特征图,每张特征图是一个H*W*C的矩阵,其中H是特征图的高度,W是特征图的宽度、C是特征图的通道数。The backbone network can perform convolution processing on the input image to generate several convolution feature maps of different scales. Each feature map is a matrix of H*W*C, where H is the height of the feature map and W is the width of the feature map , C is the number of channels in the feature map.
backbone可以采用目前多种现有的卷积网络框架,比如VGG16、Resnet50、Inception-Net等,下面以Resnet18为Backbone为例进行说明。The backbone can use a variety of existing convolutional network frameworks, such as VGG16, Resnet50, Inception-Net, etc. The following is an example of Resnet18 as Backbone.
假设输入的图像的分辨率为H*W*3(高度H,宽度W,通道数为3,也就是RBG三个通道)。输入图像可以经过Resnet18的一个卷积层Res18-Conv1进行卷积操作,生成Featuremap(特征图)C1,这个特征图相对于输入图像进行了2次下采样,并且通道数扩充为64,因此C1的分辨率是H/4*W/4*64。C1可以经过Resnet18的Res18-Conv2进行卷积操作, 得到Featuremap C2,这个特征图的分辨率与C1一致;C2继续经过Res18-Conv3进行卷积操作,生成Featuremap C3,这个特征图相对C2进一步下采样,通道数增倍,其分辨率为H/8*W/8*128;最后C3经过Res18-Conv4进行卷积操作,生成Featuremap C4,其分辨率为H/16*W/16*256。Assume that the resolution of the input image is H*W*3 (height H, width W, the number of channels is 3, that is, three channels of RBG). The input image can be convolved through a convolutional layer Res18-Conv1 of Resnet18 to generate Featuremap (feature map) C1. This feature map is down-sampled 2 times with respect to the input image, and the number of channels is expanded to 64, so the C1 The resolution is H/4*W/4*64. C1 can be convolved through Resnet18's Res18-Conv2 to get Featuremap C2, the resolution of this feature map is the same as C1; C2 continues through Res18-Conv3 for convolution operation to generate Featuremap C3, this feature map is further downsampled relative to C2 , The number of channels is doubled, and its resolution is H/8*W/8*128; finally C3 undergoes Res18-Conv4 convolution operation to generate Featuremap C4, and its resolution is H/16*W/16*256.
需要说明的是,本申请实施例中的主干网络也可以称为骨干网络,这里并不限定。It should be noted that the backbone network in the embodiments of the present application may also be referred to as a backbone network, which is not limited here.
需要说明的是,图10中示出的主干网络仅为一种实现方式,并不构成对本申请的限定。It should be noted that the backbone network shown in FIG. 10 is only an implementation manner, and does not constitute a limitation to this application.
所述第一特征图生成单元802,用于根据所述多个第一特征图生成多个第二特征图;其中,所述多个第一特征图比所述多个第二特征图包括更多的纹理细节信息和/或位置细节信息。The first characteristic map generating unit 802 is configured to generate a plurality of second characteristic maps according to the plurality of first characteristic maps; wherein, the plurality of first characteristic maps include more features than the plurality of second characteristic maps. More texture detail information and/or location detail information.
需要说明的是,这里的“根据所述多个第一特征图生成多个第二特征图”,并不应该理解为生成多个第二特征图中的每个第二特征图的来源都是多个第一特征图;在一种实现中,多个第二特征图中的一部分第二特征图是直接基于多个第一特征图中的一个或多个第一特征图生成的;在一种实现中,多个第二特征图中的一部分第二特征图是直接基于多个第一特征图中的一个或多个第一特征图,以及除自身之外的其他第二特征图生成的;在一种实现中,多个第二特征图中的一部分第二特征图是直接基于除自身之外的其他第二特征图生成的,此时,由于“除自身之外的其他第二特征图是基于多个第一特征图中的一个或多个第一特征图”生成的,因此,可以理解为根据所述多个第一特征图生成多个第二特征图。It should be noted that the "generating multiple second feature maps based on the multiple first feature maps" here should not be understood to mean that the source of each second feature map generated in the multiple second feature maps is Multiple first feature maps; in one implementation, a part of the second feature maps in the multiple second feature maps are directly generated based on one or more first feature maps in the multiple first feature maps; one In this implementation, a part of the second feature maps in the multiple second feature maps are directly generated based on one or more first feature maps in the multiple first feature maps, and other second feature maps other than itself ; In one implementation, a part of the second feature maps in the multiple second feature maps are directly generated based on other second feature maps other than itself. At this time, due to "other second feature maps other than itself The map is generated based on one or more first characteristic maps "in a plurality of first characteristic maps", therefore, it can be understood as generating a plurality of second characteristic maps according to the plurality of first characteristic maps.
需要说明的是,多个第二特征图可以是具有多尺度分辨率的特征图,即多个第二特征图并不是分辨率相同的特征图,在一种可选实现中,多个第二特征图可以构成一个特征金字塔。It should be noted that the multiple second feature maps may be feature maps with multi-scale resolution, that is, the multiple second feature maps are not feature maps with the same resolution. In an optional implementation, the multiple second feature maps The feature map can form a feature pyramid.
其中,可以对卷积处理单元生成的多个第一特征图中的最顶层特征图C4进行卷积操作,示例性的,可以使用空洞卷积和1×1卷积将最顶层特征图C4的通道数下降为256,作为特征金字塔的最顶层特征图P4;横向链接最顶层下一层特征图C3的输出结果并使用1×1卷积降低通道数至256后,与特征图P4逐通道逐像素相加得到特征图P3;以此类推,从上到下,构建出第一特征金字塔,该第一特征金字塔可以包括多个第二特征图。Among them, the convolution operation can be performed on the topmost feature map C4 in the multiple first feature maps generated by the convolution processing unit. Exemplarily, the hole convolution and 1×1 convolution can be used to convert the topmost feature map C4 The number of channels is reduced to 256, which is the top feature map P4 of the feature pyramid; the output results of the top and next feature maps C3 are horizontally linked and the number of channels is reduced to 256 using 1×1 convolution. The pixels are added to obtain the feature map P3; and so on, from top to bottom, a first feature pyramid is constructed, and the first feature pyramid may include a plurality of second feature maps.
需要说明的是,这里的纹理细节信息,可以为浅层的用于表达细小目标和边缘特征的细节信息,相比于第二特征图,第一特征图中包括更多的纹理细节信息,使得其在针对于小目标检测的检测结果的检测精度更高。其中,位置细节可以为表达图像中物体所处的位置以及物体之间的相对位置的信息。It should be noted that the texture detail information here can be shallow detail information used to express small targets and edge features. Compared with the second feature map, the first feature map includes more texture detail information, so that The detection accuracy of the detection results for small target detection is higher. Wherein, the position details can be information that expresses the position of the object in the image and the relative position between the objects.
需要说明的是,相比于第一特征图,多个第二特征图可以包括更多的深层特征,深层特征包含丰富的语义信息,其对分类任务有很好的效果,同时深层特征具有较大感受野,可以对大型目标有较好的检测效果;在一种实现中,通过引入自上向下的通路来生成多个第二特征图,可以自然地将深层特征所包含的丰富语义信息向下进行传播,使各个尺度的第二特征图都包含丰富的语义信息。It should be noted that, compared to the first feature map, multiple second feature maps can include more deep features. Deep features contain rich semantic information, which has a good effect on classification tasks. At the same time, deep features have more The large receptive field can have a good detection effect on large targets; in one implementation, by introducing a top-down path to generate multiple second feature maps, the rich semantic information contained in deep features can be naturally combined Propagate downward so that the second feature maps of various scales contain rich semantic information.
本申请实施例中,第一特征图生成单元802可以是第一FPN 902。In this embodiment of the present application, the first feature map generating unit 802 may be the first FPN 902.
本申请实施例中,所述第一FPN 902,用于根据所述多个第一特征图生成第一特征金字塔,所述第一特征金字塔包括具有多尺度分辨率的多个第二特征图。本申请实施例中,第 一FPN与主干网络连接,第一FPN可以对主干网络生成的多个不同分辨率的特征图进行卷积处理和合并处理,来构造第一特征金字塔。In the embodiment of the present application, the first FPN 902 is configured to generate a first feature pyramid according to the multiple first feature maps, and the first feature pyramid includes multiple second feature maps with multi-scale resolution. In the embodiment of the present application, the first FPN is connected to the backbone network, and the first FPN may perform convolution processing and merging processing on multiple feature maps of different resolutions generated by the backbone network to construct the first feature pyramid.
参照图11,图11为一种第一FPN的结构示意,第一FPN 902可以根据所述多个第一特征图生成第一特征金字塔,所述第一特征金字塔包括具有多尺度分辨率的多个第二特征图(特征图P2、特征图P3、特征图P4、特征图P5)。其中,对主干网络901生成的最顶层特征图C4进行卷积操作,示例性的,可以使用空洞卷积和1×1卷积将最顶层特征图C4的通道数下降为256,作为特征金字塔的最顶层特征图P4;横向链接最顶层下一层特征图C3的输出结果并使用1×1卷积降低通道数至256后,与特征图P4逐通道逐像素相加得到特征图P3;以此类推,从上到下,构建出第一特征金字塔ΦP={特征图P2、特征图P3、特征图P4、特征图P5}。Referring to FIG. 11, FIG. 11 is a schematic diagram of the structure of a first FPN. The first FPN 902 can generate a first feature pyramid according to the multiple first feature maps, and the first feature pyramid includes a multi-scale resolution. A second feature map (feature map P2, feature map P3, feature map P4, feature map P5). Among them, the convolution operation is performed on the topmost feature map C4 generated by the backbone network 901. Illustratively, the hole convolution and 1×1 convolution can be used to reduce the number of channels of the topmost feature map C4 to 256, which is used as the feature pyramid. The top-most feature map P4; horizontally link the output results of the top-level and next-level feature maps C3 and use 1×1 convolution to reduce the number of channels to 256, then add it with the feature map P4 channel by pixel to obtain the feature map P3; By analogy, from top to bottom, the first feature pyramid ΦP={feature map P2, feature map P3, feature map P4, feature map P5} is constructed.
本申请实施例中,为了获取更大感受野,第一特征金字塔还可以包括特征图P5,其可以通过直接对特征图P4进行卷积操作产生。第一特征金字塔中间的特征图可以将深层特征所包含的丰富的语义信息通过自上而下的结构一层层引入的各个特征层,使不同尺度的特征图都包含较丰富的语义信息,可以为小目标提供更好的语义信息,提升小目标的分类性能。In the embodiment of the present application, in order to obtain a larger receptive field, the first feature pyramid may further include a feature map P5, which can be generated by directly performing a convolution operation on the feature map P4. The feature map in the middle of the first feature pyramid can take the rich semantic information contained in the deep features through the top-down structure and introduce each feature layer layer by layer, so that feature maps of different scales contain richer semantic information. Provide better semantic information for small targets and improve the classification performance of small targets.
需要说明的是,图11中示出的第一FPN仅为一种实现方式,并不构成对本申请的限定。It should be noted that the first FPN shown in FIG. 11 is only an implementation manner, and does not constitute a limitation of the present application.
所述第二特征图生成单元803,用于根据所述多个第一特征图和所述多个第二特征图生成多个第三特征图。The second feature map generating unit 803 is configured to generate multiple third feature maps according to the multiple first feature maps and the multiple second feature maps.
需要说明的是,这里的“根据所述多个第一特征图和所述多个第二特征图生成多个第三特征图”,并不应该理解为多个第二特征图中的每个第三特征图的来源都是多个第一特征图和多个第二特征图,而是,从整体上来看,多个第一特征图和所述多个第二特征图是多个第三特征图的来源;在一种实现中,多个第三特征图中的一部分第三特征图是基于多个第一特征图中的一个或多个第一特征图以及多个第二特征图中的一个或多个第二特征图生成的;在一种实现中,多个第三特征图中的一部分第三特征图是基于多个第一特征图中的一个或多个第一特征图、多个第二特征图中的一个或多个第二特征图,以及除自身之外的其他第三特征图生成的;在一种实现中,多个第三特征图中的一部分第三特征图是基于除自身之外的其他第三特征图生成的。It should be noted that, "generating multiple third feature maps based on the multiple first feature maps and the multiple second feature maps" here should not be understood as each of the multiple second feature maps The sources of the third feature maps are all multiple first feature maps and multiple second feature maps, but, as a whole, the multiple first feature maps and the multiple second feature maps are multiple third feature maps. The source of the feature map; in one implementation, a part of the third feature maps in the multiple third feature maps are based on one or more first feature maps and multiple second feature maps in the multiple first feature maps One or more of the second feature maps are generated; in one implementation, a part of the third feature maps in the multiple third feature maps are based on one or more first feature maps in the multiple first feature maps, One or more second feature maps in multiple second feature maps, and other third feature maps other than itself; in one implementation, a part of the third feature maps in multiple third feature maps It is generated based on a third feature map other than itself.
需要说明的是,多个第三特征图可以是具有多尺度分辨率的特征图,即多个第三特征图并不是分辨率相同的特征图,在一种可选实现中,多个第三特征图可以构成一个特征金字塔。It should be noted that the multiple third feature maps may be feature maps with multi-scale resolution, that is, the multiple third feature maps are not feature maps with the same resolution. In an optional implementation, the multiple third feature maps The feature map can form a feature pyramid.
本申请实施例中,第二特征图生成单元803可以是第二FPN 903。In this embodiment of the present application, the second feature map generating unit 803 may be a second FPN 903.
本申请实施例中,所述第二FPN 903,用于根据所述多个第一特征图和所述多个第二特征图生成第二特征金字塔,所述第二特征金字塔包括具有多尺度分辨率的多个第三特征图。In the embodiment of the present application, the second FPN 903 is used to generate a second feature pyramid based on the multiple first feature maps and the multiple second feature maps, and the second feature pyramid includes multi-scale resolution Multiple third characteristic maps of the rate.
参照图12a,图12a为一种第二FPN的结构示意,第二FPN 903可以根据主干网络901生成的所述多个第一特征图和第一FPN 902生成的多个第二特征图生成第二特征金字塔,其中,第二特征金字塔可以包括多个第三特征图(例如图12a中示出的特征图Q1、特征图Q2、特征图Q3和特征图Q4)。Referring to Fig. 12a, Fig. 12a is a schematic diagram of the structure of a second FPN. The second FPN 903 can generate the first feature map according to the multiple first feature maps generated by the backbone network 901 and the multiple second feature maps generated by the first FPN 902. Two feature pyramids, where the second feature pyramid may include multiple third feature maps (for example, the feature map Q1, the feature map Q2, the feature map Q3, and the feature map Q4 shown in FIG. 12a).
本申请实施例中,第二特征金字塔包括具有多尺度分辨率的多个第三特征图,其中多 个第三特征图中的最底层特征图(即分辨率最低的特征图)可以根据主干网络生成的一个第一特征图和第一FPN生成的一个第二特征图生成。In the embodiment of the present application, the second feature pyramid includes a plurality of third feature maps with multi-scale resolution, and the lowest-level feature map (that is, the feature map with the lowest resolution) in the plurality of third feature maps can be based on the backbone network A first feature map generated and a second feature map generated by the first FPN are generated.
具体的,在一种实施例中,所述多个第一特征图包括第一目标特征图,所述多个第二特征图包括第二目标特征图,所述多个第三特征图包括第三目标特征图,所述第三目标特征图为所述多个第三特征图中分辨率最小的特征图,所述第二FPN用于通过以下步骤来生成所述第三目标特征图:Specifically, in an embodiment, the plurality of first feature maps include a first target feature map, the plurality of second feature maps include a second target feature map, and the plurality of third feature maps include a first target feature map. Three target feature maps, the third target feature map is the feature map with the smallest resolution in the plurality of third feature maps, and the second FPN is used to generate the third target feature map through the following steps:
对所述第一目标特征图进行下采样和卷积处理,得到第四目标特征图,所述第四目标特征图与所述第二目标特征图具有相同的通道数和分辨率;将所述第四目标特征图和所述第二目标特征图逐通道相加,以生成所述第三目标特征图。Perform down-sampling and convolution processing on the first target feature map to obtain a fourth target feature map, where the fourth target feature map has the same number of channels and resolution as the second target feature map; The fourth target feature map and the second target feature map are added channel by channel to generate the third target feature map.
参照图12b,多个第一特征图包括第一目标特征图,第一目标特征图可以为图12b中的特征图C2,所述多个第二特征图包括第二目标特征图,第二目标特征图可以为图12b中的特征图P3,所述多个第三特征图包括第三目标特征图,第三目标特征图可以为图12b中的特征图Q1。12b, the plurality of first feature maps include a first target feature map, the first target feature map may be the feature map C2 in FIG. 12b, the plurality of second feature maps include a second target feature map, and the second target The feature map may be the feature map P3 in FIG. 12b, the plurality of third feature maps include a third target feature map, and the third target feature map may be the feature map Q1 in FIG. 12b.
本申请实施例中,可以对所述第一目标特征图进行下采样和卷积处理,得到第四目标特征图,如图12b中示出的那样,可以对特征图C2进行下采样和卷积处理,其中下采样的目的是使得第四目标特征图中各个通道的特征图的分辨率与特征图P3相同,卷积处理的目的是使得第四目标特征图的通道数与特征图P3相同。通过上述方式,使得所述第四目标特征图与所述第二目标特征图具有相同的通道数和分辨率,进而可以将第四目标特征图与所述第二目标特征逐通道相加,如图12b中示出的那样,第四目标特征图与所述第二目标特征逐通道相加后可以得到特征图Q1。In the embodiment of the present application, the first target feature map can be down-sampled and convolved to obtain a fourth target feature map. As shown in FIG. 12b, the feature map C2 can be down-sampled and convolved. Processing, where the purpose of downsampling is to make the resolution of each channel feature map of the fourth target feature map the same as the feature map P3, and the purpose of convolution processing is to make the number of channels of the fourth target feature map the same as the feature map P3. Through the above method, the fourth target feature map and the second target feature map have the same number of channels and resolution, and the fourth target feature map and the second target feature can be added channel by channel, such as As shown in FIG. 12b, the fourth target feature map and the second target feature are added channel by channel to obtain the feature map Q1.
具体的,在另一种实施例中,所述多个第一特征图包括第一目标特征图,所述多个第二特征图包括第二目标特征图,所述多个第三特征图包括第三目标特征图,所述第三目标特征图为所述多个第三特征图中分辨率最小的特征图,所述第二FPN用于通过以下步骤来生成所述第三目标特征图:Specifically, in another embodiment, the plurality of first feature maps include a first target feature map, the plurality of second feature maps include a second target feature map, and the plurality of third feature maps include A third target feature map, the third target feature map is a feature map with the smallest resolution in the plurality of third feature maps, and the second FPN is used to generate the third target feature map through the following steps:
对所述第一目标特征图进行下采样,得到第四目标特征图,所述第四目标特征图与所述第二目标特征图具有相同的分辨率;将所述第四目标特征图和所述第二目标特征图逐通道相加以及进行卷积处理,以生成所述第三目标特征图,所述第三目标特征图与所述第二目标特征图具有相同的通道数。Down-sampling the first target feature map to obtain a fourth target feature map. The fourth target feature map has the same resolution as the second target feature map; The second target feature map is added channel by channel and convolution processing is performed to generate the third target feature map, and the third target feature map has the same number of channels as the second target feature map.
参照图12c,多个第一特征图包括第一目标特征图,第一目标特征图可以为图12c中的特征图C2,所述多个第二特征图包括第二目标特征图,第二目标特征图可以为图12c中的特征图P3,所述多个第三特征图包括第三目标特征图,第三目标特征图可以为图12c中的特征图Q1。12c, the plurality of first feature maps include a first target feature map, the first target feature map may be the feature map C2 in FIG. 12c, the plurality of second feature maps include a second target feature map, and the second target The feature map may be the feature map P3 in FIG. 12c, the plurality of third feature maps include a third target feature map, and the third target feature map may be the feature map Q1 in FIG. 12c.
本申请实施例中,可以对所述第一目标特征图进行下采样,得到第四目标特征图,如图12c中示出的那样,可以对特征图C2进行下采样,其中下采样的目的是使得第四目标特征图中各个通道的特征图的分辨率与特征图P3相同。通过上述方式,使得所述第四目标特征图与所述第二目标特征图具有相同的分辨率,进而可以将第四目标特征图与所述第二目标特征逐通道相加,之后再通过卷积处理,以使得到的第三目标特征图与特征图P3具有相同 的通道数,其中,上述卷积处理可以是concatenation操作。In the embodiment of the present application, the first target feature map can be down-sampled to obtain the fourth target feature map. As shown in FIG. 12c, the feature map C2 can be down-sampled, where the purpose of down-sampling is The resolution of the feature map of each channel in the fourth target feature map is made the same as the feature map P3. Through the above method, the fourth target feature map and the second target feature map have the same resolution, and then the fourth target feature map and the second target feature can be added channel by channel, and then passed through the volume. Convolution processing, so that the obtained third target feature map and the feature map P3 have the same number of channels, wherein the above-mentioned convolution processing may be a concatenation operation.
本申请实施例中,第二特征金字塔包括具有多尺度分辨率的多个第三特征图,其中多个第三特征图中的非最底层特征图(即分辨率不是最低的特征图)可以根据主干网络生成的一个第一特征图、第一FPN生成的一个第二特征图以及相邻底层的一个第三特征图生成。In the embodiment of the present application, the second feature pyramid includes multiple third feature maps with multi-scale resolution, wherein the non-bottom-most feature maps (that is, the feature maps with not the lowest resolution) in the multiple third feature maps can be based on A first feature map generated by the backbone network, a second feature map generated by the first FPN, and a third feature map of the adjacent bottom layer are generated.
具体的,在一种实施例中,所述多个第一特征图包括第一目标特征图,所述多个第二特征图包括第二目标特征图,所述多个第三特征图包括第三目标特征图和第四目标特征图,所述第三目标特征图的分辨率小于所述第四目标特征图,所述第二特征图生成单元用于通过以下步骤来生成所述第四目标特征图:Specifically, in an embodiment, the plurality of first feature maps include a first target feature map, the plurality of second feature maps include a second target feature map, and the plurality of third feature maps include a first target feature map. A three-target feature map and a fourth target feature map, the resolution of the third target feature map is smaller than that of the fourth target feature map, and the second feature map generating unit is configured to generate the fourth target through the following steps Feature map:
对所述第三目标特征图进行下采样,以得到第五目标特征图,所述第五目标特征图与所述第二目标特征图具有相同的通道数和分辨率;对所述第一目标特征图进行下采样和卷积处理,以得到第六目标特征图,所述第六目标特征图与所述第二目标特征图具有相同的通道数和分辨率;将所述第五目标特征图、所述第二目标特征图和所述第六目标特征图基于各自的通道叠加,以生成所述第四目标特征图。Down-sampling the third target feature map to obtain a fifth target feature map, the fifth target feature map and the second target feature map have the same number of channels and resolution; for the first target The feature map is subjected to down-sampling and convolution processing to obtain a sixth target feature map, where the sixth target feature map has the same number of channels and resolution as the second target feature map; and the fifth target feature map , The second target feature map and the sixth target feature map are superimposed based on respective channels to generate the fourth target feature map.
参照图12d,多个第一特征图包括第一目标特征图,第一目标特征图可以为图12d中的特征图C3,所述多个第二特征图包括第二目标特征图,第二目标特征图可以为图12d中的特征图P4,所述多个第三特征图包括第三目标特征图和第四目标特征图,第三目标特征图可以为图12d中的特征图Q2。12d, the plurality of first feature maps include a first target feature map, the first target feature map may be the feature map C3 in FIG. 12d, the plurality of second feature maps include a second target feature map, and the second target The feature map may be the feature map P4 in FIG. 12d, the plurality of third feature maps include a third target feature map and a fourth target feature map, and the third target feature map may be the feature map Q2 in FIG. 12d.
本申请实施例中,可以对所述第三目标特征图进行下采样,得到第五目标特征图,如图12d中示出的那样,可以对特征图Q1进行下采样,其中下采样的目的是使得第五目标特征图中各个通道的特征图的分辨率与特征图P4相同。In the embodiment of the present application, the third target feature map can be down-sampled to obtain the fifth target feature map. As shown in FIG. 12d, the feature map Q1 can be down-sampled, where the purpose of down-sampling is The resolution of the feature map of each channel in the fifth target feature map is made the same as the feature map P4.
本申请实施例中,可以对所述第三目标特征图进行下采样和卷积处理,得到第五目标特征图,其中下采样的目的是使得第五目标特征图中各个通道的特征图的分辨率与第二目标特征图相同,卷积处理的目的是使得第五目标特征图的通道数与第二目标特征图相同。通过上述方式,使得所述第五目标特征图与第二目标特征图具有相同的分辨率和通道数,进而可以将第五目标特征图、第六目标特征图所述和第二目标特征图逐通道相加,得到第四目标特征图。In the embodiment of the present application, the third target feature map can be down-sampling and convolution processing to obtain the fifth target feature map, wherein the purpose of down-sampling is to make the feature maps of each channel in the fifth target feature map distinguish The rate is the same as the second target feature map, and the purpose of the convolution processing is to make the number of channels of the fifth target feature map the same as the second target feature map. Through the above method, the fifth target feature map and the second target feature map have the same resolution and number of channels, and the fifth target feature map, the sixth target feature map, and the second target feature map can be combined. The channels are added to obtain the fourth target feature map.
具体的,在另一种实施例中,可以对所述第三目标特征图进行下采样,以得到第五目标特征图,所述第五目标特征图与所述第二目标特征图具有相同的通道数和分辨率;对所述第一目标特征图进行下采样,得到第六目标特征图,所述第六目标特征图与所述第二目标特征图具有相同的分辨率;将所述第五目标特征图、所述第二目标特征图和所述第六目标特征图基于各自的通道叠加以及进行卷积处理,以生成所述第四目标特征图,所述第四目标特征图与所述第二目标特征图具有相同的通道数。Specifically, in another embodiment, the third target feature map may be down-sampled to obtain a fifth target feature map, and the fifth target feature map has the same characteristics as the second target feature map. Number of channels and resolution; down-sampling the first target feature map to obtain a sixth target feature map, the sixth target feature map having the same resolution as the second target feature map; Five target feature maps, the second target feature maps, and the sixth target feature maps are superimposed on their respective channels and subjected to convolution processing to generate the fourth target feature map, and the fourth target feature map is The second target feature maps have the same number of channels.
参照图12e,多个第一特征图包括第一目标特征图,第一目标特征图可以为图12e中的特征图C3,所述多个第二特征图包括第二目标特征图,第二目标特征图可以为图12e中的特征图P4,所述多个第三特征图包括第三目标特征图和第四目标特征图,第三目标特征图可以为图12e中的特征图Q2。12e, the plurality of first feature maps include a first target feature map, the first target feature map may be the feature map C3 in FIG. 12e, the plurality of second feature maps include a second target feature map, and the second target The feature map may be the feature map P4 in FIG. 12e, the plurality of third feature maps include a third target feature map and a fourth target feature map, and the third target feature map may be the feature map Q2 in FIG. 12e.
本申请实施例中,可以对所述第三目标特征图进行下采样,得到第五目标特征图,其中下采样的目的是使得第五目标特征图中各个通道的特征图的分辨率与第二目标特征图相同。通过上述方式,使得所述第五目标特征图与所述第二目标特征图具有相同的分辨率,进而可以将第五目标特征图与所述第二目标特征逐通道相加;可以对所述第一目标特征图进行下采样,得到第六目标特征图,其中下采样的目的是使得第六目标特征图中各个通道的特征图的分辨率与第二目标特征图相同。通过上述方式,使得所述第五目标特征图、所述第六目标特征图以及所述第二目标特征图具有相同的分辨率,进而可以将所述第五目标特征图、所述第六目标特征图以及所述第二目标特征图逐通道相加,之后再通过卷积处理,以使得到的第四目标特征图与第二目标特征图具有相同的通道数,其中,上述卷积处理可以是concatenation操作。In the embodiment of the present application, the third target feature map can be down-sampled to obtain a fifth target feature map, where the purpose of down-sampling is to make the resolution of the feature maps of each channel in the fifth target feature map equal to that of the second target feature map. The target feature map is the same. Through the above method, the fifth target feature map and the second target feature map have the same resolution, and the fifth target feature map and the second target feature can be added channel by channel; The first target feature map is down-sampled to obtain a sixth target feature map, where the purpose of down-sampling is to make the resolution of the feature maps of each channel in the sixth target feature map the same as the second target feature map. Through the above method, the fifth target feature map, the sixth target feature map, and the second target feature map have the same resolution, so that the fifth target feature map, the sixth target feature map can be converted to the same resolution. The feature map and the second target feature map are added channel by channel, and then processed by convolution, so that the obtained fourth target feature map and the second target feature map have the same number of channels, where the convolution process can be It is a concatenation operation.
本实施例中,第二特征图生成单元可以生成分辨率不同的第三目标特征图和第四目标特征图,第三目标特征图的分辨率小于所述第四目标特征图,其中,分辨率较大的第四目标特征图是基于多个第一特征图中的一个第一特征图、多个第二特征图中的一个第二特征图以及第三目标特征图生成的。此时,第二特征图生成单元生成的多个第三特征图保留了特征金字塔网络的优点,其具有自下向上(从小分辨率的特征图依次生成更大分辨率的特征图),将浅层神经网络所具有的丰富的纹理细节信息和/或位置细节信息引入到深层卷积层中,检测网络利用通过这种方式生成的多个第三特征图的小目标检测结果的检测精度更高。In this embodiment, the second feature map generating unit may generate a third target feature map and a fourth target feature map with different resolutions. The resolution of the third target feature map is smaller than that of the fourth target feature map, wherein the resolution The larger fourth target feature map is generated based on a first feature map in a plurality of first feature maps, a second feature map in a plurality of second feature maps, and a third target feature map. At this time, the multiple third feature maps generated by the second feature map generation unit retain the advantages of the feature pyramid network, which has bottom-up (from the feature map with small resolution to generate feature maps with larger resolution in turn), and will be shallower. The rich texture detail information and/or position detail information of the layer neural network is introduced into the deep convolutional layer, and the detection network uses multiple third feature maps generated in this way to detect small target detection results with higher detection accuracy .
物体检测任务不同于图像分类,在图像分类任务中,模型只需要回答图像里面是什么的问题,因此,图像分类任务具有平移、尺度等变换不变性(translation invariance)。而在物体检测任务中,模型需要同时知道图像中的目标在哪,以及该目标属于哪个类别这两个并行任务,而在实际所获取的可见光图像中,目标的大小和位置是一直在变化的,存在多尺度目标检测的问题,原始的单阶段(one stage)和两阶段(two stage)模型都是基于图像分类任务直接迁移图像分类网络的一部分用于检测,并且为了获取具有足够表达能力和语义信息的特征图,一般会选取网络的较深层特征进行后续处理,目前的深度神经网络模型为了得到更好的表达能力和更高精度的结果,都在朝着更深和多分支拓扑结构上进行拓展,随着网络层的加深,随之而来的网络退化问题也出现了,恒等映射、跳跃连接的ResNet网络结构能够很好的解决网络退化问题。目前在高精度的检测模型中,网络层数都是10到100量级的深度,这样可以让网络获取更好的表达能力,但是这样做的问题是网络层越深,所获取的特征所具有的感受野越大,这样会导致检测中遗漏掉小目标。The object detection task is different from the image classification. In the image classification task, the model only needs to answer the question of what is in the image. Therefore, the image classification task has translation invariance such as translation and scale. In the object detection task, the model needs to know where the target in the image is and which category the target belongs to. These two parallel tasks, but in the actual obtained visible light image, the size and position of the target are always changing , There is the problem of multi-scale target detection. The original one-stage and two-stage models are based on the image classification task and directly migrate a part of the image classification network for detection, and in order to obtain sufficient expressive power and The feature map of semantic information generally selects the deeper features of the network for subsequent processing. In order to obtain better expression and higher precision results, the current deep neural network models are all moving towards deeper and multi-branch topological structures. Expansion, as the network layer deepens, the subsequent network degradation problem also appears. The ResNet network structure of identity mapping and jump connection can solve the network degradation problem well. At present, in the high-precision detection model, the number of network layers is in the order of 10 to 100 depth, which allows the network to obtain better expressive ability, but the problem with this is that the deeper the network layer, the acquired features have The larger the receptive field, this will cause small targets to be missed in the detection.
而且在传统的检测算法中只用到某一层所获取的特征,这样会产生一个固定的感受野,虽然在后续处理中会引入锚点(Anchor),通过设置不同大小、长宽比的锚点通过特征图映射回原图进行处理,解决多尺度问题,但是不可避免还是会存在一定目标漏检问题。特征金字塔网络(FPN)的提出能够很好的解决多尺度目标检测问题,通过对深度卷积神经网络的观察,深度网络模型本身就自带多层次多尺度的特征图,浅层特征图具有较小的感受野,深层特征图具有较大的感受野,这样直接利用这样具有金字塔层次结构的特征图可以引入多尺度信息,但是这样会存在一个问题,浅层特征图虽然具有较小的感受野,有益于 检测小目标,但是浅层特征图所包含的语义信息比较少,可以理解为浅层特征不够抽象,很难用于对检测出来的目标进行分类。特征金字塔网络(FPN)采用了一种巧妙的结构设计来解决浅层特征语义信息不丰富的问题。对于一个基础的深度卷积神经网络,有一个自底到顶的前向计算获取层次特征图的过程,并且每一个尺度的特征有一个缩放因子,特征图分辨率逐渐减小。特征金字塔网络(FPN)引入了一个自顶到底的网络结构,逐渐放大特征图分辨率,并且引入从原始特征提取网络引出的横向连接支路,将原始特征网络对应分辨率特征图与经过上采样之后的深层特征图进行融合。Moreover, in the traditional detection algorithm, only the features acquired in a certain layer are used, which will produce a fixed receptive field, although anchor points (Anchor) will be introduced in the subsequent processing, by setting anchors of different sizes and aspect ratios. The points are mapped back to the original image through the feature map for processing to solve the multi-scale problem, but it is inevitable that there will still be a certain target missed detection problem. The feature pyramid network (FPN) can solve the problem of multi-scale target detection. Through the observation of the deep convolutional neural network, the deep network model itself has its own multi-level and multi-scale feature maps, and the shallow feature maps have more Small receptive fields, deep feature maps have larger receptive fields, so the direct use of such feature maps with pyramidal hierarchical structure can introduce multi-scale information, but there will be a problem. Although shallow feature maps have smaller receptive fields , It is beneficial to detect small targets, but the semantic information contained in the shallow feature map is relatively small, which can be understood as the shallow features are not abstract enough, and it is difficult to classify the detected targets. Feature Pyramid Network (FPN) uses an ingenious structural design to solve the problem of insufficient semantic information of shallow features. For a basic deep convolutional neural network, there is a bottom-to-top forward calculation to obtain hierarchical feature maps, and each scale feature has a scaling factor, and the resolution of the feature map gradually decreases. The Feature Pyramid Network (FPN) introduces a top-to-bottom network structure, gradually enlarges the resolution of the feature map, and introduces the horizontal connection branch derived from the original feature extraction network, and the original feature network corresponds to the resolution feature map with the up-sampling Then the deep feature maps are fused.
本申请实施例中,考虑将主干网络生成的浅层信息再一次融入到各特征层中,设计了一个自下而上跳跃连接的多尺度特征层设计,浅层的特征图包含非常丰富的边缘、纹理和细节信息,通过在原始特征图和自底到顶的多尺度网络层之间引入跳跃连接,并且将下采样的原始特征图与横向连接的第二个自顶到底的多次度特征层中对应分辨率的特征图进行融合,这一网络层改进,对于小目标和部分遮挡目标有较好的检测效果,能够在多尺度的特征金字塔层中引入丰富语义信息和细节信息的特征。In the embodiment of this application, considering that the shallow information generated by the backbone network is once again integrated into each feature layer, a bottom-up jump connection multi-scale feature layer design is designed, and the shallow feature map contains very rich edges. , Texture and detail information, through the introduction of jump connections between the original feature map and the bottom-to-top multi-scale network layer, and the down-sampled original feature map is connected to the second top-to-bottom multi-degree feature layer horizontally Feature maps with corresponding resolutions are fused. This network layer improvement has better detection results for small targets and partially occluded targets, and can introduce rich semantic information and detailed information features in the multi-scale feature pyramid layer.
在现有的一种实现中,第二特征图生成单元(例如特征金字塔网络)通过引入自上向下的通路,将深层特征所包含的丰富语义信息向下进行传播,使各个尺度的第二特征图都包含丰富的语义信息,同时深层特征具有较大感受野,使得可以对大型目标有较好的检测效果。但是现有的实现中,忽略了更浅层特征图所包含的更加精细的位置细节信息以及纹理细节信息,这对中、小目标的检测精度的影响很大。本申请实施例中,所述第二特征图生成单元将原始特征图(卷积处理单元生成的多个第一特征图)浅层的纹理细节信息引入到深层特征图(第一特征图生成单元生成的多个第二特征图)中,来生成多个第三特征图,将该具有浅层的丰富纹理细节信息的第三特征图作为检测单元进行目标检测的输入数据,可以提升后续物体检测的检测精度。In an existing implementation, the second feature map generation unit (such as the feature pyramid network) introduces a top-down path to propagate the rich semantic information contained in deep features downwards, so that the second feature map generation unit of each scale Feature maps contain rich semantic information, and deep features have a large receptive field, which makes it possible to have a better detection effect on large targets. However, in the existing implementation, the more detailed position detail information and texture detail information contained in the shallower feature maps are ignored, which has a great impact on the detection accuracy of medium and small targets. In the embodiment of the present application, the second feature map generation unit introduces the shallow texture detail information of the original feature map (multiple first feature maps generated by the convolution processing unit) into the deep feature map (the first feature map generation unit). In the generated multiple second feature maps), multiple third feature maps are generated, and the third feature map with shallow rich texture detail information is used as the input data of the detection unit for target detection, which can improve subsequent object detection The detection accuracy.
需要说明的是,本实施例中并不是指对于任意包括小目标的图像的物体检测的检测精度都会更高,而是以一个大的数量样本来说,本实施例可以具有更高的综合检测精度。It should be noted that this embodiment does not mean that the detection accuracy of object detection for any image that includes small targets will be higher, but for a large number of samples, this embodiment can have higher comprehensive detection. Accuracy.
所述检测单元804,用于根据所述多个第三特征图中的至少一个第三特征图,对所述图像进行目标检测,并输出检测结果。The detection unit 804 is configured to perform target detection on the image according to at least one third feature map of the plurality of third feature maps, and output the detection result.
本申请实施例中,检测单元804可以是头端head。In this embodiment of the present application, the detection unit 804 may be a head.
本申请实施例中,所述head,用于根据所述多个第三特征图中的至少一个第三特征图,对所述图像中的目标物体进行检测,并输出检测结果。In the embodiment of the present application, the head is used to detect the target object in the image according to at least one third feature map of the plurality of third feature maps, and output the detection result.
本申请实施例中,感知网络可以包括一个或多个head,如图13a示出的那样,每个并行头端head,用于根据所述第二FPN输出的第三特征图,对一个任务中的任务物体进行检测,输出所述任务物体所在区域的2D框以及每个2D框对应的置信度;其中,所述每个并行head可以完成不同的任务物体的检测;其中,所述任务物体为该任务中需要检测的物体;所述置信度越高,表示所述对应该置信度的2D框内存在所述任务所对应的物体的概率越大。In the embodiment of the present application, the sensing network may include one or more heads. As shown in FIG. 13a, each parallel head-end head is used to perform a task in one task according to the third feature map output by the second FPN. The task object is detected, and the 2D frame of the area where the task object is located and the confidence level corresponding to each 2D frame are output; wherein, each parallel head can complete the detection of different task objects; wherein, the task object is The object that needs to be detected in the task; the higher the confidence, the greater the probability that the object corresponding to the task exists in the 2D box corresponding to the confidence.
本申请实施例中,不同的head可以完成不同的2D检测任务,比如多个head中的一个head可以完成车的检测,输出Car/Truck/Bus的2D框和置信度;多个head中的head1可以完成人的检测,输出Pedestrian/Cyclist/Tricyle的2D框和置信度;多个head中的head可以完成交通灯的 检测,输出Red_Trafficligh/Green_Trafficlight/Yellow_TrafficLight/Black_TrafficLight的2D框和置信度。In the embodiment of this application, different heads can complete different 2D detection tasks. For example, one head of multiple heads can complete car detection and output the 2D frame and confidence of Car/Truck/Bus; head1 of multiple heads It can complete human detection and output the 2D frame and confidence of Pedestrian/Cyclist/Tricyle; multiple heads can complete the detection of traffic lights, and output the 2D frame and confidence of Red_Trafficligh/Green_Trafficlight/Yellow_TrafficLight/Black_TrafficLight.
本申请实施例中,感知网络可以包括多个串行head;所述串行head与一个并行head连接;这里需要强调的是,实际上,串行head并不是必须的,对于只需要检测2D框的场景,就不需要包括串行head。In the embodiment of the present application, the sensing network may include multiple serial heads; the serial head is connected to a parallel head; it should be emphasized that, in fact, the serial head is not necessary, and it is only necessary to detect the 2D frame. In the scenario, there is no need to include the serial head.
其中,所述串行head可以用于:利用其连接的并行head提供的所属任务的任务物体的2D框,在第二FPN上的一个或多个特征图上提取所述2D框所在区域的特征,根据所述2D框所在区域的特征对所述所属任务的任务物体的3D信息、Mask信息或Keypiont信息进行预测。串行head可选地串接在并行head的后面,在检测出该任务的2D框的基础上,完成2D框内部物体的3D/Mask/Keypoint检测。比如,串行3D_head0完成车辆的朝向、质心和长宽高的估计,从而输出车辆的3D框;串行Mask_head0预测车辆的精细掩膜,从而把车辆分割开来;串行Keypont_head0完成车辆的关键点的估计。串行head并不是必须的,某些任务不需要进行3D/Mask/Keypoint检测,则不需要串接串行head,比如交通灯的检测,只需要检测2D框,就不用串接串行head。另外,某些任务可以根据任务的具体需求,选择串接一个或者多个串行head,比如停车场(Parkingslot)的检测,除了需要得到2D框外,还需要车位的关键点,因此在这个任务中只需要串接一个串行Keypoint_head即可,不需要3D和Mask的head。The serial head can be used to extract the features of the region where the 2D frame is located from one or more feature maps on the second FPN using the 2D frame of the task object of the task provided by the parallel head connected to it. Predicting the 3D information, Mask information or Keypiont information of the task object of the task to which the 2D frame is located according to the characteristics of the area where the 2D frame is located. The serial head is optionally connected in series behind the parallel head, and on the basis of detecting the 2D frame of the task, the 3D/Mask/Keypoint detection of the objects inside the 2D frame is completed. For example, serial 3D_head0 completes the estimation of the vehicle's orientation, center of mass and length, width and height, thereby outputting the 3D frame of the vehicle; serial Mask_head0 predicts the fine mask of the vehicle, thereby dividing the vehicle; serial Keypont_head0 completes the key points of the vehicle Estimate. Serial head is not necessary, some tasks do not require 3D/Mask/Keypoint detection, serial head does not need to be connected in series, such as traffic light detection, only need to detect 2D frame, without serial head. In addition, some tasks can choose to connect one or more serial heads in series according to the specific needs of the task, such as the detection of parking lots (Parking Slot). In addition to the 2D frame, it also needs the key points of the parking space. Therefore, in this task It only needs to connect a serial Keypoint_head in series, without the head of 3D and Mask.
本申请实施例中,header与FPN连接,header可以根据FPN提供的特征图,完成一个任务的2D框的检测,输出这个任务的物体的2D框以及对应的置信度等等,接下来描述一种header的结构示意,参照图13b,图13b为一种header的示意,如图13b中示出的那样,head包括候选区域生成网络(Region Proposal Network,RPN)、ROI-ALIGN和RCNN三个模块。In the embodiment of this application, the header is connected with the FPN. The header can complete the detection of the 2D frame of a task according to the feature map provided by the FPN, and output the 2D frame of the object of the task and the corresponding confidence level, etc., the following describes one For a schematic diagram of the header structure, refer to FIG. 13b, which is a schematic diagram of a header. As shown in FIG. 13b, the head includes three modules: a Region Proposal Network (RPN), ROI-ALIGN, and RCNN.
其中,RPN模块可以用于在第二FPN提供的一个或者多个第三特征图上预测所述任务物体所在的区域,并输出匹配所述区域的候选2D框;或者可以这样理解,RPN在FPN输出的一个或者多个横图上预测出可能存在该任务物体的区域,并且给出这些区域的框,这些区域称为候选区域(Proposal)。比如,当head负责检测车时,其RPN层就预测出可能存在车的候选框;当head负责检测人时,其RPN层就预测出可能存在人的候选框。当然,这些Proposal是不准确的,一方面其不一定含有该任务的物体,另一方面这些框也是不紧致的。Among them, the RPN module can be used to predict the area where the task object is located on one or more third feature maps provided by the second FPN, and output candidate 2D boxes that match the area; or it can be understood that the RPN is in the FPN The output one or more horizontal images predict the areas where the task object may exist, and give the boxes of these areas, these areas are called candidate areas (Proposal). For example, when the head is responsible for detecting a car, its RPN layer predicts a candidate frame that may have a car; when the head is responsible for detecting a person, its RPN layer predicts a candidate frame that may have a person. Of course, these Proposals are not accurate. On the one hand, they do not necessarily contain the object of the task, and on the other hand, these frames are not compact.
2D候选区域预测流程可以由head的RPN模块实施,其根据FPN提供的特征图,预测出可能存在该任务物体的区域,并且给出这些区域的候选框(也可以叫候选区域,Proposal)。在本实施例中,若head负责检测车,其RPN层就预测出可能存在车的候选框。The 2D candidate region prediction process can be implemented by the RPN module of the head, which predicts the regions where the task object may exist based on the feature map provided by the FPN, and gives candidate frames (also called candidate regions, Proposal) of these regions. In this embodiment, if the head is responsible for detecting a car, its RPN layer predicts that there may be a candidate frame for the car.
RPN层可以在第二FPN提供的第三特征图上通过例如一个3*3的卷积生成特征图RPN Hidden。后面head的RPN层将会从RPN Hidden中预测Proposal。具体来说,head的RPN层分别通过一个1*1的卷积预测出RPN Hidden每个位置处的Proposal的坐标以及置信度。这个置信度越高,表示这个Proposal存在该任务的物体的概率越大。比如,在head中某个Proposal的score越大,就表示其存在车的概率越大。每个RPN层预测出来的Proposal需要经过Proposal合并模块,根据Proposal之间的重合程度去掉多余的Proposal(这个过程可以采用但不限制于NMS算法),在剩余的K个Proposal中挑选出score最大的N(N<K)个Proposal作为候选的可能存在物体的区域。这些Proposal是不准确的,一方面其不一定含有该任务的物体,另一方面 这些框也是不紧致的。因此,RPN模块只是一个粗检测的过程,需要后续的RCNN模块进行细分。在RPN模块回归Proposal的坐标时,并不是直接回归坐标的绝对值,而是回归出相对于Anchor的坐标。当这些Anchor与实际的物体匹配越高,RPN能检测出物体的概率越大。The RPN layer may generate a feature map RPN Hidden through, for example, a 3*3 convolution on the third feature map provided by the second FPN. The RPN layer of the back head will predict Proposal from RPN Hidden. Specifically, the RPN layer of the head respectively predicts the coordinates and confidence of the proposal at each position of the RPN Hidden through a 1*1 convolution. The higher the confidence, the greater the probability that the proposal exists for the object of the task. For example, the greater the score of a certain proposal in the head, the greater the probability of its existence. The Proposal predicted by each RPN layer needs to go through the Proposal merging module, and the excess Proposal is removed according to the degree of overlap between Proposals (this process can be used but not limited to the NMS algorithm), and the largest score is selected from the remaining K Proposals N (N<K) proposals are used as candidate areas where objects may exist. These Proposals are inaccurate. On the one hand, they do not necessarily contain the object of the task, and on the other hand, these frames are not compact. Therefore, the RPN module is only a rough detection process, and the subsequent RCNN module is required for subdivision. When the RPN module returns to the Proposal coordinates, it does not directly return the absolute value of the coordinates, but returns the coordinates relative to the Anchor. The higher the match between these Anchors and the actual object, the greater the probability that the RPN can detect the object.
ROI-ALIGN模块用于根据所述RPN模块预测得到的区域,从所述FPN提供的一个特征图中扣取出所述候选2D框所在区域的特征;也就是说,ROI-ALIGN模块主要根据RPN模块提供的Proposal,在某个特征图上把每个Proposal所在的区域的特征扣取出来,并且resize到固定的大小,得到每个Proposal的特征。可以理解的是,ROI-ALIGN模块可以使用但不局限于ROI-POOLING(感兴趣区域池化)/ROI-ALIGN(感兴趣区域提取)/PS-ROIPOOLING(位置敏感的感兴趣区域池化)/PS-ROIALIGN(位置敏感的感兴趣区域提取)等特征抽取方法。The ROI-ALIGN module is used to extract the features of the region where the candidate 2D frame is located from a feature map provided by the FPN according to the region predicted by the RPN module; that is, the ROI-ALIGN module is mainly based on the RPN module For the provided Proposal, take out the features of the area where each Proposal is located on a certain feature map, and resize to a fixed size to obtain the features of each Proposal. It is understandable that the ROI-ALIGN module can be used but not limited to ROI-POOLING (region of interest pooling)/ROI-ALIGN (region of interest extraction)/PS-ROIPOOLING (position-sensitive region of interest pooling)/ Feature extraction methods such as PS-ROIALIGN (position-sensitive region of interest extraction).
RCNN模块用于通过神经网络对所述候选2D框所在区域的特征进行卷积处理,得到所述候选2D框属于各个物体类别的置信度;通过神经网络对所述候选区域2D框的坐标进行调整,使得调整后的2D候选框比所述候选2D框与实际物体的形状更加匹配,并选择置信度大于预设阈值的调整后的2D候选框作为所述区域的2D框。也就是说,RCNN模块主要是对ROI-ALIGN模块提出的每个Proposal的特征进行细化处理,得到每个Proposal的属于各个类别置信度(比如对于车这个任务,会给出Backgroud/Car/Truck/Bus 4个分数),同时对Proposal的2D框的坐标进行调整,输出更加紧致的2D框。这些2D框经过非极大值抑制(non maximum suppression,NMS)合并后,作为最后的2D框输出。The RCNN module is used to perform convolution processing on the features of the region where the candidate 2D box is located through a neural network to obtain the confidence that the candidate 2D box belongs to each object category; adjust the coordinates of the candidate area 2D box through the neural network , Making the adjusted 2D candidate frame more match the shape of the actual object than the candidate 2D frame, and selecting the adjusted 2D candidate frame with a confidence greater than a preset threshold as the 2D frame of the region. In other words, the RCNN module mainly refines the features of each Proposal proposed by the ROI-ALIGN module, and obtains the confidence that each Proposal belongs to each category (for example, for the task of car, Backgroud/Car/Truck will be given /Bus 4 points), and adjust the coordinates of the Proposal 2D frame to output a more compact 2D frame. After these 2D boxes are combined by non-maximum suppression (NMS), they are output as the final 2D box.
2D候选区域细分类主要由图13b中的head的RCNN模块实施,其根据ROI-ALIGN模块提取出来的每个Proposal的特征,进一步回归出更加紧致的2D框坐标,同时对这个Proposal进行分类,输出其属于各个类别的置信度。RCNN的可实现形式很多,ROI-ALIGN模块输出的特征大小可以为N*14*14*256(Feature of proposals),其在RCNN模块中首先经过Resnet18的卷积模块4(Res18-Conv5)处理,输出的特征大小为N*7*7*512,然后通过一个Global Avg Pool(平均池化层)进行处理,把输入特征中每个通道内的7*7的特征进行平均,得到N*512的特征,其中每个1*512维的特征向量代表每个Proposal的特征。接下来通过2个全连接层FC分别回归框的精确坐标(输出N*4的向量,这4个数值分表表示框的中心点x/y坐标,框的宽高),框的类别的置信度(在head0中,需要给出这个框是Backgroud/Car/Truck/Bus的分数)。最后通过框合并操作,选择分数最大的若干个框,并且通过NMS操作去除重复的框,从而得到紧致的框输出。The sub-classification of 2D candidate regions is mainly implemented by the RCNN module of head in Figure 13b. According to the features of each Proposal extracted by the ROI-ALIGN module, it further returns to more compact 2D frame coordinates, and at the same time classifies this Proposal. Output the confidence that it belongs to each category. RCNN can be implemented in many forms. The feature size output by the ROI-ALIGN module can be N*14*14*256 (Feature of proposals), which is first processed by the Resnet18 convolution module 4 (Res18-Conv5) in the RCNN module. The output feature size is N*7*7*512, and then processed through a Global Avg Pool (average pooling layer), and the 7*7 features in each channel in the input features are averaged to obtain N*512 Features, where each 1*512-dimensional feature vector represents the feature of each Proposal. Next, return the precise coordinates of the frame through the two fully connected layers FC (output N*4 vector, these 4 numerical sub-tables represent the x/y coordinates of the center point of the frame, the width and height of the frame), and the confidence of the frame category Degree (in head0, the score that this box is Backgroud/Car/Truck/Bus needs to be given). Finally, through the box merging operation, several boxes with the largest scores are selected, and the repeated boxes are removed through the NMS operation, so as to obtain a compact box output.
在一些实际应用场景中,该感知网络还可以包括其他head,可以在检测出2D框的基础上,进一步进行3D/Mask/Keypoint检测。示例性的,以3D为例,ROI-ALIGN模块根据head提供的准确的2D框,在FPN输出的特征图上提取出每个2D框所在区域的特征,假设2D框的个数为M,那么ROI-ALIGN模块输出的特征大小为M*14*14*256,其首先经过Resnet18的Res18-Conv5处理,输出的特征大小为N*7*7*512,然后通过一个Global Avg Pool(平均池化层)进行处理,把输入特征中每个通道的7*7的特征进行平均,得到M*512的特征,其中每个1*512维的特征向量代表每个2D框的特征。接下来通过3个全连接层FC分别回归框中物体的朝向角(orientation,M*1向量)、质心点坐标(centroid,M*2向量,这2个数值表示质心的x/y坐标)和长宽高(dimention)。In some practical application scenarios, the sensing network may also include other heads, and 3D/Mask/Keypoint detection can be further performed on the basis of detecting the 2D frame. Exemplarily, taking 3D as an example, the ROI-ALIGN module extracts the features of the region where each 2D box is located on the feature map output by the FPN according to the accurate 2D box provided by the head. Assuming that the number of 2D boxes is M, then The feature size output by the ROI-ALIGN module is M*14*14*256, which is first processed by Resnet18's Res18-Conv5, and the output feature size is N*7*7*512, and then through a Global Avg Pool (average pooling) Layer) for processing, and average the 7*7 features of each channel in the input features to obtain M*512 features, where each 1*512-dimensional feature vector represents the feature of each 2D box. Next, the orientation angle of the object in the frame (orientation, M*1 vector), centroid point coordinates (centroid, M*2 vector, these 2 values represent the x/y coordinates of the centroid) and Length, width and height (dimention).
需要说明的是,图13a和图13b中示出的header仅为一种实现方式,并不构成对本申请的限定。It should be noted that the header shown in FIG. 13a and FIG. 13b is only an implementation manner, and does not constitute a limitation to the present application.
本申请实施例中,所述感知网络还可以包括:空洞卷积层,用于对所述多个第三特征图中的至少一个第三特征图进行空洞卷积处理;相应的,所述head,具体用于根据空洞卷积处理后的所述至少一个第三特征图,对所述图像中的目标物体进行检测,并输出检测结果。In the embodiment of the present application, the perception network may further include: a hole convolution layer, configured to perform hole convolution processing on at least one third feature map of the plurality of third feature maps; correspondingly, the head , Which is specifically configured to detect the target object in the image according to the at least one third feature map after the hole convolution processing, and output the detection result.
参照图14a和图14b,图14a为本申请实施例提供的一种感知网络的结构示意,图14b为本申请实施例提供的一种空洞卷积核的示意。本申请实施例中,在候选区域提取网络(RPN)中会有一个3x3卷积起到滑动窗口的作用,在至少一个第三特征图上移动该卷积核通过后续中间层和类别判断以及边框回归层可以获取得到各锚点(Anchor)边框中是否存在目标以及预测边框与真实边框之间的差别,通过训练候选区域提取网络可以获取到较好的边框提取结果。本实施例中,将3x3的滑动窗口卷积核替换为3x3的空洞卷积核,对所述多个第三特征图中的至少一个第三特征图进行空洞卷积处理,根据空洞卷积处理后的所述至少一个第三特征图,对所述图像中的目标物体进行检测,并输出检测结果,在不增加计算量的情况下,增大感受野,减少对大目标和部分遮挡目标的漏检。其中,假设原始普通卷积核空间尺寸为k×k,引入新参数d,将(d-1)个空格塞入原始卷积核中,那么新的卷积核大小为:14a and 14b, FIG. 14a is a schematic diagram of a perceptual network structure provided by an embodiment of this application, and FIG. 14b is a schematic diagram of a hollow convolution kernel provided by an embodiment of this application. In the embodiment of this application, there will be a 3x3 convolution in the candidate region extraction network (RPN) to act as a sliding window, and move the convolution kernel on at least one third feature map to pass subsequent intermediate layer and category judgments and borders The regression layer can obtain whether there is a target in each anchor frame and the difference between the predicted frame and the real frame, and a better frame extraction result can be obtained by training the candidate region extraction network. In this embodiment, the 3x3 sliding window convolution kernel is replaced with a 3x3 hole convolution kernel, and at least one third feature map of the plurality of third feature maps is subjected to hole convolution processing, according to the hole convolution processing The latter at least one third feature map detects the target object in the image and outputs the detection result. Without increasing the amount of calculation, the receptive field is increased, and the impact on large targets and partially occluded targets is reduced. Missed inspection. Among them, assuming that the original ordinary convolution kernel space size is k×k, introduce a new parameter d, and stuff (d-1) spaces into the original convolution kernel, then the new convolution kernel size is:
n=k+(k-1)×(d-1)n=k+(k-1)×(d-1)
例如针对于3x3的卷积核,设置d=2,得到新卷积核感受野为5x5。For example, for a 3x3 convolution kernel, d=2 is set, and the receptive field of the new convolution kernel is 5x5.
现有候选区域提取网络(RPN)通过设置3x3普通卷积核作为滑动窗口在特征图上滑动进行后续处理,本实施例中将该3x3普通卷积核替换为3x3空洞卷积核,用于提升网络对大型目标和遮挡目标漏检的纠正,本实施例中经过对特征提取网络的改进和候选区域提取网络(RPN)的改进,最终所得到的标注模型能够获取很好的检测效果,且在利用3x3空洞卷积对普通卷积进行替换之后,在不增加计算量的情况下,可以获取更大的感受野,同时由于具有更大的感受野,可以获取到更好的上下文信息,减少背景误判为前景的发生。The existing candidate region extraction network (RPN) sets a 3x3 ordinary convolution kernel as a sliding window to slide on the feature map for subsequent processing. In this embodiment, the 3x3 ordinary convolution kernel is replaced with a 3x3 hole convolution kernel for improvement The network corrects the missed detection of large targets and occluded targets. In this embodiment, through the improvement of the feature extraction network and the improvement of the candidate region extraction network (RPN), the finally obtained annotation model can obtain good detection results, and After replacing the ordinary convolution with the 3x3 hole convolution, a larger receptive field can be obtained without increasing the amount of calculation. At the same time, because of the larger receptive field, better context information can be obtained and the background can be reduced. The misjudgment is the occurrence of prospects.
本申请提供了一种数据处理系统,其特征在于,所述数据处理系统包括:卷积处理单元、第一特征图生成单元、第二特征图生成单元和检测单元,所述卷积处理单元分别与所述第一特征图生成单元和所述第二特征图生成单元连接,所述第一特征图生成单元和所述第二特征图生成单元连接,所述第二特征图生成单元与所述检测单元连接;所述卷积处理单元,用于接收输入的图像,并对所述输入的图像进行卷积处理,以生成多个第一特征图;所述第一特征图生成单元,用于根据所述多个第一特征图生成多个第二特征图;其中,所述多个第一特征图比所述多个第二特征图包括更多的所述输入的图像的纹理细节和/或所述输入的图像中的位置细节;所述第二特征图生成单元,用于根据所述多个第一特征图和所述多个第二特征图生成多个第三特征图;所述检测单元,用于根据所述多个第三特征图中的至少一个第三特征图,输出对所述图像中包括的物体的检测结果。本申请实施例中,第二特征图生成单元将原始特征图(卷积处理单元生成的多个第一特征图)浅层的纹理细节信息引入到深层特征图(第一特征图生成单元生成的多个第二特征图)中,来生成多个 第三特征图,将该具有浅层的丰富纹理细节信息的第三特征图作为检测单元进行目标检测的输入数据,可以提升后续物体检测的检测精度。The present application provides a data processing system, which is characterized in that the data processing system includes: a convolution processing unit, a first feature map generating unit, a second feature map generating unit, and a detection unit, and the convolution processing units are respectively Is connected to the first characteristic map generating unit and the second characteristic map generating unit, the first characteristic map generating unit is connected to the second characteristic map generating unit, and the second characteristic map generating unit is connected to the The detection unit is connected; the convolution processing unit is configured to receive an input image and perform convolution processing on the input image to generate a plurality of first feature maps; the first feature map generation unit is configured to Generate multiple second feature maps according to the multiple first feature maps; wherein the multiple first feature maps include more texture details and/or of the input image than the multiple second feature maps Or position details in the input image; the second feature map generating unit is configured to generate multiple third feature maps according to the multiple first feature maps and the multiple second feature maps; the The detection unit is configured to output a detection result of the object included in the image according to at least one third feature map of the plurality of third feature maps. In the embodiment of this application, the second feature map generation unit introduces the shallow texture detail information of the original feature map (multiple first feature maps generated by the convolution processing unit) into the deep feature map (generated by the first feature map generation unit). In multiple second feature maps), multiple third feature maps are generated, and the third feature map with shallow rich texture detail information is used as the input data of the detection unit for target detection, which can improve the detection of subsequent object detection Accuracy.
可选的,本申请实施例中的所述数据处理系统还可以包括:Optionally, the data processing system in the embodiment of the present application may further include:
中间特征提取层,用于对所述多个第三特征图中的至少一个第三特征图做卷积,以得到至少一个第四特征图;以及对所述多个第三特征图中的至少一个第三特征图做卷积,以得到至少一个第五特征图;其中,所述第四特征图对应的感受野大于所述第五特征图对应的感受野;所述检测单元,具体用于根据所述至少一个第四特征图和所述至少一个第五特征图,输出对所述图像中包括的物体的检测结果。The intermediate feature extraction layer is used to convolve at least one third feature map of the plurality of third feature maps to obtain at least one fourth feature map; and for at least one of the plurality of third feature maps A third feature map is convolved to obtain at least one fifth feature map; wherein the receptive field corresponding to the fourth feature map is larger than the receptive field corresponding to the fifth feature map; the detection unit is specifically configured to According to the at least one fourth feature map and the at least one fifth feature map, a detection result of the object included in the image is output.
参照图14c,图14c为一种中间特征提取的处理流程示意,如图14c中示出的那样,可以分别利用不同扩张率(dilation rate)的卷积层的对第三特征图进行卷积处理,得到对应的特征图(每个特征图的通道数为c),并进行拼接操作,得到特征图4(通道数为3c),之后通过全局平均池化和算术平局处理得到global Info descriptor,并通过第一个全连接层给予非线性特征,通过第二个全连接层和sigmoid函数处理,将各个权重值限制在0-1的范围,之后将所述权重值和对应的特征图进行逐通道相乘,得到处理后的特征图。Referring to Figure 14c, Figure 14c is a schematic diagram of the processing flow of an intermediate feature extraction. As shown in Figure 14c, the third feature map can be convolved by using convolutional layers with different dilation rates. , Get the corresponding feature map (the number of channels in each feature map is c), and perform the splicing operation to obtain feature map 4 (the number of channels is 3c), and then obtain the global Info descriptor through global average pooling and arithmetic tie processing, and Give nonlinear features through the first fully connected layer, and process through the second fully connected layer and sigmoid function to limit each weight value to the range of 0-1, and then perform the weight value and the corresponding feature map channel by channel Multiply to get the processed feature map.
本实施例中,若所述第四特征图包括的待检测物体大于所述第五特征图包括的待检测物体,则相应的,第一权重值大于所述第二权重值,则处理后的第四特征图相对处理后的第五特征图获得的增益会更大,由于第四特征图本身对应的感受野大于第五特征图对应的感受野,感受野越大的特征图,其具有更多的大目标的信息,相应利用其进行的目标检测中大目标的检测精度也更高,本实施例中,当图像中具有的较大的目标时,第四特征图相应进行的增益相比于第五特征图也更多,在检测单元基于处理后的第四特征图和所述处理后的第五特征图,对所述图像进行目标检测时,整体会具有更大的感受野,相应的,检测精度也更高。In this embodiment, if the object to be detected included in the fourth feature map is greater than the object to be detected included in the fifth feature map, correspondingly, the first weight value is greater than the second weight value, then the processed The fourth feature map has a greater gain than the processed fifth feature map. Since the receptive field corresponding to the fourth feature map itself is larger than the receptive field corresponding to the fifth feature map, the larger the receptive field, the more it has For more large target information, the detection accuracy of the large target in the target detection performed by using it is also higher. In this embodiment, when there is a larger target in the image, the gain corresponding to the fourth feature map is compared with There are also more fifth feature maps. When the detection unit performs target detection on the image based on the processed fourth feature map and the processed fifth feature map, the overall receptive field will be larger. Yes, the detection accuracy is also higher.
本实施例中,可以通过训练,使得中间特征提取层学习到权重值的识别规律:针对于包括大目标的特征图,其确定的第一卷积层的第一权重值较大,其确定的第二卷积层的第二权重值较小。针对于包括小目标的特征图,其确定的第一卷积层的第一权重值较小,其确定的第二卷积层的第二权重值较大。In this embodiment, the intermediate feature extraction layer can learn the recognition law of the weight value through training: for the feature map that includes a large target, the first weight value of the first convolutional layer determined by it is larger, and the determined The second weight value of the second convolutional layer is smaller. For a feature map that includes a small target, the first weight value of the first convolutional layer determined by the feature map is relatively small, and the second weight value of the second convolutional layer determined by the feature map is relatively large.
参照图15,图15为本申请实施例提供的一种物体检测方法的流程示意,如图15中示出的那样,物体检测方法包括:Referring to FIG. 15, FIG. 15 is a schematic flowchart of an object detection method provided by an embodiment of the application. As shown in FIG. 15, the object detection method includes:
1501、接收输入的第一图像。1501. Receive an input first image.
本申请实施例中,当需要对第一图像进行物体检测时,可以接收输入的第一图像。In the embodiment of the present application, when object detection needs to be performed on the first image, the input first image may be received.
1502、通过第一感知网络对所述第一图像进行物体检测,得到第一检测结果,所述第一检测结果包括第一检测框。1502. Perform object detection on the first image through the first perception network to obtain a first detection result, where the first detection result includes a first detection frame.
本申请实施例中,接收输入的第一图像之后,可以通过第一感知网络对所述第一图像进行物体检测,得到第一检测结果,所述第一检测结果包括第一检测框,其中,第一检测框可以表示检测出的一个物体在第一图像中的像素位置。In the embodiment of the present application, after receiving the input first image, object detection can be performed on the first image through the first perception network to obtain the first detection result, and the first detection result includes the first detection frame, wherein: The first detection frame may indicate the pixel position of the detected object in the first image.
1503、通过第二感知网络对所述第一图像进行物体检测,得到第二检测结果,所述第 二检测结果包括第二检测框,所述第二检测框与所述第一检测框之间存在第一交集。1503. Perform object detection on the first image through a second perception network to obtain a second detection result, where the second detection result includes a second detection frame, between the second detection frame and the first detection frame There is a first intersection.
本申请实施例中,可以通过第二感知网络对所述第一图像进行物体检测,得到第二检测结果,所述第二检测结果包括第二检测框,其中,第一检测框可以表示检测出的一个物体在第一图像中的像素位置。In the embodiment of the present application, the first image may be detected by an object through a second perception network to obtain a second detection result. The second detection result includes a second detection frame, and the first detection frame may indicate the detection The pixel position of an object in the first image.
本申请实施例中,所述第二检测框与所述第一检测框之间存在第一交集,即在第一图像上,第二检测框所在的像素位置与第一检测框所在的像素位置存在重叠。In the embodiment of the present application, there is a first intersection between the second detection frame and the first detection frame, that is, on the first image, the pixel position where the second detection frame is located and the pixel position where the first detection frame is located There is overlap.
1504、若所述交集的面积与所述第一检测框的面积的比值小于预设值,则更新所述第二检测结果,以使得更新后的第二检测结果包括所述第一检测框。1504. If the ratio of the area of the intersection to the area of the first detection frame is less than a preset value, update the second detection result so that the updated second detection result includes the first detection frame.
本申请实施例中,若所述第一交集的面积与所述第一检测框的面积的比值小于预设值,则可以认为第二检测结果中遗漏了第一检测框,进而可以更新后的第二检测结果,以使得更新后的第二检测结果包括所述第一检测框。In the embodiment of the present application, if the ratio of the area of the first intersection to the area of the first detection frame is less than the preset value, it can be considered that the first detection frame is omitted from the second detection result, and the updated The second detection result is such that the updated second detection result includes the first detection frame.
本申请实施例中,所述第二检测结果还可以包括第三检测框,所述第三检测框与所述第一检测框之间存在第二交集,所述第二交集的面积小于所述第一交集的面积。In the embodiment of the present application, the second detection result may further include a third detection frame, and there is a second intersection between the third detection frame and the first detection frame, and the area of the second intersection is smaller than the The area of the first intersection.
本申请实施例中,第二检测结果中可以存在多个和第一检测框之间存在交集的检测框,而其中,第二检测框与所述第一检测框之间的交集的大小是最大的,即与所述第一检测框之间的交集的大小是最大的第二检测框与第一检测框之间的交集的面积与所述第一检测框的面积的比值都小于预设值,那么其余的和第一检测框之间存在交集的检测框与第一检测框之间的交集的面积与所述第一检测框的面积的比值也都小于预设值。In the embodiment of the present application, there may be multiple detection frames that have an intersection with the first detection frame in the second detection result, and the size of the intersection between the second detection frame and the first detection frame is the largest , That is, the size of the intersection with the first detection frame is the largest, the ratio of the area of the intersection between the second detection frame and the first detection frame to the area of the first detection frame is less than the preset value , Then the ratio of the area of the intersection between the remaining detection frames that have an intersection with the first detection frame and the area of the first detection frame to the area of the first detection frame is also less than the preset value.
本申请实施例中,针对于同一张图像,所述第一感知网络的物体检测精度高于所述第二感知网络的物体检测精度,所述物体检测精度与如下至少一种特征有关:检测框的形状、位置、或检测框对应的物体的类别。In the embodiment of the present application, for the same image, the object detection accuracy of the first perception network is higher than the object detection accuracy of the second perception network, and the object detection accuracy is related to at least one of the following features: The shape, position, or category of the object corresponding to the detection frame.
即,本申请实施例中,第一感知网络的物体检测精度高于所述第二感知网络的物体检测精度,即可以用第一感知网络的检测结果来更新第二感知网络的检测结果。That is, in the embodiment of the present application, the object detection accuracy of the first perception network is higher than the object detection accuracy of the second perception network, that is, the detection result of the first perception network can be used to update the detection result of the second perception network.
可选地,可以接收输入的第一图像,并对所述第一图像进行卷积处理,以生成多个第一特征图;根据所述多个第一特征图生成多个第二特征图;其中,所述多个第一特征图比所述多个第二特征图包括更多的纹理细节信息和/或位置细节信息;根据所述多个第一特征图和所述多个第二特征图生成多个第三特征图;根据所述多个第三特征图中的至少一个第三特征图,对所述第一图像进行目标检测,并输出第一检测结果。Optionally, the input first image may be received, and convolution processing is performed on the first image to generate multiple first feature maps; multiple second feature maps are generated according to the multiple first feature maps; Wherein, the multiple first feature maps include more texture detail information and/or location detail information than the multiple second feature maps; according to the multiple first feature maps and the multiple second feature maps The image generates a plurality of third feature maps; according to at least one third feature map of the plurality of third feature maps, target detection is performed on the first image, and a first detection result is output.
可选地,所述多个第二特征图比所述多个第一特征图包括更多的语义信息。Optionally, the plurality of second feature maps include more semantic information than the plurality of first feature maps.
可选地,所述多个第一特征图包括第一目标特征图,所述多个第二特征图包括第二目标特征图,所述多个第三特征图包括第三目标特征图和第四目标特征图,所述第三目标特征图的分辨率小于所述第四目标特征图;对所述第三目标特征图进行下采样,以得到第五目标特征图,所述第五目标特征图与所述第二目标特征图具有相同的通道数和分辨率;对所述第一目标特征图进行下采样和卷积处理,以得到第六目标特征图,所述第六目标特征图与所述第二目标特征图具有相同的通道数和分辨率;将所述第五目标特征图、所述第二目标特征图和所述第六目标特征图基于各自的通道叠加,以生成所述第四目标特征图;或,对所述第三目标特征图进行下采样,以得到第五目标特征图,所述第五目标特征图与所述 第二目标特征图具有相同的通道数和分辨率;对所述第一目标特征图进行下采样,得到第六目标特征图,所述第六目标特征图与所述第二目标特征图具有相同的分辨率;将所述第五目标特征图、所述第二目标特征图和所述第六目标特征图基于各自的通道叠加以及进行卷积处理,以生成所述第四目标特征图,所述第四目标特征图与所述第二目标特征图具有相同的通道数。Optionally, the plurality of first feature maps include a first target feature map, the plurality of second feature maps include a second target feature map, and the plurality of third feature maps include a third target feature map and a first target feature map. Four target feature maps, the resolution of the third target feature map is smaller than that of the fourth target feature map; down-sampling the third target feature map to obtain a fifth target feature map, the fifth target feature The image has the same number of channels and resolution as the second target feature map; down-sampling and convolution processing are performed on the first target feature map to obtain a sixth target feature map, which is the same as The second target feature map has the same number of channels and resolution; the fifth target feature map, the second target feature map, and the sixth target feature map are superimposed based on the respective channels to generate the A fourth target feature map; or, down-sampling the third target feature map to obtain a fifth target feature map, the fifth target feature map having the same number of channels and resolution as the second target feature map Rate; down-sampling the first target feature map to obtain a sixth target feature map, the sixth target feature map and the second target feature map have the same resolution; the fifth target feature map , The second target feature map and the sixth target feature map are superimposed on the respective channels and subjected to convolution processing to generate the fourth target feature map, the fourth target feature map and the second target The feature maps have the same number of channels.
可选地,可以通过第一卷积层对所述多个第三特征图中的至少一个第三特征图做卷积,以得到至少一个第四特征图;通过第二卷积层对所述多个第三特征图中的至少一个第三特征图做卷积,以得到至少一个第五特征图;其中,所述第四特征图对应的感受野大于所述第五特征图对应的感受野;根据所述第四特征图和所述第五特征图,对所述第一图像进行目标检测,并输出第一检测结果。Optionally, at least one third feature map of the plurality of third feature maps can be convolved by the first convolution layer to obtain at least one fourth feature map; At least one third feature map in the plurality of third feature maps is convolved to obtain at least one fifth feature map; wherein the receptive field corresponding to the fourth feature map is larger than the receptive field corresponding to the fifth feature map ; According to the fourth feature map and the fifth feature map, perform target detection on the first image, and output a first detection result.
可选地,确定所述第四特征图对应的第一权重值以及所述第五特征图对应的第二权重值;根据所述第一权重值对所述第四特征图进行处理,以得到处理后的第四特征图;根据所述第二权重值对所述第五特征图进行处理,以得到处理后的第五特征图;其中,在所述第四特征图包括的待检测物体大于所述第五特征图包括的待检测物体的情况下,所述第一权重值大于所述第二权重值;根据所述处理后的第四特征图和所述处理后的第五特征图,对所述第一图像进行目标检测,并输出第一检测结果。Optionally, determine a first weight value corresponding to the fourth feature map and a second weight value corresponding to the fifth feature map; process the fourth feature map according to the first weight value to obtain The processed fourth feature map; the fifth feature map is processed according to the second weight value to obtain a processed fifth feature map; wherein, the object to be detected included in the fourth feature map is greater than In the case where the fifth feature map includes the object to be detected, the first weight value is greater than the second weight value; according to the processed fourth feature map and the processed fifth feature map, Perform target detection on the first image, and output a first detection result.
可选地,所述第一检测结果包括第一检测框,可以获取所述第一图像的第二检测结果,所述第二检测结果为通过第二感知网络对所述第一图像进行物体检测得到的,所述第一感知网络的物体检测精度高于所述第二感知网络的物体检测精度,所述第二检测结果包括第二检测框,所述第二检测框所在的区域与所述第一检测框所在的区域之间存在交集;Optionally, the first detection result includes a first detection frame, and a second detection result of the first image can be obtained, and the second detection result is object detection of the first image through a second perception network Obtained, the object detection accuracy of the first perception network is higher than the object detection accuracy of the second perception network, and the second detection result includes a second detection frame, and the area where the second detection frame is located corresponds to the There is an intersection between the areas where the first detection frame is located;
若所述交集的面积与所述第一检测框的面积的比值小于预设值,则更新所述第二检测结果,以使得更新后的第二检测结果包括所述第一检测框。If the ratio of the area of the intersection to the area of the first detection frame is less than a preset value, the second detection result is updated so that the updated second detection result includes the first detection frame.
可选地,所述第二检测结果包括多个检测框,所述多个检测框中的每个检测框所在的区域与所述第一检测框所在的区域之间存在交集,所述多个检测框包括所述第二检测框,其中,在所述多个检测框中每个检测框所在的区域与所述第一检测框所在的区域的交集的面积中,所述第二检测框所在的区域与所述第一检测框所在的区域之间的交集的面积最小。Optionally, the second detection result includes multiple detection frames, and there is an intersection between the area where each detection frame of the multiple detection frames is located and the area where the first detection frame is located. The detection frame includes the second detection frame, wherein, in the area of the intersection of the area where each detection frame is located in the multiple detection frames and the area where the first detection frame is located, the second detection frame is located The area of the intersection between the area and the area where the first detection frame is located is the smallest.
可选地,所述第一图像为视频中的一个图像帧,第二图像为视频中的一个图像帧,所述第一图像与所述第二图像在所述视频中的帧间距小于预设值,获取所述第二图像的第三检测结果,所述第三检测结果包括第四检测框以及所述第四检测框对应的物体类别;在所述第四检测框与所述第一检测框之间的形状差异以及位置差异在预设范围内的情况下,所述第一检测框对应于所述第四检测框对应的物体类别。Optionally, the first image is an image frame in the video, the second image is an image frame in the video, and the frame distance between the first image and the second image in the video is smaller than a preset Value, the third detection result of the second image is acquired, the third detection result includes a fourth detection frame and the object category corresponding to the fourth detection frame; in the fourth detection frame and the first detection When the shape difference and the position difference between the frames are within a preset range, the first detection frame corresponds to the object category corresponding to the fourth detection frame.
可选地,所述第四检测框的检测置信度大于预设阈值。Optionally, the detection confidence of the fourth detection frame is greater than a preset threshold.
本申请实施例中,所述第一图像为视频中的一个图像帧,第二图像为视频中的一个图像帧,所述第一图像与所述第二图像在所述视频中的帧间距小于预设值,还可以获取所述第二图像的第三检测结果,所述第三检测结果为通过所述第一感知网络或所述第二感知网络对所述第二图像进行物体检测得到的,所述第三检测结果包括第四检测框以及所述第四检测框对应的物体类别;若所述第四检测框与所述第一检测框之间的形状差异以及位置差 异在预设范围内,则确定所述第一检测框对应于所述第四检测框对应的物体类别。可选地,所述第四检测框的检测置信度大于预设阈值。In the embodiment of the present application, the first image is an image frame in the video, the second image is an image frame in the video, and the frame distance between the first image and the second image in the video is less than A preset value, a third detection result of the second image can also be obtained, and the third detection result is obtained by performing object detection on the second image through the first perception network or the second perception network , The third detection result includes a fourth detection frame and an object category corresponding to the fourth detection frame; if the shape difference and position difference between the fourth detection frame and the first detection frame are within a preset range , It is determined that the first detection frame corresponds to the object category corresponding to the fourth detection frame. Optionally, the detection confidence of the fourth detection frame is greater than a preset threshold.
本申请实施例中,如果第一图像是通过视频抽帧得到的,可以考虑加入时序检测算法。针对漏检算法所检测出的漏检目标,在当前图像前后帧选取几帧图像,通过对比前后帧中在漏检目标中心附近区域比较是否有与漏检目标面积、长宽比、中心坐标相似的目标,选取特定数量的相似目标框,然后比较相似目标框与疑似漏检图像中的其他目标框是否有相似目标框,在检测出的相似目标框中去除与其他目标框相似的目标框。通过基于内容相似度和特征相似度算法得到前后帧中各一个最相似的目标框,得到疑似漏检目标框的最相似目标框,如果最相似目标框的置信度高于某一阈值,则可以确定漏检目标的类别,如果低于某一阈值,对比疑似漏检目标与最相似目标框类别,如果类别相同,判断为漏检,如果类别不同,则可以进行人工校验。In the embodiment of the present application, if the first image is obtained through video frame extraction, a timing detection algorithm can be considered. For the missed target detected by the missed detection algorithm, select several frames of images before and after the current image, and compare whether the area near the center of the missed target in the previous and next frames is similar to the area, aspect ratio, and center coordinates of the missed target Select a specific number of similar target frames, and then compare the similar target frames with other target frames in the suspected missed image whether there are similar target frames, and remove target frames that are similar to other target frames from the detected similar target frames. Obtain the most similar target frame in each of the front and rear frames based on the content similarity and feature similarity algorithm, and get the most similar target frame that is suspected to be missed. If the confidence of the most similar target frame is higher than a certain threshold, you can Determine the category of the missed target. If it is lower than a certain threshold, compare the suspected missed target with the most similar target frame category. If the category is the same, it is judged as a missed test. If the category is different, manual verification can be performed.
本申请实施例提供了一种物体检测方法,所述方法包括:接收输入的第一图像;通过第一感知网络对所述第一图像进行物体检测,得到第一检测结果,所述第一检测结果包括第一检测框;通过第二感知网络对所述第一图像进行物体检测,得到第二检测结果,所述第二检测结果包括第二检测框,所述第二检测框与所述第一检测框之间存在第一交集;若所述第一交集的面积与所述第一检测框的面积的比值小于预设值,则更新所述二检测结果,更新后的第二检测结果包括所述第一检测框。通过上述方式,将时序特性引入模型中进行辅助判断疑似漏检结果是否为真正漏检结果,并对漏检结果类别进行判断,提升了校验效率。An embodiment of the present application provides an object detection method. The method includes: receiving an input first image; performing object detection on the first image through a first perception network to obtain a first detection result, and the first detection The result includes a first detection frame; object detection is performed on the first image through a second perception network, and a second detection result is obtained. The second detection result includes a second detection frame. There is a first intersection between a detection frame; if the ratio of the area of the first intersection to the area of the first detection frame is less than the preset value, the second detection result is updated, and the updated second detection result includes The first detection frame. Through the above method, the timing characteristics are introduced into the model to assist in determining whether the suspected missed test result is a true missed test result, and the category of the missed test result is judged, which improves the verification efficiency.
参照图16,图16为本申请实施例提供的一种感知网络训练方法的流程示意,该感知网络训练方法可以用于训练得到上述实施例中的感知网络,需要说明的是,本实施例中的第一感知网络可以是上述实施例中感知网络的初始网络,通过对第一感知网络进行训练得到的第二感知网络可以是上述实施例中感知网络。Referring to FIG. 16, FIG. 16 is a schematic flowchart of a perceptual network training method provided by an embodiment of this application. The perceptual network training method can be used to train the perceptual network in the foregoing embodiment. It should be noted that in this embodiment The first perception network may be the initial network of the perception network in the foregoing embodiment, and the second perception network obtained by training the first perception network may be the perception network in the foregoing embodiment.
如图16中示出的那样,感知网络训练方法包括:As shown in Figure 16, the perceptual network training method includes:
1601、获取图像中目标物体的预标注检测框。1601. Obtain a pre-labeled detection frame of a target object in an image.
1602、获取对应于所述图像以及第一感知网络的目标检测框,所述目标检测框用于标识所述目标物体。1602. Acquire a target detection frame corresponding to the image and the first perception network, where the target detection frame is used to identify the target object.
本申请实施例中,可以获取所述图像的检测结果,所述检测结果为通过第一感知网络对所述图像进行物体检测得到的,所述检测结果包括所述第一物体对应的目标检测框。In the embodiment of the present application, the detection result of the image may be obtained. The detection result is obtained by object detection on the image through a first perception network, and the detection result includes the target detection frame corresponding to the first object. .
1603、根据损失函数对所述第一感知网络进行迭代训练,以输出第二感知网络;其中,所述损失函数与所述预标注检测框和所述目标检测框之间的交并比IoU有关。1603. Perform iterative training on the first perception network according to a loss function to output a second perception network; wherein the intersection of the loss function and the pre-labeled detection frame and the target detection frame is more related to IoU .
本申请实施例中,所述预设的损失函数还与所述目标检测框与所述预标注检测框的形状差异有关,其中,所述形状差异与所述预标注检测框的面积负相关。In the embodiment of the present application, the preset loss function is also related to the shape difference between the target detection frame and the pre-labeled detection frame, wherein the shape difference is negatively related to the area of the pre-labeled detection frame.
本申请实施例中,所述矩形检测框包括相连的第一边和第二边,所述外接矩形框包括与所述第一边对应的第三边,以及与所述第二边对应的第四边,所述面积差异还与所述第一边和所述第三边的长度差异正相关、以及与所述第二边和所述第四边的长度差异正相关。In the embodiment of the present application, the rectangular detection frame includes a first side and a second side that are connected, and the circumscribed rectangular frame includes a third side corresponding to the first side, and a first side corresponding to the second side. Four sides, the area difference is also positively correlated with the length difference between the first side and the third side, and positively correlated with the length difference between the second side and the fourth side.
本申请实施例中,所述预设的损失函数还与所述目标检测框与所述预标注检测框在所述图像中的位置差异有关,其中,所述位置差异与所述预标注检测框的面积负相关;或所述位置差异与所述预标注检测框和所述目标检测框的凸包的最小外接矩形的面积负相关。In the embodiment of the present application, the preset loss function is also related to the position difference between the target detection frame and the pre-labeled detection frame in the image, wherein the position difference is related to the pre-labeled detection frame. The area of ?? is negatively correlated; or the position difference is negatively correlated with the area of the smallest circumscribed rectangle of the convex hull of the pre-labeled detection frame and the target detection frame.
本申请实施例中,所述目标检测框包括第一角点和第一中心点,所述预标注检测框包括第二角点和第二中心点,所述第一角点和所述第二角点为矩形对角线的两个端点,所述位置差异还与所述第一中心点和所述第二中心点在所述图像中的位置差异正相关、以及与所述第一角点和所述第二角点的长度负相关。In the embodiment of the present application, the target detection frame includes a first corner point and a first center point, the pre-labeled detection frame includes a second corner point and a second center point, the first corner point and the second center point are The corner points are the two end points of the diagonal of the rectangle, and the position difference is also positively correlated with the position difference between the first center point and the second center point in the image, and with the first corner point. It is negatively related to the length of the second corner point.
示例性的,该损失函数可以为如下所示:Exemplarily, the loss function may be as follows:
Figure PCTCN2021089118-appb-000001
Figure PCTCN2021089118-appb-000001
Figure PCTCN2021089118-appb-000002
Figure PCTCN2021089118-appb-000002
Figure PCTCN2021089118-appb-000003
Figure PCTCN2021089118-appb-000003
本申请实施例中,新设计的边框回归损失函数利用了具有尺度不变性,应用于目标检测度量方法的IoU损失项、考虑了预测边框与真实边框长宽比的损失项以及预测边框中心坐标与真实边框中心坐标距离与预测边框右下角坐标与真实边框左上角坐标距离比值的损失项,IoU损失项自然引入尺度不变的边框预测质量评估指标,两边框长宽比的损失项度量了两边框之间形状的贴合程度,第三个距离比值度量项用于解决在IoU=0时,无法得知预测边框与真实边框之间相对位置关系以及难以进行反向传播,最小化损失函数的问题,引入距离比之后,为使距离比变小,会自然拉近o p与o g之间的距离,拉远
Figure PCTCN2021089118-appb-000004
Figure PCTCN2021089118-appb-000005
之间的距离。三项损失函数各自分配不同权重用于平衡各项的影响,其中长宽比和距离比会引入与
Figure PCTCN2021089118-appb-000006
成正比的权重系数来减少边框尺度的影响,大尺度边框有更小的权重,小尺度边框有更大的权重。本专利所提出的这一边框回归损失函数适用于各类two stage和one stage算法,具有很好的通用性,以及对目标尺度、边框贴合,以及中心点和边框角点之间的贴合起到了很优良的促进作用。
In the embodiment of this application, the newly designed frame regression loss function uses scale invariance and is applied to the IoU loss item of the target detection measurement method, the loss item considering the aspect ratio of the predicted frame and the real frame, and the predicted frame center coordinates and The loss item of the ratio of the distance between the center coordinates of the real frame and the coordinates of the lower right corner of the predicted frame and the coordinates of the upper left corner of the real frame. The IoU loss item naturally introduces a constant-scale frame prediction quality evaluation index. The loss item of the aspect ratio of the two frames measures the two frames. The degree of fit between the shapes, the third distance ratio metric is used to solve the problem that when IoU=0, the relative position relationship between the predicted frame and the real frame cannot be known, and the back propagation is difficult to minimize the loss function. , after the introduction ratio of the distance, the distance is smaller than the naturally narrow the distance between the o p o g, remote
Figure PCTCN2021089118-appb-000004
with
Figure PCTCN2021089118-appb-000005
the distance between. The three loss functions are assigned different weights to balance the impact of each item. Among them, the aspect ratio and distance ratio will be introduced and
Figure PCTCN2021089118-appb-000006
Proportional weight coefficient to reduce the influence of frame scale, large-scale frame has smaller weight, and small-scale frame has larger weight. The frame regression loss function proposed in this patent is suitable for various two-stage and one-stage algorithms, and has good versatility, as well as the fit between the target scale, the frame, and the center point and the corner of the frame. Played a very good role in promoting.
本申请实施例中,通过将目标检测中衡量检测边框与真实边框贴合程度的IoU作为边框回归的损失函数,由于IoU自带的尺度不变性,解决了以往其他边框回归函数对尺度变化较为敏感的问题;且引入预测边框与真实边框之间长宽比的损失项,促使训练过程中预测边框更加贴合真实边框,并通过引入与
Figure PCTCN2021089118-appb-000007
成正比的权重系数,减少尺度变化的影响;引入预测边框中心坐标与真实边框中心坐标距离与预测边框右下角坐标与真实边框左上角坐标距离比值,用于解决当IoU=0时,无法得知两个边框之间相对位置、距离远近,无法进行反向传播的问题,引入距离比之后,为使距离比变小,会自然拉近o p与o g之间的距离, 拉远
Figure PCTCN2021089118-appb-000008
Figure PCTCN2021089118-appb-000009
之间的距离促进边框朝着正确的方向进行变化。
In the embodiment of the application, the IoU, which measures the degree of fit between the detected frame and the real frame in the target detection, is used as the loss function of frame regression. Due to the inherent scale invariance of IoU, it solves that other frame regression functions in the past are more sensitive to scale changes. In addition, the loss of the aspect ratio between the predicted frame and the real frame is introduced, which promotes the predicted frame to fit the real frame more closely during the training process, and through the introduction of
Figure PCTCN2021089118-appb-000007
Proportional weight coefficient to reduce the impact of scale changes; introduce the ratio of the distance between the center coordinates of the predicted border and the center coordinates of the real border and the distance between the lower right corner coordinates of the predicted border and the upper left corner coordinates of the real border, which is used to solve the problem of not knowing when IoU=0 the relative position between the two border, then the distance, the problem can not be back-propagation, the introduction ratio of the distance, the distance is smaller than the naturally narrow the distance between the o p o g, remote
Figure PCTCN2021089118-appb-000008
with
Figure PCTCN2021089118-appb-000009
The distance between promotes the border to change in the right direction.
所述预设的损失函数包括与所述位置差异有关的目标损失项,所述目标损失项随着所述位置差异的变化而变化;其中,当所述位置差异大于预设值时,所述目标损失项的变化率大于第一预设变化率;和/或,当所述位置差异小于预设值时,所述目标损失项的变化率小于第二预设变化率。The preset loss function includes a target loss item related to the position difference, and the target loss item changes with the change of the position difference; wherein, when the position difference is greater than a preset value, the The rate of change of the target loss item is greater than the first preset rate of change; and/or, when the position difference is less than the preset value, the rate of change of the target loss item is less than the second preset rate of change.
可选的,边框回归损失函数还可以为如下损失函数:Optionally, the bounding box regression loss function can also be the following loss function:
Figure PCTCN2021089118-appb-000010
Figure PCTCN2021089118-appb-000010
Figure PCTCN2021089118-appb-000011
Figure PCTCN2021089118-appb-000011
上述边框回归损失函数利用了具有尺度不变性,被广泛应用于检测度量方法的IoU损失项、考虑了预测边框与真实边框长宽比的损失项以及拉近预测边框与真正边框距离的pull损失项,IoU损失项自然引入尺度不变的边框预测质量评估指标,两边框长宽比的损失项度量了两边框之间形状的贴合程度,第三个距离比值度量项用于解决在IoU=0时,无法得知预测边框与真实边框之间相对位置关系以及难以进行反向传播,最小化损失函数的问题,因此引入了第三项拉近预测边框与真实边框距离的pull损失项,其中以预测边框中心与真实边框中心构成对角线的矩形的面积area1和预测边框与真实边框最小凸包矩形的面积area2之比作为distance_IoU项,采用f(x)=-ln(1-x)作为损失函数计算项,可知distance_IoU变化区间为[0,1),从f(x)=-ln(1-x)的曲线中可以看出其在[0,1)之间有快速收敛的趋势,因此将x设置为x=distance_IoU来得到第三项损失项能够起到快速拉近预测边框与真实边框的目的。The above-mentioned frame regression loss function utilizes scale invariance and is widely used in the detection measurement method of the IoU loss item, the loss item that takes into account the aspect ratio of the predicted frame and the real frame, and the pull loss item that narrows the distance between the predicted frame and the real frame. , The IoU loss term naturally introduces a constant-scale border prediction quality evaluation index. The loss term of the aspect ratio of the two borders measures the fit of the shape between the two borders. The third distance ratio metric is used to solve the problem when IoU=0. It is impossible to know the relative position relationship between the predicted frame and the real frame, and it is difficult to backpropagate to minimize the loss function. Therefore, the third pull loss item is introduced to shorten the distance between the predicted frame and the real frame. The ratio of the area area1 of the diagonal rectangle between the center of the predicted frame and the center of the real frame and the area2 of the smallest convex rectangle between the predicted frame and the real frame is used as the distance_IoU term, and f(x)=-ln(1-x) is used as the loss Function calculation item, it can be seen that the range of distance_IoU is [0,1). From the curve of f(x)=-ln(1-x), it can be seen that it has a rapid convergence trend between [0,1), so Setting x to x=distance_IoU to obtain the third loss item can achieve the purpose of quickly narrowing the predicted bounding box and the real bounding box.
参照图17,图17为本申请实施例提供的一种物体检测方法的流程示意,如图17中示出的那样,物体检测方法包括:Referring to FIG. 17, FIG. 17 is a schematic flowchart of an object detection method provided by an embodiment of the application. As shown in FIG. 17, the object detection method includes:
1701、接收输入的第一图像;1701. Receive an input first image;
1702、对所述第一图像进行卷积处理,以生成多个第一特征图;1702. Perform convolution processing on the first image to generate multiple first feature maps;
需要说明的是,这里的“对所述输入的图像进行卷积处理”,不应理解为,仅仅对对所述输入的图像进行卷积处理,在一些实现中,可以对所述输入的图像进行卷积处理、池化操作等等。It should be noted that “convolution processing on the input image” here should not be understood as only performing convolution processing on the input image. In some implementations, the input image can be Perform convolution processing, pooling operations, and so on.
需要说明的是,这里的“对所述第一图像进行卷积处理,以生成多个第一特征图”,不应仅理解为,对所述第一图像进行多次卷积处理,每次卷积处理可以生成一个第一特征图,即不应该理解为每张第一特征图都是基于对第一图像进行卷积处理得到的,而是,从整体上来看,第一图像是多个第一特征图的来源;在一种实现中,可以对所述第一图像进行卷积处理得到一个第一特征图,之后可以对生成的第一特征图进行卷积处理,得到另一个第一特征图,以此类推,就可以得到多个第一特征图。It should be noted that “convolution processing on the first image to generate multiple first feature maps” here should not only be understood as performing multiple convolution processing on the first image each time Convolution processing can generate a first feature map, that is, it should not be understood that each first feature map is obtained based on convolution processing on the first image, but, on the whole, the first image is multiple The source of the first feature map; in one implementation, the first image can be convolved to obtain a first feature map, and then the generated first feature map can be convolved to obtain another first feature map. Feature maps, and so on, can get multiple first feature maps.
需要说明的是,可以是对所述输入的图像进行一系列的卷积处理,具体的,在每次卷 积处理时,可以是对前一次卷积处理得到的第一特征图进行卷积处理,进而得到一个第一特征图,通过上述方式,可以得到多个第一特征图。It should be noted that a series of convolution processing may be performed on the input image. Specifically, in each convolution processing, the first feature map obtained by the previous convolution processing may be subjected to convolution processing. , And then obtain a first feature map, and multiple first feature maps can be obtained by the above method.
需要说明的是,多个第一特征图可以是具有多尺度分辨率的特征图,即多个第一特征图并不是分辨率相同的特征图,在一种可选实现中,多个第一特征图可以构成一个特征金字塔。It should be noted that the multiple first feature maps may be feature maps with multi-scale resolution, that is, the multiple first feature maps are not feature maps with the same resolution. In an optional implementation, the multiple first feature maps The feature map can form a feature pyramid.
其中,可以接收输入的图像,并对所述输入的图像进行卷积处理,生成具有多尺度分辨率的多个第一特征图;卷积处理单元可以对输入的图像进行一系列的卷积处理,得到在不同的尺度(具有不同分辨率)下的特征图(feature map)。卷积处理单元可以采用多种形式,比如视觉几何组(visual geometry group,VGG)、残差神经网络(residual neural network,resnet)、GoogLeNet的核心结构(Inception-net)等。The input image may be received, and the input image may be subjected to convolution processing to generate multiple first feature maps with multi-scale resolution; the convolution processing unit may perform a series of convolution processing on the input image , To obtain feature maps at different scales (with different resolutions). The convolution processing unit can take many forms, such as visual geometry group (VGG), residual neural network (residual neural network, resnet), GoogLeNet core structure (Inception-net), and so on.
1703、根据所述多个第一特征图生成多个第二特征图;其中,所述多个第一特征图比所述多个第二特征图包括更多的所述输入的图像的纹理细节和/或所述输入的图像中的位置细节;1703. Generate multiple second feature maps according to the multiple first feature maps; wherein the multiple first feature maps include more texture details of the input image than the multiple second feature maps And/or location details in the input image;
需要说明的是,这里的“根据所述多个第一特征图生成多个第二特征图”,并不应该理解为生成多个第二特征图中的每个第二特征图的来源都是多个第一特征图;在一种实现中,多个第二特征图中的一部分第二特征图是直接基于多个第一特征图中的一个或多个第一特征图生成的;在一种实现中,多个第二特征图中的一部分第二特征图是直接基于多个第一特征图中的一个或多个第一特征图,以及除自身之外的其他第二特征图生成的;在一种实现中,多个第二特征图中的一部分第二特征图是直接基于除自身之外的其他第二特征图生成的,此时,由于“除自身之外的其他第二特征图是基于多个第一特征图中的一个或多个第一特征图”生成的,因此,可以理解为根据所述多个第一特征图生成多个第二特征图。It should be noted that the "generating multiple second feature maps based on the multiple first feature maps" here should not be understood to mean that the source of each second feature map generated in the multiple second feature maps is Multiple first feature maps; in one implementation, a part of the second feature maps in the multiple second feature maps are directly generated based on one or more first feature maps in the multiple first feature maps; one In this implementation, a part of the second feature maps in the multiple second feature maps are directly generated based on one or more first feature maps in the multiple first feature maps, and other second feature maps other than itself ; In one implementation, a part of the second feature maps in the multiple second feature maps are directly generated based on other second feature maps other than itself. At this time, due to "other second feature maps other than itself The map is generated based on one or more first characteristic maps "in a plurality of first characteristic maps", therefore, it can be understood as generating a plurality of second characteristic maps according to the plurality of first characteristic maps.
需要说明的是,多个第二特征图可以是具有多尺度分辨率的特征图,即多个第二特征图并不是分辨率相同的特征图,在一种可选实现中,多个第二特征图可以构成一个特征金字塔。It should be noted that the multiple second feature maps may be feature maps with multi-scale resolution, that is, the multiple second feature maps are not feature maps with the same resolution. In an optional implementation, the multiple second feature maps The feature map can form a feature pyramid.
其中,可以对卷积处理单元生成的多个第一特征图中的最顶层特征图C4进行卷积操作,示例性的,可以使用空洞卷积和1×1卷积将最顶层特征图C4的通道数下降为256,作为特征金字塔的最顶层特征图P4;横向链接最顶层下一层特征图C3的输出结果并使用1×1卷积降低通道数至256后,与特征图P4逐通道逐像素相加得到特征图P3;以此类推,从上到下,构建出第一特征金字塔,该第一特征金字塔可以包括多个第二特征图。Among them, the convolution operation can be performed on the topmost feature map C4 in the multiple first feature maps generated by the convolution processing unit. Exemplarily, the hole convolution and 1×1 convolution can be used to convert the topmost feature map C4 The number of channels is reduced to 256, which is the top feature map P4 of the feature pyramid; the output results of the top and next feature maps C3 are horizontally linked and the number of channels is reduced to 256 using 1×1 convolution. The pixels are added to obtain the feature map P3; and so on, from top to bottom, a first feature pyramid is constructed, and the first feature pyramid may include a plurality of second feature maps.
其中,纹理细节可以用于表达细小目标和边缘特征的细节信息,相比于第二特征图,第一特征图中包括更多的纹理细节信息,使得其在针对于小目标检测的检测结果的检测精度更高。其中,位置细节可以为表达图像中物体所处的位置以及物体之间的相对位置的信息。Among them, the texture details can be used to express the detailed information of small targets and edge features. Compared with the second feature map, the first feature map includes more texture detail information, so that it can be used in the detection results of small target detection. The detection accuracy is higher. Wherein, the position details can be information that expresses the position of the object in the image and the relative position between the objects.
相比于第一特征图,多个第二特征图可以包括更多的深层特征,深层特征包含丰富的语义信息,其对分类任务有很好的效果,同时深层特征具有较大感受野,可以对大型目标有较好的检测效果;在一种实现中,通过引入自上向下的通路来生成多个第二特征图,可以自然地将深层特征所包含的丰富语义信息向下进行传播,使各个尺度的第二特征图都包 含丰富的语义信息。Compared with the first feature map, multiple second feature maps can include more deep features. Deep features contain rich semantic information, which has a good effect on classification tasks. At the same time, deep features have larger receptive fields. It has a good detection effect for large targets; in one implementation, by introducing a top-down path to generate multiple second feature maps, the rich semantic information contained in deep features can naturally be propagated downward. Make the second feature maps of each scale contain rich semantic information.
1704、根据所述多个第一特征图和所述多个第二特征图生成多个第三特征图;1704. Generate multiple third feature maps according to the multiple first feature maps and the multiple second feature maps.
1705、根据所述多个第三特征图中的至少一个第三特征图,输出对所述图像中包括的物体的第一检测结果。1705. According to at least one third feature map of the plurality of third feature maps, output a first detection result of an object included in the image.
在现有的一种实现中,第二特征图生成单元(例如特征金字塔网络)通过引入自上向下的通路,将深层特征所包含的丰富语义信息向下进行传播,使各个尺度的第二特征图都包含丰富的语义信息,同时深层特征具有较大感受野,使得可以对大型目标有较好的检测效果。但是现有的实现中,忽略了更浅层特征图所包含的更加精细的位置细节信息以及纹理细节信息,这对中、小目标的检测精度的影响很大。本申请实施例中,第二特征图生成单元将原始特征图(卷积处理单元生成的多个第一特征图)浅层的纹理细节信息引入到深层特征图(第一特征图生成单元生成的多个第二特征图)中,来生成多个第三特征图,将该具有浅层的丰富纹理细节信息的第三特征图作为检测单元进行目标检测的输入数据,可以提升后续物体检测的检测精度。In an existing implementation, the second feature map generation unit (such as the feature pyramid network) introduces a top-down path to propagate the rich semantic information contained in deep features downwards, so that the second feature map generation unit of each scale Feature maps contain rich semantic information, and deep features have a large receptive field, which makes it possible to have a better detection effect on large targets. However, in the existing implementation, the more detailed position detail information and texture detail information contained in the shallower feature maps are ignored, which has a great impact on the detection accuracy of medium and small targets. In the embodiment of this application, the second feature map generation unit introduces the shallow texture detail information of the original feature map (multiple first feature maps generated by the convolution processing unit) into the deep feature map (generated by the first feature map generation unit). In multiple second feature maps), multiple third feature maps are generated, and the third feature map with shallow layer of rich texture detail information is used as the input data of the detection unit for target detection, which can improve the detection of subsequent object detection Accuracy.
需要说明的是,本实施例中并不是指对于任意包括小目标的图像的物体检测的检测精度都会更高,而是以一个大的数量样本来说,本实施例可以具有更高的综合检测精度。It should be noted that this embodiment does not mean that the detection accuracy of object detection for any image that includes small targets will be higher, but for a large number of samples, this embodiment can have higher comprehensive detection. Accuracy.
可选地,所述多个第二特征图比所述多个第一特征图包括更多的语义信息。Optionally, the plurality of second feature maps include more semantic information than the plurality of first feature maps.
可选地,所述多个第一特征图、所述多个第二特征图以及所述多个第三特征图为具有多尺度分辨率的特征图。Optionally, the plurality of first feature maps, the plurality of second feature maps, and the plurality of third feature maps are feature maps with multi-scale resolution.
可选地,所述多个第一特征图包括第一目标特征图,所述多个第二特征图包括第二目标特征图,所述多个第三特征图包括第三目标特征图和第四目标特征图,所述第三目标特征图的分辨率小于所述第四目标特征图;所述根据所述多个第一特征图和所述多个第二特征图生成多个第三特征图,包括:Optionally, the plurality of first feature maps include a first target feature map, the plurality of second feature maps include a second target feature map, and the plurality of third feature maps include a third target feature map and a first target feature map. Four target feature maps, the resolution of the third target feature map is smaller than that of the fourth target feature map; said generating a plurality of third features according to the plurality of first feature maps and the plurality of second feature maps Figures, including:
对所述第三目标特征图进行下采样,以得到第五目标特征图,所述第五目标特征图与所述第二目标特征图具有相同的通道数和分辨率;对所述第一目标特征图进行下采样和卷积处理,以得到第六目标特征图,所述第六目标特征图与所述第二目标特征图具有相同的通道数和分辨率;将所述第五目标特征图、所述第二目标特征图和所述第六目标特征图做通道叠加,以生成所述第四目标特征图;或,Down-sampling the third target feature map to obtain a fifth target feature map, the fifth target feature map and the second target feature map have the same number of channels and resolution; for the first target The feature map is subjected to down-sampling and convolution processing to obtain a sixth target feature map. The sixth target feature map has the same number of channels and resolution as the second target feature map; and the fifth target feature map , The second target feature map and the sixth target feature map are superimposed on channels to generate the fourth target feature map; or,
对所述第三目标特征图进行下采样,以得到第五目标特征图,所述第五目标特征图与所述第二目标特征图具有相同的通道数和分辨率;对所述第一目标特征图进行下采样,得到第六目标特征图,所述第六目标特征图与所述第二目标特征图具有相同的分辨率;将所述第五目标特征图、所述第二目标特征图和所述第六目标特征图做通道叠加以及进行卷积处理,以生成所述第四目标特征图,所述第四目标特征图与所述第二目标特征图具有相同的通道数。Down-sampling the third target feature map to obtain a fifth target feature map, the fifth target feature map and the second target feature map have the same number of channels and resolution; for the first target The feature map is down-sampled to obtain a sixth target feature map. The sixth target feature map and the second target feature map have the same resolution; the fifth target feature map and the second target feature map are Perform channel superposition and convolution processing with the sixth target feature map to generate the fourth target feature map, and the fourth target feature map has the same number of channels as the second target feature map.
可选地,所述方法还包括:Optionally, the method further includes:
通过第一卷积层对所述多个第三特征图中的至少一个第三特征图做卷积,以得到至少一个第四特征图;Performing convolution on at least one third feature map of the plurality of third feature maps by the first convolution layer to obtain at least one fourth feature map;
以及对所述多个第三特征图中的至少一个第三特征图做卷积,以得到至少一个第五特 征图;其中,所述第四特征图对应的感受野大于所述第五特征图对应的感受野;And convolving at least one third feature map in the plurality of third feature maps to obtain at least one fifth feature map; wherein the receptive field corresponding to the fourth feature map is larger than the fifth feature map Corresponding receptive field;
相应的,所述根据所述多个第三特征图中的至少一个第三特征图,输出对所述图像中包括的物体的第一检测结果,包括:Correspondingly, the outputting the first detection result of the object included in the image according to at least one third feature map of the plurality of third feature maps includes:
根据所述至少一个第四特征图和所述至少一个第五特征图,输出对所述图像中包括的物体的第一检测结果。According to the at least one fourth feature map and the at least one fifth feature map, a first detection result of the object included in the image is output.
可选地,所述方法还包括:Optionally, the method further includes:
根据第一权重值对所述第四特征图进行处理,以得到处理后的第四特征图;Processing the fourth feature map according to the first weight value to obtain a processed fourth feature map;
根据第二权重值对所述第五特征图进行处理,以得到处理后的第五特征图;其中,在所述第四特征图包括的待检测物体大于所述第五特征图包括的待检测物体的情况下,所述第一权重值大于所述第二权重值;The fifth feature map is processed according to the second weight value to obtain a processed fifth feature map; wherein, the object to be detected included in the fourth feature map is larger than the object to be detected included in the fifth feature map In the case of an object, the first weight value is greater than the second weight value;
相应的,所述根据所述多个第三特征图中的至少一个第三特征图,输出对所述图像中包括的物体的第一检测结果,包括:Correspondingly, the outputting the first detection result of the object included in the image according to at least one third feature map of the plurality of third feature maps includes:
根据所述处理后的第四特征图和所述处理后的第五特征图,输出对所述图像中包括的物体的第一检测结果。According to the processed fourth feature map and the processed fifth feature map, a first detection result of the object included in the image is output.
可选地,所述第一检测结果包括第一检测框,所述方法还包括:Optionally, the first detection result includes a first detection frame, and the method further includes:
获取所述第一图像的第二检测结果,所述第二检测结果为通过第二感知网络对所述第一图像进行物体检测得到的,所述第一感知网络的物体检测精度高于所述第二感知网络的物体检测精度,所述第二检测结果包括第二检测框,所述第二检测框所在的区域与所述第一检测框所在的区域之间存在交集;Acquire a second detection result of the first image, the second detection result is obtained by object detection of the first image through a second perception network, and the object detection accuracy of the first perception network is higher than that of the Object detection accuracy of the second perception network, the second detection result includes a second detection frame, and there is an intersection between the area where the second detection frame is located and the area where the first detection frame is located;
若所述交集的面积与所述第一检测框的面积的比值小于预设值,则更新所述第二检测结果,以使得更新后的第二检测结果包括所述第一检测框。If the ratio of the area of the intersection to the area of the first detection frame is less than a preset value, the second detection result is updated so that the updated second detection result includes the first detection frame.
可选地,所述第二检测结果包括多个检测框,所述多个检测框中的每个检测框所在的区域与所述第一检测框所在的区域之间存在交集,所述多个检测框包括所述第二检测框,其中,在所述多个检测框中每个检测框所在的区域与所述第一检测框所在的区域的交集的面积中,所述第二检测框所在的区域与所述第一检测框所在的区域之间的交集的面积最小。Optionally, the second detection result includes multiple detection frames, and there is an intersection between the area where each detection frame of the multiple detection frames is located and the area where the first detection frame is located. The detection frame includes the second detection frame, wherein, in the area of the intersection of the area where each detection frame is located in the plurality of detection frames and the area where the first detection frame is located, the second detection frame is located The area of the intersection between the area and the area where the first detection frame is located is the smallest.
可选地,所述第一图像为视频中的一个图像帧,第二图像为视频中的一个图像帧,所述第一图像与所述第二图像在所述视频中的帧间距小于预设值,所述方法还包括:Optionally, the first image is an image frame in the video, the second image is an image frame in the video, and the frame distance between the first image and the second image in the video is less than a preset Value, the method further includes:
获取所述第二图像的第三检测结果,所述第三检测结果包括第四检测框以及所述第四检测框对应的物体类别;Acquiring a third detection result of the second image, where the third detection result includes a fourth detection frame and an object category corresponding to the fourth detection frame;
在所述第四检测框与所述第一检测框之间的形状差异以及位置差异在预设范围内的情况下,所述第一检测框对应于所述第四检测框对应的物体类别。When the shape difference and position difference between the fourth detection frame and the first detection frame are within a preset range, the first detection frame corresponds to the object category corresponding to the fourth detection frame.
可选地,所述第四检测框的检测置信度大于预设阈值。Optionally, the detection confidence of the fourth detection frame is greater than a preset threshold.
参照图18,图18为本申请实施例提供的一种感知网络训练装置的示意,如图18中示出的那样,感知网络训练装置1800包括:Referring to FIG. 18, FIG. 18 is a schematic diagram of a perceptual network training device provided by an embodiment of the application. As shown in Fig. 18, the perceptual network training device 1800 includes:
获取模块1801,用于获取图像中目标物体的预标注检测框;以及获取对应于所述图像以及第一感知网络的目标检测框,所述目标检测框用于标识所述目标物体;An obtaining module 1801, configured to obtain a pre-labeled detection frame of a target object in an image; and obtain a target detection frame corresponding to the image and the first perception network, the target detection frame being used to identify the target object;
迭代训练模块1802,用于根据损失函数对所述第一感知网络进行迭代训练,以输出第二感知网络;其中,所述损失函数与所述预标注检测框和所述目标检测框之间的交并比IoU有关。The iterative training module 1802 is configured to iteratively train the first perceptual network according to the loss function to output the second perceptual network; wherein the loss function and the difference between the pre-labeled detection frame and the target detection frame Consolidation is more related to IoU.
可选地,所述预设的损失函数还与所述目标检测框与所述预标注检测框的形状差异有关,其中,所述形状差异与所述预标注检测框的面积负相关。Optionally, the preset loss function is also related to the shape difference between the target detection frame and the pre-labeled detection frame, wherein the shape difference is negatively related to the area of the pre-labeled detection frame.
可选地,所述预设的损失函数还与所述目标检测框与所述预标注检测框在所述图像中的位置差异有关,其中,所述位置差异与所述预标注检测框的面积负相关;或所述位置差异与所述预标注检测框和所述目标检测框的凸包的最小外接矩形的面积负相关。Optionally, the preset loss function is also related to the position difference between the target detection frame and the pre-labeled detection frame in the image, wherein the position difference is related to the area of the pre-labeled detection frame. Negative correlation; or the position difference is negatively correlated with the area of the smallest circumscribed rectangle of the convex hull of the pre-labeled detection frame and the target detection frame.
可选地,所述目标检测框包括第一角点和第一中心点,所述预标注检测框包括第二角点和第二中心点,所述第一角点和所述第二角点为矩形对角线的两个端点,所述位置差异还与所述第一中心点和所述第二中心点在所述图像中的位置差异正相关、以及与所述第一角点和所述第二角点的长度负相关。Optionally, the target detection frame includes a first corner point and a first center point, the pre-labeled detection frame includes a second corner point and a second center point, the first corner point and the second corner point Are the two end points of the diagonal of the rectangle, the position difference is also positively correlated with the position difference between the first center point and the second center point in the image, and is also positively correlated with the first corner point and the The length of the second corner point is negatively related.
可选地,所述预设的损失函数包括与所述位置差异有关的目标损失项,所述目标损失项随着所述位置差异的变化而变化;其中,Optionally, the preset loss function includes a target loss item related to the position difference, and the target loss item changes as the position difference changes; wherein,
当所述位置差异大于预设值时,所述目标损失项的变化率大于第一预设变化率;和/或,当所述位置差异小于预设值时,所述目标损失项的变化率小于第二预设变化率。When the position difference is greater than the preset value, the rate of change of the target loss item is greater than the first preset rate of change; and/or, when the position difference is less than the preset value, the rate of change of the target loss item Less than the second preset rate of change.
参照图19,图19为本申请实施例提供的一种物体检测装置的示意,如图19中示出的那样,物体检测装置1900包括:Referring to FIG. 19, FIG. 19 is a schematic diagram of an object detection device provided by an embodiment of the application. As shown in FIG. 19, the object detection device 1900 includes:
接收模块1901,用于接收输入的第一图像,并对所述第一图像进行卷积处理,以生成多个第一特征图;The receiving module 1901 is configured to receive an input first image, and perform convolution processing on the first image to generate multiple first feature maps;
卷积处理模块1902,用于对所述第一图像进行卷积处理,以生成多个第一特征图;The convolution processing module 1902 is configured to perform convolution processing on the first image to generate multiple first feature maps;
第一特征图生成模块1903,用于根据所述多个第一特征图生成多个第二特征图;其中,所述多个第一特征图比所述多个第二特征图包括更多的所述输入的图像的纹理细节和/或所述输入的图像中的位置细节;The first feature map generating module 1903 is configured to generate multiple second feature maps according to the multiple first feature maps; wherein, the multiple first feature maps include more than the multiple second feature maps. Texture details of the input image and/or location details in the input image;
第二特征图生成模块1904,用于根据所述多个第一特征图和所述多个第二特征图生成多个第三特征图;The second feature map generating module 1904 is configured to generate multiple third feature maps according to the multiple first feature maps and the multiple second feature maps;
检测模块1905,用于根据所述多个第三特征图中的至少一个第三特征图,输出对所述图像中包括的物体的第一检测结果。The detection module 1905 is configured to output a first detection result of an object included in the image according to at least one third feature map of the plurality of third feature maps.
可选地,所述多个第二特征图比所述多个第一特征图包括更多的语义信息。Optionally, the plurality of second feature maps include more semantic information than the plurality of first feature maps.
可选地,所述多个第一特征图、所述多个第二特征图以及所述多个第三特征图为具有多尺度分辨率的特征图。Optionally, the plurality of first feature maps, the plurality of second feature maps, and the plurality of third feature maps are feature maps with multi-scale resolution.
可选地,所述多个第一特征图包括第一目标特征图,所述多个第二特征图包括第二目标特征图,所述多个第三特征图包括第三目标特征图和第四目标特征图,所述第三目标特征图的分辨率小于所述第四目标特征图,所述第二特征图生成模块,具体用于:Optionally, the plurality of first feature maps include a first target feature map, the plurality of second feature maps include a second target feature map, and the plurality of third feature maps include a third target feature map and a first target feature map. Four target feature maps, the resolution of the third target feature map is smaller than that of the fourth target feature map, and the second feature map generating module is specifically configured to:
对所述第三目标特征图进行下采样,以得到第五目标特征图,所述第五目标特征图与所述第二目标特征图具有相同的通道数和分辨率;对所述第一目标特征图进行下采样和卷 积处理,以得到第六目标特征图,所述第六目标特征图与所述第二目标特征图具有相同的通道数和分辨率;将所述第五目标特征图、所述第二目标特征图和所述第六目标特征图做通道叠加,以生成所述第四目标特征图;或,Down-sampling the third target feature map to obtain a fifth target feature map, the fifth target feature map and the second target feature map have the same number of channels and resolution; for the first target The feature map is subjected to down-sampling and convolution processing to obtain a sixth target feature map, where the sixth target feature map has the same number of channels and resolution as the second target feature map; and the fifth target feature map , The second target feature map and the sixth target feature map are superimposed on channels to generate the fourth target feature map; or,
对所述第三目标特征图进行下采样,以得到第五目标特征图,所述第五目标特征图与所述第二目标特征图具有相同的通道数和分辨率;对所述第一目标特征图进行下采样,得到第六目标特征图,所述第六目标特征图与所述第二目标特征图具有相同的分辨率;将所述第五目标特征图、所述第二目标特征图和所述第六目标特征图做通道叠加以及进行卷积处理,以生成所述第四目标特征图,所述第四目标特征图与所述第二目标特征图具有相同的通道数。Down-sampling the third target feature map to obtain a fifth target feature map, the fifth target feature map and the second target feature map have the same number of channels and resolution; for the first target The feature map is down-sampled to obtain a sixth target feature map. The sixth target feature map and the second target feature map have the same resolution; the fifth target feature map and the second target feature map are Perform channel superposition and convolution processing with the sixth target feature map to generate the fourth target feature map, and the fourth target feature map has the same number of channels as the second target feature map.
可选地,所述装置还包括:Optionally, the device further includes:
中间特征提取模块,用于通过第一卷积层对所述多个第三特征图中的至少一个第三特征图做卷积,以得到至少一个第四特征图;An intermediate feature extraction module, configured to perform convolution on at least one third feature map of the plurality of third feature maps through the first convolution layer to obtain at least one fourth feature map;
以及对所述多个第三特征图中的至少一个第三特征图做卷积,以得到至少一个第五特征图;其中,所述第四特征图对应的感受野大于所述第五特征图对应的感受野;And convolving at least one third feature map in the plurality of third feature maps to obtain at least one fifth feature map; wherein the receptive field corresponding to the fourth feature map is larger than the fifth feature map Corresponding receptive field;
相应的,所述检测模块,具体用于:Correspondingly, the detection module is specifically used for:
根据所述至少一个第四特征图和所述至少一个第五特征图,输出对所述图像中包括的物体的第一检测结果。According to the at least one fourth feature map and the at least one fifth feature map, a first detection result of the object included in the image is output.
可选地,所述装置还包括:Optionally, the device further includes:
根据第一权重值对所述第四特征图进行处理,以得到处理后的第四特征图;Processing the fourth feature map according to the first weight value to obtain a processed fourth feature map;
根据第二权重值对所述第五特征图进行处理,以得到处理后的第五特征图;其中,在所述第四特征图包括的待检测物体大于所述第五特征图包括的待检测物体的情况下,所述第一权重值大于所述第二权重值;The fifth feature map is processed according to the second weight value to obtain a processed fifth feature map; wherein, the object to be detected included in the fourth feature map is larger than the object to be detected included in the fifth feature map In the case of an object, the first weight value is greater than the second weight value;
相应的,所述检测模块,具体用于:Correspondingly, the detection module is specifically used for:
根据所述处理后的第四特征图和所述处理后的第五特征图,输出对所述图像中包括的物体的第一检测结果。According to the processed fourth feature map and the processed fifth feature map, a first detection result of the object included in the image is output.
可选地,所述第一检测结果包括第一检测框,所述获取模块还用于:Optionally, the first detection result includes a first detection frame, and the acquisition module is further configured to:
获取所述第一图像的第二检测结果,所述第二检测结果为通过第二感知网络对所述第一图像进行物体检测得到的,所述第一感知网络的物体检测精度高于所述第二感知网络的物体检测精度,所述第二检测结果包括第二检测框,所述第二检测框所在的区域与所述第一检测框所在的区域之间存在交集;Acquire a second detection result of the first image, the second detection result is obtained by object detection of the first image through a second perception network, and the object detection accuracy of the first perception network is higher than that of the Object detection accuracy of the second perception network, the second detection result includes a second detection frame, and there is an intersection between the area where the second detection frame is located and the area where the first detection frame is located;
更新模块,用于若所述交集的面积与所述第一检测框的面积的比值小于预设值,则更新所述第二检测结果,以使得更新后的第二检测结果包括所述第一检测框。An update module, configured to update the second detection result if the ratio of the area of the intersection to the area of the first detection frame is less than a preset value, so that the updated second detection result includes the first Check box.
可选地,所述第一图像为视频中的一个图像帧,第二图像为视频中的一个图像帧,所述第一图像与所述第二图像在所述视频中的帧间距小于预设值,所述获取模块还用于:Optionally, the first image is an image frame in the video, the second image is an image frame in the video, and the frame distance between the first image and the second image in the video is smaller than a preset Value, the acquisition module is also used to:
获取所述第二图像的第三检测结果,所述第三检测结果包括第四检测框以及所述第四检测框对应的物体类别;Acquiring a third detection result of the second image, where the third detection result includes a fourth detection frame and an object category corresponding to the fourth detection frame;
在所述第四检测框与所述第一检测框之间的形状差异以及位置差异在预设范围内的情 况下,所述第一检测框对应于所述第四检测框对应的物体类别。In the case that the shape difference and position difference between the fourth detection frame and the first detection frame are within a preset range, the first detection frame corresponds to the object category corresponding to the fourth detection frame.
可选地,所述第四检测框的检测置信度大于预设阈值。Optionally, the detection confidence of the fourth detection frame is greater than a preset threshold.
接下来介绍本申请实施例提供的一种执行设备,请参阅图20,图20为本申请实施例提供的执行设备的一种结构示意图,执行设备2000具体可以表现为虚拟现实VR设备、手机、平板、笔记本电脑、智能穿戴设备、监控数据处理设备等,此处不做限定。其中,执行设备2000上可以部署有图19对应实施例中所描述的物体检测装置,用于实现图19对应实施例中物体检测的功能。具体的,执行设备2000包括:接收器2001、发射器2002、处理器2003和存储器2004(其中执行设备2000中的处理器2003的数量可以一个或多个,图20中以一个处理器为例),其中,处理器2003可以包括应用处理器20031和通信处理器20032。在本申请的一些实施例中,接收器2001、发射器2002、处理器2003和存储器2004可通过总线或其它方式连接。Next, an execution device provided by an embodiment of this application will be introduced. Please refer to FIG. 20. FIG. 20 is a schematic structural diagram of an execution device provided by an embodiment of this application. Tablets, laptops, smart wearable devices, monitoring data processing devices, etc., are not limited here. Wherein, the object detection device described in the embodiment corresponding to FIG. 19 may be deployed on the execution device 2000 to implement the function of object detection in the embodiment corresponding to FIG. 19. Specifically, the execution device 2000 includes: a receiver 2001, a transmitter 2002, a processor 2003, and a memory 2004 (the number of processors 2003 in the execution device 2000 may be one or more, and one processor is taken as an example in FIG. 20) , Where the processor 2003 may include an application processor 20031 and a communication processor 20032. In some embodiments of the present application, the receiver 2001, the transmitter 2002, the processor 2003, and the memory 2004 may be connected by a bus or other methods.
存储器2004可以包括只读存储器和随机存取存储器,并向处理器2003提供指令和数据。存储器2004的一部分还可以包括非易失性随机存取存储器(non-volatile random access memory,NVRAM)。存储器2004存储有处理器和操作指令、可执行模块或者数据结构,或者它们的子集,或者它们的扩展集,其中,操作指令可包括各种操作指令,用于实现各种操作。The memory 2004 may include a read-only memory and a random access memory, and provides instructions and data to the processor 2003. A part of the memory 2004 may also include a non-volatile random access memory (NVRAM). The memory 2004 stores a processor and operating instructions, executable modules or data structures, or a subset of them, or an extended set of them. The operating instructions may include various operating instructions for implementing various operations.
处理器2003控制执行设备的操作。具体的应用中,执行设备的各个组件通过总线系统耦合在一起,其中总线系统除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都称为总线系统。The processor 2003 controls the operation of the execution device. In a specific application, the various components of the execution device are coupled together through a bus system, where the bus system may include a power bus, a control bus, and a status signal bus in addition to a data bus. However, for the sake of clear description, various buses are referred to as bus systems in the figure.
上述本申请实施例揭示的方法可以应用于处理器2003中,或者由处理器2003实现。处理器2003可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器2003中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器2003可以是通用处理器、数字信号处理器(digital signal processing,DSP)、微处理器或微控制器,还可进一步包括专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。该处理器2003可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器2004,处理器2003读取存储器2004中的信息,结合其硬件完成上述方法的步骤。The method disclosed in the above embodiments of the present application may be applied to the processor 2003 or implemented by the processor 2003. The processor 2003 may be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method can be completed by an integrated logic circuit of hardware in the processor 2003 or instructions in the form of software. The aforementioned processor 2003 may be a general-purpose processor, a digital signal processing (DSP), a microprocessor or a microcontroller, and may further include an application specific integrated circuit (ASIC), field programmable Field-programmable gate array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components. The processor 2003 can implement or execute the methods, steps, and logical block diagrams disclosed in the embodiments of the present application. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor. The software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers. The storage medium is located in the memory 2004, and the processor 2003 reads the information in the memory 2004, and completes the steps of the above method in combination with its hardware.
接收器2001可用于接收输入的数字或字符信息,以及产生与执行设备的相关设置以及功能控制有关的信号输入。发射器2002可用于通过第一接口输出数字或字符信息;发射器2002还可用于通过第一接口向磁盘组发送指令,以修改磁盘组中的数据;发射器2002还可以包括显示屏等显示设备。The receiver 2001 can be used to receive input digital or character information, and generate signal input related to the relevant settings and function control of the execution device. The transmitter 2002 can be used to output digital or character information through the first interface; the transmitter 2002 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group; the transmitter 2002 can also include display devices such as a display screen .
本申请实施例中,在一种情况下,处理器2003,用于执行图9至图11对应实施例中的执行设备执行的图像处理方法。具体的,应用处理器20031,用于执行上述实施例中的物体检测方法。In the embodiment of the present application, in one case, the processor 2003 is configured to execute the image processing method executed by the execution device in the embodiment corresponding to FIG. 9 to FIG. 11. Specifically, the application processor 20031 is configured to execute the object detection method in the foregoing embodiment.
本申请实施例还提供了一种训练设备,请参阅图21,图21是本申请实施例提供的训练设备一种结构示意图,训练设备2100上可以部署有图18对应实施例中所描述的感知网络训练装置,用于实现图18对应实施例中感知网络训练装置的功能,具体的,训练设备2100由一个或多个服务器实现,训练设备2100可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)2121(例如,一个或一个以上处理器)和存储器2132,一个或一个以上存储应用程序2142或数据2144的存储介质2130(例如一个或一个以上海量存储设备)。其中,存储器2132和存储介质2130可以是短暂存储或持久存储。存储在存储介质2130的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对训练设备中的一系列指令操作。更进一步地,中央处理器2121可以设置为与存储介质2130通信,在训练设备2100上执行存储介质2130中的一系列指令操作。An embodiment of this application also provides a training device. Please refer to FIG. 21. FIG. 21 is a schematic structural diagram of a training device provided in an embodiment of this application. The training device 2100 may be deployed with the perception described in the embodiment corresponding to FIG. 18 The network training device is used to realize the function of the perceptual network training device in the embodiment corresponding to FIG. 18. Specifically, the training device 2100 is implemented by one or more servers, and the training device 2100 may have relatively large differences due to different configurations or performances. It may include one or more central processing units (CPU) 2121 (for example, one or more processors) and a memory 2132, and one or more storage media 2130 (for example, one or A storage device in Shanghai). Among them, the memory 2132 and the storage medium 2130 may be short-term storage or persistent storage. The program stored in the storage medium 2130 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the training device. Furthermore, the central processing unit 2121 may be configured to communicate with the storage medium 2130, and execute a series of instruction operations in the storage medium 2130 on the training device 2100.
训练设备2100还可以包括一个或一个以上电源2126,一个或一个以上有线或无线网络接口2150,一个或一个以上输入输出接口2158;或,一个或一个以上操作系统2141,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。The training device 2100 may also include one or more power supplies 2126, one or more wired or wireless network interfaces 2150, and one or more input and output interfaces 2158; or, one or more operating systems 2141, such as Windows ServerTM, Mac OS XTM , UnixTM, LinuxTM, FreeBSDTM and so on.
本申请实施例中,中央处理器2121,用于执行上述实施例中的感知网络训练方法相关的步骤。In the embodiment of the present application, the central processing unit 2121 is configured to execute the relevant steps of the perceptual network training method in the foregoing embodiment.
本申请实施例中还提供一种包括计算机程序产品,当其在计算机上运行时,使得计算机执行如前述执行设备所执行的步骤,或者,使得计算机执行如前述训练设备所执行的步骤。The embodiments of the present application also provide a product including a computer program, which when running on a computer, causes the computer to execute the steps executed by the aforementioned execution device, or causes the computer to execute the steps executed by the aforementioned training device.
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有用于进行信号处理的程序,当其在计算机上运行时,使得计算机执行如前述执行设备所执行的步骤,或者,使得计算机执行如前述训练设备所执行的步骤。The embodiments of the present application also provide a computer-readable storage medium that stores a program for signal processing, and when it runs on a computer, the computer executes the steps performed by the aforementioned execution device. Or, make the computer execute the steps performed by the aforementioned training device.
本申请实施例提供的执行设备、训练设备或终端设备具体可以为芯片,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使执行设备内的芯片执行上述实施例描述的数据处理方法,或者,以使训练设备内的芯片执行上述实施例描述的数据处理方法。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元还可以是所述无线接入设备端内的位于所述芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。The execution device, training device, or terminal device provided by the embodiments of the present application may specifically be a chip. The chip includes a processing unit and a communication unit. The processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, Pins or circuits, etc. The processing unit can execute the computer-executable instructions stored in the storage unit to make the chip in the execution device execute the data processing method described in the foregoing embodiment, or to make the chip in the training device execute the data processing method described in the foregoing embodiment. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, etc., and the storage unit may also be a storage unit located outside the chip in the wireless access device, such as Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), etc.
具体的,请参阅图22,图22为本申请实施例提供的芯片的一种结构示意图,所述芯片可以表现为神经网络处理器NPU 2200,NPU 2200作为协处理器挂载到主CPU(Host  CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路2203,通过控制器2204控制运算电路2203提取存储器中的矩阵数据并进行乘法运算。Specifically, please refer to Fig. 22. Fig. 22 is a schematic diagram of a structure of a chip provided by an embodiment of the application. On the CPU), the Host CPU assigns tasks. The core part of the NPU is the arithmetic circuit 2203, and the controller 2204 controls the arithmetic circuit 2203 to extract matrix data from the memory and perform multiplication operations.
在一些实现中,运算电路2203内部包括多个处理单元(Process Engine,PE)。在一些实现中,运算电路2203是二维脉动阵列。运算电路2203还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路2203是通用的矩阵处理器。In some implementations, the arithmetic circuit 2203 includes multiple processing units (Process Engine, PE). In some implementations, the arithmetic circuit 2203 is a two-dimensional systolic array. The arithmetic circuit 2203 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 2203 is a general-purpose matrix processor.
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器2202中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器2201中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)2208中。For example, suppose there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the corresponding data of matrix B from the weight memory 2202 and caches it on each PE in the arithmetic circuit. The arithmetic circuit fetches matrix A data and matrix B from the input memory 2201 to perform matrix operations, and the partial result or final result of the obtained matrix is stored in an accumulator 2208.
统一存储器2206用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(Direct Memory Access Controller,DMAC)2205,DMAC被搬运到权重存储器2202中。输入数据也通过DMAC被搬运到统一存储器2206中。The unified memory 2206 is used to store input data and output data. The weight data directly passes through the storage unit access controller (Direct Memory Access Controller, DMAC) 2205, and the DMAC is transferred to the weight memory 2202. The input data is also transferred to the unified memory 2206 through the DMAC.
BIU为Bus Interface Unit即,总线接口单元2210,用于AXI总线与DMAC和取指存储器(Instruction Fetch Buffer,IFB)2209的交互。The BIU is the Bus Interface Unit, that is, the bus interface unit 2210, which is used for the interaction of the AXI bus with the DMAC and the instruction fetch buffer (IFB) 2209.
总线接口单元2210(Bus Interface Unit,简称BIU),用于取指存储器2209从外部存储器获取指令,还用于存储单元访问控制器2205从外部存储器获取输入矩阵A或者权重矩阵B的原数据。The bus interface unit 2210 (Bus Interface Unit, BIU for short) is used for the instruction fetch memory 2209 to obtain instructions from the external memory, and is also used for the storage unit access controller 2205 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器2206或将权重数据搬运到权重存储器2202中或将输入数据数据搬运到输入存储器2201中。The DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 2206 or to transfer the weight data to the weight memory 2202 or to transfer the input data to the input memory 2201.
向量计算单元2207包括多个运算处理单元,在需要的情况下,对运算电路2203的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/全连接层网络计算,如Batch Normalization(批归一化),像素级求和,对特征平面进行上采样等。The vector calculation unit 2207 includes a plurality of arithmetic processing units. If necessary, further processing is performed on the output of the arithmetic circuit 2203, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and so on. It is mainly used in the calculation of non-convolutional/fully connected layer networks in neural networks, such as Batch Normalization, pixel-level summation, and upsampling of feature planes.
在一些实现中,向量计算单元2207能将经处理的输出的向量存储到统一存储器2206。例如,向量计算单元2207可以将线性函数;或,非线性函数应用到运算电路2203的输出,例如对卷积层提取的特征平面进行线性插值,再例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元2207生成归一化的值、像素级求和的值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路2203的激活输入,例如用于在神经网络中的后续层中的使用。In some implementations, the vector calculation unit 2207 can store the processed output vector to the unified memory 2206. For example, the vector calculation unit 2207 may apply a linear function; or, apply a nonlinear function to the output of the arithmetic circuit 2203, for example, perform linear interpolation on the feature plane extracted by the convolutional layer, and for example, a vector of accumulated values to generate the activation value. In some implementations, the vector calculation unit 2207 generates normalized values, pixel-level summed values, or both. In some implementations, the processed output vector can be used as an activation input to the arithmetic circuit 2203, for example for use in a subsequent layer in a neural network.
控制器2204连接的取指存储器(instruction fetch buffer)2209,用于存储控制器2204使用的指令;The instruction fetch buffer 2209 connected to the controller 2204 is used to store instructions used by the controller 2204;
统一存储器2206,输入存储器2201,权重存储器2202以及取指存储器2209均为On-Chip存储器。外部存储器私有于该NPU硬件架构。The unified memory 2206, the input memory 2201, the weight memory 2202, and the fetch memory 2209 are all On-Chip memories. The external memory is private to the NPU hardware architecture.
其中,上述任一处提到的处理器,可以是一个通用中央处理器,微处理器,ASIC,或一个或多个用于控制上述程序执行的集成电路。Among them, the processor mentioned in any one of the above may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above-mentioned programs.
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件 说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。In addition, it should be noted that the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physically separate. The physical unit can be located in one place or distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments. In addition, in the drawings of the device embodiments provided in the present application, the connection relationship between the modules indicates that they have a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,训练设备,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that this application can be implemented by means of software plus necessary general hardware. Of course, it can also be implemented by dedicated hardware including dedicated integrated circuits, dedicated CPUs, dedicated memory, Dedicated components and so on to achieve. Under normal circumstances, all functions completed by computer programs can be easily implemented with corresponding hardware. Moreover, the specific hardware structures used to achieve the same function can also be diverse, such as analog circuits, digital circuits or special-purpose circuits. Circuit etc. However, for this application, software program implementation is a better implementation in more cases. Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a readable storage medium, such as a computer floppy disk. , U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., including several instructions to make a computer device (which can be a personal computer, training device, or network device, etc.) execute the various embodiments described in this application method.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented by software, it can be implemented in the form of a computer program product in whole or in part.
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、训练设备或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、训练设备或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的训练设备、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions described in the embodiments of the present application are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, training device, or data. The center transmits to another website, computer, training equipment, or data center through wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a training device or a data center integrated with one or more available media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk (SSD)).

Claims (26)

  1. 一种数据处理系统,其特征在于,所述数据处理系统包括:卷积处理单元、第一特征图生成单元、第二特征图生成单元和检测单元,所述卷积处理单元分别与所述第一特征图生成单元和所述第二特征图生成单元连接,所述第一特征图生成单元和所述第二特征图生成单元连接,所述第二特征图生成单元与所述检测单元连接;A data processing system, characterized in that the data processing system comprises: a convolution processing unit, a first feature map generating unit, a second feature map generating unit, and a detection unit, the convolution processing unit and the first feature map generating unit, respectively A feature map generating unit is connected to the second feature map generating unit, the first feature map generating unit is connected to the second feature map generating unit, and the second feature map generating unit is connected to the detection unit;
    所述卷积处理单元,用于接收输入的图像,并对所述输入的图像进行卷积处理,以生成多个第一特征图;The convolution processing unit is configured to receive an input image, and perform convolution processing on the input image to generate a plurality of first feature maps;
    所述第一特征图生成单元,用于根据所述多个第一特征图生成多个第二特征图;其中,所述多个第一特征图比所述多个第二特征图包括更多的所述输入的图像的纹理细节和/或所述输入的图像中的位置细节;The first characteristic map generating unit is configured to generate a plurality of second characteristic maps according to the plurality of first characteristic maps; wherein, the plurality of first characteristic maps include more than the plurality of second characteristic maps Texture details of the input image and/or location details in the input image;
    所述第二特征图生成单元,用于根据所述多个第一特征图和所述多个第二特征图生成多个第三特征图;The second characteristic map generating unit is configured to generate a plurality of third characteristic maps according to the plurality of first characteristic maps and the plurality of second characteristic maps;
    所述检测单元,用于根据所述多个第三特征图中的至少一个第三特征图,输出对所述图像中包括的物体的检测结果。The detection unit is configured to output a detection result of an object included in the image according to at least one third feature map of the plurality of third feature maps.
  2. 根据权利要求1所述的数据处理系统,其特征在于,所述多个第二特征图比所述多个第一特征图包括更多的语义信息。The data processing system according to claim 1, wherein the plurality of second feature maps include more semantic information than the plurality of first feature maps.
  3. 根据权利要求1或2所述的数据处理系统,其特征在于,所述卷积处理单元为主干网络,所述第一特征图生成单元和所述第二特征图生成单元为特征金字塔网络FPN,所述检测单元为头端head。The data processing system according to claim 1 or 2, wherein the convolution processing unit is a backbone network, and the first feature map generating unit and the second feature map generating unit are feature pyramid network FPN, The detection unit is head.
  4. 根据权利要求1至3任一所述的数据处理系统,其特征在于,所述多个第一特征图包括第一目标特征图,所述多个第二特征图包括第二目标特征图,所述多个第三特征图包括第三目标特征图和第四目标特征图,所述第三目标特征图的分辨率小于所述第四目标特征图,所述第二特征图生成单元用于通过以下步骤来生成所述第四目标特征图:The data processing system according to any one of claims 1 to 3, wherein the plurality of first feature maps include a first target feature map, and the plurality of second feature maps include a second target feature map, so The multiple third feature maps include a third target feature map and a fourth target feature map, the resolution of the third target feature map is smaller than the fourth target feature map, and the second feature map generating unit is used to pass The following steps are used to generate the fourth target feature map:
    对所述第三目标特征图进行下采样,以得到第五目标特征图,所述第五目标特征图与所述第二目标特征图具有相同的通道数和分辨率;对所述第一目标特征图进行下采样和卷积处理,以得到第六目标特征图,所述第六目标特征图与所述第二目标特征图具有相同的通道数和分辨率;将所述第五目标特征图、所述第二目标特征图和所述第六目标特征图做通道叠加,以生成所述第四目标特征图;或,Down-sampling the third target feature map to obtain a fifth target feature map, the fifth target feature map and the second target feature map have the same number of channels and resolution; for the first target The feature map is subjected to down-sampling and convolution processing to obtain a sixth target feature map. The sixth target feature map has the same number of channels and resolution as the second target feature map; and the fifth target feature map , The second target feature map and the sixth target feature map are superimposed on channels to generate the fourth target feature map; or,
    对所述第三目标特征图进行下采样,以得到第五目标特征图,所述第五目标特征图与所述第二目标特征图具有相同的通道数和分辨率;对所述第一目标特征图进行下采样,得到第六目标特征图,所述第六目标特征图与所述第二目标特征图具有相同的分辨率;将所述第五目标特征图、所述第二目标特征图和所述第六目标特征图做通道叠加以及进行卷积处理,以生成所述第四目标特征图,所述第四目标特征图与所述第二目标特征图具有相同的通道数。Down-sampling the third target feature map to obtain a fifth target feature map, the fifth target feature map and the second target feature map have the same number of channels and resolution; for the first target The feature map is down-sampled to obtain a sixth target feature map. The sixth target feature map and the second target feature map have the same resolution; the fifth target feature map and the second target feature map are Perform channel superposition and convolution processing with the sixth target feature map to generate the fourth target feature map, and the fourth target feature map has the same number of channels as the second target feature map.
  5. 根据权利要求1至4任一所述的数据处理系统,其特征在于,所述数据处理系统还包括:The data processing system according to any one of claims 1 to 4, wherein the data processing system further comprises:
    中间特征提取层,用于对所述多个第三特征图中的至少一个第三特征图做卷积,以得到至少一个第四特征图;An intermediate feature extraction layer, configured to perform convolution on at least one third feature map of the plurality of third feature maps to obtain at least one fourth feature map;
    以及对所述多个第三特征图中的至少一个第三特征图做卷积,以得到至少一个第五特征图;其中,所述第四特征图对应的感受野大于所述第五特征图对应的感受野;And convolving at least one third feature map in the plurality of third feature maps to obtain at least one fifth feature map; wherein the receptive field corresponding to the fourth feature map is larger than the fifth feature map Corresponding receptive field;
    所述检测单元,具体用于根据所述至少一个第四特征图和所述至少一个第五特征图,输出对所述图像中包括的物体的检测结果。The detection unit is specifically configured to output a detection result of the object included in the image according to the at least one fourth feature map and the at least one fifth feature map.
  6. 根据权利要求5所述的数据处理系统,其特征在于,所述中间特征提取层还用于:The data processing system according to claim 5, wherein the intermediate feature extraction layer is further used for:
    根据第一权重值对所述至少一个第四特征图进行处理,以得到处理后的第四特征图;Processing the at least one fourth feature map according to the first weight value to obtain a processed fourth feature map;
    根据第二权重值对所述至少一个第五特征图进行处理,以得到处理后的第五特征图;其中,在所述第四特征图包括的待检测物体大于所述第五特征图包括的待检测物体的情况下,所述第一权重值大于所述第二权重值;The at least one fifth feature map is processed according to the second weight value to obtain a processed fifth feature map; wherein, the object to be detected included in the fourth feature map is larger than that included in the fifth feature map In the case of an object to be detected, the first weight value is greater than the second weight value;
    相应的,所述检测单元,具体用于根据所述处理后的第四特征图和所述处理后的第五特征图,输出对所述图像中包括的物体的检测结果。Correspondingly, the detection unit is specifically configured to output the detection result of the object included in the image according to the processed fourth feature map and the processed fifth feature map.
  7. 一种感知网络训练方法,其特征在于,所述方法包括:A perceptual network training method, characterized in that the method includes:
    获取图像中目标物体的预标注检测框;Obtaining a pre-labeled detection frame of the target object in the image;
    获取对应于所述图像以及第一感知网络的目标检测框,所述目标检测框用于标识所述目标物体;Acquiring a target detection frame corresponding to the image and the first perception network, where the target detection frame is used to identify the target object;
    根据损失函数对所述第一感知网络进行迭代训练,以输出第二感知网络;其中,所述损失函数与所述预标注检测框和所述目标检测框之间的交并比IoU有关。The first perceptual network is iteratively trained according to the loss function to output the second perceptual network; wherein the intersection of the loss function and the pre-labeled detection frame and the target detection frame is more related to IoU.
  8. 根据权利要求7所述的方法,其特征在于,所述预设的损失函数还与所述目标检测框与所述预标注检测框的形状差异有关,其中,所述形状差异与所述预标注检测框的面积负相关。The method according to claim 7, wherein the preset loss function is also related to the shape difference between the target detection frame and the pre-labeled detection frame, wherein the shape difference is related to the pre-labeled detection frame. The area of the detection frame is negatively correlated.
  9. 根据权利要求7或8所述的方法,其特征在于,所述预设的损失函数还与所述目标检测框与所述预标注检测框在所述图像中的位置差异有关,其中,所述位置差异与所述预标注检测框的面积负相关;或所述位置差异与所述预标注检测框和所述目标检测框的凸包的最小外接矩形的面积负相关。The method according to claim 7 or 8, wherein the preset loss function is also related to the position difference between the target detection frame and the pre-labeled detection frame in the image, wherein the The position difference is negatively related to the area of the pre-labeled detection frame; or the position difference is negatively related to the area of the smallest circumscribed rectangle of the convex hull of the pre-labeled detection frame and the target detection frame.
  10. 根据权利要求9所述的方法,其特征在于,所述目标检测框包括第一角点和第一中心点,所述预标注检测框包括第二角点和第二中心点,所述第一角点和所述第二角点为矩形对角线的两个端点,所述位置差异还与所述第一中心点和所述第二中心点在所述图像中的位置差异正相关、以及与所述第一角点和所述第二角点的长度负相关。The method according to claim 9, wherein the target detection frame includes a first corner point and a first center point, the pre-labeled detection frame includes a second corner point and a second center point, and the first The corner point and the second corner point are the two end points of the diagonal of the rectangle, and the position difference is also positively correlated with the position difference between the first center point and the second center point in the image, and It is negatively related to the length of the first corner point and the second corner point.
  11. 根据权利要求7或8所述的方法,其特征在于,所述预设的损失函数包括与所述位置差异有关的目标损失项,所述目标损失项随着所述位置差异的变化而变化;其中,The method according to claim 7 or 8, wherein the preset loss function includes a target loss item related to the position difference, and the target loss item changes as the position difference changes; in,
    当所述位置差异大于预设值时,所述目标损失项的变化率大于第一预设变化率;和/或,当所述位置差异小于预设值时,所述目标损失项的变化率小于第二预设变化率。When the position difference is greater than the preset value, the rate of change of the target loss item is greater than the first preset rate of change; and/or, when the position difference is less than the preset value, the rate of change of the target loss item Less than the second preset rate of change.
  12. 一种物体检测方法,其特征在于,所述方法用于第一感知网络,所述方法包括:An object detection method, characterized in that the method is used in a first perception network, and the method includes:
    接收输入的第一图像,并对所述第一图像进行卷积处理,以生成多个第一特征图;Receiving an input first image, and performing convolution processing on the first image to generate a plurality of first feature maps;
    根据所述多个第一特征图生成多个第二特征图;其中,所述多个第一特征图比所述多个第二特征图包括更多的所述输入的图像的纹理细节和/或所述输入的图像中的位置细节;Generate multiple second feature maps according to the multiple first feature maps; wherein the multiple first feature maps include more texture details and/or of the input image than the multiple second feature maps Or the location details in the input image;
    根据所述多个第一特征图和所述多个第二特征图生成多个第三特征图;Generating a plurality of third characteristic maps according to the plurality of first characteristic maps and the plurality of second characteristic maps;
    根据所述多个第三特征图中的至少一个第三特征图,输出对所述图像中包括的物体的第一检测结果。According to at least one third feature map of the plurality of third feature maps, a first detection result of the object included in the image is output.
  13. 根据权利要求12所述的方法,其特征在于,所述多个第二特征图比所述多个第一特征图包括更多的语义信息。The method according to claim 12, wherein the plurality of second feature maps include more semantic information than the plurality of first feature maps.
  14. 根据权利要求12或13所述的方法,其特征在于,所述多个第一特征图、所述多个第二特征图以及所述多个第三特征图为具有多尺度分辨率的特征图。The method according to claim 12 or 13, wherein the plurality of first feature maps, the plurality of second feature maps, and the plurality of third feature maps are feature maps with multi-scale resolution .
  15. 根据权利要求12至14任一所述的方法,其特征在于,所述多个第一特征图包括第一目标特征图,所述多个第二特征图包括第二目标特征图,所述多个第三特征图包括第三目标特征图和第四目标特征图,所述第三目标特征图的分辨率小于所述第四目标特征图;所述根据所述多个第一特征图和所述多个第二特征图生成多个第三特征图,包括:The method according to any one of claims 12 to 14, wherein the multiple first feature maps include a first target feature map, the multiple second feature maps include a second target feature map, and the multiple The third feature map includes a third target feature map and a fourth target feature map. The resolution of the third target feature map is smaller than that of the fourth target feature map; The multiple second feature maps generate multiple third feature maps, including:
    对所述第三目标特征图进行下采样,以得到第五目标特征图,所述第五目标特征图与所述第二目标特征图具有相同的通道数和分辨率;对所述第一目标特征图进行下采样和卷积处理,以得到第六目标特征图,所述第六目标特征图与所述第二目标特征图具有相同的通道数和分辨率;将所述第五目标特征图、所述第二目标特征图和所述第六目标特征图做通道叠加,以生成所述第四目标特征图;或,Down-sampling the third target feature map to obtain a fifth target feature map, the fifth target feature map and the second target feature map have the same number of channels and resolution; for the first target The feature map is subjected to down-sampling and convolution processing to obtain a sixth target feature map. The sixth target feature map has the same number of channels and resolution as the second target feature map; and the fifth target feature map , The second target feature map and the sixth target feature map are superimposed on channels to generate the fourth target feature map; or,
    对所述第三目标特征图进行下采样,以得到第五目标特征图,所述第五目标特征图与所述第二目标特征图具有相同的通道数和分辨率;对所述第一目标特征图进行下采样,得到第六目标特征图,所述第六目标特征图与所述第二目标特征图具有相同的分辨率;将所述第五目标特征图、所述第二目标特征图和所述第六目标特征图做通道叠加以及进行卷积处理,以生成所述第四目标特征图,所述第四目标特征图与所述第二目标特征图具有相同的通道数。Down-sampling the third target feature map to obtain a fifth target feature map, the fifth target feature map and the second target feature map have the same number of channels and resolution; for the first target The feature map is down-sampled to obtain a sixth target feature map. The sixth target feature map and the second target feature map have the same resolution; the fifth target feature map and the second target feature map are Perform channel superposition and convolution processing with the sixth target feature map to generate the fourth target feature map, and the fourth target feature map has the same number of channels as the second target feature map.
  16. 根据权利要求12至15任一所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 12 to 15, wherein the method further comprises:
    通过第一卷积层对所述多个第三特征图中的至少一个第三特征图做卷积,以得到至少 一个第四特征图;Performing convolution on at least one third feature map of the plurality of third feature maps by the first convolution layer to obtain at least one fourth feature map;
    以及对所述多个第三特征图中的至少一个第三特征图做卷积,以得到至少一个第五特征图;其中,所述第四特征图对应的感受野大于所述第五特征图对应的感受野;And convolving at least one third feature map in the plurality of third feature maps to obtain at least one fifth feature map; wherein the receptive field corresponding to the fourth feature map is larger than the fifth feature map Corresponding receptive field;
    相应的,所述根据所述多个第三特征图中的至少一个第三特征图,输出对所述图像中包括的物体的第一检测结果,包括:Correspondingly, the outputting the first detection result of the object included in the image according to at least one third feature map of the plurality of third feature maps includes:
    根据所述至少一个第四特征图和所述至少一个第五特征图,输出对所述图像中包括的物体的第一检测结果。According to the at least one fourth feature map and the at least one fifth feature map, a first detection result of the object included in the image is output.
  17. 根据权利要求16所述的方法,其特征在于,所述方法还包括:The method according to claim 16, wherein the method further comprises:
    根据第一权重值对所述第四特征图进行处理,以得到处理后的第四特征图;Processing the fourth feature map according to the first weight value to obtain a processed fourth feature map;
    根据第二权重值对所述第五特征图进行处理,以得到处理后的第五特征图;其中,在所述第四特征图包括的待检测物体大于所述第五特征图包括的待检测物体的情况下,所述第一权重值大于所述第二权重值;The fifth feature map is processed according to the second weight value to obtain a processed fifth feature map; wherein, the object to be detected included in the fourth feature map is larger than the object to be detected included in the fifth feature map In the case of an object, the first weight value is greater than the second weight value;
    相应的,所述根据所述多个第三特征图中的至少一个第三特征图,输出对所述图像中包括的物体的第一检测结果,包括:Correspondingly, the outputting the first detection result of the object included in the image according to at least one third feature map of the plurality of third feature maps includes:
    根据所述处理后的第四特征图和所述处理后的第五特征图,输出对所述图像中包括的物体的第一检测结果。According to the processed fourth feature map and the processed fifth feature map, a first detection result of the object included in the image is output.
  18. 根据权利要求12至17任一所述的方法,其特征在于,所述第一检测结果包括第一检测框,所述方法还包括:The method according to any one of claims 12 to 17, wherein the first detection result comprises a first detection frame, and the method further comprises:
    获取所述第一图像的第二检测结果,所述第二检测结果为通过第二感知网络对所述第一图像进行物体检测得到的,所述第一感知网络的物体检测精度高于所述第二感知网络的物体检测精度,所述第二检测结果包括第二检测框,所述第二检测框所在的区域与所述第一检测框所在的区域之间存在交集;Acquire a second detection result of the first image, the second detection result is obtained by object detection of the first image through a second perception network, and the object detection accuracy of the first perception network is higher than that of the Object detection accuracy of the second perception network, the second detection result includes a second detection frame, and there is an intersection between the area where the second detection frame is located and the area where the first detection frame is located;
    若所述交集的面积与所述第一检测框的面积的比值小于预设值,则更新所述第二检测结果,以使得更新后的第二检测结果包括所述第一检测框。If the ratio of the area of the intersection to the area of the first detection frame is less than a preset value, the second detection result is updated so that the updated second detection result includes the first detection frame.
  19. 根据权利要求18所述的方法,其特征在于,所述第二检测结果包括多个检测框,所述多个检测框中的每个检测框所在的区域与所述第一检测框所在的区域之间存在交集,所述多个检测框包括所述第二检测框,其中,在所述多个检测框中每个检测框所在的区域与所述第一检测框所在的区域的交集的面积中,所述第二检测框所在的区域与所述第一检测框所在的区域之间的交集的面积最小。The method according to claim 18, wherein the second detection result includes a plurality of detection frames, and the area where each detection frame of the plurality of detection frames is located is the same as the area where the first detection frame is located. There is an intersection, and the plurality of detection frames includes the second detection frame, wherein the area of the intersection of the area where each detection frame is located in the multiple detection frames and the area where the first detection frame is located Wherein, the area of the intersection between the area where the second detection frame is located and the area where the first detection frame is located is the smallest.
  20. 根据权利要求18或19所述的方法,其特征在于,所述第一图像为视频中的一个图像帧,第二图像为视频中的一个图像帧,所述第一图像与所述第二图像在所述视频中的帧间距小于预设值,所述方法还包括:The method according to claim 18 or 19, wherein the first image is an image frame in the video, the second image is an image frame in the video, and the first image and the second image are The frame spacing in the video is less than a preset value, and the method further includes:
    获取所述第二图像的第三检测结果,所述第三检测结果包括第四检测框以及所述第四 检测框对应的物体类别;Acquiring a third detection result of the second image, where the third detection result includes a fourth detection frame and an object category corresponding to the fourth detection frame;
    在所述第四检测框与所述第一检测框之间的形状差异以及位置差异在预设范围内的情况下,所述第一检测框对应于所述第四检测框对应的物体类别。When the shape difference and position difference between the fourth detection frame and the first detection frame are within a preset range, the first detection frame corresponds to the object category corresponding to the fourth detection frame.
  21. 一种感知网络训练装置,其特征在于,所述装置包括:A perceptual network training device, characterized in that the device includes:
    获取模块,用于获取图像中目标物体的预标注检测框;以及获取对应于所述图像以及第一感知网络的目标检测框,所述目标检测框用于标识所述目标物体;An acquiring module for acquiring a pre-labeled detection frame of a target object in an image; and acquiring a target detection frame corresponding to the image and the first perception network, the target detection frame being used to identify the target object;
    迭代训练模块,用于根据损失函数对所述第一感知网络进行迭代训练,以输出第二感知网络;其中,所述损失函数与所述预标注检测框和所述目标检测框之间的交并比IoU有关。The iterative training module is used to iteratively train the first perception network according to the loss function to output the second perception network; wherein the loss function and the intersection between the pre-labeled detection frame and the target detection frame And more related to IoU.
  22. 根据权利要求21所述的装置,其特征在于,所述预设的损失函数还与所述目标检测框与所述预标注检测框的形状差异有关,其中,所述形状差异与所述预标注检测框的面积负相关。The device according to claim 21, wherein the preset loss function is also related to the shape difference between the target detection frame and the pre-labeled detection frame, wherein the shape difference is related to the pre-labeled detection frame. The area of the detection frame is negatively correlated.
  23. 根据权利要求21或22所述的装置,其特征在于,所述预设的损失函数还与所述目标检测框与所述预标注检测框在所述图像中的位置差异有关,其中,所述位置差异与所述预标注检测框的面积负相关;或所述位置差异与所述预标注检测框和所述目标检测框的凸包的最小外接矩形的面积负相关。The device according to claim 21 or 22, wherein the preset loss function is also related to the position difference between the target detection frame and the pre-labeled detection frame in the image, wherein the The position difference is negatively related to the area of the pre-labeled detection frame; or the position difference is negatively related to the area of the smallest circumscribed rectangle of the convex hull of the pre-labeled detection frame and the target detection frame.
  24. 根据权利要求23所述的装置,其特征在于,所述预设的损失函数包括与所述位置差异有关的目标损失项,所述目标损失项随着所述位置差异的变化而变化;其中,The apparatus according to claim 23, wherein the preset loss function comprises a target loss item related to the position difference, and the target loss item changes with the change of the position difference; wherein,
    当所述位置差异大于预设值时,所述目标损失项的变化率大于第一预设变化率;和/或,当所述位置差异小于预设值时,所述目标损失项的变化率小于第二预设变化率。When the position difference is greater than the preset value, the rate of change of the target loss item is greater than the first preset rate of change; and/or, when the position difference is less than the preset value, the rate of change of the target loss item Less than the second preset rate of change.
  25. 一种物体检测装置,其特征在于,包括存储介质、处理电路以及总线系统;其中,所述存储介质用于存储指令,所述处理电路用于执行存储器中的指令,以执行所述权利要求7至20中任一项所述的方法的步骤。An object detection device, characterized by comprising a storage medium, a processing circuit, and a bus system; wherein the storage medium is used to store instructions, and the processing circuit is used to execute instructions in the memory to execute the claim 7. To the steps of the method of any one of 20.
  26. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现权利要求7至20中任一项所述的方法的步骤。A computer-readable storage medium having a computer program stored thereon, wherein the program is executed by a processor to implement the steps of the method according to any one of claims 7 to 20.
PCT/CN2021/089118 2020-04-30 2021-04-23 Data processing system, object detection method and apparatus thereof WO2021218786A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/050,051 US20230076266A1 (en) 2020-04-30 2022-10-27 Data processing system, object detection method, and apparatus thereof

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010362601.2A CN113591872A (en) 2020-04-30 2020-04-30 Data processing system, object detection method and device
CN202010362601.2 2020-04-30

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/050,051 Continuation US20230076266A1 (en) 2020-04-30 2022-10-27 Data processing system, object detection method, and apparatus thereof

Publications (1)

Publication Number Publication Date
WO2021218786A1 true WO2021218786A1 (en) 2021-11-04

Family

ID=78237350

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/089118 WO2021218786A1 (en) 2020-04-30 2021-04-23 Data processing system, object detection method and apparatus thereof

Country Status (3)

Country Link
US (1) US20230076266A1 (en)
CN (1) CN113591872A (en)
WO (1) WO2021218786A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114648802A (en) * 2022-05-19 2022-06-21 深圳市海清视讯科技有限公司 Method, device and equipment for identifying facial expressions of users
CN115760990A (en) * 2023-01-10 2023-03-07 华南理工大学 Identification and positioning method of pineapple pistil, electronic equipment and storage medium

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11893710B2 (en) * 2020-11-16 2024-02-06 Boe Technology Group Co., Ltd. Image reconstruction method, electronic device and computer-readable storage medium
CN112633258B (en) * 2021-03-05 2021-05-25 天津所托瑞安汽车科技有限公司 Target determination method and device, electronic equipment and computer readable storage medium
CN114220188A (en) * 2021-12-27 2022-03-22 上海高德威智能交通系统有限公司 Parking space inspection method, device and equipment
CN115228092B (en) * 2022-09-22 2022-12-23 腾讯科技(深圳)有限公司 Game battle force evaluation method, device and computer readable storage medium
CN116805284B (en) * 2023-08-28 2023-12-19 之江实验室 Feature migration-based super-resolution reconstruction method and system between three-dimensional magnetic resonance planes
CN117473880B (en) * 2023-12-27 2024-04-05 中国科学技术大学 Sample data generation method and wireless fall detection method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109472298A (en) * 2018-10-19 2019-03-15 天津大学 Depth binary feature pyramid for the detection of small scaled target enhances network
US20190325243A1 (en) * 2018-04-20 2019-10-24 Sri International Zero-shot object detection
CN111062413A (en) * 2019-11-08 2020-04-24 深兰科技(上海)有限公司 Road target detection method and device, electronic equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106056647B (en) * 2016-05-30 2019-01-11 南昌大学 A kind of magnetic resonance fast imaging method based on the sparse double-deck iterative learning of convolution

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190325243A1 (en) * 2018-04-20 2019-10-24 Sri International Zero-shot object detection
CN109472298A (en) * 2018-10-19 2019-03-15 天津大学 Depth binary feature pyramid for the detection of small scaled target enhances network
CN111062413A (en) * 2019-11-08 2020-04-24 深兰科技(上海)有限公司 Road target detection method and device, electronic equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114648802A (en) * 2022-05-19 2022-06-21 深圳市海清视讯科技有限公司 Method, device and equipment for identifying facial expressions of users
CN114648802B (en) * 2022-05-19 2022-08-23 深圳市海清视讯科技有限公司 User facial expression recognition method, device and equipment
CN115760990A (en) * 2023-01-10 2023-03-07 华南理工大学 Identification and positioning method of pineapple pistil, electronic equipment and storage medium
CN115760990B (en) * 2023-01-10 2023-04-21 华南理工大学 Pineapple pistil identification and positioning method, electronic equipment and storage medium

Also Published As

Publication number Publication date
US20230076266A1 (en) 2023-03-09
CN113591872A (en) 2021-11-02

Similar Documents

Publication Publication Date Title
WO2021218786A1 (en) Data processing system, object detection method and apparatus thereof
JP7289918B2 (en) Object recognition method and device
WO2021227726A1 (en) Methods and apparatuses for training face detection and image detection neural networks, and device
CN110070107B (en) Object recognition method and device
US10748033B2 (en) Object detection method using CNN model and object detection apparatus using the same
CN114220035A (en) Rapid pest detection method based on improved YOLO V4
US20220215227A1 (en) Neural Architecture Search Method, Image Processing Method And Apparatus, And Storage Medium
CN111401517B (en) Method and device for searching perceived network structure
CN111368972B (en) Convolutional layer quantization method and device
Yang et al. A multi-task Faster R-CNN method for 3D vehicle detection based on a single image
CN111401293B (en) Gesture recognition method based on Head lightweight Mask scanning R-CNN
CN110222718B (en) Image processing method and device
CN111310604A (en) Object detection method and device and storage medium
WO2021249114A1 (en) Target tracking method and target tracking device
Yang et al. A fusion network for road detection via spatial propagation and spatial transformation
CN114764856A (en) Image semantic segmentation method and image semantic segmentation device
Yuan et al. Mask-RCNN with spatial attention for pedestrian segmentation in cyber–physical systems
CN110909656B (en) Pedestrian detection method and system integrating radar and camera
CN115375781A (en) Data processing method and device
Lian et al. Towards unified on-road object detection and depth estimation from a single image
Lv et al. Memory‐augmented neural networks based dynamic complex image segmentation in digital twins for self‐driving vehicle
WO2022179599A1 (en) Perceptual network and data processing method
CN116503602A (en) Unstructured environment three-dimensional point cloud semantic segmentation method based on multi-level edge enhancement
Liu et al. Pedestrian detection based on Faster R-CNN
CN114972182A (en) Object detection method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21795783

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21795783

Country of ref document: EP

Kind code of ref document: A1