CN116959026A - Object detection method, device, apparatus, storage medium and computer program product - Google Patents

Object detection method, device, apparatus, storage medium and computer program product Download PDF

Info

Publication number
CN116959026A
CN116959026A CN202310786732.7A CN202310786732A CN116959026A CN 116959026 A CN116959026 A CN 116959026A CN 202310786732 A CN202310786732 A CN 202310786732A CN 116959026 A CN116959026 A CN 116959026A
Authority
CN
China
Prior art keywords
detection
feature
mapping
scale
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310786732.7A
Other languages
Chinese (zh)
Inventor
李德辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202310786732.7A priority Critical patent/CN116959026A/en
Publication of CN116959026A publication Critical patent/CN116959026A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/757Matching configurations of points or features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the application provides a target detection method, a device, equipment, a storage medium and a computer program product, which are at least applied to the field of artificial intelligence and the field of image recognition, wherein the method comprises the following steps: carrying out multi-scale feature fusion on feature graphs with different resolutions in an image to be detected to obtain output features under multiple scales; determining a plurality of detection categories for target detection and detection heads corresponding to the detection categories; performing feature mapping on the output features under a plurality of scales to obtain mapping features matched with the detection head; invoking each detection head, and carrying out target detection on the image to be detected based on the matched mapping characteristics to obtain at least one detection frame aiming at each detection category; and determining a target detection result of the image to be detected under each detection category based on at least one detection frame. According to the application, accurate target detection can be carried out on the image to be detected based on the matched mapping characteristics, and the accuracy of target detection is improved.

Description

Object detection method, device, apparatus, storage medium and computer program product
Technical Field
Embodiments of the present application relate to the field of artificial intelligence, and relate to, but are not limited to, a target detection method, apparatus, device, storage medium and computer program product.
Background
Object Detection (Object Detection) is one of the core problems in the computer vision field, and is the task of finding all objects (objects) of interest in an image, determining the class and location of these objects. Because various objects have different appearances, shapes and postures, and the interference of factors such as illumination, shielding and the like during imaging is added, target detection is always the most challenging problem in the field of computer vision.
In the related art, a backbone network is generally used to extract a feature map of an image, and target detection is performed based on the feature map, and because the feature map of the image extracted by the backbone network cannot be extracted to accurately represent features of image information, in a subsequent detection process, the positions of targets in different detection categories cannot be accurately determined by a detection head, so that the accuracy of target detection is reduced.
Disclosure of Invention
The embodiment of the application provides a target detection method, device, equipment, storage medium and computer program product, which can be at least applied to the field of artificial intelligence and the field of image recognition, and can accurately obtain the characteristics required by a detection head when detecting corresponding detection types, so that the detection head can accurately detect the target of an image to be detected based on matched mapping characteristics, thereby improving the accuracy of target detection.
The technical scheme of the embodiment of the application is realized as follows:
the embodiment of the application provides a target detection method, which comprises the following steps: carrying out multi-scale feature fusion on feature graphs with different resolutions in an image to be detected to obtain output features under multiple scales; wherein the multi-scale feature fusion is to fuse feature maps with at least one resolution; determining a plurality of detection categories for carrying out target detection on the image to be detected and detection heads corresponding to each detection category; wherein each detection category corresponds to one detection head; performing feature mapping on the output features of the multiple scales aiming at each detection head to obtain mapping features matched with the corresponding detection heads; wherein the mapping features are matched with the corresponding detection heads for characterization: the mapping features are features required by the corresponding detection heads when detecting the corresponding detection categories; invoking the detection heads corresponding to each detection category, and performing target detection on the image to be detected based on the matched mapping characteristics to obtain at least one detection frame aiming at each detection category; and determining a target detection result of the image to be detected under each detection category based on the at least one detection frame.
The embodiment of the application provides a target detection device, which comprises: the multi-scale feature fusion module is used for carrying out multi-scale feature fusion on feature images with different resolutions in the image to be detected to obtain output features under a plurality of scales; wherein the multi-scale feature fusion is to fuse feature maps with at least one resolution; the first determining module is used for determining a plurality of detection categories for carrying out target detection on the image to be detected and detection heads corresponding to the detection categories; wherein each detection category corresponds to one detection head; the feature mapping module is used for carrying out feature mapping on the output features under the multiple scales aiming at each detection head to obtain mapping features matched with the corresponding detection heads; wherein the mapping features are matched with the corresponding detection heads for characterization: the mapping features are features required by the corresponding detection heads when detecting the corresponding detection categories; the target detection module is used for calling the detection head corresponding to each detection category, and carrying out target detection on the image to be detected based on the matched mapping characteristics to obtain at least one detection frame aiming at each detection category; and the second determining module is used for determining a target detection result of the image to be detected under each detection category based on the at least one detection frame.
In some embodiments, the apparatus further comprises: the maximum pooling processing module is used for calling a backbone layer in a backbone network to carry out maximum pooling processing on the image to be detected so as to obtain maximum pooling characteristics; the multi-resolution feature extraction module is used for calling a plurality of residual convolution layers in the backbone network to sequentially extract features of the maximum pooling features under different resolutions, so as to obtain the feature images with different resolutions; wherein each residual convolution layer corresponds to a downsampling scale and a resolution corresponding to the downsampling scale; the residual convolution layers are sequentially connected, and the residual convolution layers are connected behind the trunk layer; and the resolutions corresponding to the residual convolution layers which are connected in sequence are decreased.
In some embodiments, each of the residual convolution layers comprises a plurality of residual convolution modules; the multi-resolution feature extraction module is further configured to: if the current residual convolution layer is the first residual convolution layer in the backbone network, a plurality of residual convolution modules in the current residual convolution layer are called to carry out convolution processing on the maximum pooling feature to obtain a convolution feature; if the current residual convolution layer is the N-th residual convolution layer in the backbone network, calling a plurality of residual convolution modules in the current residual convolution layer to carry out convolution processing on the convolution characteristics output by the N-1-th residual convolution layer to obtain iterative convolution characteristics; n is an integer greater than 1 and N is less than or equal to the total number of residual convolution layers; determining a feature map for each of the plurality of residual convolutional layers based on the convolutional features and the iterative convolutional features output by the respective residual convolutional layer; wherein, each characteristic diagram corresponding to the residual convolution layer has different resolution.
In some embodiments, the multi-scale feature fusion module is further to: invoking a plurality of single-scale feature modules in a multi-scale feature network, and respectively carrying out multi-scale feature fusion on the feature graphs corresponding to each residual convolution layer to obtain output features under a plurality of scales; wherein each of the single-scale feature modules outputs an output feature at one scale.
In some embodiments, the feature map corresponding to the nth residual convolution layer includes feature information in the feature maps corresponding to the first through N-1 th residual convolution layers; each residual convolution layer corresponds to a single-scale feature module; the single-scale feature modules are sequentially connected; each single-scale feature module corresponds to a feature scale; the multi-scale feature fusion module is further configured to: aiming at the last single-scale feature module in the plurality of single-scale feature modules, calling the single-scale feature module to perform feature mapping on a feature map corresponding to the last residual convolution layer to obtain the output feature of the last single-scale feature module under the corresponding feature scale; invoking an N-1 single-scale feature module, and carrying out multi-scale feature fusion on the output features of the N single-scale feature module under the feature scale and the feature map corresponding to the N-1 residual convolution layer to obtain the output features of the N-1 single-scale feature module under the corresponding feature scale; n is an integer greater than 1 and N is less than or equal to the total number of single-scale feature modules.
In some embodiments, the multi-scale feature fusion module is further to: calling an N-1 single-scale feature module, and adding the output features of the N single-scale feature module under the feature scale and the feature map corresponding to the N-1 residual convolution layer to obtain the output features of the N-1 single-scale feature module under the corresponding feature scale; or, calling an N-1 single-scale feature module, and performing splicing treatment on the output features of the N single-scale feature module under the feature scale and the feature map corresponding to the N-1 residual convolution layer to obtain a spliced feature map; and carrying out convolution dimension reduction processing on the spliced feature map to obtain the output features of the N-1 single-scale feature module under the corresponding feature scale.
In some embodiments, each of the detection heads corresponds to a feature mapping network; the feature mapping module is further configured to: and for each detection head, invoking a feature mapping network corresponding to the detection head, and respectively performing feature mapping on the output features under the multiple scales through a feature mapping unit in the feature mapping network to obtain mapping features matched with the corresponding detection head.
In some embodiments, the object detection module is further to: invoking the detection head corresponding to each detection category, and performing target detection on the image to be detected based on the matched mapping characteristics to obtain position information and size information of each detection frame under each detection category; and rendering the detection frame into the image to be detected based on the position information and the size information.
In some embodiments, the target detection method is implemented by a target detection model; the apparatus further comprises: model training module for: constructing a sample data set corresponding to each detection category, and acquiring the total anchor frame quantity corresponding to a plurality of specified scales; training the target detection model by iterating the following steps in a loop until the target detection model converges: carrying out multi-scale feature fusion on sample feature images with different resolutions in sample detection images in a sample data set to obtain sample output features under a plurality of scales; performing feature mapping on the sample output features under the multiple scales aiming at each detection head to obtain sample mapping features matched with the corresponding detection heads; wherein the sample mapping features are matched with the corresponding detection heads for characterization: the sample mapping characteristics are sample characteristics required by the corresponding detection head when detecting the corresponding detection category; acquiring a plurality of rectangular frames in the sample detection image; clustering the plurality of rectangular frames to obtain a plurality of classes corresponding to the total anchor frames and class cores of each class; determining a plurality of anchor frames in the current iterative training process based on each class and the class center; applying the anchor frames to a sample mapping feature map corresponding to the sample mapping feature matched with each detection head; based on the position information of the anchor frame in the sample mapping feature map and the labeling information marked in advance, performing loss calculation on a target detection model in the current iterative training process to obtain a loss result; and correcting model parameters in the target detection model based on the loss result to obtain a trained target detection model.
In some embodiments, the model training module is further to: for any detection category, collecting a sample image with the detection category from an image library, and labeling a detection target corresponding to the detection category in the sample image to form a labeled sample detection image; the sample detection image corresponds to a piece of labeling information; adding the sample detection image to a sample data set corresponding to the detection category; wherein the same sample image belongs to a sample detection image in a sample data set of at least one detection class.
In some embodiments, the object detection model includes a backbone network, a multi-scale feature network, a feature mapping network, and at least one detection head; the model training module is further configured to: inputting the sample detection image into the target detection model, and extracting features of the sample detection image under different resolutions through the backbone network to obtain the sample feature images with different resolutions; carrying out multi-scale feature fusion on the sample feature graphs with different resolutions through the multi-scale feature network to obtain sample output features under multiple scales; and for each detection head, performing feature mapping on the sample output features under the multiple scales through the feature mapping network to obtain sample mapping features matched with the corresponding detection head.
In some embodiments, the feature mapping network includes a plurality of feature mapping units, the number of feature mapping units being the same as the number of single-scale feature modules in the multi-scale feature network; the sample mapping features include sub-sample mapping features at multiple scales; each sub-sample mapping feature corresponds to a sub-mapping feature map; the model training module is further configured to: determining the area of each anchor frame; sequencing the anchor frames according to the sequence from the large area to the small area to form an anchor frame sequence; dividing the anchor frames into a plurality of anchor frame groups according to the order of the anchor frames in the anchor frame sequence; the number of the anchor frame groups is the same as the number of the feature mapping units in the feature mapping network; and applying all anchor frames in each anchor frame group to a sub-sample mapping feature map corresponding to a group of sub-sample mapping features according to the scale of each mapping feature and the area of the anchor frames in the anchor frame group.
In some embodiments, the model training module is further to: and when the sample detection image is an image in a sample data set corresponding to any detection category, correcting weights in a backbone network, a multi-scale feature network, a feature mapping network and a detection head corresponding to the detection category in the target detection model based on the loss result to obtain the trained target detection model.
An embodiment of the present application provides an electronic device, including: a memory for storing executable instructions; and the processor is used for realizing the target detection method when executing the executable instructions stored in the memory.
Embodiments of the present application provide a computer program product comprising executable instructions stored in a computer readable storage medium; the processor of the electronic device reads the executable instructions from the computer readable storage medium and executes the executable instructions to implement the target detection method.
The embodiment of the application provides a computer readable storage medium, which stores executable instructions for causing a processor to execute the executable instructions to implement the target detection method.
The embodiment of the application has the following beneficial effects: the method comprises the steps of carrying out multi-scale feature fusion on feature graphs with different resolutions in an image to be detected to obtain output features under multiple scales; performing feature mapping on the output features under multiple scales aiming at each detection head to obtain mapping features matched with the corresponding detection heads; wherein, the matching of the mapping feature and the corresponding detection head means that the mapping feature is a feature required by the corresponding detection head when detecting the corresponding detection category; then, invoking a detection head corresponding to each detection category, and carrying out target detection on the image to be detected based on the matched mapping characteristics, so that at least one detection frame aiming at each detection category can be accurately obtained; therefore, the target detection result of the image to be detected under each detection category is accurately determined based on at least one detection frame. Therefore, the characteristics required by the detection head when detecting the corresponding detection category are accurately obtained by carrying out multi-scale characteristic fusion on the characteristic graphs with different resolutions in the image to be detected and carrying out characteristic mapping on the output characteristics under a plurality of scales, so that the detection head can accurately detect the target of the image to be detected based on the matched mapping characteristics, and the accuracy of target detection is improved.
Drawings
FIG. 1 is a schematic diagram of an alternative architecture of an object detection system provided by an embodiment of the present application;
fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
FIG. 3 is a schematic flow chart of an alternative method for detecting an object according to an embodiment of the present application;
FIG. 4 is a schematic flow chart of another alternative method for detecting an object according to an embodiment of the present application;
FIG. 5 is a flowchart of a training method of a target detection model according to an embodiment of the present application;
FIG. 6 is a schematic diagram of an implementation flow for constructing a sample data set corresponding to each detection class according to an embodiment of the present application;
FIG. 7 is a schematic diagram of an implementation flow of applying an anchor frame to a sample mapping feature map according to an embodiment of the present application;
FIG. 8 is a schematic diagram of an implementation flow of a target detection method according to an embodiment of the present application;
FIG. 9 is a schematic diagram of data set decoupling provided by an embodiment of the present application;
fig. 10 is a schematic structural diagram of a multitasking detection network according to an embodiment of the present application;
fig. 11 is a schematic diagram of an internal structure of each module in the multitasking detection network according to an embodiment of the present application.
Detailed Description
The present application will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present application more apparent, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which embodiments of this application belong. The terminology used in the embodiments of the application is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.
Before explaining the target detection method according to the embodiment of the present application, the technical terms involved in the embodiment of the present application are explained first.
(1) Sample imbalance: meaning that there is a serious imbalance in the number of samples of different categories in one dataset. Typically, the number of samples of some classes is much smaller than that of other classes, which can have a non-negligible negative impact on the training of the model.
(2) Category decoupling: refers to the separate processing of the target categories of category imbalance, including but not limited to processing at both the data set and model level.
The method in the related art will be described below.
In the related art, aiming at the problem of unbalanced category in target detection, the general improvement methods mainly include the following steps:
(1) And weighting loss. In training of class imbalance, training of the model tends to preferentially fit the more frequently occurring classes due to the low frequency of occurrence of the less frequently occurring classes, to ensure less overall loss (e.g., extreme examples: class a has 99 samples, class B has 1 sample, model fits only class a, overall accuracy is also up to 99%). The most straightforward approach to this problem is to increase the loss of a small number of classes, making the model more focused on this class fit in training.
However, while weighted loss allows the model to better fit rare class samples in the training set, but tends to over-fit these samples, there is typically a small training loss, but the test loss becomes larger instead, and thus the imbalance problem cannot be truly improved.
(2) Resampling. Resampling is also a relatively straightforward method, i.e. the number of rare targets is intuitively increased by multi-sampling the few times the model is fed into training for training samples of the category with a small number of occurrences. Since direct resampling may cause over-fitting of the sample, some image enhancement methods are added at the time of resampling, so that the image for each resampling is somewhat different.
However, the resampling method also has the problem of overfitting, and although the image enhancement method can be used for making the resampled image different, the change is limited, and the changed sample distribution surrounds the periphery of the original image, so that the true sample distribution cannot be represented well.
(3) And (5) category cutting and pasting. This is a method of image modification, specifically, cutting out the area of the rare target, and then pasting it to other positions of the present image or into other images, thereby increasing the number of rare targets.
However, the image modification method can greatly increase the number of rare categories, but the artificial modification destroys the original data distribution and can influence the robustness of the model in the use of the model in the ground.
In addition, in the target detection method in the related art, a backbone network is generally used to extract a feature map of an image, and target detection is performed based on the feature map, and since the feature map of the image extracted by the backbone network cannot be extracted to accurately represent features of image information, in a subsequent detection process, the positions of targets in different detection categories cannot be accurately determined by a detection head, so that the accuracy of target detection is reduced.
Aiming at least one problem, the embodiment of the application provides a sample imbalance training method based on category decoupling and a method for carrying out target detection by using a target detection model obtained by training by using the training method, which can better solve the problem of sample imbalance and improve the accuracy of target detection compared with the method in the related art.
According to the target detection method provided by the embodiment of the application, firstly, multi-scale feature fusion is carried out on feature graphs with different resolutions in an image to be detected, so that output features under multiple scales are obtained; wherein the multi-scale feature fusion is to fuse feature images with at least one resolution; meanwhile, determining a plurality of detection categories for carrying out target detection on the image to be detected and detection heads corresponding to each detection category; wherein each detection category corresponds to one detection head; then, aiming at each detection head, carrying out feature mapping on the output features under a plurality of scales to obtain mapping features matched with the corresponding detection heads; wherein the mapping features are matched with the corresponding detection heads for characterization: the mapping characteristics are the characteristics required by the corresponding detection head when detecting the corresponding detection category; then, invoking a detection head corresponding to each detection category, and carrying out target detection on the image to be detected based on the matched mapping characteristics to obtain at least one detection frame aiming at each detection category; and finally, determining a target detection result of the image to be detected under each detection category based on at least one detection frame. Therefore, the characteristics required by the detection head when detecting the corresponding detection category are accurately obtained by carrying out multi-scale characteristic fusion on the characteristic graphs with different resolutions in the image to be detected and carrying out characteristic mapping on the output characteristics under a plurality of scales, so that the detection head can accurately detect the target of the image to be detected based on the matched mapping characteristics, and the accuracy of target detection is improved.
Here, first, an exemplary application of the object detection device, which is an electronic device for realizing the object detection method, of the embodiment of the present application will be described. In one implementation manner, the object detection device (i.e., the electronic device) provided in the embodiment of the present application may be implemented as a terminal or may be implemented as a server. In one implementation manner, the object detection device provided by the embodiment of the application can be implemented as any terminal with an image recognition function and a data processing function, such as a notebook computer, a tablet computer, a desktop computer, a mobile phone, a portable music player, a personal digital assistant, a special message device, a portable game device, an intelligent robot, an intelligent household appliance, an intelligent vehicle-mounted device and the like; in another implementation manner, the object detection device provided by the embodiment of the present application may be implemented as a server, where the server may be an independent physical server, or may be a server cluster or a distributed system formed by multiple physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content distribution networks (CDN, content Delivery Network), and basic cloud computing services such as big data and artificial intelligence platforms. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiment of the present application. In the following, an exemplary application when the object detection device is implemented as a server will be described.
Referring to fig. 1, fig. 1 is a schematic diagram of an optional architecture of a target detection system provided by an embodiment of the present application, where the embodiment of the present application is illustrated by using an application of a target detection method to any vehicle autopilot product, where the vehicle autopilot product corresponds to an autopilot application, and the autopilot application is deployed on an autopilot vehicle, and the autopilot vehicle forms a terminal in the target detection system. In the automatic driving process of the automatic driving vehicle, the image to be detected can be acquired through the image acquisition equipment positioned around the vehicle, and the acquired image to be detected is subjected to target detection by adopting the target detection method provided by the embodiment of the application, so that a target detection result is obtained, and the automatic driving vehicle is controlled to automatically drive based on the target detection result.
In the embodiment of the present application, the object detection system 10 includes at least an autonomous vehicle 100, a network 200 and a server 300. Wherein the server 300 may be a server of an autopilot application. The server 300 may constitute an object detection device of an embodiment of the present application. The autonomous vehicle 100 is connected to the server 300 via the network 200, and the network 200 may be a wide area network or a local area network, or a combination of both.
In the embodiment of the present application, the autopilot vehicle 100 is provided with an autopilot application, and the autopilot vehicle 100 can collect the image to be detected through the image collecting devices located around the vehicle during the autopilot process. After the image to be detected is acquired, the image to be detected may be transmitted to the server 300 through the network 200. After receiving the image to be detected, the server 300 adopts the target detection method provided by the embodiment of the application to perform multi-scale feature fusion on feature images with different resolutions in the image to be detected, so as to obtain output features under multiple scales; determining a plurality of detection categories for carrying out target detection on the image to be detected and detection heads corresponding to each detection category; wherein each detection category corresponds to one detection head; performing feature mapping on the output features under a plurality of scales aiming at each detection head to obtain mapping features matched with the corresponding detection heads; invoking a detection head corresponding to each detection category, and performing target detection on the image to be detected based on the matched mapping characteristics to obtain at least one detection frame aiming at each detection category; and finally, determining a target detection result of the image to be detected under each detection category based on at least one detection frame. After obtaining the target detection result, the server 300 may control the current driving policy according to the target detection result, generate a control policy for the automatic driving vehicle 100, and send a control instruction corresponding to the control policy to the automatic driving vehicle 100, so as to realize safe and reliable driving of the automatic driving vehicle 100.
In some embodiments, the target detection method may also be implemented by the autopilot vehicle 100, that is, the autopilot vehicle 100 is provided with an autopilot application, in the autopilot process, the autopilot vehicle 100 may collect an image to be detected through an image collecting device located around the vehicle, and after the autopilot vehicle 100 collects the image to be detected, the autopilot vehicle 100 performs multi-scale feature fusion on feature graphs with different resolutions in the image to be detected, so as to obtain output features under multiple scales; determining a plurality of detection categories for carrying out target detection on the image to be detected and detection heads corresponding to each detection category; then, the automated driving vehicle 100 performs feature mapping on the output features under multiple scales for each detection head, to obtain mapping features matched with the corresponding detection heads; the automatic driving vehicle 100 calls a detection head corresponding to each detection category, and performs target detection on the image to be detected based on the matched mapping characteristics to obtain at least one detection frame aiming at each detection category; finally, the autonomous vehicle 100 determines a target detection result of the image to be detected under each detection category based on at least one detection frame. After obtaining the target detection result, the autonomous vehicle 100 may control the current driving strategy according to the target detection result, generate a control command corresponding to the current control strategy, and execute the control command corresponding to the control strategy.
The target detection method provided by the embodiment of the application can also be implemented based on a cloud platform and through cloud technology, for example, the server 300 can be a cloud server. The method comprises the steps that multi-scale feature fusion is conducted on feature graphs with different resolutions in an image to be detected through a cloud server, or a plurality of detection categories for carrying out target detection on the image to be detected and detection heads corresponding to the detection categories are determined through the cloud server, or feature mapping is conducted on output features under a plurality of scales through the cloud server, or target detection is conducted on the image to be detected through the cloud server based on matched mapping features, or target detection results of the image to be detected under each detection category are determined through the cloud server.
In some embodiments, a cloud storage may be further provided, and the image to be detected and the corresponding target detection result may be stored in the cloud storage, or model parameters of a target detection model for implementing the target detection method may be further stored in the cloud storage. Therefore, when a target detection request aiming at a new image to be detected is received, model parameters of a target detection model can be directly acquired from a cloud memory, so that the target detection model is quickly called to carry out target detection on the image to be detected.
Here, cloud technology (Cloud technology) refers to a hosting technology that unifies serial resources such as hardware, software, and networks in a wide area network or a local area network to implement calculation, storage, processing, and sharing of data. The cloud technology is based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied by the cloud computing business mode, can form a resource pool, and is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data with different levels can be processed separately, and various industry data need strong system rear shield support, which can be realized through cloud computing.
Fig. 2 is a schematic structural diagram of an electronic device provided in an embodiment of the present application, where the electronic device shown in fig. 2 may be an object detection device, and the object detection device includes: at least one processor 310, a memory 350, at least one network interface 320, and a user interface 330. The various components in the object detection device are coupled together by a bus system 340. It is understood that the bus system 340 is used to enable connected communications between these components. The bus system 340 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled in fig. 2 as bus system 340.
The processor 310 may be an integrated circuit chip with signal processing capabilities such as a general purpose processor, which may be a microprocessor or any conventional processor, or the like, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.
The user interface 330 includes one or more output devices 331 that enable presentation of media content, and one or more input devices 332.
Memory 350 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 350 optionally includes one or more storage devices physically located remote from processor 310. Memory 350 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a random access Memory (RAM, random Access Memory). The memory 350 described in embodiments of the present application is intended to comprise any suitable type of memory. In some embodiments, memory 350 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.
The operating system 351 including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks; network communication module 352 for reaching other computing devices via one or more (wired or wireless) network interfaces 320, exemplary network interfaces 320 include: bluetooth, wireless compatibility authentication (WiFi), and universal serial bus (USB, universal Se rial Bus), etc.; an input processing module 353 for detecting one or more user inputs or interactions from one of the one or more input devices 332 and translating the detected inputs or interactions.
In some embodiments, the apparatus provided in the embodiments of the present application may be implemented in software, and fig. 2 shows an object detection apparatus 354 stored in a memory 350, where the object detection apparatus 354 may be an object detection apparatus in an electronic device, and may be software in the form of a program and a plug-in, and includes the following software modules: the multi-scale feature fusion module 3541, the first determination module 3542, the feature mapping module 3543, the object detection module 3544, and the second determination module 3545 are logical, and thus can be arbitrarily combined or further split depending on the functionality implemented. The functions of the respective modules will be described hereinafter.
In some embodiments, the apparatus provided by the embodiments of the present application may be implemented in hardware, and by way of example, the apparatus provided by the embodiments of the present application may be a processor in the form of a hardware decoding processor that is programmed to perform the object detection method provided by the embodiments of the present application, for example, the processor in the form of a hardware decoding processor may employ one or more application specific integrated circuits (ASIC, application Specific Integrated Circuit), DSPs, programmable logic devices (PLD, pr ogrammable Logic Device), complex programmable logic devices (CPLD, complex Programmable Logic Device), field programmable gate arrays (FPGA, field-Programmable Gate Array), or other electronic components.
The target detection method provided by the embodiments of the present application may be performed by an electronic device, where the electronic device may be a server or a terminal, that is, the target detection method of the embodiments of the present application may be performed by the server or the terminal, or may be performed by interaction between the server and the terminal.
Fig. 3 is a schematic flowchart of an alternative method for detecting an object according to an embodiment of the present application, and the steps shown in fig. 3 are described below, and as shown in fig. 3, the method includes the following steps S101 to S105, where an execution subject of the method for detecting an object is taken as a server, and the method is described below:
Step S101, carrying out multi-scale feature fusion on feature graphs with different resolutions in an image to be detected to obtain output features under multiple scales.
Here, the image to be detected may or may not have a detection target, and the detection target may have various detection categories, for example, the detection target may be a vehicle, a building, a pedestrian, a tree, an animal, or the like.
In the embodiment of the application, after the image to be detected is acquired, the image to be detected can be subjected to feature extraction with different resolutions, so that the feature images with different resolutions are obtained, and the feature images have the feature information of the image to be detected.
Multiscale feature fusion is the fusion of feature maps having at least one resolution, where, due to the different scales of feature maps having different resolutions, smaller targets can be detected in feature maps having a larger resolution, because in feature maps having a larger resolution, more detailed information in the extracted image to be detected is more.
In the embodiment of the application, as different detection targets have different sizes and multiple detection categories, the size difference of the detection targets corresponding to the different detection categories is also larger, so that the large and small detection targets can be expected to be accurately detected during target detection, the embodiment of the application can extract the feature images with different resolutions and perform multi-scale feature fusion on the feature images with different resolutions. That is, when multi-scale feature fusion is performed, feature images with different scales are fused, so that information of different details in the feature images with different scales are fused together, and the obtained output features with multiple scales can respectively have information of different details in an image to be detected.
Step S102, determining a plurality of detection categories for performing target detection on the image to be detected and detection heads corresponding to each detection category.
Here, each detection category corresponds to one detection head; the plurality of detection categories for performing target detection on the image to be detected may be preset detection categories, for example, the detection categories to be detected may be preset as vehicles and pedestrians. In some embodiments, a detection head may be pre-trained for each detection class, through which detection of detection targets of the respective detection class can be achieved.
Step S103, for each detection head, performing feature mapping on the output features under a plurality of scales to obtain mapping features matched with the corresponding detection head.
Here, the feature mapping refers to performing multiple residual convolution processing on the output features under multiple scales, for example, the residual convolution processing may be sequentially performed on the output features under multiple scales through multiple residual convolution modules, where the residual convolution module (R equivalent Block) is composed of two or three convolution layers, batch normalization (Batch Normalization) and cross-layer connection.
In the embodiment of the application, aiming at any detection head, the output characteristics under a plurality of scales are subjected to characteristic mapping, the output characteristics under a plurality of scales are fused when the characteristic mapping is carried out, the output characteristics under a plurality of scales are used as input parameters in the residual convolution processing process, and the mapping characteristics matched with the detection head are output by carrying out residual convolution processing on the output characteristics under a plurality of scales. Similarly, for another detection head, feature mapping may still be performed on the output features under multiple scales, and when feature mapping is performed, the output features under multiple scales are fused, the output features under multiple scales are used as input parameters in the residual convolution processing process, and the mapping features matched with the detection head are output by performing residual convolution processing on the output features under multiple scales. The difference is that the processing parameters in performing the residual convolution processing are different for each detection head. For example, for a detection head for detecting a vehicle and a detection head for detecting a pedestrian, parameters in a residual convolution module corresponding to the detection head for detecting a vehicle are different from parameters in a residual convolution module corresponding to the detection head for detecting a pedestrian.
The mapping features are matched with the corresponding detection heads for characterization: the mapping features are features required by the corresponding detection head when detecting the corresponding detection class. That is, the mapping features are features required by the corresponding detection head when detecting the corresponding detection category, that is, the features required by the corresponding detection head can be extracted by performing feature mapping on the output features under multiple scales through residual convolution processing. For example, with respect to a detection head for detecting a vehicle and a detection head for detecting a pedestrian, by performing feature mapping with respect to parameters in a residual convolution module corresponding to the detection head for detecting a vehicle, a mapped feature matching with the detection head for detecting a vehicle can be obtained; similarly, by performing feature mapping on parameters in the residual convolution module corresponding to the detection head for detecting pedestrians, mapping features matched with the detection head for detecting pedestrians can be obtained.
Step S104, calling a detection head corresponding to each detection category, and carrying out target detection on the image to be detected based on the matched mapping characteristics to obtain at least one detection frame aiming at each detection category.
In the embodiment of the application, each detection head is responsible for detecting one type of target, and the detection head can determine at least one detection frame aiming at each detection type based on the matched mapping characteristics, that is, can determine the position information of at least one detection target on the image to be detected. In some embodiments, confidence information for each detection box on the image to be detected may also be determined. Wherein the location information includes, but is not limited to: the confidence information comprises whether the detection target exists in the detection frame or not and whether the detection target in the detection frame belongs to a corresponding detection category or not.
Step S105, determining a target detection result of the image to be detected under each detection category based on at least one detection frame.
In the embodiment of the application, after at least one detection frame is obtained, whether the detection frame accurately detects the detection target can be determined based on the position information and the confidence information corresponding to each detection frame. If yes, the detection frame is determined to be the final detection frame, and if no, the detection frame is deleted. After deleting the detection frames in which the detection targets are not accurately detected, determining the positions of the detection targets under each detection category from the residual detection frames, and rendering the detection frames into the image to be detected based on the position information of the residual detection frames to obtain target detection results of the image to be detected under each detection category. The target detection result is used for representing: and deleting the result of successful detection of the detection targets under each detection category after the detection frames in which the detection targets are not accurately detected remain.
According to the target detection method provided by the embodiment of the application, the output characteristics under a plurality of scales are obtained by carrying out multi-scale characteristic fusion on the characteristic graphs with different resolutions in the image to be detected; performing feature mapping on the output features under multiple scales aiming at each detection head to obtain mapping features matched with the corresponding detection heads; wherein, the matching of the mapping feature and the corresponding detection head means that the mapping feature is a feature required by the corresponding detection head when detecting the corresponding detection category; then, invoking a detection head corresponding to each detection category, and carrying out target detection on the image to be detected based on the matched mapping characteristics, so that at least one detection frame aiming at each detection category can be accurately obtained; therefore, the target detection result of the image to be detected under each detection category is accurately determined based on at least one detection frame. Therefore, the characteristics required by the detection head when detecting the corresponding detection category are accurately obtained by carrying out multi-scale characteristic fusion on the characteristic graphs with different resolutions in the image to be detected and carrying out characteristic mapping on the output characteristics under a plurality of scales, so that the detection head can accurately detect the target of the image to be detected based on the matched mapping characteristics, and the accuracy of target detection is improved.
In some embodiments, the application of the object detection method to the field of automatic driving is exemplified. The object detection system at least comprises an automatic driving vehicle and a server, an automatic driving application is installed on a vehicle machine of the automatic driving vehicle, and an image to be detected can be acquired through image acquisition equipment positioned around the vehicle in the automatic driving process of the automatic driving vehicle. After the image to be detected is acquired, the image to be detected may be sent to a server. The server is a background server of the autopilot application.
Fig. 4 is another optional flowchart of the target detection method according to the embodiment of the present application, as shown in fig. 4, the method includes the following steps S201 to S214:
step S201, the image acquisition device acquires images around the automatic driving vehicle in the automatic driving process of the automatic driving vehicle to obtain an image to be detected.
In the embodiment of the application, the image acquisition equipment can be arranged around the automatic driving vehicle, and the images around the automatic driving vehicle can be periodically or aperiodically acquired in the automatic driving process of the automatic driving vehicle to obtain the image to be detected.
In step S202, the autonomous vehicle encapsulates the image to be detected into a target detection request.
In the embodiment of the application, when one image to be detected is acquired, the image to be detected is synchronously packaged into the target detection request and sent to the server, so that the efficiency of target detection of the image to be detected by the server is improved, the real-time control of an automatic driving vehicle is further ensured, and the driving safety of the vehicle is ensured.
In some embodiments, the method for detecting the target in the embodiment of the present application may also be implemented by an autonomous vehicle, that is, after the image to be detected is collected, the autonomous vehicle itself completes the step of detecting the target based on the image to be detected, without data interaction with a server, so as to improve the efficiency of target detection.
In step S203, the automated guided vehicle transmits the target detection request to the server.
In step S204, the server responds to the target detection request, and invokes a backbone layer in the backbone network to perform maximum pooling processing on the image to be detected, so as to obtain the maximum pooling feature.
Step S205, the server calls a plurality of residual convolution layers in the backbone network to sequentially extract the features of the maximum pooling feature under different resolutions, and a feature map with different resolutions is obtained.
In an embodiment of the present application, the backbone network includes a backbone layer (which may be denoted as B-layer) and a plurality of residual convolution layers (which may be denoted as B-layer) connected in sequence. The trunk layer is used for carrying out maximum pooling treatment on the image to be detected so as to extract the characteristics in the image to be detected and obtain the maximum pooling characteristics.
In the implementation, the backbone layer consists of a max pooling layer and a batch normalization layer (BN, batch Normalization). The stride (stride) of the maximum pooling layer may be 2. The residual convolution layer is formed by a plurality of residual convolution modules (which may be denoted as Res B lock), wherein the stride (stride) of the residual convolution modules may be 1 or 2, and the stride of the plurality of residual convolution modules may be the same or different.
In the same residual convolution layer, each residual convolution module is used for carrying out convolution processing on the maximum pooling feature, and in the implementation process, as a plurality of residual convolution modules are connected in sequence, the output of the previous residual convolution module forms the input of the next adjacent residual convolution module, and the input of the first residual convolution module is the output of the residual convolution layer positioned before the current residual convolution layer. For the first residual convolutional layer, the input is the largest pooling feature of the backbone layer output.
In the embodiment of the application, each residual convolution layer corresponds to a downsampling scale and a resolution corresponding to the downsampling scale; the residual convolution layers are sequentially connected, and the residual convolution layers are connected behind the trunk layer; the resolutions corresponding to the plurality of residual convolution layers which are connected in sequence decrease in sequence, and the scales of the feature graphs output by the plurality of residual convolution layers which are connected in sequence increase in sequence.
In some embodiments, each residual convolution layer includes a plurality of residual convolution modules; the feature extraction process performed at different resolutions in step S205 includes the following two cases:
case one: if the current residual convolution layer is the first residual convolution layer in the backbone network, a plurality of residual convolution modules in the current residual convolution layer can be called to carry out convolution processing on the maximum pooling feature, so as to obtain the convolution feature. Then, a feature map of the residual convolutional layer is determined based on the convolutional features.
And a second case: if the current residual convolution layer is the N residual convolution layer in the backbone network, a plurality of residual convolution modules in the current residual convolution layer can be called to carry out convolution processing on the convolution characteristics output by the N-1 residual convolution layer, so as to obtain iterative convolution characteristics; n is an integer greater than 1 and N is less than or equal to the total number of residual convolutional layers. Then, a feature map of the residual convolution layer is determined based on the iterative convolution features.
It should be noted that, the feature map corresponding to each residual convolution layer has a different resolution.
Step S206, the server calls a plurality of single-scale feature modules in the multi-scale feature network, and performs multi-scale feature fusion on the feature graphs corresponding to each residual convolution layer to obtain output features under a plurality of scales.
Here, the multi-scale feature network includes a plurality of single-scale feature modules (which may be denoted as F-layers) that are sequentially connected, and each single-scale feature module outputs an output feature under one scale; multiscale feature fusion is the fusion of feature maps having at least one resolution.
The number of single-scale feature modules in the multi-scale feature network is the same as the number of residual convolution layers in the backbone network. The input of the N-1 single-scale feature module comprises the output features of the N single-scale feature module and the output features of the N residual convolution layer corresponding to the N-1 single-scale feature module; the input of the last single-scale feature module in the multi-scale feature network is the output feature of the most residual convolution layer in the backbone network, i.e. the feature map at the last resolution of the backbone network output.
In some embodiments, the feature map corresponding to the nth residual convolution layer includes feature information in the feature maps corresponding to the first through N-1 th residual convolution layers; each residual convolution layer corresponds to a single-scale feature module; the single-scale feature modules are sequentially connected; each single-scale feature module corresponds to a feature scale.
In step S206, the multi-scale feature fusion is performed on the feature map corresponding to each residual convolution layer, which may be implemented in the following manner: aiming at the last single-scale feature module in the plurality of single-scale feature modules, invoking the single-scale feature module to perform feature mapping on the feature map corresponding to the last residual convolution layer to obtain the output feature of the last single-scale feature module under the corresponding feature scale; invoking an N-1 single-scale feature module, and carrying out multi-scale feature fusion on the output features of the N single-scale feature module under the feature scale and the feature map corresponding to the N-1 residual convolution layer to obtain the output features of the N-1 single-scale feature module under the corresponding feature scale; n is an integer greater than 1 and N is less than or equal to the total number of single-scale feature modules.
In some embodiments, the N-1 single-scale feature module is called, the multi-scale feature fusion is performed on the output features of the N single-scale feature module under the feature scale and the feature map corresponding to the N-1 residual convolution layer, so that the output features of the N-1 single-scale feature module under the corresponding feature scale can be achieved through any one of the following two modes:
Mode one: and calling the N-1 single-scale feature module, and adding the output features of the N single-scale feature module under the feature scale and the feature map corresponding to the N-1 residual convolution layer to obtain the output features of the N-1 single-scale feature module under the corresponding feature scale.
Mode two: invoking an N-1 single-scale feature module, and performing splicing treatment on the output features of the N single-scale feature module under the feature scale and the feature map corresponding to the N-1 residual convolution layer to obtain a spliced feature map; and then, carrying out convolution dimension reduction processing on the spliced feature map to obtain the output features of the N-1 single-scale feature module under the corresponding feature scale.
In step S207, the server determines a plurality of detection categories for performing target detection on the image to be detected and detection heads corresponding to each detection category.
Here, each detection category corresponds to one detection head; each detection head corresponds to a feature mapping network.
Step S208, for each detection head, the server calls a feature mapping network corresponding to the detection head, and performs feature mapping on the output features under multiple scales through feature mapping units in the feature mapping network to obtain mapping features matched with the corresponding detection head.
Here, the mapping features are matched to the corresponding detection heads for characterization: the mapping features are features required by the corresponding detection head when detecting the corresponding detection class.
Each detection head corresponds to a feature mapping network, each feature mapping network comprises a plurality of feature mapping units, and the number of the feature mapping units in each feature mapping network is the same as the number of single-scale feature modules in the multi-scale feature network and the number of residual convolution layers in the backbone network. That is, at the same scale, there are one residual convolution layer, one single scale feature module and one feature mapping unit
The output features under multiple scales are respectively input into different feature mapping units in the same feature mapping network, and are input into feature mapping units matched with the scales of the output features.
For example, if the number of feature map units, the number of single-scale feature modules, and the number of residual convolution layers are all 3, the scale corresponding to the first feature map unit, the first single-scale feature module, and the first residual convolution layer may be 128×h/8*w/8, the scale corresponding to the second feature map unit, the second single-scale feature module, and the second residual convolution layer may be 256×h/16×w/16, and the scale corresponding to the third feature map unit, the third single-scale feature module, and the third residual convolution layer may be 512×h/32×w/32. In the implementation process, the feature map with the scale of 512×h/32×w/32 output by the third residual convolution layer is input to the third single-scale feature module, the output feature with the scale of 512×h/32×w/32 output by the third single-scale feature module is input to the third feature mapping unit in the feature mapping network, and the feature mapping unit performs feature mapping on the output feature with the scale of 512×h/32×w/32 to obtain a mapping feature with the scale of k×h/32×w/32 matched with the corresponding detection head, where k refers to the number of channels of the output feature, and the determination mode of the number of channels k and the number of channels k will be described below. Similarly, the feature map with the scale of 256×h/16×w/16 output by the second residual convolution layer is input to the second single-scale feature module, meanwhile, the output feature with the scale of 512×h/32×w/32 output by the third single-scale feature module is also input to the second single-scale feature module, the second single-scale feature module performs multi-scale feature fusion based on the input two features to obtain the output feature with the scale of 256×h/16×w/16, the output feature with the scale of 256×h/16×w/16 is input to a second feature mapping unit in the feature mapping network, and the second feature mapping unit performs feature mapping on the output feature with the scale of 256×h/16×w/16 to obtain the mapping feature with the scale of k×h/16×w/16 matched with the corresponding detection head. Similarly, a feature map with a scale of 128×h/8*w/8 output by the first residual convolution layer is input into a first single-scale feature module, meanwhile, an output feature with a scale of 256×h/16×w/16 output by the second single-scale feature module is also input into the first single-scale feature module, the first single-scale feature module performs multi-scale feature fusion based on the input two features to obtain an output feature with a scale of 128×h/8*w/8, and the output feature with a scale of 128×h/8*w/8 is input into a first feature mapping unit in a feature mapping network, and feature mapping is performed on the output feature with a scale of 128×h/8*w/8 by the first feature mapping unit to obtain a mapping feature with a scale of k×h/8*w/8 matched with a corresponding detection head. Up to this point, three feature mapping units in the feature mapping network output mapping features with dimensions of k×h/8*w/8, k×h/16×w/16 and k×h/32×w/32, respectively. Therefore, the detection head connected with the feature mapping network can perform target detection based on the mapping features of the three scales.
Step S209, the server calls the detection head corresponding to each detection category, and performs target detection on the image to be detected based on the matched mapping characteristics, so as to obtain the position information and the size information of each detection frame under each detection category.
Here, the position information may be a coordinate value of a center point of the detection frame, and the size information may be a width and a height of the detection frame.
Step S210, the server renders the detection frame into the image to be detected based on the position information and the size information.
In step S211, the server determines a target detection result of the image to be detected under each detection category based on at least one detection frame.
In step S212, the server determines a driving strategy applicable to the automatically driven vehicle based on the target detection result, and generates a driving instruction corresponding to the driving strategy.
Here, the driving strategy includes, but is not limited to, at least one of: acceleration running, deceleration running, steering running, straight running, and the like.
In step S213, the server transmits the driving instruction to the autonomous vehicle.
In step S214, the autonomous vehicle executes the driving instruction to perform the autonomous driving according to the driving strategy described above.
In some embodiments, the above-described target detection method may be implemented by artificial intelligence techniques, for example, by a target detection model. Here, the object detection model includes a backbone network, a multi-scale feature network, a feature mapping network, and at least one detection head.
For the target detection model, the embodiment of the present application further provides a training method for the target detection model, and fig. 5 is a schematic flow chart of the training method for the target detection model provided by the embodiment of the present application, where the training method for the target detection model may be executed by the model training module. The model training module can be a module in the target detection equipment (namely the electronic equipment), namely the model training module can be a server or a terminal; alternatively, the model training module may be another device independent of the target detection device, i.e. the model training module is different from the other electronic devices except the server and the terminal for implementing the target detection method. As shown in fig. 5, the training method of the target detection model includes the following steps S301 to S306:
step S301, a sample data set corresponding to each detection category is constructed, and the total anchor frame number corresponding to a plurality of specified scales is obtained.
In some embodiments, referring to fig. 6, fig. 6 shows that constructing a sample data set corresponding to each detection category may be achieved by the following steps S3011 to S3012:
step S3011, for any detection category, collecting a sample image with the detection category from the image library, and labeling a detection target corresponding to the detection category in the sample image, so as to form a labeled sample detection image.
Here, the sample detection image corresponds to a piece of labeling information;
step S3012, the sample detection image is added to the sample data set corresponding to the detection category.
Here, the same sample image belongs to a sample detection image in the sample data set of at least one detection class.
After that, the following steps S302 to S309 may be iterated through a loop, to train the target detection model until the target detection model meets the preset convergence condition and reaches convergence:
step S302, multi-scale feature fusion is carried out on sample feature graphs with different resolutions in sample detection images in a sample data set, and sample output features under multiple scales are obtained.
In the embodiment of the application, the sample detection image can be input into the target detection model, and the feature extraction under different resolutions is carried out on the sample detection image through a backbone network in the target detection model, so that the sample feature images with different resolutions are obtained; and then, carrying out multi-scale feature fusion on the sample feature graphs with different resolutions through a multi-scale feature network to obtain sample output features under a plurality of scales.
Step S303, for each detection head, performing feature mapping on the sample output features under a plurality of scales to obtain sample mapping features matched with the corresponding detection head. Wherein, the sample mapping features are matched with the corresponding detection heads for characterization: the sample mapping feature is a sample feature required by the corresponding detection head when detecting the corresponding detection category.
In the embodiment of the application, for each detection head, the feature mapping network is used for carrying out feature mapping on the sample output features under a plurality of scales to obtain the sample mapping features matched with the corresponding detection head.
Step S304, a plurality of rectangular frames in the sample detection image are acquired.
Step S305, clustering the plurality of rectangular frames to obtain a plurality of classes corresponding to the total anchor frame number and class cores of each class.
Step S306, based on each class and class center, a plurality of anchor frames in the current iterative training process are determined.
In step S307, a plurality of anchor frames are applied to the sample mapping feature map corresponding to the sample mapping feature matched with each detection head.
In the embodiment of the application, the feature mapping network comprises a plurality of feature mapping units, and the number of the feature mapping units is the same as that of single-scale feature modules in the multi-scale feature network; the sample mapping features include sub-sample mapping features at multiple scales; each sub-sample mapping feature corresponds to a sub-mapping feature map.
In the process of implementation, referring to fig. 7, fig. 7 shows that step S307 may be implemented by the following steps S3071 to S3074:
in step S3071, the area of each anchor frame is determined.
Step S3072, the anchor frames are ordered according to the sequence from large area to small area, and an anchor frame sequence is formed.
Step S3073, dividing the anchor frames into a plurality of anchor frame groups according to the order of the anchor frames in the anchor frame sequence; the number of anchor frame groups is the same as the number of feature mapping units in the feature mapping network, and each anchor frame group comprises at least one cat frame.
Step S3074, according to the scale of each mapping feature and the area of the anchor frames in the anchor frame groups, all the anchor frames in each anchor frame group are applied to the sub-sample mapping feature map corresponding to a group of sub-sample mapping features.
Step S308, performing loss calculation on the target detection model in the current iterative training process based on the position information of the anchor frame in the sample mapping feature map and the pre-labeled labeling information to obtain a loss result.
Step S309, based on the loss result, correcting the model parameters in the target detection model to obtain the trained target detection model.
In the embodiment of the application, when the sample detection image is an image in a sample data set corresponding to any detection category, the weight values in the backbone network, the multi-scale feature network, the feature mapping network and the detection head corresponding to the detection category in the target detection model can be corrected based on the loss result to obtain the trained target detection model. That is, if the number of detection categories is plural, only the detection head corresponding to one detection category is trained in each round of iterative training, and therefore, only the weight value in the detection head corresponding to the detection category is corrected in the one round of training. And in the next training round, training the detection head corresponding to the other detection category, and correspondingly, correcting the weight in the detection head corresponding to the other detection category in the next training round.
In some embodiments, after the above loop iteration process is completed, the weight in the detection head corresponding to each detection category may be further fine-tuned, so as to further improve the detection accuracy of the target detection model. In the implementation process, the weights of the backbone network and the multi-scale feature network can be frozen first, the backbone network and the multi-scale feature network are set to be shared partial networks, and the weight (namely the fine adjustment process) of the detection head corresponding to any detection category is updated after freezing, so that the effects of other detection heads are not affected. After the common part is frozen, the target detection model is continuously trained on the sample data set of any detection category, and a smaller learning rate can be used at the moment, for example, the target detection model can be learned by using the learning rate smaller than the learning rate threshold value, and only the weight of the detection head of the detection category is updated in each iteration. The operations in the above fine tuning process are also performed for the detection heads corresponding to the other detection categories.
In the following, an exemplary application of the embodiment of the present application in a practical application scenario will be described.
Aiming at the problem of sample imbalance during training of a target detection model in the related art, the embodiment of the application provides a class decoupling method, which is used for training a class imbalance Computer Vision (CV) task and can better solve the problem of class imbalance during training of the target detection model.
First, an application of the target detection method in the embodiment of the present application on the product side will be described.
The target detection method provided by the embodiment of the application can effectively solve the problem of unbalanced target detection category, and can realize the following performances and benefits on products: the accuracy and recall rate of target detection are improved, and the conditions of missed detection and false detection are reduced; the generalization capability of the model is improved, and the model can adapt to the changes of different scenes and data distribution; the training efficiency of the model is improved, and the training time and the consumption of computing resources are reduced; the robustness of the model is improved, and the influence of noise and interference can be resisted.
The application scenario of the target detection method provided in the embodiment of the present application is illustrated herein:
example one: in automatic driving, the accuracy and recall rate of the algorithm on environmental perception can be improved, the conditions of false detection, missing recall and the like are reduced, more reliable environmental information is provided for the realization of downstream regulation, so that the safety of automatic driving is improved, and the floor application of an automatic driving technology is promoted.
Example two: in the field of security protection, the detection precision of few targets such as pedestrians, bicycles, motorcycles and the like can be improved, so that false alarm and missing alarm are reduced, and the security protection and management efficiency is improved. For example, a target detection system can be deployed at an intersection or a parking lot, information such as the number, the position, the direction and the like of vehicles and pedestrians can be monitored in real time, and functions such as traffic flow analysis, violation identification, accident early warning and the like can be performed.
Example three: in the medical field, the detection precision of focus can be improved, thereby reducing missed diagnosis and misdiagnosis and improving the quality of diagnosis and treatment. For example, the target detection system may be deployed in medical imaging to detect information such as the position, size, and morphology of lesions such as lung nodules, breast cancer, and liver cancer in real time, and perform functions such as lesion segmentation, classification, and localization.
Example four: in the industrial field, the detection precision of defects can be improved, so that waste products and reworks are reduced, and the production and quality control level is improved. For example, the object detection system can be deployed in industrial vision, and the position, size, type and other information of defects such as scratches, cracks, bubbles and the like on the surface of a product can be detected in real time, so that the functions such as defect classification, positioning, evaluation and the like can be performed.
The following specifically describes a target detection method provided in the embodiment of the present application.
As shown in fig. 8, the target detection method provided by the embodiment of the application mainly includes the following aspects:
step S801, the data set of the sample imbalance is decoupled.
Step S802, designing a multi-task detection network.
Step 803, training the multi-task detection network.
Step S804, fusing the detection results and applying to the network.
As shown in fig. 9, the data set decoupling in step S801 is a schematic diagram of data set decoupling provided in the embodiment of the present application, where the data set decoupling includes the following steps, taking vehicles and pedestrians commonly found in an autopilot scenario as an example:
in step S901, image data of a vehicle is acquired, and a vehicle target in the image is marked with a target frame 91.
In step S902, image data of the pedestrian is collected, and the pedestrian target in the image is marked with the target frame 92.
Through the above steps S901 and S902, one data set with unbalanced original categories can be decoupled into two data sets, and since the target has no limitation of "appearing on the same image", each data set can freely increase the number of samples, so as to control the sample number balance between different categories.
For the design of the multi-task detection network (i.e., the target detection model) in step S802, referring to the schematic structural diagram of the multi-task detection network shown in fig. 10 and the schematic internal structural diagram of each module in the multi-task detection network shown in fig. 11, it can be seen that the multi-task detection network design includes the following steps:
first, a backbone network 1001 is constructed. The backbone network comprises a backbone layer B-stem and a plurality of residual convolution layers B-layers, and each B-layer comprises a plurality of residual convolution modules. The backbone network contains a number of downsampling scales and the output of the last three B-layers will be passed to the next stage, i.e. to the multi-scale feature network.
Then, a multi-scale feature network 1002 is constructed. In order to better adapt to the scale change of the target, three different scale features (i.e. feature graphs with different resolutions) output by the backbone network are fused in the multi-scale feature network, and then corresponding output features are generated on each scale.
Then, a plurality of detection heads 1004 are constructed. Each detection head 1004 is responsible for detecting one type of object, in an embodiment of the present application, there are two types of objects, pedestrian and vehicle, and thus two detection heads. Each feature mapping unit Trans module in the feature mapping network 1003 is designed, and feature mapping is performed on the features output by each F-layer in the multi-scale feature network 1002 by using the Trans module, so that features required by different detection heads are extracted. The multi-scale feature network is composed of a plurality of single-scale feature modules F-layers.
Finally, the number of channels k of the output feature is determined. Here, each channel of the output feature corresponds to specific information, k=b (4+1+c), where b is the number of anchor frames at each position on the output feature output by each F-layer determined in the previous step, 4 represents the deviation regression amount for each anchor frame in the center abscissa, center ordinate, length, and width, 1 represents the confidence degree of whether or not to be the target, c is the number of target categories, and in the embodiment of the present application, each category of the detection head is 1, and thus the output feature channel is 6b.
The training process of the multitasking network is described below.
In training a multitasking network, first, an anchor frame is designed. For each detection head, designating the total anchor frame quantity B (for example, B is equal to 9) on all scales, taking the width and the height of the target marked rectangular frame as characteristics, clustering all the rectangular frames into B types by using a k-means clustering algorithm, and taking the class center of the B types as the width and the height of the corresponding anchor frame. And finally, sorting the anchor frames according to the area from large to small, wherein the front B/3 anchor frames are used on the largest characteristic diagram, the rear B/3 anchor frames are used on the smallest characteristic diagram, and the rest anchor frames are used on the characteristic diagram with the middle size.
Then, for the loss function design, for each detection head, the loss function on the nth branch is determined as the following equation (1):
wherein the first rowLoss of offset regression for prediction box relative to anchor box center, second line +.>To predict the loss of offset regression of the width and height of the frame relative to the anchor frame, S n Representing the width and height of the output characteristic diagram on the nth branch; b n The number of anchor frames at each position in the output characteristic diagram on the nth branch; s is the width and height of the output characteristic diagram corresponding to the detection head; / >A (i, j) position representing the query output feature map has a target, if so, a value of 1, otherwise 0; third row->Confidence loss for whether there is a goal; fourth row->The category loss is calculated on the output characteristic diagram, and c in k epsilon c represents the identification of the prediction frame with the target, namely, for the category loss, only the category loss of the prediction frame with the target is calculated; alpha, beta, gamma respectively represent the lossesThe weight, alpha, beta and gamma are super parameters, and are adjustable weights.
x ij 、y ij Respectively representing the abscissa and the ordinate of the central point of the prediction frame;respectively representing the abscissa and the ordinate of the center point of the anchor frame; w (w) ij 、h ij Representing the width and height of the prediction frame, respectively; />Respectively representing the width and the height of the anchor frame; c (C) ij A predicted value indicating whether or not a target exists; />A true value indicating whether or not a target exists; p is p i (k) A probability value indicating whether or not it is a predicted target (i.e., vehicle, pedestrian); />Indicating whether it is a true value of the predicted target.
Since the prediction purpose includes two categories, vehicle and pedestrian, the loss on the two branches is calculated in total 1 And loss of 2
The total loss of the model is the sum of the losses of the branches, see the following formula (2):
loss=loss 1 +loss 2 (2)。
When the multi-task detection network is trained, a Batch training method can be adopted, a sample set (Batch) is firstly obtained on a pedestrian data set, loss is calculated through forward transmission, and then weights of a backbone network, a multi-scale feature network and a pedestrian detection head are updated through gradient reverse transmission errors; and acquiring a Batch from the vehicle data set, calculating loss through forward transmission, and updating weights of the backbone network, the multi-scale feature network and the vehicle detection head through gradient reverse transmission errors. The above steps are performed by alternating loops until the training end condition is satisfied (e.g., the verification error reaches the set value or the number of training rounds reaches the set value).
In some embodiments, the detection heads for each task may also be fine-tuned. Because the training of the multi-task detection network is completed in the steps, the detection head can be further finely adjusted in order to obtain a better detection effect. The specific operation is to freeze the weights of the backbone network and the multi-scale characteristic network firstly, because the backbone network and the multi-scale characteristic network are shared partial networks, the weights are updated for the detection heads after freezing, and the effects of other detection heads are not affected. After the common part is frozen, the multi-task detection network is continuously trained on the pedestrian data set, and the weight of the pedestrian detection head is only updated in each iteration by using a smaller learning rate. The above operation is also performed for the vehicle detection head.
The training of the multi-task detection network can be completed through the steps.
Next, the process of fusing the detection results and applying the network in step S804 will be described.
In the implementation process, the target frames output by different detection heads can be combined together; then, performing Non-maximum suppression (NMS, non-Maximum Suppression) on the combined target frames to obtain final target frames; finally, the resulting target box is used for downstream applications.
It should be noted that, the class decoupling method provided by the embodiment of the application can better solve the problem of unbalanced samples of different classes in target detection, thereby improving the detection effect under unbalanced samples.
It can be understood that, in the embodiment of the present application, the content of the user information, for example, the information such as the image to be detected, the target detection result, etc., if the data related to the user information or the enterprise information is related, when the embodiment of the present application is applied to a specific product or technology, the user permission or consent is required to be obtained, or the information is subjected to blurring processing, so as to eliminate the correspondence between the information and the user; and the related data collection and processing should be strictly according to the requirements of relevant national laws and regulations when the example is applied, obtain the informed consent or independent consent of the personal information body, and develop the subsequent data use and processing behaviors within the authorized scope of laws and regulations and personal information body.
Continuing with the description below, the object detection device 354 provided in accordance with embodiments of the present application is implemented as an exemplary architecture of a software module, and in some embodiments, as shown in fig. 2, the object detection device 354 includes: the multi-scale feature fusion module 3541 is configured to perform multi-scale feature fusion on feature graphs with different resolutions in an image to be detected, so as to obtain output features under multiple scales; wherein the multi-scale feature fusion is to fuse feature maps with at least one resolution; a first determining module 3542, configured to determine a plurality of detection categories for performing target detection on the image to be detected and detection heads corresponding to each detection category; wherein each detection category corresponds to one detection head; the feature mapping module 3543 is configured to perform feature mapping on the output features of the multiple scales for each detection head, so as to obtain mapping features matched with the corresponding detection head; wherein the mapping features are matched with the corresponding detection heads for characterization: the mapping features are features required by the corresponding detection heads when detecting the corresponding detection categories; the target detection module 3544 is configured to invoke the detection head corresponding to each detection category, and perform target detection on the image to be detected based on the matched mapping feature, so as to obtain at least one detection frame for each detection category; a second determining module 3545, configured to determine, based on the at least one detection frame, a target detection result of the image to be detected under each detection category.
In some embodiments, the apparatus further comprises: the maximum pooling processing module is used for calling a backbone layer in a backbone network to carry out maximum pooling processing on the image to be detected so as to obtain maximum pooling characteristics; the multi-resolution feature extraction module is used for calling a plurality of residual convolution layers in the backbone network to sequentially extract features of the maximum pooling features under different resolutions, so as to obtain the feature images with different resolutions; wherein each residual convolution layer corresponds to a downsampling scale and a resolution corresponding to the downsampling scale; the residual convolution layers are sequentially connected, and the residual convolution layers are connected behind the trunk layer; and the resolutions corresponding to the residual convolution layers which are connected in sequence are decreased.
In some embodiments, each of the residual convolution layers comprises a plurality of residual convolution modules; the multi-resolution feature extraction module is further configured to: if the current residual convolution layer is the first residual convolution layer in the backbone network, a plurality of residual convolution modules in the current residual convolution layer are called to carry out convolution processing on the maximum pooling feature to obtain a convolution feature; if the current residual convolution layer is the N-th residual convolution layer in the backbone network, calling a plurality of residual convolution modules in the current residual convolution layer to carry out convolution processing on the convolution characteristics output by the N-1-th residual convolution layer to obtain iterative convolution characteristics; n is an integer greater than 1 and N is less than or equal to the total number of residual convolution layers; determining a feature map for each of the plurality of residual convolutional layers based on the convolutional features and the iterative convolutional features output by the respective residual convolutional layer; wherein, each characteristic diagram corresponding to the residual convolution layer has different resolution.
In some embodiments, the multi-scale feature fusion module is further to: invoking a plurality of single-scale feature modules in a multi-scale feature network, and respectively carrying out multi-scale feature fusion on the feature graphs corresponding to each residual convolution layer to obtain output features under a plurality of scales; wherein each of the single-scale feature modules outputs an output feature at one scale.
In some embodiments, the feature map corresponding to the nth residual convolution layer includes feature information in the feature maps corresponding to the first through N-1 th residual convolution layers; each residual convolution layer corresponds to a single-scale feature module; the single-scale feature modules are sequentially connected; each single-scale feature module corresponds to a feature scale; the multi-scale feature fusion module is further configured to: aiming at the last single-scale feature module in the plurality of single-scale feature modules, calling the single-scale feature module to perform feature mapping on a feature map corresponding to the last residual convolution layer to obtain the output feature of the last single-scale feature module under the corresponding feature scale; invoking an N-1 single-scale feature module, and carrying out multi-scale feature fusion on the output features of the N single-scale feature module under the feature scale and the feature map corresponding to the N-1 residual convolution layer to obtain the output features of the N-1 single-scale feature module under the corresponding feature scale; n is an integer greater than 1 and N is less than or equal to the total number of single-scale feature modules.
In some embodiments, the multi-scale feature fusion module is further to: calling an N-1 single-scale feature module, and adding the output features of the N single-scale feature module under the feature scale and the feature map corresponding to the N-1 residual convolution layer to obtain the output features of the N-1 single-scale feature module under the corresponding feature scale; or, calling an N-1 single-scale feature module, and performing splicing treatment on the output features of the N single-scale feature module under the feature scale and the feature map corresponding to the N-1 residual convolution layer to obtain a spliced feature map; and carrying out convolution dimension reduction processing on the spliced feature map to obtain the output features of the N-1 single-scale feature module under the corresponding feature scale.
In some embodiments, each detection head corresponds to a feature mapping network; the feature mapping module is further configured to: and for each detection head, invoking a feature mapping network corresponding to the detection head, and respectively performing feature mapping on the output features under the multiple scales through a feature mapping unit in the feature mapping network to obtain mapping features matched with the corresponding detection head.
In some embodiments, the object detection module is further to: invoking the detection head corresponding to each detection category, and performing target detection on the image to be detected based on the matched mapping characteristics to obtain position information and size information of each detection frame under each detection category; and rendering the detection frame into the image to be detected based on the position information and the size information.
In some embodiments, the target detection method is implemented by a target detection model; the apparatus further comprises: model training module for: constructing a sample data set corresponding to each detection category, and acquiring the total anchor frame quantity corresponding to a plurality of specified scales; training the target detection model by iterating the following steps in a loop until the target detection model converges: carrying out multi-scale feature fusion on sample feature images with different resolutions in sample detection images in a sample data set to obtain sample output features under a plurality of scales; performing feature mapping on the sample output features under the multiple scales aiming at each detection head to obtain sample mapping features matched with the corresponding detection heads; wherein the sample mapping features are matched with the corresponding detection heads for characterization: the sample mapping characteristics are sample characteristics required by the corresponding detection head when detecting the corresponding detection category; acquiring a plurality of rectangular frames in the sample detection image; clustering the plurality of rectangular frames to obtain a plurality of classes corresponding to the total anchor frames and class cores of each class; determining a plurality of anchor frames in the current iterative training process based on each class and the class center; applying the anchor frames to a sample mapping feature map corresponding to the sample mapping feature matched with each detection head; based on the position information of the anchor frame in the sample mapping feature map and the labeling information marked in advance, performing loss calculation on a target detection model in the current iterative training process to obtain a loss result; and correcting model parameters in the target detection model based on the loss result to obtain a trained target detection model.
In some embodiments, the model training module is further to: for any detection category, collecting a sample image with the detection category from an image library, and labeling a detection target corresponding to the detection category in the sample image to form a labeled sample detection image; the sample detection image corresponds to a piece of labeling information; adding the sample detection image to a sample data set corresponding to the detection category; wherein the same sample image belongs to a sample detection image in a sample data set of at least one detection class.
In some embodiments, the object detection model includes a backbone network, a multi-scale feature network, a feature mapping network, and at least one detection head; the model training module is further configured to: inputting the sample detection image into the target detection model, and extracting features of the sample detection image under different resolutions through the backbone network to obtain the sample feature images with different resolutions; carrying out multi-scale feature fusion on the sample feature graphs with different resolutions through the multi-scale feature network to obtain sample output features under multiple scales; and for each detection head, performing feature mapping on the sample output features under the multiple scales through the feature mapping network to obtain sample mapping features matched with the corresponding detection head.
In some embodiments, the feature mapping network includes a plurality of feature mapping units, the number of feature mapping units being the same as the number of single-scale feature modules in the multi-scale feature network; the sample mapping features include sub-sample mapping features at multiple scales; each sub-sample mapping feature corresponds to a sub-mapping feature map; the model training module is further configured to: determining the area of each anchor frame; sequencing the anchor frames according to the sequence from the large area to the small area to form an anchor frame sequence; dividing the anchor frames into a plurality of anchor frame groups according to the order of the anchor frames in the anchor frame sequence; the number of the anchor frame groups is the same as the number of the feature mapping units in the feature mapping network; and applying all anchor frames in each anchor frame group to a sub-sample mapping feature map corresponding to a group of sub-sample mapping features according to the scale of each mapping feature and the area of the anchor frames in the anchor frame group.
In some embodiments, the model training module is further to: and when the sample detection image is an image in a sample data set corresponding to any detection category, correcting weights in a backbone network, a multi-scale feature network, a feature mapping network and a detection head corresponding to the detection category in the target detection model based on the loss result to obtain the trained target detection model.
It should be noted that, the description of the apparatus according to the embodiment of the present application is similar to the description of the embodiment of the method described above, and has similar beneficial effects as the embodiment of the method, so that a detailed description is omitted. For technical details not disclosed in the present apparatus embodiment, please refer to the description of the method embodiment of the present application for understanding.
Embodiments of the present application provide a computer program product comprising executable instructions, the executable instructions being a computer instruction; the executable instructions are stored in a computer readable storage medium. The executable instructions, when read from the computer readable storage medium by a processor of an electronic device, when executed by the processor, cause the electronic device to perform the method of embodiments of the present application described above.
Embodiments of the present application provide a storage medium having stored therein executable instructions which, when executed by a processor, cause the processor to perform a method provided by embodiments of the present application, for example, as shown in fig. 3.
In some embodiments, the storage medium may be a computer readable storage medium, such as a ferroelectric Memory (FRAM, ferromagnetic Random Access Memory), read Only Memory (ROM), programmable Read Only Memory (PROM, programmable Read Only Memory), erasable programmable Read Only Memory (E PROM, erasable Programmable Read Only Memory), charged erasable programmable Read Only Memory (EEPR OM, electrically Erasable Programmable Read Only Memory), flash Memory, magnetic surface Memory, optical Disk, or Compact Disk-Read Only Memory (CD-ROM), among others; but may be a variety of devices including one or any combination of the above memories. In some embodiments, the executable instructions may be in the form of programs, software modules, scripts, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.
As an example, the executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, for example, in one or more scripts in a hypertext markup language (HTML, hyper Text Mar kup Language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code). As an example, executable instructions may be deployed to be executed on one electronic device or on multiple electronic devices located at one site or, alternatively, on multiple electronic devices distributed across multiple sites and interconnected by a communication network.
The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims (17)

1. A method of target detection, the method comprising:
carrying out multi-scale feature fusion on feature graphs with different resolutions in an image to be detected to obtain output features under multiple scales; wherein the multi-scale feature fusion is to fuse feature maps with at least one resolution;
Determining a plurality of detection categories for carrying out target detection on the image to be detected and detection heads corresponding to each detection category; wherein each detection category corresponds to one detection head;
performing feature mapping on the output features of the multiple scales aiming at each detection head to obtain mapping features matched with the corresponding detection heads; wherein the mapping features are matched with the corresponding detection heads for characterization: the mapping features are features required by the corresponding detection heads when detecting the corresponding detection categories;
invoking the detection heads corresponding to each detection category, and performing target detection on the image to be detected based on the matched mapping characteristics to obtain at least one detection frame aiming at each detection category;
and determining a target detection result of the image to be detected under each detection category based on the at least one detection frame.
2. The method according to claim 1, wherein the method further comprises:
calling a backbone layer in a backbone network to carry out maximum pooling treatment on the image to be detected to obtain maximum pooling characteristics;
invoking a plurality of residual convolution layers in the backbone network to sequentially extract the features of the maximum pooling feature under different resolutions to obtain the feature map with different resolutions;
Wherein each residual convolution layer corresponds to a downsampling scale and a resolution corresponding to the downsampling scale; the residual convolution layers are sequentially connected, and the residual convolution layers are connected behind the trunk layer; and the resolutions corresponding to the residual convolution layers which are connected in sequence are decreased.
3. The method of claim 2, wherein each of the residual convolution layers comprises a plurality of residual convolution modules;
the calling the residual convolution layers in the backbone network sequentially performs feature extraction on the maximum pooling features under different resolutions to obtain the feature graphs with different resolutions, and the method comprises the following steps:
if the current residual convolution layer is the first residual convolution layer in the backbone network, a plurality of residual convolution modules in the current residual convolution layer are called to carry out convolution processing on the maximum pooling feature to obtain a convolution feature;
if the current residual convolution layer is the N-th residual convolution layer in the backbone network, calling a plurality of residual convolution modules in the current residual convolution layer to carry out convolution processing on the convolution characteristics output by the N-1-th residual convolution layer to obtain iterative convolution characteristics; n is an integer greater than 1 and N is less than or equal to the total number of residual convolution layers;
Determining a feature map for each of the plurality of residual convolutional layers based on the convolutional features and the iterative convolutional features output by the respective residual convolutional layer; wherein, each characteristic diagram corresponding to the residual convolution layer has different resolution.
4. A method according to claim 3, wherein the performing multi-scale feature fusion on feature maps with different resolutions in the image to be detected to obtain output features under multiple scales includes:
invoking a plurality of single-scale feature modules in a multi-scale feature network, and respectively carrying out multi-scale feature fusion on the feature graphs corresponding to each residual convolution layer to obtain output features under a plurality of scales; wherein each of the single-scale feature modules outputs an output feature at one scale.
5. The method of claim 4, wherein the feature map corresponding to the nth residual convolution layer includes feature information in the feature maps corresponding to the first through N-1 th residual convolution layers; each residual convolution layer corresponds to a single-scale feature module; the single-scale feature modules are sequentially connected; each single-scale feature module corresponds to a feature scale;
The method for calling the single-scale feature modules in the multi-scale feature network, respectively carrying out multi-scale feature fusion on the feature graphs corresponding to each residual convolution layer to obtain output features under a plurality of scales, comprises the following steps:
aiming at the last single-scale feature module in the plurality of single-scale feature modules, calling the single-scale feature module to perform feature mapping on a feature map corresponding to the last residual convolution layer to obtain the output feature of the last single-scale feature module under the corresponding feature scale;
invoking an N-1 single-scale feature module, and carrying out multi-scale feature fusion on the output features of the N single-scale feature module under the feature scale and the feature map corresponding to the N-1 residual convolution layer to obtain the output features of the N-1 single-scale feature module under the corresponding feature scale; n is an integer greater than 1 and N is less than or equal to the total number of single-scale feature modules.
6. The method of claim 5, wherein the invoking the nth-1 single-scale feature module performs multi-scale feature fusion on the output feature of the nth single-scale feature module under the feature scale and the feature map corresponding to the nth-1 residual convolution layer to obtain the output feature of the nth-1 single-scale feature module under the corresponding feature scale, and the method comprises:
Calling an N-1 single-scale feature module, and adding the output features of the N single-scale feature module under the feature scale and the feature map corresponding to the N-1 residual convolution layer to obtain the output features of the N-1 single-scale feature module under the corresponding feature scale;
or alternatively, the process may be performed,
invoking an N-1 single-scale feature module, and performing splicing processing on output features under the feature scale output by the N single-scale feature module and feature graphs corresponding to the N-1 residual convolution layer to obtain spliced feature graphs;
and carrying out convolution dimension reduction processing on the spliced feature map to obtain the output features of the N-1 single-scale feature module under the corresponding feature scale.
7. The method of any one of claims 1 to 6, wherein each of the detection heads corresponds to a feature mapping network;
performing feature mapping on the output features of the multiple scales for each detection head to obtain mapping features matched with the corresponding detection head, including:
and for each detection head, invoking a feature mapping network corresponding to the detection head, and respectively performing feature mapping on the output features under the multiple scales through a feature mapping unit in the feature mapping network to obtain mapping features matched with the corresponding detection head.
8. The method according to any one of claims 1 to 6, wherein the invoking the detection header corresponding to each detection category, performing object detection on the image to be detected based on the matched mapping features, to obtain at least one detection box for each detection category, includes:
invoking the detection head corresponding to each detection category, and performing target detection on the image to be detected based on the matched mapping characteristics to obtain position information and size information of each detection frame under each detection category;
and rendering the detection frame into the image to be detected based on the position information and the size information.
9. The method according to any one of claims 1 to 6, wherein the object detection method is implemented by an object detection model; the method further comprises the steps of:
constructing a sample data set corresponding to each detection category, and acquiring the total anchor frame quantity corresponding to a plurality of specified scales;
training the target detection model by iterating the following steps in a loop until the target detection model converges:
carrying out multi-scale feature fusion on sample feature images with different resolutions in sample detection images in a sample data set to obtain sample output features under a plurality of scales;
Performing feature mapping on the sample output features under the multiple scales aiming at each detection head to obtain sample mapping features matched with the corresponding detection heads; wherein the sample mapping features are matched with the corresponding detection heads for characterization: the sample mapping characteristics are sample characteristics required by the corresponding detection head when detecting the corresponding detection category;
acquiring a plurality of rectangular frames in the sample detection image;
clustering the plurality of rectangular frames to obtain a plurality of classes corresponding to the total anchor frames and class cores of each class;
determining a plurality of anchor frames in the current iterative training process based on each class and the class center;
applying the anchor frames to a sample mapping feature map corresponding to the sample mapping feature matched with each detection head;
based on the position information of the anchor frame in the sample mapping feature map and the labeling information marked in advance, performing loss calculation on a target detection model in the current iterative training process to obtain a loss result;
and correcting model parameters in the target detection model based on the loss result to obtain a trained target detection model.
10. The method of claim 9, wherein constructing a sample dataset for each detection category comprises:
for any detection category, collecting a sample image with the detection category from an image library, and labeling a detection target corresponding to the detection category in the sample image to form a labeled sample detection image; the sample detection image corresponds to a piece of labeling information;
adding the sample detection image to a sample data set corresponding to the detection category; wherein the same sample image belongs to a sample detection image in a sample data set of at least one detection class.
11. The method of claim 9, wherein the object detection model comprises a backbone network, a multi-scale feature network, a feature mapping network, and at least one detection head;
carrying out multi-scale feature fusion on sample feature images with different resolutions in sample detection images in a sample data set to obtain sample output features under a plurality of scales; for each detection head, performing feature mapping on the sample output features under the multiple scales to obtain sample mapping features matched with the corresponding detection head, including:
Inputting the sample detection image into the target detection model, and extracting features of the sample detection image under different resolutions through the backbone network to obtain the sample feature images with different resolutions;
carrying out multi-scale feature fusion on the sample feature graphs with different resolutions through the multi-scale feature network to obtain sample output features under multiple scales;
and for each detection head, performing feature mapping on the sample output features under the multiple scales through the feature mapping network to obtain sample mapping features matched with the corresponding detection head.
12. The method of claim 11, wherein the feature mapping network comprises a plurality of feature mapping units, the number of feature mapping units being the same as the number of single-scale feature modules in the multi-scale feature network; the sample mapping features include sub-sample mapping features at multiple scales; each sub-sample mapping feature corresponds to a sub-mapping feature map;
the applying the plurality of anchor frames to the sample mapping feature map corresponding to the sample mapping feature matched with each detection head includes:
Determining the area of each anchor frame;
sequencing the anchor frames according to the sequence from the large area to the small area to form an anchor frame sequence;
dividing the anchor frames into a plurality of anchor frame groups according to the order of the anchor frames in the anchor frame sequence; the number of the anchor frame groups is the same as the number of the feature mapping units in the feature mapping network;
and applying all anchor frames in each anchor frame group to a sub-sample mapping feature map corresponding to a group of sub-sample mapping features according to the scale of each mapping feature and the area of the anchor frames in the anchor frame group.
13. The method of claim 11, wherein the correcting model parameters in the target detection model based on the loss result to obtain a trained target detection model comprises:
when the sample detection image is an image in the sample data set corresponding to any detection category,
and correcting weights in a backbone network, a multi-scale feature network, a feature mapping network and detection heads corresponding to the detection categories in the target detection model based on the loss result to obtain the trained target detection model.
14. An object detection device, the device comprising:
the multi-scale feature fusion module is used for carrying out multi-scale feature fusion on feature images with different resolutions in the image to be detected to obtain output features under a plurality of scales; wherein the multi-scale feature fusion is to fuse feature maps with at least one resolution;
the first determining module is used for determining a plurality of detection categories for carrying out target detection on the image to be detected and detection heads corresponding to the detection categories; wherein each detection category corresponds to one detection head;
the feature mapping module is used for carrying out feature mapping on the output features under the multiple scales aiming at each detection head to obtain mapping features matched with the corresponding detection heads; wherein the mapping features are matched with the corresponding detection heads for characterization: the mapping features are features required by the corresponding detection heads when detecting the corresponding detection categories;
the target detection module is used for calling the detection head corresponding to each detection category, and carrying out target detection on the image to be detected based on the matched mapping characteristics to obtain at least one detection frame aiming at each detection category;
And the second determining module is used for determining a target detection result of the image to be detected under each detection category based on the at least one detection frame.
15. An electronic device, comprising:
a memory for storing executable instructions; a processor for implementing the object detection method according to any one of claims 1 to 13 when executing executable instructions stored in said memory.
16. A computer readable storage medium, characterized in that executable instructions are stored for causing a processor to execute the executable instructions for implementing the object detection method according to any one of claims 1 to 13.
17. A computer program product or computer program comprising executable instructions stored in a computer readable storage medium;
the object detection method of any one of claims 1 to 13 is implemented when a processor of an electronic device reads the executable instructions from the computer-readable storage medium and executes the executable instructions.
CN202310786732.7A 2023-06-29 2023-06-29 Object detection method, device, apparatus, storage medium and computer program product Pending CN116959026A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310786732.7A CN116959026A (en) 2023-06-29 2023-06-29 Object detection method, device, apparatus, storage medium and computer program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310786732.7A CN116959026A (en) 2023-06-29 2023-06-29 Object detection method, device, apparatus, storage medium and computer program product

Publications (1)

Publication Number Publication Date
CN116959026A true CN116959026A (en) 2023-10-27

Family

ID=88457432

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310786732.7A Pending CN116959026A (en) 2023-06-29 2023-06-29 Object detection method, device, apparatus, storage medium and computer program product

Country Status (1)

Country Link
CN (1) CN116959026A (en)

Similar Documents

Publication Publication Date Title
CN111626208B (en) Method and device for detecting small objects
WO2023131065A1 (en) Image processing method, lane line detection method and related device
CN111931929A (en) Training method and device of multi-task model and storage medium
CN112949578B (en) Vehicle lamp state identification method, device, equipment and storage medium
CN111310770A (en) Target detection method and device
WO2024083121A1 (en) Data processing method and apparatus
WO2023125628A1 (en) Neural network model optimization method and apparatus, and computing device
WO2022194069A1 (en) Saliency map generation method, and abnormal object detection method and device
CN115830399A (en) Classification model training method, apparatus, device, storage medium, and program product
CN110135428B (en) Image segmentation processing method and device
CN116861262B (en) Perception model training method and device, electronic equipment and storage medium
CN113902793A (en) End-to-end building height prediction method and system based on single vision remote sensing image and electronic equipment
CN113239883A (en) Method and device for training classification model, electronic equipment and storage medium
CN110222652B (en) Pedestrian detection method and device and electronic equipment
CN116343169A (en) Path planning method, target object motion control device and electronic equipment
CN115527187A (en) Method and device for classifying obstacles
CN115984723A (en) Road damage detection method, system, device, storage medium and computer equipment
CN116959026A (en) Object detection method, device, apparatus, storage medium and computer program product
CN114973173A (en) Method and device for classifying driving scene data, electronic equipment and storage medium
CN112069899A (en) Road shoulder detection method and device and storage medium
CN116958176B (en) Image segmentation method, device, computer equipment and medium
CN112001211A (en) Object detection method, device, equipment and computer readable storage medium
CN112837326B (en) Method, device and equipment for detecting carryover
CN115019278B (en) Lane line fitting method and device, electronic equipment and medium
CN116246128B (en) Training method and device of detection model crossing data sets and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication