CN116071557A - Long tail target detection method, computer readable storage medium and driving device - Google Patents

Long tail target detection method, computer readable storage medium and driving device Download PDF

Info

Publication number
CN116071557A
CN116071557A CN202310136841.4A CN202310136841A CN116071557A CN 116071557 A CN116071557 A CN 116071557A CN 202310136841 A CN202310136841 A CN 202310136841A CN 116071557 A CN116071557 A CN 116071557A
Authority
CN
China
Prior art keywords
target detection
feature
long
detection model
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310136841.4A
Other languages
Chinese (zh)
Inventor
谢涛
王百超
刘国翌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Weilai Zhijia Technology Co Ltd
Original Assignee
Anhui Weilai Zhijia Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Weilai Zhijia Technology Co Ltd filed Critical Anhui Weilai Zhijia Technology Co Ltd
Priority to CN202310136841.4A priority Critical patent/CN116071557A/en
Publication of CN116071557A publication Critical patent/CN116071557A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention relates to the technical field of target detection, in particular to a long-tail target detection method, a computer-readable storage medium and driving equipment, and aims to solve the problems that the existing long-tail target detection method is limited in applicable scene and low in accuracy. For this purpose, the long tail target detection method of the present invention comprises: extracting features corresponding to the corresponding candidate frames, namely a first feature and a second feature, from a feature map obtained by a feature extraction layer of the corresponding target detection model based on the candidate frames obtained by respectively detecting the image to be identified by the first target detection model and the second target detection model; the first feature and the second feature are fused to obtain a fusion feature; and inputting the fusion characteristic into a detection head to obtain a first long tail target detection result of the image to be identified. The method combines the target detection models trained by adopting different scene training sets, is favorable for enriching the diversity of scenes, and improves the accuracy of long tail target detection and the applicable scenes.

Description

Long tail target detection method, computer readable storage medium and driving device
Technical Field
The invention relates to the technical field of target detection, and particularly provides a long tail target detection method, a computer readable storage medium and driving equipment.
Background
The real natural scene data set is the basis for developing the automatic driving capability, and different data are needed for the construction of the vehicle sensing capability aiming at different sensing tasks. However, the data distribution in the real scene has a significant long tail effect problem, and the long tail effect is that in the whole data set, a small part of categories occupy most training samples, and a large part of categories have few training samples. For example, in real scenes, conventional categories such as vehicles, pedestrians, non-motor vehicles, roads, buildings, etc., occur more frequently, while tail categories such as obstacles, animals, car accidents, etc., and tail scenes occur relatively infrequently. If the data with unbalanced categories is directly used for the development of the perception model, the recognition rate of the perception model on the long-tail object is greatly reduced, so that the safety problem is easy to cause.
In view of the above problems, there are some related technologies that simulate various real scenes using a neural network model and acquire simulation data in different scenes, and train a perception model based on the simulation data. However, the simulation data obtained by simulation cannot fully reflect the diversified real scenes, and high-value long-tail data cannot be easily obtained. In other related technologies, there are training models based on an countermeasure network to encode a plurality of training pictures respectively to obtain a plurality of hidden variables corresponding to scene information; randomly selecting a candidate picture, and generating amplified pictures of the candidate picture under various different scene information based on a training model of a second generation type countermeasure network according to the hidden variable and the candidate picture; and finally, adding the obtained amplified picture into a data set. According to the method, a large amount of combined enhancement data can be rapidly generated through a plurality of scene template pictures and a plurality of candidate pictures, and the problem of low recognition rate of the model to long-tail objects can be relieved to a certain extent. However, since the generation range of the enhanced picture is limited by the template and the candidate picture features, an unseen scene sample cannot be generated, and thus the problem of scene mobility of the long-tail object cannot be solved.
Disclosure of Invention
The invention aims to solve the technical problems that the existing long tail target detection method is limited in applicable scene and low in accuracy.
In a first aspect, the present invention provides a long tail target detection method, which may include:
detecting an image to be identified by using a first target detection model and a second target detection model to obtain a first candidate frame and a second candidate frame, wherein the first target detection model and the second target detection model adopt training sets from different scenes for training;
extracting a first feature corresponding to the first candidate frame from a feature map obtained by a feature extraction layer of the first target detection model, and extracting a second feature corresponding to the second candidate frame from a feature map obtained by a feature extraction layer of the second target detection model;
fusing the first feature and the second feature to obtain a fused feature; the method comprises the steps of,
and inputting the fusion characteristic into a detection head to obtain a first long tail target detection result of the image to be identified.
In some embodiments, the extracting the first feature corresponding to the first candidate frame from the feature map obtained by the feature extraction layer of the first object detection model includes:
Extracting first features corresponding to the first candidate frames from feature graphs obtained from a backbone network and/or a neck network of the first target detection model;
and/or the number of the groups of groups,
the extracting the second feature corresponding to the second candidate frame from the feature map obtained by the feature extraction layer of the second target detection model includes:
and extracting a second feature corresponding to the second candidate frame from a feature map obtained by the backbone network and/or the neck network of the second target detection model.
In some embodiments, the extracting the first feature corresponding to the first candidate box from the feature map obtained from the backbone network and/or the neck network of the first object detection model includes:
extracting initial first features corresponding to the first candidate frames from feature graphs obtained by a backbone network and/or a neck network of the first target detection model;
the initial first features of relatively small size are up-sampled to obtain first features of a target size.
In some embodiments, the extracting the second feature corresponding to the second candidate box from the feature map obtained from the backbone network and/or the neck network of the second object detection model includes:
extracting initial second features corresponding to the second candidate frames from feature graphs obtained by a backbone network and/or a neck network of the second target detection model;
Upsampling the initial second feature of relatively small size results in a second feature of the target size.
In some embodiments, the fusing the first feature and the second feature to obtain a fused feature includes:
performing bilinear pooling operation on the first feature and the second feature to obtain the fusion feature;
and/or the number of the groups of groups,
after the fused feature is obtained and before the fused feature is input into a detection head, the method further comprises: and carrying out feature dimension reduction and normalization on the fusion features.
In some embodiments, the long tail target detection method further comprises:
and determining a final long tail target detection result based on the first long tail target detection result and the second long tail target detection result detected by the second target detection model.
In some embodiments, the determining the final long tail target detection result based on the first long tail target detection result and the second long tail target detection result detected by the second target detection model includes:
when the second long-tail target detection result is one, comparing the second long-tail target detection result with the first long-tail target detection result, and if the comparison is consistent, determining the first long-tail target detection result as the final long-tail target detection result;
And when the second long tail target detection results are multiple, and the multiple second long tail target detection results come from the multiple second target detection models, determining the final long tail target detection result by adopting a voting mechanism based on the multiple second long tail target detection results and the multiple first long tail target detection results.
In some embodiments, the long tail target detection method further comprises:
acquiring an initial sample;
processing the initial sample to construct a long-tail target sample set, wherein the long-tail target sample set corresponds to an automatic driving scene;
and training a first target detection model to be trained according to the long tail target sample set to obtain the trained first target detection model.
In some embodiments, the processing the initial sample to construct a long tail target sample set includes:
performing image classification on the image sample in the initial sample to obtain a long tail target sample related to an automatic driving scene; and/or
Image segmentation is carried out on the image sample in the initial sample, and a long tail target sample related to an automatic driving scene is obtained; and/or
Performing image detection on the image sample in the initial sample to obtain a long tail target sample related to an automatic driving scene; and/or
Respectively extracting image features of an image sample in a source image and the initial sample by using a convolutional neural network, and calculating feature distances of the source image and the image sample according to the image features; selecting a long tail target sample related to the automatic driving scene based on the comparison of the characteristic distance and a first distance threshold; and/or
Respectively extracting text features and image features from the text samples and the image samples in the initial samples; fusing the extracted text features and target image features to obtain first multi-mode fusion features; extracting image features of the source image to obtain source image features; fusing the text features and the source image features to obtain second multi-mode fusion features; calculating cosine distances between the second multi-mode fusion feature and the first multi-mode fusion feature; and selecting a long tail target sample related to the automatic driving scene based on the comparison of the cosine distance and the second distance threshold.
In some embodiments, the training the first target detection model to be trained according to the long tail target sample set to obtain a trained first target detection model includes:
Preliminary training stage:
inputting the long-tail target sample set into the first target detection model to be trained to obtain a detection result of each long-tail target sample in the long-tail target sample set, wherein the detection result comprises a candidate frame and a corresponding category;
iterative training phase:
when the category is a long tail target, determining a pseudo long tail target according to a candidate frame and a truth box corresponding to the category:
constructing a difficult negative sample set based on the pseudo long tail target;
performing iterative training on the first target detection model to be trained by at least using the difficult negative sample set;
judging whether a preset condition is met, and if so, stopping iterative training to obtain the trained first target detection model.
In some embodiments, the determining whether the preset condition is satisfied, if so, stopping the iterative training to obtain a trained first target detection model includes:
judging whether the false recognition rate of the first target detection model to be trained is within a preset threshold range or not;
if yes, determining that the preset condition is met, and stopping iterative training.
In a second aspect of the present invention, there is provided a computer readable storage medium having stored therein a computer program which, when executed by a processor, implements the long tail target detection method of any one of the above.
In a third aspect of the present invention, there is provided a driving apparatus comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, implements the long tail target detection method of any one of the above.
Under the condition that the technical scheme is adopted, the method and the device can be used for detecting the first candidate frame obtained by detecting the image to be identified based on the first target detection model, extracting the first characteristic corresponding to the first candidate frame from the characteristic diagram obtained by the characteristic extraction layer of the first target detection model, detecting the second candidate frame obtained by detecting the image to be identified based on the second target detection model, and extracting the second characteristic corresponding to the second candidate frame from the characteristic diagram obtained by the characteristic extraction layer of the second target detection model; the first feature and the second feature are fused to obtain a fusion feature; finally, inputting the fusion characteristics into a detection head to obtain a first long tail target detection result of the image to be identified; the method is beneficial to enriching the diversity of scenes and improving the accuracy of long-tail target detection and applicable scenes by combining the target detection models trained by adopting different scene training sets.
Scheme 1. A long tail target detection method comprising:
detecting an image to be identified by using a first target detection model and a second target detection model to obtain a first candidate frame and a second candidate frame, wherein the first target detection model and the second target detection model adopt training sets from different scenes for training;
extracting a first feature corresponding to the first candidate frame from a feature map obtained by a feature extraction layer of the first target detection model, and extracting a second feature corresponding to the second candidate frame from a feature map obtained by a feature extraction layer of the second target detection model;
fusing the first feature and the second feature to obtain a fused feature; the method comprises the steps of,
and inputting the fusion characteristic into a detection head to obtain a first long tail target detection result of the image to be identified.
Scheme 2. According to the method of scheme 1, the extracting the first feature corresponding to the first candidate frame from the feature map obtained by the feature extraction layer of the first object detection model includes:
extracting first features corresponding to the first candidate frames from feature graphs obtained from a backbone network and/or a neck network of the first target detection model;
And/or the number of the groups of groups,
the extracting the second feature corresponding to the second candidate frame from the feature map obtained by the feature extraction layer of the second target detection model includes:
and extracting a second feature corresponding to the second candidate frame from a feature map obtained by the backbone network and/or the neck network of the second target detection model.
Scheme 3. The method according to scheme 2, wherein extracting the first feature corresponding to the first candidate frame from the feature map obtained from the backbone network and/or the neck network of the first object detection model comprises:
extracting initial first features corresponding to the first candidate frames from feature graphs obtained by a backbone network and/or a neck network of the first target detection model;
the initial first features of relatively small size are up-sampled to obtain first features of a target size.
Scheme 4. The method according to scheme 3, wherein extracting the second feature corresponding to the second candidate frame from the feature map obtained from the backbone network and/or the neck network of the second object detection model comprises:
extracting initial second features corresponding to the second candidate frames from feature graphs obtained by a backbone network and/or a neck network of the second target detection model;
Upsampling the initial second feature of relatively small size results in a second feature of the target size.
Scheme 5. According to the method of scheme 1, the fusing the first feature and the second feature to obtain a fused feature includes:
performing bilinear pooling operation on the first feature and the second feature to obtain the fusion feature;
and/or the number of the groups of groups,
after the fused feature is obtained and before the fused feature is input into a detection head, the method further comprises: and carrying out feature dimension reduction and normalization on the fusion features.
Scheme 6. The method of scheme 1, the method further comprising:
and determining a final long tail target detection result based on the first long tail target detection result and the second long tail target detection result detected by the second target detection model.
Scheme 7. The method of scheme 6, wherein determining the final long tail target detection result based on the first long tail target detection result and the second long tail target detection result detected by the second target detection model comprises:
when the second long-tail target detection result is one, comparing the second long-tail target detection result with the first long-tail target detection result, and if the comparison is consistent, determining the first long-tail target detection result as the final long-tail target detection result;
And when the second long tail target detection results are multiple, and the multiple second long tail target detection results come from the multiple second target detection models, determining the final long tail target detection result by adopting a voting mechanism based on the multiple second long tail target detection results and the multiple first long tail target detection results.
Scheme 8. The method according to any one of schemes 1 to 7, the method further comprising:
acquiring an initial sample;
processing the initial sample to construct a long-tail target sample set, wherein the long-tail target sample set corresponds to an automatic driving scene;
and training a first target detection model to be trained according to the long tail target sample set to obtain the trained first target detection model.
Scheme 9. The method according to scheme 8, said processing said initial sample to construct a long tail target sample set, comprising:
performing image classification on the image sample in the initial sample to obtain a long tail target sample related to an automatic driving scene; and/or
Image segmentation is carried out on the image sample in the initial sample, and a long tail target sample related to an automatic driving scene is obtained; and/or
Performing image detection on the image sample in the initial sample to obtain a long tail target sample related to an automatic driving scene; and/or
Respectively extracting image features of an image sample in a source image and the initial sample by using a convolutional neural network, and calculating feature distances of the source image and the image sample according to the image features; selecting a long tail target sample related to the automatic driving scene based on the comparison of the characteristic distance and a first distance threshold; and/or
Respectively extracting text features and image features from the text samples and the image samples in the initial samples; fusing the extracted text features and target image features to obtain first multi-mode fusion features; extracting image features of the source image to obtain source image features; fusing the text features and the source image features to obtain second multi-mode fusion features; calculating cosine distances between the second multi-mode fusion feature and the first multi-mode fusion feature; and selecting a long tail target sample related to the automatic driving scene based on the comparison of the cosine distance and the second distance threshold.
Scheme 10. According to the method of scheme 8, the training the first target detection model to be trained according to the long tail target sample set to obtain a trained first target detection model includes:
Preliminary training stage:
inputting the long-tail target sample set into the first target detection model to be trained to obtain a detection result of each long-tail target sample in the long-tail target sample set, wherein the detection result comprises a candidate frame and a corresponding category;
iterative training phase:
when the category is a long tail target, determining a pseudo long tail target according to a candidate frame and a truth box corresponding to the category;
constructing a difficult negative sample set based on the pseudo long tail target;
performing iterative training on the first target detection model to be trained by at least using the difficult negative sample set;
judging whether a preset condition is met, and if so, stopping iterative training to obtain the trained first target detection model.
Scheme 11. According to the method of scheme 10, the determining whether the preset condition is satisfied or not, if the preset condition is satisfied, stopping the iterative training, and obtaining a trained first target detection model includes:
judging whether the false recognition rate of the first target detection model to be trained is within a preset threshold range or not;
if yes, determining that the preset condition is met, and stopping iterative training.
Scheme 12. A computer readable storage medium having stored therein a computer program which when executed by a processor implements the long tail target detection method of any one of schemes 1 to 11.
Scheme 13. A driving apparatus comprising a memory and a processor, the memory having stored therein a computer program which when executed by the processor implements the long tail target detection method of any one of schemes 1 to 11.
Drawings
Preferred embodiments of the present invention are described below with reference to the accompanying drawings, in which:
FIG. 1 is a schematic flow chart of a long tail target detection method provided by an embodiment of the invention;
FIG. 2 is a schematic diagram of a long tail target detection model architecture according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a fusion process of a first feature and a second feature according to an embodiment of the present invention;
FIG. 4 is a flowchart of a training method of a first object detection model according to an embodiment of the present invention
FIG. 5 is a schematic diagram of a first object detection model training architecture according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of another long tail target detection model architecture provided by an embodiment of the present invention;
FIG. 7 is a flowchart of a method for long tail target detection according to another embodiment of the present invention;
fig. 8 is a schematic structural diagram of a driving apparatus according to an embodiment of the present invention.
Detailed Description
Some embodiments of the invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are merely for explaining the technical principles of the present invention, and are not intended to limit the scope of the present invention.
Referring to fig. 1, fig. 1 is a schematic flow chart of a long tail target detection method according to an embodiment of the present invention, which may include:
step S11: detecting an image to be identified by using a first target detection model and a second target detection model to obtain a first candidate frame and a second candidate frame, wherein the first target detection model and the second target detection model adopt training sets from different scenes for training;
step S12: extracting first features corresponding to the first candidate frames from the feature graphs obtained by the feature extraction layer of the first target detection model, and extracting second features corresponding to the second candidate frames from the feature graphs obtained by the feature extraction layer of the second target detection model;
step S13: fusing the first feature and the second feature to obtain a fused feature;
step S14: and inputting the fusion characteristics into a detection head to obtain a first long tail target detection result of the image to be identified.
The first target detection model is obtained based on training set training of a first scene, the second target detection model is obtained based on training set training of a second scene, and the first scene and the second scene are different. The first target detection model and the second target detection model which are trained by combining training sets from different scenes are beneficial to enriching the diversity of the scenes and improving the accuracy of long tail target detection and the applicable scenes. The method solves the problem that cold start or large scene migration is encountered when long-tail target detection is performed based on a single target detection model because the probability of occurrence or occurrence of the long-tail target in a real scene is small.
In some embodiments, the second scene may be a non-driving scene and the first scene may be a driving scene. As an example, the second target detection model may be a model trained based on a millions of data image datasets such as ImageNet or OpenImage; the first object detection model may employ Faster-RCNN (Faster Region Convolutional Neural Network) or SSD (Single Shot MultiBox Detector).
In an embodiment of the present invention, long-tail target detection may be performed based on the long-tail target detection model shown in fig. 2, which may include:
an input module; the first target detection model and the second target detection model are arranged in parallel; a feature fusion module; and a detection head and an output module.
The long-tail target detection method provided by the present invention will be described hereinafter based on the long-tail target detection model shown in fig. 2.
In some embodiments, step S11 may specifically be that the input module inputs the image to be identified into a first target detection model and a second target detection model, and the first target detection model and the second target detection model detect the image to be identified respectively, so as to obtain a first candidate frame and a second candidate frame.
In some embodiments, the first object detection model may include a backbone network and a detection head, the backbone network being configured to perform feature extraction on the image to be identified; the detection head is used for obtaining a detection result. In other embodiments, the first object detection model may include a backbone network, a neck network, and a detection head, where the neck network may be used to fuse, etc., feature maps output by the backbone network.
In some embodiments, extracting the first feature corresponding to the first candidate box from the feature map obtained by the feature extraction layer of the first object detection model in step S12 may include: and extracting a first feature corresponding to the first candidate frame from a feature map obtained from the backbone network and/or the neck network of the first target detection model.
In some embodiments, extracting the first feature corresponding to the first candidate box from the feature map obtained from the backbone and/or neck network of the first object detection model may include extracting an initial first feature corresponding to the first candidate box from the feature map obtained from the backbone and/or neck network of the first object detection model; the initial first features of relatively small size are up-sampled to obtain first features of a target size. The first candidate frame may be used to obtain a plurality of initial first features from a plurality of feature graphs with different sizes, where the plurality of initial first features form a first feature pyramid, as shown in fig. 3, and fig. 3 shows a schematic diagram of a first feature and second feature fusion process provided by an embodiment of the present invention; the first features of the target size are obtained by upsampling the initial first features of relatively small size.
As a specific example, the first feature corresponding to the first candidate frame may be extracted from a feature map obtained from a backbone Network and a neck Network of the first object detection model, where the backbone Network may employ a Residual Network (RPN (Region Proposal Network, region candidate frame Network) and an ROI poll (Region Of Interest, region of interest pooling) that are disposed in cascade, and specifically, a plurality of feature maps obtained after feature extraction by the Resnet and corresponding to one half, one quarter, and one eighth of an original size of an image to be detected and a feature map output by the ROI poll may be selected to obtain initial first features of multiple sizes, and the initial first features of the target sizes may be obtained by respectively upsampling the initial first features of relatively small sizes, where the target sizes may be one half of the original sizes.
In some embodiments, extracting the second feature corresponding to the second candidate box from the feature map obtained from the feature extraction layer of the second object detection model in step S12 may include extracting the second feature corresponding to the second candidate box from the feature map obtained from the backbone and/or neck network of the second object detection model.
In some embodiments, extracting the second feature corresponding to the second candidate box from the feature map derived from the backbone network and/or the neck network of the second object detection model may include: extracting initial second features corresponding to the second candidate frames from feature graphs obtained from a backbone network and/or a neck network of the second target detection model; upsampling the initial second feature of relatively small size yields a second feature of the target size. Wherein, a plurality of initial second features can be obtained from a plurality of feature maps with different sizes based on the second candidate frame, and the plurality of initial second features form a second feature pyramid, as shown in fig. 3.
As a specific example, the second feature corresponding to the second candidate box may be extracted from a feature map obtained from the neck mesh of the second object detection model. The neck network can adopt RPN and ROI polling which are arranged in cascade, the output of the ROI polling is a characteristic diagram with fixed size, and the characteristic diagram can be directly up-sampled to obtain a second characteristic with target size.
As shown in fig. 3, in some embodiments, step S13 may specifically be performing a bilinear pooling operation on the first feature and the second feature to obtain a fused feature. In other embodiments, after the fusion feature is obtained, feature degradation and normalization may be further performed on the fusion feature, so that a first long tail target detection result of the image to be identified is obtained based on the fusion feature after feature degradation and normalization. As an example, the fused features may be reduced in dimension by means of Sum pooling (Sum pooling) and PCA (Principal Component Analysis ).
In some embodiments, the detection head may employ an existing network, and as an example, the detection head may be a fully connected layer. Step S14 may specifically be inputting the fusion feature into the full-connection layer, to obtain a first long tail target detection result of the image to be identified. The output module is used for outputting a first long tail target detection result.
In some embodiments, before the feature extraction of the image to be identified by using the first target detection model, the first target detection model may be further trained, as shown in fig. 4, fig. 4 shows a flowchart of a training method of the first target detection model according to the embodiment of the present invention, which may include:
step S41: acquiring an initial sample;
step S42: processing the initial sample to construct a long-tail target sample set, wherein the long-tail target sample set corresponds to an automatic driving scene;
step S43: and training the first target detection model to be trained according to the long tail target sample set to obtain a trained first target detection model.
Referring to fig. 5, fig. 5 illustrates a schematic diagram of a first object detection model training architecture provided by an embodiment of the present invention, which may include an initial sample input module, a cold start module, and a self-iteration module that are arranged in cascade. Hereinafter, a training method of the first object detection model will be described based on the first object detection model training architecture shown in fig. 5.
In some embodiments, step S41 may be specifically inputting the initial sample through the initial sample input module. In some embodiments, the initial sample may contain an image sample therein; in other embodiments, the initial sample may include both an image sample and a text sample.
In some embodiments, the cold start module may include at least one long-tail data mining model for mining long-tail target samples using the at least one long-tail data mining model and constructing a long-tail target sample set based on the resulting long-tail target samples. As an example, the cold start module may include at least one long-tail data mining model of an image classification model, an image segmentation model, an image detection model, a single-mode image retrieval model, and a single-mode image retrieval model.
When the image sample is an image sample, step S42 may specifically be to input the image sample into at least one of an image classification model, an image segmentation model, an image detection model and a single-mode image retrieval model, and process the image sample to obtain a long-tail target sample; and constructing a long tail target sample set based on the long tail target sample.
In some embodiments, the image samples in the initial sample may be image classified by an image classification model to obtain long tail target samples related to the automated driving scenario. The image classification model may be used to classify full-view features of an autopilot scene, such as for weather, location, road, lighting, or image quality, among others.
In some embodiments, the image samples in the initial sample may be image segmented by an image segmentation model to obtain long tail target samples related to the autopilot scenario. The image segmentation model may be used to segment non-rigid objects in long tail target samples, such as for at least one of lane lines, roads, handrails, mud dropsy, and greening in the image samples.
In some embodiments, the image samples in the initial sample may be image detected by an image detection model to obtain long tail target samples related to the autopilot scenario. The image detection model can be used for detecting rigid objects such as dynamic obstacles, static obstacles and the like, such as vehicles, pedestrians, pier studs or animals and the like.
In some embodiments, a single-mode image retrieval model may be used for large-scale retrieval in source images of a network image library; specifically, CNN (Convolutional Neural Network ) is used to extract the image features of the source image and the image sample respectively and calculate the feature distance of the source image and the image sample according to the image features; and selecting a long-tail target sample related to the automatic driving scene based on the comparison of the characteristic distance and the first distance threshold, for example, an image which is more similar to the image sample can be selected as the long-tail target sample by adopting a threshold filtering method based on the first distance threshold.
When there are an image sample and a text sample in the initial sample, step S42 may be specifically inputting the text sample and the image sample into the multimodal image retrieval model; searching in a source image of a network image library by utilizing a multi-mode image searching model to obtain a long tail target sample; and constructing a long tail target sample set based on the long tail target sample. The multi-mode image retrieval model can be used for extracting text characteristics of an input text sample by utilizing a text encoding module and extracting image characteristics of the input image sample by utilizing an image encoding module; fusing the extracted text features and target image features to obtain first multi-mode fusion features; extracting image features of the source image by utilizing an image coding module to obtain source image features; fusing the text features and the source image features to obtain second multi-mode fusion features; calculating cosine distances between the second multi-mode fusion feature and the first multi-mode fusion feature; and selecting a long tail target sample related to the automatic driving scene based on the comparison of the cosine distance and the second distance threshold. Specifically, a source image which is more similar to the image sample can be obtained by adopting a threshold filtering method based on a second distance threshold value and used as a long tail target sample. The method can be applied under the condition that image samples are rare or the model image retrieval recall rate is low, so that long-tail target samples can be effectively retrieved.
In some preferred embodiments, the cold start module may include an image classification model, an image segmentation model, an image detection model, a single-mode image retrieval model, and a multi-mode image retrieval model, so as to obtain more long tail target samples, which is further beneficial to improving the recognition accuracy of the first target detection model.
In some embodiments, constructing a long-tail target sample set based on the long-tail target samples may be performed to screen at least a portion of the long-tail target sample set from the obtained long-tail target samples according to requirements.
In some embodiments, step S43 may specifically include:
preliminary training stage: inputting the long-tail target sample set into a first target detection model to be trained, and obtaining a detection result of each long-tail target sample in the long-tail target sample set; wherein the detection result comprises a candidate frame and a corresponding category;
iterative training phase: when the category is a long tail target, determining a pseudo long tail target according to a candidate frame and a truth box corresponding to the category; constructing a difficult negative sample set based on the pseudo long tail target; performing iterative training on a first target detection model to be trained by at least utilizing a difficult negative sample set; judging whether a preset condition is met, and if so, stopping iterative training to obtain a trained first target detection model.
In some embodiments, the long tail target sample set may also be expanded by data enhancement prior to performing the preliminary training, where the data enhancement may be performed by flipping, scaling, cropping, and the like. The first target detection model to be trained can be used for target detection of long-tail target samples, and confidence that the candidate frame and the candidate frame belong to the long-tail target class is obtained. And when the confidence coefficient corresponding to the candidate frame is larger than the preset confidence coefficient, determining the category corresponding to the candidate frame as a long tail target.
In some embodiments, the long-tail target sample set may include a plurality of long-tail target samples, which may also be labeled prior to inputting the long-tail target sample set into the first target detection model to be trained, where the true class and truth box of the long-tail target samples may be labeled.
In some embodiments, when the detected class is a long tail target, determining a pseudo long tail target according to the candidate frame and the truth box corresponding to the class may specifically calculate an intersection ratio of the candidate frame and the truth box corresponding to the long tail target, and when the intersection ratio is zero, determine the long tail target as the pseudo long tail target, that is, as a difficult negative sample.
In some embodiments, constructing a difficult negative sample set based on the pseudo long tail target, and performing iterative training on the first target detection model to be trained at least by using the difficult negative sample set may specifically include pasting a candidate frame corresponding to the pseudo long tail target into the long tail target negative sample by using a data enhancement method such as Mixup or Cutmix, so as to obtain a plurality of difficult negative samples; a difficult negative sample set can be constructed from the plurality of difficult negative samples, and at least the first target detection model to be trained is iteratively trained by using the difficult negative sample set. The long tail target sample may be divided into a long tail target positive sample and a long tail target negative sample, and the long tail target negative sample may be an image that does not include a long tail target. In other embodiments, the construction of the difficult negative sample set based on the pseudo long tail target may also be that an image of a corresponding region of the long tail target sample is cut out based on a candidate frame corresponding to the pseudo long tail target as the difficult negative sample, and the difficult negative sample set is constructed based on a plurality of obtained difficult negative samples. In some embodiments, the first target detection model to be trained may also be iteratively trained using the difficult negative sample set and the long tail target sample set.
In some embodiments, determining whether the preset condition is satisfied, and if the preset condition is satisfied, stopping the iterative training, obtaining the trained first target detection model may include: judging whether the false recognition rate of the first target detection model to be trained is within a preset threshold range; if yes, determining that the preset condition is met, and stopping iterative training. If not, continuing to construct a difficult negative sample set based on the false long tail target identified by mistake through a data enhancement method, and marking the difficult negative sample in the currently constructed difficult negative sample set so as to train the first target detection model to be trained based on the marked difficult negative sample.
The false recognition rate of the first target detection model for iterative training can be determined by the following steps: and obtaining the false recognition rate according to the proportion of the false long tail target number obtained by false recognition to the total input sample number of the first target detection model to be trained.
When the false recognition rate is smaller, the trained first target detection model can be determined, and the preset threshold range can be set according to requirements.
In the embodiment of the invention, the recognition capability of the first target detection model on the pseudo long tail target can be improved by determining the pseudo long tail target, constructing a difficult negative sample set by using the pseudo long tail target and training the first target detection model to be trained at least based on the difficult negative sample set, so that the follow-up improvement of the accuracy of long tail target detection is facilitated.
In some embodiments, in order to further improve accuracy of long tail target detection, integrated learning may be further performed in combination with a detection result of the second target detection model to obtain a final long tail target detection result, which may be specifically shown in fig. 6 and fig. 7.
Fig. 6 is a schematic diagram of another long tail target detection model architecture provided in the embodiment of the present invention, which may further include an integrated learning module on the basis of the schematic diagram shown in fig. 2, where the integrated learning module is configured to perform integrated learning according to a second long tail target detection result obtained by the second target detection model and a first long tail target detection result obtained based on the fusion feature, so as to obtain a final long tail target detection result.
Fig. 7 is a schematic flow chart of a long tail target detection method according to another embodiment of the present invention, which may include:
step S71: detecting the image to be identified by using a first target detection model and a second target detection model to obtain a first candidate frame and a second candidate frame;
step S72: extracting first features corresponding to the first candidate frames from the feature graphs obtained by the feature extraction layer of the first target detection model, and extracting second features corresponding to the second candidate frames from the feature graphs obtained by the feature extraction layer of the second target detection model;
Step S73: fusing the first feature and the second feature to obtain a fused feature;
step S74: inputting the fusion characteristics into a detection head to obtain a first long tail target detection result of the image to be identified;
step S75: and determining a final long tail target detection result based on the first long tail target detection result and the second long tail target detection result detected by the second target detection model.
The steps S71 to S74 may be performed in the same manner as the steps S11 to S14.
In some embodiments, the second long tail target detection result may be obtained by detecting the image to be identified using the second target detection model in step S71. Wherein the second target detection result may include a second candidate box.
When one second target detection model is set, a second long tail target detection result is correspondingly obtained, and step S75 may specifically be: comparing the first long tail target detection result with the second long tail target detection result; when the comparison is consistent, determining the second long tail target detection result as a final long tail target detection result;
when there are multiple second target detection models, multiple second long-tail target detection results are obtained correspondingly, and step S75 may specifically be to determine a final long-tail target detection result by using a voting mechanism based on the multiple second long-tail target detection results and the first long-tail target detection result.
The second long-tail target detection result and the first long-tail target detection result can both comprise corresponding categories of candidate frames, statistics is carried out on the categories corresponding to the candidate frames in the plurality of second long-tail target detection results and the first long-tail target detection result through a voting mechanism, and the category with the highest vote number is used as a final long-tail target detection result.
It will be appreciated by those skilled in the art that the present invention may implement all or part of the procedures in the methods of the above embodiments, or may be implemented by a computer program for instructing relevant hardware, where the computer program may be stored in a computer readable storage medium, and the computer program may implement the steps of each of the method embodiments when executed by a processor. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable storage medium may include: any entity or device, medium, usb disk, removable hard disk, magnetic disk, optical disk, computer memory, read-only memory, random access memory, electrical carrier wave signals, telecommunications signals, software distribution media, and the like capable of carrying the computer program code.
Another aspect of the present invention also provides a computer readable storage medium, where a computer program is stored, where the computer program can implement the long tail target detection method according to any one of the foregoing embodiments when executed by a processor. The computer readable storage medium may be a storage device including various electronic devices, and optionally, the computer readable storage medium in the embodiments of the present invention is a non-transitory computer readable storage medium.
Referring to fig. 8, another aspect of the present invention further provides a driving apparatus, which may include a memory 81 and a processor 82, where the memory 81 stores a computer program, and the computer program when executed by the processor 82 implements the long tail target detection method according to any one of the above embodiments.
The memory 81 and the processor 82 may be connected by a bus or other means, and fig. 8 exemplarily shows a configuration in which the memory 81 and the processor 82 are connected by a bus, and the processor 82 is provided with only one.
In other embodiments, the driving apparatus may include a plurality of memories 81 and a plurality of processors 82. The program for executing the long tail target detection method of any of the above embodiments may be divided into a plurality of sub-programs, each of which may be loaded and executed by a processor to perform the different steps of the long tail target detection method of the above method embodiments, respectively. Specifically, each segment of the subroutine may be stored in a different memory 81, respectively, and each processor 82 may be configured to execute the programs in one or more memories 81 to collectively implement the long tail target detection method of the above-described method embodiment.
Thus far, the technical solution of the present invention has been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of protection of the present invention is not limited to these specific embodiments. Equivalent modifications and substitutions for related technical features may be made by those skilled in the art without departing from the principles of the present invention, and such modifications and substitutions will fall within the scope of the present invention.

Claims (10)

1. A long tail target detection method, comprising:
detecting an image to be identified by using a first target detection model and a second target detection model to obtain a first candidate frame and a second candidate frame, wherein the first target detection model and the second target detection model adopt training sets from different scenes for training;
extracting a first feature corresponding to the first candidate frame from a feature map obtained by a feature extraction layer of the first target detection model, and extracting a second feature corresponding to the second candidate frame from a feature map obtained by a feature extraction layer of the second target detection model;
fusing the first feature and the second feature to obtain a fused feature; the method comprises the steps of,
And inputting the fusion characteristic into a detection head to obtain a first long tail target detection result of the image to be identified.
2. The method according to claim 1, wherein extracting the first feature corresponding to the first candidate box from the feature map obtained from the feature extraction layer of the first object detection model includes:
extracting first features corresponding to the first candidate frames from feature graphs obtained from a backbone network and/or a neck network of the first target detection model;
and/or the number of the groups of groups,
the extracting the second feature corresponding to the second candidate frame from the feature map obtained by the feature extraction layer of the second target detection model includes:
and extracting a second feature corresponding to the second candidate frame from a feature map obtained by the backbone network and/or the neck network of the second target detection model.
3. The method according to claim 2, wherein extracting the first feature corresponding to the first candidate box from the feature map obtained from the backbone network and/or the neck network of the first object detection model comprises:
extracting initial first features corresponding to the first candidate frames from feature graphs obtained by a backbone network and/or a neck network of the first target detection model;
The initial first features of relatively small size are up-sampled to obtain first features of a target size.
4. A method according to claim 3, wherein extracting the second feature corresponding to the second candidate box from the feature map obtained from the backbone network and/or the neck network of the second object detection model comprises:
extracting initial second features corresponding to the second candidate frames from feature graphs obtained by a backbone network and/or a neck network of the second target detection model;
upsampling the initial second feature of relatively small size results in a second feature of the target size.
5. The method of claim 1, wherein fusing the first feature and the second feature to obtain a fused feature comprises:
performing bilinear pooling operation on the first feature and the second feature to obtain the fusion feature;
and/or the number of the groups of groups,
after the fused feature is obtained and before the fused feature is input into a detection head, the method further comprises: and carrying out feature dimension reduction and normalization on the fusion features.
6. The method according to claim 1, wherein the method further comprises:
And determining a final long tail target detection result based on the first long tail target detection result and the second long tail target detection result detected by the second target detection model.
7. The method of claim 6, wherein the determining a final long tail target detection result based on the first long tail target detection result and a second long tail target detection result detected by the second target detection model comprises:
when the second long-tail target detection result is one, comparing the second long-tail target detection result with the first long-tail target detection result, and if the comparison is consistent, determining the first long-tail target detection result as the final long-tail target detection result;
and when the second long tail target detection results are multiple, and the multiple second long tail target detection results come from the multiple second target detection models, determining the final long tail target detection result by adopting a voting mechanism based on the multiple second long tail target detection results and the multiple first long tail target detection results.
8. The method according to any one of claims 1 to 7, further comprising:
Acquiring an initial sample;
processing the initial sample to construct a long-tail target sample set, wherein the long-tail target sample set corresponds to an automatic driving scene;
and training a first target detection model to be trained according to the long tail target sample set to obtain the trained first target detection model.
9. The method of claim 8, wherein processing the initial sample to construct a long tail target sample set comprises:
performing image classification on the image sample in the initial sample to obtain a long tail target sample related to an automatic driving scene; and/or
Image segmentation is carried out on the image sample in the initial sample, and a long tail target sample related to an automatic driving scene is obtained; and/or
Performing image detection on the image sample in the initial sample to obtain a long tail target sample related to an automatic driving scene; and/or
Respectively extracting image features of an image sample in a source image and the initial sample by using a convolutional neural network, and calculating feature distances of the source image and the image sample according to the image features; selecting a long tail target sample related to the automatic driving scene based on the comparison of the characteristic distance and a first distance threshold; and/or
Respectively extracting text features and image features from the text samples and the image samples in the initial samples; fusing the extracted text features and target image features to obtain first multi-mode fusion features; extracting image features of the source image to obtain source image features; fusing the text features and the source image features to obtain second multi-mode fusion features; calculating cosine distances between the second multi-mode fusion feature and the first multi-mode fusion feature; and selecting a long tail target sample related to the automatic driving scene based on the comparison of the cosine distance and the second distance threshold.
10. The method of claim 8, wherein training the first target detection model to be trained from the long tail target sample set to obtain the trained first target detection model, comprises:
preliminary training stage:
inputting the long-tail target sample set into the first target detection model to be trained to obtain a detection result of each long-tail target sample in the long-tail target sample set, wherein the detection result comprises a candidate frame and a corresponding category;
Iterative training phase:
when the category is a long tail target, determining a pseudo long tail target according to a candidate frame and a truth box corresponding to the category;
constructing a difficult negative sample set based on the pseudo long tail target;
performing iterative training on the first target detection model to be trained by at least using the difficult negative sample set;
judging whether a preset condition is met, and if so, stopping iterative training to obtain the trained first target detection model.
CN202310136841.4A 2023-02-10 2023-02-10 Long tail target detection method, computer readable storage medium and driving device Pending CN116071557A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310136841.4A CN116071557A (en) 2023-02-10 2023-02-10 Long tail target detection method, computer readable storage medium and driving device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310136841.4A CN116071557A (en) 2023-02-10 2023-02-10 Long tail target detection method, computer readable storage medium and driving device

Publications (1)

Publication Number Publication Date
CN116071557A true CN116071557A (en) 2023-05-05

Family

ID=86180047

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310136841.4A Pending CN116071557A (en) 2023-02-10 2023-02-10 Long tail target detection method, computer readable storage medium and driving device

Country Status (1)

Country Link
CN (1) CN116071557A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116977810A (en) * 2023-09-25 2023-10-31 之江实验室 Multi-mode post-fusion long tail category detection method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116977810A (en) * 2023-09-25 2023-10-31 之江实验室 Multi-mode post-fusion long tail category detection method and system
CN116977810B (en) * 2023-09-25 2024-01-09 之江实验室 Multi-mode post-fusion long tail category detection method and system

Similar Documents

Publication Publication Date Title
CN111476284B (en) Image recognition model training and image recognition method and device and electronic equipment
US20190042888A1 (en) Training method, training apparatus, region classifier, and non-transitory computer readable medium
US20210312232A1 (en) Domain alignment for object detection domain adaptation tasks
CN111369581A (en) Image processing method, device, equipment and storage medium
CN110956081B (en) Method and device for identifying position relationship between vehicle and traffic marking and storage medium
CN110717863B (en) Single image snow removing method based on generation countermeasure network
CN112287983B (en) Remote sensing image target extraction system and method based on deep learning
CN112949578B (en) Vehicle lamp state identification method, device, equipment and storage medium
CN111160205A (en) Embedded multi-class target end-to-end unified detection method for traffic scene
CN112613434A (en) Road target detection method, device and storage medium
CN116071557A (en) Long tail target detection method, computer readable storage medium and driving device
CN115131634A (en) Image recognition method, device, equipment, storage medium and computer program product
CN113743300A (en) Semantic segmentation based high-resolution remote sensing image cloud detection method and device
KR102026280B1 (en) Method and system for scene text detection using deep learning
CN114724128B (en) License plate recognition method, device, equipment and medium
CN116597270A (en) Road damage target detection method based on attention mechanism integrated learning network
CN116977484A (en) Image desensitizing method, device, electronic equipment and storage medium
CN115830399A (en) Classification model training method, apparatus, device, storage medium, and program product
CN114155524A (en) Single-stage 3D point cloud target detection method and device, computer equipment and medium
CN113706636A (en) Method and device for identifying tampered image
CN113392837A (en) License plate recognition method and device based on deep learning
CN112348011A (en) Vehicle damage assessment method and device and storage medium
CN111832463A (en) Deep learning-based traffic sign detection method
CN116884003B (en) Picture automatic labeling method and device, electronic equipment and storage medium
CN116612466B (en) Content identification method, device, equipment and medium based on artificial intelligence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination