CN113780256A

CN113780256A - Image target detection method combining thickness classification and related device

Info

Publication number: CN113780256A
Application number: CN202111338401.4A
Authority: CN
Inventors: 孙萍; 金博伟; 王旭; 许琢; 鲁盈悦; 金玥; 高逸晨; 支洪平
Original assignee: Iflytek Suzhou Technology Co Ltd
Current assignee: Iflytek Suzhou Technology Co Ltd
Priority date: 2021-11-12
Filing date: 2021-11-12
Publication date: 2021-12-10
Anticipated expiration: 2041-11-12
Also published as: CN113780256B

Abstract

The application provides a method, a device, equipment and a storage medium for detecting an image target by combining thickness classification, wherein the method comprises the following steps: extracting a candidate region from an image to be detected, and acquiring the characteristics of the candidate region; performing target detection processing based on the candidate region characteristics to obtain a first detection result, wherein the first detection result at least comprises a coarse classification result of the candidate region; and extracting fine-grained features from the candidate region features, and carrying out image classification processing based on the extracted fine-grained features to obtain a fine classification result of the candidate region. In the image target detection task with large intra-class difference and containing subclasses in the coarse class, the image target detection method combining the coarse classification and the fine classification can perform coarse classification and fine classification on the image respectively based on the image features with different granularities, so that the target object in the image can be identified on the coarse granularity, and the subclass of the target object under the coarse classification can be identified, and the image target can be identified more accurately and comprehensively.

Description

Image target detection method combining thickness classification and related device

Technical Field

The present application relates to the field of computer vision technologies, and in particular, to a method, an apparatus, a device, and a storage medium for detecting an image target by combining thickness classification.

Background

In recent years, the target detection technology is mainly based on a deep learning model, that is, a convolutional neural network (such as LeNet, vggtet, ResNet50, etc.) is used as a feature extraction network to extract deep features of a target object for classification and positioning. As the types and the number of the target objects are uncertain in the detection process, and the shapes, the sizes and the materials of the targets are different, the problem of target detection of various detachable types in large-difference and coarse types of the target is solved.

For the problems that in the target detection process, the difference between certain target classes is large, and the coarse classes contain various detachable subclasses, two processing modes are generally adopted according to the conventional detection flow: one is that the method is directly regarded as a category for detection, and because the characteristics in the category are mixed, the model is difficult to extract effective common characteristics to accurately and comprehensively represent all the subclass targets; the other is to split all the subclasses and respectively detect the subclasses as a single class, because the detected classes are various, some classes need to be split, and the characteristics of some classes are stable, for example, the lighter class basically has a structure that a shell contains a lighter movement, and does not need to be split, if the split fine class and the large class which does not need to be split are detected together, a problem that the granularity of the distinguishable features is inconsistent is faced at the moment, for example, the model not only needs to distinguish a knife and a lighter from the overall features, but also needs to distinguish a folding knife and a blade from the detailed features of the knife, the overall feature granularity is thicker, the detailed feature granularity is thinner, and the model is difficult to simultaneously learn the features with different granularities in the same classifier, so that the final identification effect is unsatisfactory.

Disclosure of Invention

Based on the technical current situation, the application provides a method, a device, equipment and a storage medium for detecting image targets by combining rough classification, which can perform rough classification and fine classification on images respectively based on image features with different granularities in image target detection tasks with large intra-class differences and coarse classes including subclasses, thereby obtaining ideal image target detection and classification effects.

In order to achieve the technical effect, the application specifically provides the following technical scheme:

a method for detecting an image target by combining thickness classification comprises the following steps:

extracting a candidate region from an image to be detected, and acquiring the characteristics of the candidate region;

performing target detection processing based on the candidate region characteristics to obtain a first detection result, wherein the first detection result at least comprises a coarse classification result of the candidate region;

and extracting fine-grained features from the candidate region features, and carrying out image classification processing based on the extracted fine-grained features to obtain a fine classification result of the candidate region.

An image object detection device combining thickness classification comprises:

the characteristic extraction module is used for extracting a candidate region from the image to be detected and acquiring the characteristics of the candidate region;

the rough classification module is used for carrying out target detection processing based on the candidate region characteristics to obtain a first detection result, wherein the first detection result at least comprises a rough classification result of the candidate region;

and the fine classification module is used for extracting fine-grained features from the candidate region features and carrying out image classification processing on the basis of the extracted fine-grained features to obtain a fine classification result of the candidate region.

An image object detecting apparatus combining thickness classification, comprising:

a memory and a processor;

wherein the memory is connected with the processor and used for storing programs;

the processor is used for realizing the image target detection method combining the thickness classification by operating the program in the memory.

A storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described method for detecting an image object in combination with a coarse-fine classification.

The coarse and fine classification combined image target detection method provided by the embodiment of the application extracts the features of the candidate region from coarse granularity to fine granularity, performs coarse classification on the candidate region based on the extracted coarse granularity features, and performs fine classification on the candidate region based on the extracted fine granularity features, so that a coarse classification result and a fine classification result of the candidate region can be obtained simultaneously. In the image target detection task with large intra-class difference and containing subclasses in the coarse class, the image target detection method combining the coarse classification and the fine classification can perform coarse classification and fine classification on the image respectively based on the image features with different granularities, so that the target object in the image can be identified on the coarse granularity, and the subclass of the target object under the coarse classification can be identified, and the image target can be identified more accurately and comprehensively.

Further, compared with a method for performing coarse-to-fine progressive classification in a tree-shaped cascade structure, the image target detection method provided by the embodiment of the application realizes the coarse-to-fine image feature extraction process, and the coarse classification and the fine classification are required respectively, and the image classification is performed by applying features with corresponding granularity, so that the fine classification result does not depend on the coarse classification result, and the accuracy of the classification result can be fully ensured.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flowchart of an image target detection method combining thickness classification according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a target detection model provided in an embodiment of the present application;

FIG. 3 is a schematic structural diagram of at least one fine classification model provided by an embodiment of the present application;

fig. 4 is a schematic diagram of a fine-grained feature extraction process provided by an embodiment of the present application;

fig. 5 is a schematic process diagram for extracting a candidate region from an image to be detected and acquiring features of the candidate region according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of an image object detection model combining rough classification and fine classification provided in an embodiment of the present application;

fig. 7 is a schematic structural diagram of an image object detection apparatus combining thickness classification according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an image object detection apparatus combining thickness classification according to an embodiment of the present application.

Detailed Description

The technical scheme of the embodiment of the application is applied to an application scene of image target detection. By adopting the technical scheme of the embodiment of the application, the target object can be detected from the image, the target object can be roughly classified and the position of the target object can be labeled, the target object in the image can be finely classified, and a better detection effect can be obtained in the task of detecting the image target with large intra-class difference and coarse class including subclasses.

Currently mainstream target detection algorithms can be classified into two major categories: two-stage detection algorithm of two stages and one-stage detection algorithm of one stage. Firstly, generating a candidate region (region prosages) by using an RPN (resilient packet network) to perform coarse positioning, and then performing fine positioning and classification prediction on the candidate region at the head of a model by using a typical two-stage detection algorithm, wherein the typical two-stage detection algorithm comprises a series of methods such as fast R-CNN (fast R-CNN) and RFCN (radiofrequency common network); the one-stage detection algorithm is based on the class probability and the position information of a target object directly regressed by features extracted from a backbone network, a candidate region is not generated for rough positioning, and typical algorithms include YOLOv1/v2/v3, SSD and the like.

the two _ stage target detection model mainly comprises the following modules:

1) a backhaul + neutral feature extraction module: the Backbone comprises a series of operations of convolution Conv, Pooling Pooling, regularization BN/GN, nonlinear activation layer Relu and the like and is used for extracting deep convolution characteristics of images, and commonly used Backbones comprise Vgg19, Resnet50, Inception Net and the like; the Neck part mainly fuses top-layer features with high-level semantic information and bottom-layer features with obvious geometric information step by step so as to achieve the purpose of enriching feature expression capability. The resulting deep convolution features can be represented by a four-dimensional vector N C H W, where N is the size of the input image batch size, C is the number of feature channels, often set to 256 or higher dimensions, and H/W is the spatial dimension of the feature map.

2) An RPN candidate region generation module: based on the extracted convolution characteristics, the RPN network respectively presets a series of anchors by taking each pixel point of the characteristic diagram as the center, preliminarily positions a candidate region proposals possibly containing a target object, and the module only analyzes whether the candidate region contains the target object (foreground/background) or not, does not pay attention to the specific category of the target object and only preliminarily positions the position of the target object.

3) RCNN Head module: after the candidate region is generated, the candidate region can be mapped back to the Backbone convolution feature graph, and features of the corresponding part are intercepted to carry out classification and position regression prediction. Because the sizes of the propofol are different, in order to obtain the fully-connected layer input features with fixed dimensions, the ROI posing operation is firstly performed to pool the partial features, and if the number of pooled blocks is set to m × m, candidate region feature vectors with n × c × m dimensions are obtained. And then, accessing a plurality of full connection layer FCs for feature mapping, and finally respectively accessing an n × cls _ num classifier and an n × 4 regression to accurately predict the target class probability and accurately regress the target position. The model head can also adopt a cascade mode, a target frame obtained by the previous stage of prediction is used as a next stage of candidate region, and a more accurate detection result can be obtained by performing multi-stage regression and prediction.

Because the type and the number of the target objects are uncertain in the image target detection process, and the shapes, the sizes and the materials of the targets are different, the image target detection tasks frequently meet the image target detection tasks with large intra-class difference and coarse classes containing subclasses. For example, in an X-ray security inspection scene, contraband may appear in a variety of categories including knives, tools, pressure vessels, lighters, fireworks and crackers, and features of some categories are different greatly, for example, knives may include various shapes such as folding knives, straight-handle knives, blades and the like in appearance, and small razors and large razors are provided in size. At present, most detection methods detect various types of cutters as one type, the recognition effect is not ideal due to large intra-type differences, and the detected type is not fine enough and becomes a pain point in an X-ray safety inspection scene. If various types of cutters are directly separated out to be detected together with other types, the characteristic learning process of the model becomes more difficult due to the fact that the distinguishable characteristic granularity between the subclasses and the large class is inconsistent, and the subclass detection precision hardly meets the actual combat requirement.

Some methods provide for learning the cognitive learning process of human beings by reference, and adopt a mode of progressive classification from coarse to fine, namely, a coarse classifier is firstly arranged to predict the coarse class to which the target belongs, and then a fine classifier is arranged to predict the fine class to which the target belongs in detail according to the recognition result of the coarse classifier. The rough-to-fine progressive classification method seems to separate rough classification and fine classification identification processes, but the first step of rough classification identification process faces the problem that the rough classification cannot be accurately predicted due to large intra-class difference, and a fine classifier is difficult to be effectively trained under the guidance of an inaccurate rough classification prediction result.

Therefore, the mainstream target detection algorithm cannot obtain a considerable detection effect when dealing with image target detection tasks with large intra-class differences and coarse classes containing subclasses.

Based on the technical current situation, the embodiment of the application provides a set of technical scheme applied to an image target detection scene, and the scheme is particularly suitable for image target detection tasks with large intra-class differences and coarse classes containing subclasses. Aiming at the special image target detection task, the application specifically provides a multi-branch fine-grained image target identification scheme combining coarse and fine classification, and the scheme additionally constructs a plurality of fine classification branches on the basis of a fast _ rcnn model structure, so that an original classifier only detects the merged large classes, the fine classifier is responsible for detecting the sub-classes split from the large classes, the coarse and fine classification branches are synchronously trained and mutually promoted, and finally the detection effect of the model is jointly improved. The experimental effect proves that the mode can effectively improve the detection precision of each coarse class, and the model can be expanded to enable the model to have good fine class detection capability, so that better detection effect can be obtained in the image target detection task with large intra-class difference and coarse class including subclasses.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides a method for detecting an image target by combining thickness classification, which is shown in fig. 1 and includes:

s101, extracting a candidate region from the image to be detected, and acquiring the characteristics of the candidate region.

Specifically, the candidate region is an image region that may include the target object in the image to be detected. The target object is a target object which is desired to be detected or identified from an image in the image target detection process. For example, in X-ray security inspection, it is necessary to detect and identify contraband in passenger baggage, and various contraband, such as tools, pressure vessels, lighters, fireworks and crackers, are target objects for image target detection in the scene.

In the image target detection method combining the thickness classification and the fineness classification provided by the embodiment of the application, after the image to be detected is obtained, a candidate region is extracted from the image to be detected. It will be appreciated that when extracting candidate regions from an image to be detected, in particular image regions that may contain which target objects, may be determined according to the object that the image target detection is intended to detect. For example, in an X-ray security inspection task, if it is desired to detect contraband in passenger baggage, an image region that may contain contraband may be extracted from an X-ray image to be detected as a candidate region; for another example, in a face recognition scene, if it is desired to detect and recognize a face in a natural scene image, an image region that may include the face may be extracted from the natural scene image to be detected as a candidate region.

For example, with reference to a mainstream two-stage target detection algorithm, feature extraction may be performed on an image to be detected, and then based on an extracted feature map, a candidate region propofol that may include a target object is extracted from the image to be detected through an RPN (region pro temporal Network).

Since the image to be detected may not only contain one target object, and even if a certain target object is desired to be detected finally, only one candidate region may not be extracted in the candidate region extraction process. Therefore, there may be one or a plurality of candidate regions extracted from the image to be detected. In order to avoid missing detection, the number of extracted candidate regions is usually multiple, and then the final detection result is obtained through subsequent candidate region classification.

Further, after the candidate region is extracted from the image to be detected, feature extraction is performed on the candidate region, and then the candidate region feature can be obtained.

As another optional implementation manner, after extracting a feature map of an image to be detected according to the mainstream two-stage target detection algorithm, and extracting a candidate region from the image to be detected based on the feature map, remapping the candidate region onto the feature map of the image to be detected, extracting a feature map of a corresponding portion of the candidate region from the feature map, then performing pooling processing on the extracted feature map to obtain a feature vector of a fixed dimension (e.g., 7 × 7), and assuming that the number of feature channels is set to 256, finally obtaining the feature vector of n × 256 × 7 for n candidate regions, that is, the feature of the candidate region.

S102, target detection processing is carried out based on the candidate region characteristics to obtain a first detection result, and the first detection result at least comprises a coarse classification result of the candidate region.

Specifically, the position regression prediction and the rough classification are performed based on the candidate region features obtained in the above steps, so as to obtain a first detection result, where the first detection result includes two detection results, that is, a position prediction result of four vertices of the candidate region obtained by performing the position regression prediction on the candidate region on the one hand, and a classification result obtained by classifying the candidate region on the other hand.

The technical scheme of the embodiment of the application aims to realize both coarse classification and fine classification of the targets in the image. In step S102, the detection process is performed based on the overall features of the candidate regions, and the candidate regions are classified for the first time, so that the classification result is actually a result of rough classification. Therefore, for the sake of convenience of distinction from the results of the fine classification described later, the classification result obtained by classifying the candidate region included in the first detection result is defined as a coarse classification result.

As an alternative implementation, the target detection processing may be performed based on the candidate region features by means of a conventional target detection model or algorithm.

For example, the candidate region features are input into the target detection model shown in fig. 2, two shared FC fully-connected layers are accessed first, n × 1024-dimensional feature vectors are obtained through mapping, and then the position regression prediction branch of n × 4 and the rough classification branch of n × cls _ num _ coarse are accessed respectively, so that the position regression prediction result of the candidate region and the rough classification result of the candidate region can be obtained. Where cls _ num _ coarse is the total number of coarse category.

S103, extracting fine-grained features from the candidate region features, and carrying out image classification processing based on the extracted fine-grained features to obtain a fine classification result of the candidate region.

Specifically, the candidate region feature is a feature vector obtained for the entire candidate region, and can characterize the content of the candidate region as a whole.

Based on the candidate region features, the embodiment of the application further extracts finer-grained features from the candidate region features, that is, extracts fine-grained features. The fine-grained feature is a feature that can more specifically represent the detailed content of the candidate region relative to the whole content of the candidate region expressed by the feature of the candidate region, and can be used for distinguishing the image content of the candidate region on a finer granularity.

And performing image classification processing on the candidate region based on the fine-grained features extracted from the candidate region features, namely realizing image content classification based on the detail information of the candidate region. Since the classification processing applies the detail features of the candidate regions, the image contents in the candidate regions can be classified more finely from a finer granularity, and a fine classification result of the candidate regions can be obtained.

As an optional implementation manner, the aforementioned extracting of the fine-grained features from the candidate region features and the image classification processing based on the extracted fine-grained features may be implemented by using a model structure or an algorithm of "feature extraction module + classifier".

The execution sequence of the steps S102 and S103 may be flexibly set, for example, the steps may be executed synchronously, or the steps may be executed sequentially according to requirements.

As can be seen from the above description, in the image target detection method combining coarse and fine classification provided in the embodiment of the present application, the feature extraction is performed on the candidate region from coarse granularity to fine granularity, the candidate region is coarsely classified based on the extracted coarse-granularity feature, and the candidate region is finely classified based on the extracted fine-granularity feature, so that the coarse classification result and the fine classification result of the candidate region can be obtained simultaneously. In the image target detection task with large intra-class difference and containing subclasses in the coarse class, the image target detection method combining the coarse classification and the fine classification can perform coarse classification and fine classification on the image respectively based on the image features with different granularities, so that the target object in the image can be identified on the coarse granularity, and the subclass of the target object under the coarse classification can be identified, and the image target can be identified more accurately and comprehensively.

Furthermore, compared with a method for performing coarse-fine progressive classification in a tree-shaped cascade structure, the image target detection method combining coarse classification and fine classification provided by the embodiment of the application realizes the coarse-fine image feature extraction process, and the coarse classification and the fine classification are required respectively, and the image classification is performed by using features with corresponding granularity, so that the fine classification result does not depend on the coarse classification result, and the accuracy of the classification result can be fully ensured.

As an optional implementation, the extracting fine-grained features from the candidate region features, and performing image classification processing based on the extracted fine-grained features to obtain a fine classification result of the candidate region specifically includes:

and respectively extracting fine-grained features corresponding to each coarse class from the candidate region features, and classifying fine-grained images in a corresponding coarse class range based on the extracted fine-grained features to obtain a fine classification result of the candidate region.

Specifically, in an actual image target detection scene, not all categories of articles may be sub-classified more finely, for example, in an X-ray security inspection scene, the rough category of "tools" may be further subdivided, some tools are dangerous articles that are not allowed to be taken on, such as daggers, choppers, etc., while some tools are daily articles that are allowed to be carried by passengers, such as shavers, fingernails, etc., so that the detected tools need to be subdivided to determine whether the detected tools are dangerous articles; however, the category "lighter" is not generally subdivided, since in some circumstances, any lighter, regardless of its specific form, is not portable and therefore only the lighter can be identified, regardless of the specific type of lighter.

Based on the above situation, the embodiments of the present application, in combination with the scenario of application of the scheme, predetermine each coarse class that should be subdivided.

Meanwhile, different categories are focused on different contents when performing classification, for example, regarding a tool, the tool of different categories is usually distinguished in terms of the shape, the overall shape, the size, the material and the like of the tool holder, and if the tool is classified in a subdivided manner, feature information related to the shape, the overall shape, the size, the material and the like of the tool holder should be extracted from the image for classification. Therefore, for different coarse categories, when performing fine classification, fine-grained features should be extracted according to the distinguishing features between the fine categories under the coarse category, so that the extracted fine-grained features can effectively distinguish the fine categories within the range of the coarse category, and the fine classification of the coarse category can be realized based on the fine-grained features.

Therefore, in the embodiment of the present application, when extracting coarse-grained features from candidate regional features, fine-grained features corresponding to each coarse class are specifically extracted from the candidate regional features. Wherein "each coarse class" refers to a coarse class that can be or needs to be subdivided as may occur in the current application scenario. For example, in an X-ray security check scenario, the coarse categories "tools" and "tools" need to be subdivided to identify which tool or which tool is specific, and thus the coarse categories "tools" and "tools" are the coarse categories that need to be subdivided in the current application scenario. The specific content of each rough category can be flexibly set according to the practical application scene or requirement under the idea introduced in the embodiment of the application.

And after extracting the fine-grained features corresponding to a certain coarse class from the candidate region features, classifying the fine-grained images in the coarse class range based on the extracted fine-grained features to obtain a fine classification result of the candidate region.

Therefore, the embodiment of the application respectively carries out independent parallel fine classification processing on a plurality of different coarse classes, the fine classification of each coarse class cannot be influenced mutually, and the fine classification corresponding to each coarse class is carried out on the basis of the fine-grained features corresponding to each coarse class, so that the condition of confusion of the fine-grained features cannot occur, and the accuracy of the fine classification can be ensured.

As a preferred implementation manner, referring to fig. 3, the embodiment of the present application sets a corresponding fine classification model for each coarse class that may be or needs to be fine classified. It can be seen that each fine classification model corresponds to a coarse classification. Assuming that there are k coarse classes that need or can be subdivided, k fine class models are set accordingly.

Each fine classification model is composed of a fine-grained feature extraction module and fine classification branches, after training, each fine classification model is respectively provided with fine-grained features corresponding to corresponding coarse classes extracted from the candidate region features, and the fine-grained images in the coarse class range are classified based on the extracted fine-grained features, so that the fine classification result of the candidate region can be obtained.

For example, assuming that the first fine classification model corresponds to a coarse class of "tool", when the candidate region features extracted from the image to be detected are input into the fine classification model, the fine classification model can extract fine-grained features corresponding to the category of "tool" from the candidate region features, and then perform fine-grained image classification processing within the category range of "tool" based on the extracted fine-grained features, so as to obtain a fine classification result of the candidate region within the category range of "tool", that is, determine which tool the image content in the candidate region specifically belongs to.

At least one coarse class in an image target detection scene with coexisting coarse and fine classifications needs to be classified in a fine classification mode, so that at least one fine classification model needs to be set. And respectively inputting the acquired candidate region characteristics into at least one fine classification model trained in advance, namely obtaining a fine classification result of the candidate region by means of the function of the fine classification model.

As can be seen from the foregoing description, in the embodiment of the present application, when extracting fine-grained features from candidate region features, the coarse classes that are subdivided as needed are extracted, that is, fine-grained features corresponding to the coarse classes are extracted from the candidate region. And if the coarse classes need to be finely classified, fine-grained features need to be extracted respectively corresponding to the coarse classes.

As an alternative implementation manner, a fine-grained feature corresponding to a coarse class is extracted from the candidate region features, for example, a fine-grained model corresponding to a coarse class is extracted from the candidate region features, and as shown in fig. 4, the following steps a1-A3 are performed to implement:

corresponding to the rough classes, feature vectors of local feature maps corresponding to a first number of subclasses and a second number of subclasses are extracted from the candidate region features. Wherein the first number of sub-classes belong to the same coarse class.

Specifically, assuming that the number of fine classification categories included in the coarse classification is f, for the input candidate region feature (n × 256 × 7), 3 × 3 convolution is performed first, and the number of output channels is set to c _ out = f ×

，f *

The output channels mean that each sub-class is individually set

And (3) a local feature map specially detecting local features of the local feature map, and the dimension of the feature vector is n × c _ out × 7 after the convolution operation is carried out.

And A2, respectively carrying out aggregation processing on the second quantity of local feature maps corresponding to each subclass in the extracted feature vectors to obtain feature vectors corresponding to the first quantity of subclasses for each candidate region.

Specifically, each subclass is associated with

The local feature maps are aggregated, which is realized by cross-channel maximum pooling

Each channel feature map is used as a group, and each space position corresponds to

Taking the maximum feature value as the final aggregation feature, thus aggregating the feature points with discriminant in each group of features into one featureOn the graph, a local characteristic aggregation graph of n x f 7 x 7 is finally obtained.

And A3, performing bilinear pooling on the feature vectors of the first number of subclasses corresponding to each candidate area to obtain fine-grained features corresponding to the coarse classes.

Specifically, f channels of the aggregated feature map correspond to f subclasses respectively, since the f classes in the branches belong to one large class together, the difference between the features is small, in order to capture the correlation between the features, bilinear pooling bilinear poling processing is further performed, that is, the outer product between the features of each spatial position on the feature map is calculated and normalized, each spatial position corresponds to an f-dimensional vector, and n f²And the dimensional bilinear pooling characteristic has stronger fine-grained characteristic characterization capability, and can be used as a fine-grained characteristic corresponding to the coarse class.

In a preferred embodiment, especially when a fine-grained feature is extracted from the candidate region features by using a fine-classification model, after the step a1 is executed, before the step a2 is executed, the step a4 is executed, and a predetermined proportion of local feature maps are randomly selected and deactivated from among the second number of local feature maps corresponding to each sub-class in the extracted feature vector.

Specifically, in the fine classification model training process, in order to make the local feature detection map of each class learn more discriminative information, a Channel Attention mechanism (Channel Attention), i.e., [0 ] is used here]*a+[1]*(

-a) the series of coefficients is weighted over the channel domain of the feature, such that

Individual signature portions were randomly inactivated. If setting a =

At 2, the deactivation probability is 0.5, so that only one is in each iterationThe partial feature map is in effect, and in order to obtain a correct classification result, the partial feature map needs to learn more powerful feature expression. At the same time, random deactivation can reduce redundancy among features and increase complementarity among local feature detection maps.

Based on the training scheme, in the practical application process, the model training process can be adapted, and when f subclasses corresponding to each candidate region and corresponding to each subclass are extracted from the candidate region features

After the feature vectors of the local feature map are extracted, corresponding to each subclass in the extracted feature vectors

Randomly selecting partial feature maps with set proportion in the partial feature maps to be inactivated, and then correspondingly selecting partial feature maps of each subclass

And carrying out aggregation processing on the local feature maps.

The coarse classification and fine classification combined image target detection method provided by the embodiment of the application separately executes the coarse classification and the fine classification, and can ensure that a fine classification result is not influenced by a coarse classification result. Therefore, each candidate region feature extracted from the image to be detected is input into each fine classification branch for fine-grained feature extraction and fine classification processing.

In general, in order to accurately detect all target objects in an image to be detected, a plurality of candidate regions, for example, 1000 regions, are extracted from the image to be detected.

Then, when the candidate regions are subdivided, the features of each candidate region propofol need to be input into each fine classification model shown in fig. 3, and meanwhile, if the number of coarse classes that need to be subdivided is large, a plurality of fine classification models exist, and each fine classification model needs to perform processing such as fine-grained feature extraction, fine classification regression prediction, and the like on each propofol. Thus, the larger the number of candidate regions propusals, the more the amount of calculation of the fine classification processing based on the features of the candidate regions becomes.

In order to reduce the amount of computation and accelerate the inference speed, the embodiment of the present application selects the classification result of the joint coarse branch to perform filtering processing on the candidate regions, so as to reduce the number of candidate regions for performing fine classification processing.

When a plurality of candidate regions are extracted from an image to be detected, firstly, carrying out target detection processing based on the obtained candidate region features to obtain rough classification processing of the candidate regions, then, extracting fine-grained features from the candidate region features, carrying out image classification processing based on the extracted fine-grained features, and screening the candidate regions extracted from the image to be detected according to the rough classification result of each candidate region before obtaining a fine classification result of the candidate regions to obtain N candidate regions with the highest classification confidence; and acquiring the feature vectors of the N candidate regions as the updated candidate region features. Then, fine-grained features are extracted from the updated candidate region features, and image classification processing is carried out on the basis of the extracted fine-grained features, so that a fine classification result of the candidate region is obtained.

Specifically, after n candidate regions propusals are extracted from the image to be detected and the features of the n candidate regions propusals are obtained, the features of the n (e.g., n = 1000) propusals are sent to a rough classification branch to obtain a n × cls _ num _ coarse-dimension class prediction vector, that is, a rough classification result for the n candidate regions propusals.

Then, through an NMS non-maximum suppression method, the first nth (e.g. N = 300) detection frames with higher classification confidence are selected according to the coarse classification result, and then back indexing is performed to obtain N original candidate regions propofol corresponding to the nth detection frames, so as to serve as updated candidate regions for fine classification.

The subsequent fine classification processing is carried out based on the updated candidate region features, so that the number of the candidate regions proposals which are really subjected to fine classification is reduced from the original n to Ntop, and the calculation amount of the fine classification processing is greatly reduced. Experiments prove that the final detection effect is not obviously influenced by the propusals filtration, and the detection speed can be improved by about 1.2 times.

As a preferred implementation manner, the image target detection method combining the thickness classification and the detail classification provided in the embodiment of the present application may be specifically implemented by means of a pre-trained feature extraction model when extracting a candidate region from an image to be detected and acquiring features of the candidate region.

The process of extracting the candidate region and the feature of the image to be detected by the feature extraction model can be seen in fig. 5.

After the image to be detected is input into the feature extraction model, the resnet50 residual convolution neural network is used as a backbone to extract deep features of the image from bottom to top. The residual error network is formed by stacking a plurality of residual error modules, each residual error module adopts 1 × 1, 3 × 3 and 1 × 1 convolution layers to form a bottleneck structure bottleeck for residual error fitting, and shortcut connection is utilized for carrying out identity mapping so as to solve the problem that the gradient of the deep network disappears in the gradient return process, and the model can still be effectively trained in a deeper depth. In the process of extracting the features of the backbone from bottom to top, the size of the feature space is gradually reduced, the receptive field is gradually increased, and the semantic expression capability of the backbone is gradually improved. After the top-level features are obtained, the feature Pyramid network FPN (feature Pyramid networks) is used for up-sampling from top to bottom step by step, and the feature Pyramid network FPN (feature Pyramid networks) is fused with corresponding low-level feature maps respectively through lateral connection, so that multi-level feature vectors which have rich semantic information and geometric characteristics are obtained. The resulting deep convolution features may be represented by a four-dimensional vector N C H W, where N is the size of the input image batch size, C is the number of feature channels, typically set to 256 or more dimensions, and H/W is the size of the feature map in space.

Based on the extracted convolution characteristics, the RPN network respectively presets a series of anchors by taking each pixel point of the characteristic diagram as a center, and positions candidate regions proposals which possibly contain target objects. After the candidate region propusals is generated, the candidate region propusals can be mapped back to the backhaul convolution characteristic diagram, and the corresponding part of characteristics are intercepted to obtain the candidate region characteristics. Because the sizes of the propofol are different, in order to obtain the fully-connected layer input features with fixed dimensions, the ROI posing operation is firstly performed to pool the partial features, and if the number of pooled blocks is set to m × m, candidate region feature vectors with n × c × m dimensions are obtained.

As a preferred training method, the feature extraction model at least includes a loss function of a classification result of a rough classification performed based on the candidate region features extracted by the feature extraction model, and a loss function of a classification result of a fine classification performed based on the candidate region features extracted by the feature extraction model.

Specifically, in the training process of the feature extraction model, when the candidate region feature is obtained through the extraction, the rough classification processing on the candidate region described in the above embodiment of the present application is performed based on the candidate region feature, and the loss function of the rough classification result is calculated as the first loss function; meanwhile, the fine classification processing of the candidate area described in the above-described embodiment of the present application is performed based on the candidate area feature, and the loss function of the fine classification result is calculated as the second loss function.

Then, based on the first loss function and the second loss function, the operation parameters of the feature extraction model are optimized, so that the extracted candidate region features can support subsequent coarse classification and fine classification to obtain more accurate classification results.

In addition, for the candidate region extraction capability of the feature extraction model, training is performed according to the candidate region label of the training sample data.

Specifically, when the feature extraction model extracts the candidate region, a series of anchors are preset according to the possible aspect ratio attributes of the target object, for example, three sizes of 64 × 64, 1288128, and 256 × 256 are set, and each size corresponds to 9 anchors in three proportions of 1:1, 1:2, and 2: 1. On a feature map extracted from an input image, a group of anchors are respectively placed with each pixel point as the center, and then the anchors and the candidate region labeling frame gt _ box are determined to belong to the back according to the overlapping degree IOU of the anchors and the candidate region labeling frame gt _ boxWhether the scene is foreground or foreground, 0/1 two classifications are made, and according to the position offset of anchor relative to gt _ box ()

) Performing position regression prediction, wherein the position regression formula of the candidate region is as follows:

wherein the content of the first and second substances,

，

，

，

represents the coordinate of the center point and the width and height values of the preset anchor,

，

，

，

and obtaining the center coordinates and width and height values of the candidate frames through prediction. Due to the larger number of anchors preset, the non-maximum suppression operation of NMS is usually adopted to select n (e.g., n = 1000) regression boxes with higher confidence as final candidate regions.

The training mode applies fine classification loss and coarse classification loss to train and optimize the feature extraction model, so that on one hand, the performance of extracting the overall features of the candidate region by the feature extraction model can be improved, on the other hand, the model can be prompted to extract more useful fine-grained features, and the performance of detecting the image target combined with the coarse classification and the fine classification can be integrally improved.

The most preferred implementation manner of the image target detection method combining the coarse classification and the fine classification provided by the embodiment of the present application is to train an image target detection model combining the coarse classification and the fine classification, and execute the processing procedure of the image target detection method combining the coarse classification and the fine classification provided by the present application by using the image target detection model combining the coarse classification and the fine classification, so as to realize the coarse classification and the fine classification of the target object in the image to be detected.

Referring to fig. 6, the coarse and fine classification combined image target detection model is composed of a feature extraction backbone network, a coarse classification network and at least one fine classification network. As an optional network building manner, the feature extraction backbone network has the same structure and function as the feature extraction model in the above embodiment, the coarse classification network has the same structure and function as the target detection model for realizing coarse classification in the above embodiment, and the fine classification network has the same structure and function as the fine classification model in the above embodiment.

In the training process, the feature extraction backbone network is trained simultaneously based on the coarse classification loss of the coarse classification network and the fine classification loss of at least one fine classification network, and has the capabilities of extracting features of an image to be detected, generating a candidate region and extracting features of the candidate region, and the extracted features can better support the coarse classification network and the fine classification network to perform coarse classification and fine classification respectively.

After training, the rough classification network has the capability of extracting the candidate region characteristics output by the backbone network based on the characteristics, and performing target detection processing on the candidate region to obtain a first detection result; the first detection result comprises a coarse classification result of the candidate region and a position prediction result of the candidate region;

each fine classification network has the capability of extracting a fine-grained feature corresponding to a coarse class from the candidate region features output by the feature extraction backbone network, and classifying fine-grained images in the coarse class range based on the extracted fine-grained feature, so as to obtain a fine classification result of the candidate region.

After the training of the image target detection model combined with the coarse and fine classification is finished, the image to be detected is input into the image target detection model combined with the coarse and fine classification obtained through pre-training, the image to be detected firstly enters a feature extraction backbone network, and the feature extraction backbone network extracts a candidate region from the image to be detected and obtains the features of the candidate region. Then, the rough classification network of the image target detection model combined with the rough classification carries out target detection processing on the basis of the candidate regional characteristics output by the characteristic extraction backbone network to obtain a first detection result; and at least one fine classification network of the image target detection model combined with the rough classification extracts fine-grained features from the candidate region features output by the feature extraction backbone network respectively, and carries out image classification processing based on the extracted fine-grained features to obtain a fine classification result of the candidate region.

For example, the above-mentioned image object detection model combining the coarse classification and the fine classification can be obtained by performing the following steps S1-S4:

s1, acquiring a training sample image and an annotation label of the training sample image; the labeling labels of the training sample images comprise labeling labels corresponding to all target objects in the training sample images, the labeling labels of the target objects comprise category labels and labeling frame position labels, and the category labels of the target objects comprise rough labels of the target objects and fine labels of the target objects.

Specifically, training sample images are acquired in batch, and label labeling is performed on each acquired training sample image.

For a training image containing a plurality of target objects, each target object is labeled with a class label cls and a label box position label gt _ box (x 1, y1, x2, y 2). In the technical solution of the embodiment of the present application, some target objects belong to a fine class split from a coarse class, where the class labels are set to be in the form of two-level labels including a coarse class label cls _ coarse and a fine class label cls _ refined, and for those target objects whose morphological characteristics are stable and do not need to be split, the class labels are set to be in the form of a default fine class label, and only the cls _ coarse label is provided. Taking the tool type in the X-ray security inspection intelligent identification system as an example, the coarse type tag cls _ coarse is set as knife, and the fine type tag cls _ refined is set as one of an approximately shell-free blade, a shell-same-color blade, a special blade, a same-color folding blade, a different-color folding blade, a same-color shank blade, a different-color shank blade, a linear tool, a large-area blade and the like according to the morphological characteristics. In the training process, the coarse class label cls _ coarse is only applied to the coarse classification network training process, and the fine class label cls _ refined is only applied to the fine classification network training process.

Furthermore, a series of enhancement operations are performed on the obtained training sample images, so that the diversity of data is increased, and the robustness and the generalization capability of the training model are improved. Data enhancement operations as used herein include: 1) geometric transformation: carrying out translation, rotation, horizontal/vertical turning, random cutting and the like so as to adapt to target objects with different postures and different positions; 2) performing multi-scale scaling: if the image is scaled by randomly sampling a coefficient in steps of 0.05 in the range of 0.8-1.5, the scaling is carried out on the image so as to adapt to target objects with different sizes; 3) digital image transformation: the brightness, contrast, saturation and the like of the image are adjusted to adapt to the images obtained by different models and different imaging environments.

And S2, inputting the training sample image into an image target detection model combining coarse and fine classification to obtain a first detection result output by the coarse classification network and a fine classification result output by each fine classification network.

Specifically, after a training sample image is input into an image target detection model combining coarse and fine classification, a feature extraction backbone network in the model is used for extracting the features of a candidate region, a coarse classification network is used for carrying out coarse classification on the candidate region based on the features of the candidate region, a fine classification network is used for carrying out fine classification on the candidate region based on the features of the candidate region, and finally a coarse classification result, a position prediction result and a fine classification result of the candidate region output by the coarse classification network are obtained.

The above process of extracting the candidate region from the training sample image and obtaining the feature of the candidate region by the feature extraction backbone network may refer to the corresponding descriptions of the structure and the function of the feature extraction model. Similarly, the specific working contents of the coarse classification network and the fine classification network may be referred to the working contents of the target detection model and the fine classification model in the above embodiments, respectively. The specific operation of each network is not repeated here.

In order to improve the training accuracy, the result of predicting the position of the candidate region output by the above-described rough classification network is represented by the offset of the four vertices of the predicted candidate region from the positions of the four vertices of the real target labeling frame.

S3, calculating and determining rough classification loss and position regression loss according to a first detection result output by the rough classification network, and a rough label and a label frame position label in a label of a training sample image; and calculating and determining the fine classification loss of each fine classification network according to the fine classification result output by each fine classification network and the fine classification label in the labeling label of the training sample image.

Specifically, according to the position offset of the candidate region in the first detection result output by the rough classification network and the position label gt _ box of the labeling frame of the training sample image, the position regression loss SmoothL1 is calculated according to the following formula:

wherein the content of the first and second substances,

，

labeling box relativity of candidate region for coarse classification network predictionAt the position offset of gt _ box,

is the true offset of the candidate box with respect to gt _ box.

And the result of the coarse classification of the candidate region by the coarse classification network is in the form of probability labels of various coarse classes to which the candidate region belongs. Assuming that the number of coarse class categories is cls _ num _ coarse, a dimension category prediction value of n × cls _ num _ coarse can be finally obtained for n candidate regions, and softmax is performed:

and obtaining the class probability after conversion. Wherein xi represents the prediction probability of the ith rough category to which the candidate region belongs, S_iAnd the normalized probability value of the ith rough category to which the candidate region belongs is represented.

Then, according to the target object rough class label of the training sample image, the rough classification loss of the rough classification network can be calculated

Wherein, P (x)_i) Is the above S_i。

The fine classification loss for each fine classification network can be calculated with reference to the coarse classification loss scheme described above.

As a more preferred implementation manner, the embodiment of the present application calculates and determines the fine classification loss of each fine classification network according to the following processing of steps S31 and S32:

s31, respectively determining target object labeling labels corresponding to the fine classification networks from the labeling labels of the training sample images; the target object labeling label corresponding to the fine classification network is specifically a labeling label of a coarse class target object belonging to the fine classification network for fine classification.

Specifically, since each fine classification network performs fine classification processing corresponding to different coarse classes, in the same batch of training sample images, objects targeted by each fine classification network are different because positive and negative samples are determined according to the overlapping degree of the candidate regions propofol and the real labeled frames in the training process, for example, a candidate frame with a high overlapping degree with gt _ box (for example, IOU ≧ 0.7) is considered as a positive sample, a candidate frame with a low overlapping degree (for example, IOU < 0.3) is considered as a negative sample, in the coarse classification network, all labeled object objects are considered as valid gt _ box to participate in calculation, and in the fine classification network, in order to perform sub-class identification only for a specific class of coarse classes, only labeled objects belonging to the coarse class are considered as gt _ box to participate in calculation, the remaining labeled frames are considered as backgrounds, and at this time, each object frame takes a fine class label cls _ box as a class, and predicting the thin class attribution condition of the target frame by using the training model.

According to the rules, the embodiment of the application determines the target object corresponding to each fine classification network and labels for the target object respectively.

And S32, calculating and determining the fine classification loss of each fine classification network according to the fine classification result output by each fine classification network and the fine classification label in the target object labeling label corresponding to each fine classification network.

Specifically, after the new category label cls _ refined and the new gt _ box _ new corresponding to the fine-classification network are clarified, the fine-classification loss of the fine-classification network can be calculated and determined by comparing the fine-classification result output by the fine-classification network with the category label cls _ refined corresponding to the fine-classification network, and the specific calculation method can be referred to the coarse-classification loss calculation method.

And S4, performing parameter correction on the feature extraction backbone network and the rough classification network according to the rough classification loss and the position regression loss, and performing parameter correction on the feature extraction backbone network and each fine classification network according to the fine classification loss of each fine classification network.

As can be understood from the above description, the feature extraction backbone network mainly extracts the coarse-grained category overall features, the coarse classification network directly performs coarse classification using the features, the fine-grained network continues to perform fine-grained features on the basis of the features, and performs intra-class classification using the extracted fine-grained features, thereby implementing fine classification of the candidate regions. In the model training process, reverse optimization is carried out on the rough classification network based on rough classification loss, and reverse optimization is carried out on the fine classification network based on fine classification loss, and meanwhile, parameter optimization is carried out on the feature extraction backbone network by the reverse transmission gradients of the rough classification network and the fine classification network.

Specifically, the above step S4 may be subdivided into the following steps S41 and S42:

s41, determining the overall loss of the image target detection model combined by the rough classification based on the rough classification loss, the position regression loss and the fine classification loss of each fine classification network.

In the model training process, the overall loss of the image target detection model combining the rough classification and the fine classification comprises the rough classification loss, the position regression loss and the fine classification loss of each fine classification network. Specifically, the overall loss of the image target detection model combined by the coarse classification and the fine classification can be determined by summing the coarse classification loss, the position regression loss and the fine classification losses of the respective fine classification networks.

In a preferred embodiment, when determining the overall loss of the image target detection model combining the rough classification and the fine classification, firstly, determining the fine classification loss weight corresponding to each fine classification network according to the class label of the target object in the training sample image; when all training sample images do not contain target objects belonging to a coarse class which is subjected to fine classification by a fine classification network, the fine classification loss weight corresponding to the fine classification network is 0; when all the training sample images contain target objects belonging to the coarse class subdivided by the fine classification network, the fine classification loss weight corresponding to the fine classification network is a value other than 0.

Specifically, in the actual training process, since the number of the fine class objects is smaller than the total number of the objects to be recognized, in a training sample of batch _ size, the class objects may not be contained at all, and at this time, if all the candidate frames are directly considered as the background for training, the loss of the negative sample is dominant under multiple iterations, the convergence direction of the model is affected, and the detection rate of the classifier is reduced. Therefore, in the embodiment of the application, a dynamic fine classification loss weight dynamic _ weight is respectively set for each fine classification network, the fine classification network only plays a role when the input batch contains the class target, and when the batch does not contain the class target at all, the fine classification loss weight dynamic _ weight of the fine classification network is set to be 0. Dynamic loss weights can, on the one hand, alleviate the problem of imbalance between background and positive samples in a fine-classification network, and, on the other hand, reduce the negative sample loss function value to avoid over-suppression of other classes (false backgrounds) since part of the background in this network is a positive sample relative to other fine-classification networks.

Based on the idea, when the overall loss of the image target detection model combined with the coarse classification is calculated during each iterative training, whether the training sample image contains the image target belonging to the coarse classification of the fine classification network is determined according to the class label of the target object in the training sample image, so that whether the fine classification loss weight corresponding to the fine classification network is set to be 0 or not is determined, and further, the fine classification loss weight of each fine classification network can be determined respectively.

For example, when the fine-classification loss weight of the fine-classification network is not 0, its specific value may be set to 1, or a fractional value between 0 and 1.

And then, based on the fine classification loss weight corresponding to each fine classification network, performing weighted summation on the fine classification loss of each fine classification network, and summing the weighted summation result of the fine classification loss of each fine classification network with the rough classification loss and the position regression loss to obtain the overall loss of the image target detection model combining the rough classification and the fine classification.

Specifically, assuming that there are k fine classification networks in total, the fine classification loss of the k fine classification networks is cls _ loss _ fine _ i, i =1, …, k; meanwhile, assuming that the coarse classification loss is cls _ loss _ coarse and the position regression loss is reg _ loss, the overall loss of the image target detection model combining coarse and fine classifications can be calculated according to the following formula:

wherein the content of the first and second substances,

are coefficients, which are used to balance coarse classification losses and fine classification losses,

weight is lost for fine classification of the fine classification network.

And S42, performing reverse parameter correction on the feature extraction backbone network and the rough classification network according to the rough classification loss and the position regression loss, and performing reverse parameter correction on the feature extraction backbone network and each fine classification network according to the fine classification loss of each fine classification network so as to reduce the overall loss gradient of the image target detection model combined by the rough classification and the fine classification.

Specifically, a classical random gradient descent algorithm is utilized, coarse classification losses and position regression losses are utilized, reverse parameter optimization is conducted on the feature extraction backbone network and the coarse classification network, fine classification losses of all fine classification networks are utilized, reverse parameter optimization is conducted on the feature extraction backbone network and all fine classification networks, and therefore overall loss gradient of the image target detection model combining coarse classification and fine classification is reduced.

In the optimization process, the feature extraction backbone network mainly extracts category overall features of coarse granularity, the coarse classification network directly utilizes the features to carry out coarse classification, and the fine classification network continues to be connected with a fine granularity feature extraction module on the basis of the features and utilizes the generated fine granularity features to carry out intra-category classification. In the iterative process, the back-propagation gradient of the fine classification branch can further optimize the feature extraction backbone network, so that the identification result of the coarse classification branch is assisted to be promoted. On the whole, the method improves the coarse class detection precision on one hand, and expands the class detection capability of the model on the other hand, so that various fine classes can be detected together.

In addition, as an optimal model training mode, in the model training process, the fine-grained features of the target need to be extracted based on the candidate region, and the position of the candidate region predicted at the early stage of training is usually not accurate enough, so as to avoid interference on the feature learning process of the fine classification network, it can be considered that the coarse classification network of the model is firstly trained for a plurality of epochs (such as 5 epochs), and the fine-grained feature extraction module is started to perform fine-grained identification after the coarse classification network tends to be stable.

Through the introduction of the structure, the function and the training process of the image target detection model combining the coarse classification and the fine classification, it should be understood that the image target detection scheme combining the coarse classification and the fine classification provided by the embodiment of the application is mainly improved on the basis of the structure of the fast _ rcnn model, a conventional detection model usually only comprises a single class detector, and for each coarse class mixed in a class, if the coarse class is directly used as a class for detection, the detection precision is not high due to the large intra-class difference, and if all the subclasses are developed to be detected together with other classes, the fine class is difficult to detect due to the inconsistency of the feature granularity. The scheme adopts a multi-branch fine-granularity identification scheme combining thickness classification, and a plurality of branches are synchronously predicted and mutually promoted, so that the detection precision of the thick class is improved, and the fine class detection capability is good.

Compared with a method for carrying out coarse-to-fine progressive classification by using a tree-shaped cascade structure, the method has the advantage that the characteristic extraction process from coarse to fine is emphasized. The progressive classification depends on a tree-shaped classifier with the thickness being larger than the thickness of the tree-shaped classifier, the method has higher requirements on the feature extraction capability of the feature extraction network backbone, and the extracted features have the feature characterization capability of the intra-class components and the overall feature characterization capability among the classes and are generally difficult to realize. Even if the features with different granularities, such as coarse/fine, can be extracted at the same time, because the extracted features are input from the same interface, and then flow to the fine classifier after passing through the coarse classifier, in the process, the features facing each stage of classifier have redundancy, for example, the fine granularity features are redundant for the coarse classifier, while the coarse granularity features are unnecessary for the fine classifier, and under the condition of redundancy, it is difficult to ensure that the corresponding granularity features of the stage are utilized by each stage of classifier. In addition, the subdivided branch in the tree structure depends on the recognition result of the roughly classified branch, and is easily affected by the recognition result of the roughly classified branch. According to the technical scheme of the embodiment of the application, new fine classification branches are directly constructed at the head of the model in a parallel mode, one branch corresponds to one split class, and the identification result of a coarse classifier is not relied on; and the model generates features with different granularities in a mode of a backbone coarse granularity feature and a fine granularity feature extraction module, and a coarse classification branch and a fine classification branch are respectively required. In the optimization process, the gradient of the fine-grained feature extraction module is also transmitted back to the feature extraction network backbone, so that the effects of mutual assistance and common optimization can be achieved.

In addition, the model structure provided by the technical scheme of the embodiment of the application supports that any number of fine classification branches are added to any stage of the cascade structure; in addition, the new class splitting and recombining mode has strong generalizability, and classes which do not belong to a large class in the traditional sense but have similarity can be combined and split, such as an electric shock device and a flashlight, and can be combined into a large class to be detected in a rough classification branch, and then detailed differentiation is carried out in a fine branch.

In correspondence to the above-mentioned image object detection method combining coarse and fine classification, an embodiment of the present application provides an image object detection apparatus combining coarse and fine classification, as shown in fig. 7, the apparatus includes:

a feature extraction module 100, configured to extract a candidate region from an image to be detected and obtain a candidate region feature;

a rough classification module 110, configured to perform target detection processing based on the candidate region features to obtain a first detection result, where the first detection result at least includes a rough classification result for the candidate region;

and a fine classification module 120, configured to extract fine-grained features from the candidate region features, and perform image classification processing based on the extracted fine-grained features to obtain a fine classification result for the candidate region.

The image target detection device combining coarse classification and fine classification provided by the embodiment of the application can extract the features of the candidate region from coarse granularity to fine granularity, perform coarse classification on the candidate region based on the extracted coarse granularity features, and perform fine classification on the candidate region based on the extracted fine granularity features, so that a coarse classification result and a fine classification result of the candidate region can be obtained simultaneously. In the image target detection task with large intra-class difference and subclass in the coarse class and the image target detection task with subclass in the coarse class, the image target detection scheme with the combination of coarse classification and fine classification can perform coarse classification and fine classification on the image respectively based on image features with different granularities, so that the target object in the image can be identified on the coarse granularity, and the subclass of the target object under the coarse classification can be identified, so that the image target can be identified more accurately and comprehensively.

Furthermore, compared with a scheme of performing coarse-fine progressive classification in a tree-shaped cascade structure, the image target detection device combining coarse classification and fine classification provided by the embodiment of the application realizes the coarse-fine image feature extraction process, and the coarse classification and the fine classification are required respectively, and the image classification is performed by using features of corresponding granularity, so that the fine classification result does not depend on the coarse classification result, and the accuracy of the classification result can be fully ensured.

Optionally, the extracting fine-grained features from the candidate region features, and performing image classification processing based on the extracted fine-grained features to obtain a fine classification result of the candidate region, including:

Optionally, the step of respectively extracting fine-grained features corresponding to each coarse category from the candidate region features, and performing fine-grained image classification processing in a corresponding coarse category range based on the extracted fine-grained features to obtain a fine classification result of the candidate region includes:

respectively inputting the characteristics of the candidate areas into at least one fine classification model trained in advance to obtain fine classification results of the candidate areas;

each fine classification model is respectively provided with the capability of extracting a fine-grained feature corresponding to a coarse class from the candidate region features and classifying fine-grained images in the coarse class range based on the extracted fine-grained feature so as to obtain a fine classification result of the candidate region.

Optionally, extracting a fine-grained feature corresponding to a coarse class from the candidate region features includes:

corresponding to the coarse class, extracting feature vectors of local feature maps, each of which corresponds to a first number of subclasses and each of which corresponds to a second number, of each candidate region from the candidate region features; wherein the first number of subclasses belong to the same coarse class;

respectively carrying out aggregation processing on the second quantity of local feature maps corresponding to each subclass in the extracted feature vectors to obtain the feature vectors corresponding to the first quantity of subclasses in each candidate region;

and performing bilinear pooling on the feature vectors of the first number of subclasses corresponding to each candidate area to obtain fine-grained features corresponding to the coarse classes.

Optionally, after the feature vectors of the local feature maps corresponding to the first number of sub-classes and the second number of sub-classes are extracted and obtained, the fine classification module 120 is further configured to:

and randomly selecting local feature maps with a set proportion from the second number of local feature maps corresponding to each subclass in the extracted feature vectors to be inactivated.

Optionally, the apparatus further comprises:

the candidate region screening module is used for screening the candidate regions extracted from the image to be detected according to the coarse classification result of each candidate region to obtain N candidate regions with the highest classification confidence; and acquiring the feature vectors of the N candidate regions as the updated candidate region features.

Optionally, the extracting a candidate region from an image to be detected and acquiring a candidate region feature includes:

inputting an image to be detected into a feature extraction model, so that the feature extraction model extracts a candidate region from the image to be detected and obtains the feature of the candidate region;

the feature extraction model is obtained by training at least according to a first loss function and a second loss function;

the first loss function is a loss function of a coarse classification result of the candidate region, which is obtained by performing target detection processing on the candidate region feature output by the feature extraction model; the second loss function is a loss function of a fine classification result of the candidate region obtained by extracting fine-grained features from the candidate region features output by the feature extraction model and performing image classification processing based on the extracted fine-grained features.

Optionally, extracting a candidate region from the image to be detected, and acquiring the characteristics of the candidate region; performing target detection processing based on the candidate region characteristics to obtain a first detection result, wherein the first detection result at least comprises a coarse classification result of the candidate region; and extracting fine-grained features from the candidate region features, and performing image classification processing based on the extracted fine-grained features to obtain a fine classification result of the candidate region, wherein the fine classification result comprises the following steps:

inputting an image to be detected into a pre-trained image target detection model combining rough classification, so that a feature extraction backbone network of the image target detection model combining rough classification and fine classification extracts a candidate region from the image to be detected and obtains the feature of the candidate region; the rough classification network of the image target detection model combined with the rough classification carries out target detection processing based on the candidate region characteristics to obtain a first detection result; at least one fine classification network of the image target detection model combined with the coarse and fine classification extracts fine-grained features from the candidate region features respectively, and carries out image classification processing based on the extracted fine-grained features to obtain a fine classification result of the candidate region;

the feature extraction backbone network is obtained based on rough classification loss of the rough classification network and fine classification loss training of the at least one fine classification network, and has the capability of performing feature extraction on an image to be detected, generating a candidate region and extracting features of the candidate region;

the rough classification network has the capability of extracting the candidate region characteristics output by the backbone network based on the characteristics, and performing target detection processing on the candidate region to obtain a first detection result; the first detection result comprises a coarse classification result of the candidate region and a position prediction result of the candidate region;

Optionally, the training process of the image target detection model combining the coarse and fine classification includes:

acquiring a training sample image and an annotation label of the training sample image; the labeling labels of the training sample images comprise labeling labels corresponding to all target objects in the training sample images, the labeling labels of the target objects comprise category labels and labeling frame position labels, and the category labels of the target objects comprise rough labels of the target objects and fine labels of the target objects;

inputting the training sample image into an image target detection model combining coarse and fine classification to obtain a first detection result output by a coarse classification network and a fine classification result output by each fine classification network;

calculating and determining coarse classification loss and position regression loss according to a first detection result output by the coarse classification network, and a coarse label and a label frame position label in a label of a training sample image; calculating and determining the fine classification loss of each fine classification network according to the fine classification result output by each fine classification network and the fine classification label in the label of the training sample image;

and according to the coarse classification loss and the position regression loss, performing parameter correction on the feature extraction backbone network and the coarse classification network, and according to the fine classification loss of each fine classification network, performing parameter correction on the feature extraction backbone network and each fine classification network.

Optionally, the calculating and determining the fine classification loss of each fine classification network according to the fine classification result output by each fine classification network and the fine classification label in the label of the training sample image includes:

respectively determining target object labeling labels corresponding to the fine classification networks from the labeling labels of the training sample images; the label of the target object corresponding to the fine classification network is specifically a label of a coarse class target object belonging to the fine classification network for fine classification;

and calculating and determining the fine classification loss of each fine classification network according to the fine classification result output by each fine classification network and the fine classification label in the target object labeling label corresponding to each fine classification network.

Optionally, performing parameter correction on the feature extraction backbone network and the rough classification network according to the rough classification loss and the position regression loss, and performing parameter correction on the feature extraction backbone network and each fine classification network according to the fine classification loss of each fine classification network, including:

determining the overall loss of the image target detection model combined by the coarse classification and the fine classification based on the coarse classification loss, the position regression loss and the fine classification loss of each fine classification network;

and according to the coarse classification loss and the position regression loss, reverse parameter correction is carried out on the feature extraction backbone network and the coarse classification network, and according to the fine classification loss of each fine classification network, reverse parameter correction is carried out on the feature extraction backbone network and each fine classification network, so that the overall loss gradient of the image target detection model combined by the coarse classification and the fine classification is reduced.

Optionally, the determining the overall loss of the image target detection model combined by the coarse classification and the fine classification based on the coarse classification loss, the position regression loss and the fine classification loss of each fine classification network includes:

determining fine classification loss weights corresponding to the fine classification networks according to the class labels of the target objects in the training sample images; when all training sample images do not contain target objects belonging to a coarse class which is subjected to fine classification by a fine classification network, the fine classification loss weight corresponding to the fine classification network is 0; when all training sample images contain target objects belonging to a coarse class which is subjected to fine classification by a fine classification network, the fine classification loss weight corresponding to the fine classification network is a value other than 0;

and carrying out weighted summation on the fine classification losses of each fine classification network based on the fine classification loss weight corresponding to each fine classification network, and summing the weighted summation result of the fine classification losses of each fine classification network with the rough classification loss and the position regression loss to obtain the overall loss of the image target detection model combined with the rough classification.

Specifically, please refer to the content of the corresponding method embodiment for the specific working content of each module of the image target detection apparatus with the combination of the coarse and fine classifications, which is not described herein again.

Another embodiment of the present application further provides an image object detection apparatus combining thickness classification, and as shown in fig. 8, the apparatus includes:

a memory 200 and a processor 210;

wherein, the memory 200 is connected to the processor 210 for storing programs;

the processor 210 is configured to execute the program stored in the memory 200 to implement the method for detecting an image object by combining coarse and fine classification disclosed in any of the above embodiments.

Specifically, the image target detection device combining the coarse and fine classification may further include: a bus, a communication interface 220, an input device 230, and an output device 240.

The processor 210, the memory 200, the communication interface 220, the input device 230, and the output device 240 are connected to each other through a bus. Wherein:

a bus may include a path that transfers information between components of a computer system.

The processor 210 may be a general-purpose processor, such as a general-purpose Central Processing Unit (CPU), microprocessor, etc., an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of programs in accordance with the present invention. But may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components.

The processor 210 may include a main processor and may also include a baseband chip, modem, and the like.

The memory 200 stores programs for executing the technical solution of the present invention, and may also store an operating system and other key services. In particular, the program may include program code including computer operating instructions. More specifically, memory 200 may include a read-only memory (ROM), other types of static storage devices that may store static information and instructions, a Random Access Memory (RAM), other types of dynamic storage devices that may store information and instructions, a disk storage, a flash, and so forth.

The input device 230 may include a means for receiving data and information input by a user, such as a keyboard, mouse, camera, scanner, light pen, voice input device, touch screen, pedometer, or gravity sensor, among others.

Output device 240 may include equipment that allows output of information to a user, such as a display screen, a printer, speakers, and the like.

Communication interface 220 may include any device that uses any transceiver or the like to communicate with other devices or communication networks, such as an ethernet network, a Radio Access Network (RAN), a Wireless Local Area Network (WLAN), etc.

The processor 210 executes the program stored in the memory 200 and invokes other devices, which can be used to implement the steps of the image object detection method combining the rough classification provided in the embodiments of the present application.

Another embodiment of the present application further provides a storage medium, where a computer program is stored on the storage medium, and when the computer program is executed by a processor, the computer program implements the steps of the image object detection method combining the thickness classification provided in any of the above embodiments.

Specifically, the specific work content of each part of the image object detection device combining the coarse and fine classifications and the specific processing content of the computer program on the storage medium when being executed by the processor can refer to the content of each embodiment of the image object detection method combining the coarse and fine classifications, which is not described herein again.

While, for purposes of simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present application is not limited by the order of acts or acts described, as some steps may occur in other orders or concurrently with other steps in accordance with the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The steps in the method of each embodiment of the present application may be sequentially adjusted, combined, and deleted according to actual needs, and technical features described in each embodiment may be replaced or combined.

The modules and sub-modules in the device and the terminal in the embodiments of the application can be combined, divided and deleted according to actual needs.

In the several embodiments provided in the present application, it should be understood that the disclosed terminal, apparatus and method may be implemented in other manners. For example, the above-described terminal embodiments are merely illustrative, and for example, the division of a module or a sub-module is only one logical division, and there may be other divisions when the terminal is actually implemented, for example, a plurality of sub-modules or modules may be combined or integrated into another module, or some features may be omitted or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

The modules or sub-modules described as separate parts may or may not be physically separate, and parts that are modules or sub-modules may or may not be physical modules or sub-modules, may be located in one place, or may be distributed over a plurality of network modules or sub-modules. Some or all of the modules or sub-modules can be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, each functional module or sub-module in the embodiments of the present application may be integrated into one processing module, or each module or sub-module may exist alone physically, or two or more modules or sub-modules may be integrated into one module. The integrated modules or sub-modules may be implemented in the form of hardware, or may be implemented in the form of software functional modules or sub-modules.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software unit executed by a processor, or in a combination of the two. The software cells may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for detecting an image target by combining thickness classification is characterized by comprising the following steps:

2. The method according to claim 1, wherein the extracting fine-grained features from the candidate region features and performing image classification processing based on the extracted fine-grained features to obtain a fine classification result for the candidate region, comprises:

3. The method of claim 2, wherein the step of extracting fine-grained features corresponding to each coarse class from the candidate region features respectively, and performing fine-grained image classification processing within a corresponding coarse class range based on the extracted fine-grained features to obtain a fine classification result for the candidate region comprises:

4. The method of claim 2 or 3, wherein extracting fine-grained features corresponding to a coarse class from the candidate region features comprises:

5. The method of claim 4, wherein after extracting feature vectors of the local feature maps corresponding to a first number of sub-classes for each candidate region and a second number for each sub-class, the method further comprises:

6. The method according to claim 1, wherein when there are a plurality of candidate regions extracted from the image to be detected, before extracting fine-grained features from the candidate region features and performing image classification processing based on the extracted fine-grained features to obtain a result of fine classification of the candidate regions, the method further comprises:

screening candidate regions extracted from the image to be detected according to the coarse classification result of each candidate region to obtain N candidate regions with the highest classification confidence;

and acquiring the feature vectors of the N candidate regions as the updated candidate region features.

7. The method according to claim 1, wherein the extracting candidate regions from the image to be detected and obtaining candidate region features comprises:

8. The method according to claim 1, characterized in that a candidate region is extracted from the image to be detected, and candidate region features are obtained; performing target detection processing based on the candidate region characteristics to obtain a first detection result, wherein the first detection result at least comprises a coarse classification result of the candidate region; and extracting fine-grained features from the candidate region features, and performing image classification processing based on the extracted fine-grained features to obtain a fine classification result of the candidate region, wherein the fine classification result comprises the following steps:

inputting an image to be detected into a pre-trained image target detection model so as to enable a feature extraction backbone network of the image target detection model to extract a candidate region from the image to be detected and acquire the feature of the candidate region; the rough classification network of the image target detection model carries out target detection processing based on the candidate region characteristics to obtain a first detection result; at least one fine classification network of the image target detection model extracts fine-grained features from the candidate region features respectively, and carries out image classification processing based on the extracted fine-grained features to obtain a fine classification result of the candidate region;

9. The method of claim 8, wherein the training process of the image target detection model comprises:

inputting the training sample image into an image target detection model to obtain a first detection result output by the rough classification network and a fine classification result output by each fine classification network;

10. The method of claim 9, wherein the calculating and determining the fine classification loss of each fine classification network according to the fine classification result output by each fine classification network and the fine classification label in the label of the training sample image comprises:

11. The method of claim 9, wherein performing parameter corrections on the feature extraction backbone network and the coarse classification network based on the coarse classification loss and the location regression loss, and performing parameter corrections on the feature extraction backbone network and the fine classification networks based on the fine classification losses of the fine classification networks comprises:

determining an overall loss for the image target detection model based on the coarse classification loss, the positional regression loss, and the fine classification losses for the respective fine classification networks;

and according to the coarse classification loss and the position regression loss, reverse parameter correction is carried out on the feature extraction backbone network and the coarse classification network, and according to the fine classification loss of each fine classification network, reverse parameter correction is carried out on the feature extraction backbone network and each fine classification network, so that the overall loss gradient of the image target detection model is reduced.

12. The method of claim 11, wherein determining the overall loss of the image object detection model based on the coarse classification loss, the positional regression loss, and the fine classification losses of the respective fine classification networks comprises:

and carrying out weighted summation on the fine classification losses of each fine classification network based on the fine classification loss weight corresponding to each fine classification network, and summing the weighted summation result of the fine classification losses of each fine classification network with the coarse classification loss and the position regression loss to obtain the overall loss of the image target detection model.

13. An image object detection device combining thickness classification, comprising:

14. An image object detection apparatus combining thickness classification, comprising:

a memory and a processor;

the processor is used for implementing the method for detecting the image target by combining the thickness classification as claimed in any one of claims 1 to 12 by running the program in the memory.

15. A storage medium having stored thereon a computer program for implementing a combined coarse-fine classification image object detection method as claimed in any one of claims 1 to 12 when executed by a processor.