CN116524263A

CN116524263A - Semi-automatic labeling method for fine-grained images

Info

Publication number: CN116524263A
Application number: CN202310485129.5A
Authority: CN
Inventors: 张载龙; 柏明潇; 王堃
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2023-04-28
Filing date: 2023-04-28
Publication date: 2023-08-01

Abstract

The invention discloses a semi-automatic labeling method for fine-grained images, which comprises the following steps: acquiring a fine-grained image data set to be marked; manually labeling each fine-grained image to be labeled in the fine-grained image data set; inputting the manually marked fine-grained images into a preset machine learning model, and outputting the marked fine-grained images; the preset machine learning model comprises a K-Means clustering model constructed based on SIFT features of the marked image and a model of hierarchical classification trained by using the marked data set; according to the hierarchical classification training method, coarse-granularity classification information is extracted from fine-granularity classification information, then two branches of fine granularity and coarse granularity are trained in parallel, finally, the locally different images are distinguished better by means of stably extracted image features of a BatchNorm layer, and the accuracy rate of classification prediction is greatly improved.

Description

Semi-automatic labeling method for fine-grained images

Technical Field

The invention relates to a semi-automatic labeling method for fine-grained images, and belongs to the technical field of target detection in the field of computer vision.

Background

Image annotation, also known as image tagging, is a technique that processes digital image meta-information. Such meta information may be described in various forms, such as titles, keywords, to aid in organizing and locating images in a dataset. Without meta-information, it is very difficult to retrieve a specific image by means of only the image dataset. By adding meta information to each image in the dataset, the required images can be quickly retrieved and located, which greatly facilitates the operation of image annotation for the user.

Currently, image labeling techniques are divided into three modes, manual labeling, semi-automatic labeling and automatic labeling. Semi-automatic labeling and automatic labeling become hot spots in research today because of the time and effort consuming manual labeling. However, automatic labeling is still required to be manually assisted in correcting labeling errors because images retrieved by an algorithm sometimes deviate greatly from the expected results of people. Therefore, a method of semi-automatically labeling such a combination of manual and machine is becoming mainstream.

The conventional image semiautomatic labeling method generally comprises the following key steps: firstly, training a deep neural network classifier by using an image with known meta information; then, the classifier is used for classifying and predicting the image to be marked without the known meta information; and finally, manually correcting the predicted category, thereby establishing the corresponding relation between the image to be marked and the known meta information.

However, this semi-automatic labeling method also has obvious drawbacks: for the image to be annotated, the number of known meta-information is fixed, from which new concepts cannot be extended to annotate new never seen images. This makes image annotation systems that rely on fixed classification unsuitable for annotation scenes of fine-grained class.

In the labeling scene of fine-grained categories, the number of categories is often quite huge. How to distinguish local minor differences between different classes will become the key to determine the quality of the labeling. Meanwhile, considering the influence of interference factors such as shielding, light rays, shooting angles and the like, human eyes are more difficult to distinguish differences among fine-grained images, so that image annotation of fine-grained categories is always a challenging task in the field of computer vision.

For this reason, a new image labeling method needs to be developed to solve this problem. In particular, in the field of electronic retail, commodity categories are often tens of thousands, and new commodities are sometimes emerging, which makes the task of image annotation in the field a serious challenge. Aiming at the problems that the labeling image main body is easy to be interfered by background noise, the semi-automatic labeling algorithm is insufficient in meta-information utilization, the local difference of the fine-granularity class image is difficult to distinguish and the like which are common in the fine-granularity class labeling task, the patent is correspondingly improved.

There are many kinds of conventional image semiautomatic labeling methods. One of them is a semi-automatic image labeling method (CN 202110177362.8) based on online learning. The method aims at solving the time-consuming problem of manually preparing training data in the field of target detection, and semi-automatic labeling is realized in a mode of labeling and learning. However, this approach suffers from the problem of unreliable label category predictions as the number of categories increases to fine-grained levels. Another semi-automated labeling method, apparatus, medium, and device (CN 202010617664.8) generates labeled candidate regions for selection by a user based on similarity scores for pictures. But only one category at a time can be predicted and no other potentially similar categories can be revealed. The invention provides a semi-automatic labeling method for fine-grained images, which combines the advantages of the traditional image algorithm and the deep learning, and can accurately extract the image characteristics. The classification task of the fine granularity level can be completed only by one time of manual marking, and the auxiliary memory of a label system is introduced, so that the similar pictures can be easily distinguished manually, and the wrong classification is not required to be screened repeatedly.

Disclosure of Invention

The invention aims to provide a semi-automatic labeling method for fine-grained images, which aims to solve the defect that an image labeling system relying on fixed classification in the prior art cannot be suitable for labeling scenes in fine-grained categories.

A method for semi-automatic labeling of fine-grained images, the method comprising:

acquiring a fine-grained image data set to be marked;

manually labeling each fine-grained image to be labeled in the fine-grained image data set;

inputting the manually marked fine-grained images into a preset machine learning model, and outputting the marked fine-grained images;

the preset machine learning model comprises a K-Means clustering model constructed based on SIFT features of the marked image and a hierarchical classification model trained by using the marked data set.

Further, the preset machine learning model is obtained based on labeling sample training, the labeling sample is a fine-grained image dataset of manual labeling, and the training method comprises the following steps:

constructing an initial machine learning model;

inputting the manually marked fine-grained images into an initial machine learning model, and calculating the similarity between the candidate category and the current marked area;

sorting the candidate categories according to the similarity, comparing the result of sorting according to the similarity with the current labeling area, and selecting the category which is most similar to the category to be labeled;

and putting the labeling result into a training data set of the model, and iteratively updating the initial machine learning model to obtain a preset machine learning model.

Further, the machine learning model includes a K-Means clustering model and a hierarchical classification model:

the K-Means clustering model consists of K center points and characteristic subspaces divided by the K center points, wherein the K center points are obtained by using a K-Means clustering algorithm on the basis of SIFT characteristics extracted from the marked region; the K feature subspaces are divided by SIFT features of the marked areas according to distances from K center points;

the hierarchical classification model adopts ResNet as a backbone network for extracting the characteristics of the marked area, and is connected with two branches after the pooling layer for training the image characteristics of different levels, wherein one is used for a fine granularity level and the other is used for a coarse granularity level.

Further, each branch is formed by sequentially connecting a Self-attribute layer, a BatchNorm layer and a full-connection layer from front to back;

the Self-Attention layer is used for extracting a striking part in the features, the BatchNorm layer is used for balancing the scale of the features, the fully connected layer is used for classifying the features, and after the fully connected layer, the hierarchical classification model uses a cross entropy loss function, a Focal loss function and an ArcFace loss function, wherein the cross entropy loss function is used for maximizing the prediction probability of correct classification, the Focal loss function is used for solving the classification of an unbalanced data set, and the ArcFace loss function is used for increasing the distinguishing degree of the features of different classes in angle space.

Further, the training method of the K-Means clustering model comprises the following steps:

extracting SIFT features from all marked areas, and generating K center points by adopting a K-Means clustering algorithm;

dividing the SIFT features of all marked areas into K feature subspaces according to distances from K center points;

an initial K-Means clustering model is formed through K center points and K characteristic subspaces;

after determining the category of SIFT features of the region to be marked, merging the SIFT features with the SIFT features of the marked region, and updating K center points and K feature subspaces of an initial K-Means clustering model to obtain the K-Means clustering model.

Further, the training phase of the hierarchical classification model comprises the following steps:

all marked areas form a marked data set;

training a pre-training network for the marked data set by adopting a hierarchical classification model;

after the category of the region to be marked is determined, the region to be marked and the marked region are combined to be used as a new marked data set to iteratively train a pre-training network, and a hierarchical classification model is obtained.

The data set generated by early annotation is trained by using an ImageNet pre-trained parameter synchronization ResNet model as a backbone network and combining with a custom detection head.

Further, based on different labeling sample amounts, different sample recommendation algorithms are adopted, and a label information search algorithm is introduced for the degree of fit between the searched labeled sample and the target area.

Further, the sample recommendation algorithm comprises the following steps:

early stage recommendation algorithm: in the early stage, the sample recommendation algorithm builds a K-Means clustering model based on SIFT features of the marked images, and then selects marked samples which are most similar to the target area for recommendation according to the model.

Late stage recommendation algorithm: in the later stage, the sample recommendation algorithm is based on a deep learning mode, a model of hierarchical classification is trained by using a marked data set, unmarked data is predicted according to the model, and a plurality of categories with highest prediction probability are selected to be used as sample recommendation.

Compared with the prior art, the invention has the beneficial effects that:

1. the background noise is restrained from interfering with the labeling main body, the background segmentation technology of Self-Attention algorithm is adopted, the background and the main body can be effectively separated, and noise interference is further restrained, the technology can carry out different Attention weighting on different parts of an image, so that the main body is extracted more accurately, and through multiple experiments and verification, the method shows higher accuracy and stability under various different scenes, and the labeling effect is obviously better than that of the traditional method;

2. the invention fully utilizes the semantic information of the known category, the invention assists the artificial memory through the image retrieval and the label system, fully utilizes the semantic information of the known category, retrieves the category and the label of unlabeled data according to the similarity by using the model trained by the existing data, corrects and records the prediction result of the model manually, and finally updates the model together with the manual correction and the supplementary labeling. The semantic information of the known category is fully utilized, the generalization capability of the model is improved, the burden of manual labeling is greatly reduced, and the model can be better adapted to a continuously-changing data set;

3. the invention provides a hierarchical classification training method, which is characterized in that firstly, coarse-granularity classification information is extracted from fine-granularity classification information, then, two branches of fine granularity and coarse granularity are trained in parallel, and finally, the local differences in different images are better distinguished by means of stably extracted image features of a BatchNorm layer, so that the accuracy rate of classification prediction is greatly improved.

Drawings

FIG. 1 is a front-end and back-end interaction flow diagram of the present invention;

FIG. 2 is a schematic diagram of a system interface of the present invention;

FIG. 3 is a schematic diagram of the early labeling algorithm of the present invention;

FIG. 4 is a schematic diagram of a post-labeling algorithm of the present invention;

FIG. 5 is a schematic diagram of a feature search of the present invention;

FIG. 6 is a schematic diagram of a tag information retrieval algorithm of the present invention;

FIG. 7 is a flowchart of the labeling operation of the present invention;

FIG. 8 is a post-labeling training model of the present invention.

Detailed Description

The invention is further described in connection with the following detailed description, in order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the invention easy to understand.

The invention discloses a semi-automatic labeling method for fine-grained images, which recommends similar labeled samples for target areas labeled by users, and attaches class information and label information for the users to determine the class of the labeled areas. The key points comprise a labeled sample recommendation algorithm and a label information search algorithm:

the labeled sample recommendation algorithm comprises a pre-labeling algorithm and a post-labeling algorithm, firstly extracting the characteristics of a target area which is labeled currently by a user, comparing the characteristics with the characteristics of the labeled sample one by one, and recommending similar categories for the user according to the sequence of the similarities.

The principle of the pre-labeling algorithm is shown in fig. 3, and the characteristic process of model retrieval is shown in fig. 5. The early annotation is suitable for the scene of generating a small quantity of annotation samples by a single personal operation, and expensive equipment such as a GPU is not needed. And in the early stage of labeling, when the number of labeled areas is smaller than a certain threshold value, the sample recommendation algorithm builds a K-Means cluster model based on SIFT features of the labeled areas.

In the training stage of the model, firstly, SIFT features are extracted from all marked areas, and K center points are generated by adopting a K-Means clustering algorithm; dividing K feature subspaces according to distances from K center points by SIFT features of all marked areas, and forming an initial K-Means clustering model through the K center points and the K feature subspaces; and finally, after determining the category of the SIFT feature of the region to be marked, merging the SIFT feature with the SIFT feature of the marked region, and updating K center points and K feature subspaces of the initial K-Means clustering model to obtain the K-Means clustering model.

When the primary labeling is performed, firstly, for SIFT features of a region to be labeled, judging which center point is nearest to the SIFT features, namely which feature subspace the SIFT features belong to; secondly, comparing SIFT features of the region to be marked with all SIFT features in the feature subspace one by one according to a FLANN matching algorithm; and finally, selecting the first N nearest SIFT features in the feature subspace, and converting the SIFT features into categories by means of a hash table, namely the possible categories of the to-be-marked area.

The principle of the post-labeling algorithm is shown in fig. 4, the training model of the post-labeling algorithm is shown in fig. 8, and the characteristic retrieval process of the model is shown in fig. 5. The post-labeling is suitable for scenes in which a large number of labeling samples are generated by cooperation of multiple persons, and meanwhile, the GPU is required to be deployed on a server for training. In the latter stage of labeling, when the number of labeled regions exceeds a certain threshold, the sample recommendation algorithm is based on a hierarchical classification model trained by the labeled dataset.

The hierarchical classification model adopts ResNet as a backbone network for extracting the features of the marked areas. After pooling the layers, the model connects the two branches. These two branches are used to train different levels of image features, one for the fine granularity level and the other for the coarse granularity level. Each branch is formed by sequentially connecting a Self-attribute layer, a BatchNorm layer and a full-connection layer from front to back. The Self-Attention layer is used for extracting the striking part in the characteristics, noise interference can be effectively restrained, the BatchNorm layer is used for balancing different scales of the characteristics, and the full connection layer is used for classifying the characteristics. After the fully connected layers, the model uses three different loss functions, including a cross entropy loss function, a Focal loss function, and an ArcFace loss function. Wherein the cross entropy loss function is used to maximize the prediction probability of correct classification, the Focal loss function is used to solve the classification problem of unbalanced data sets, and the ArcFace loss function is used to increase the angular spatial differentiation of the features of different classes.

In the training stage of the model, the hierarchical classification model is trained by using a marked data set, the initial learning rate is recommended to be between 0.0001 and 0.001, the initial learning rate is reduced by 10 times every 20 periods, the random gradient descent algorithm can adopt an SGD algorithm, and the size of the batch sample is recommended to be between 64 and 256.

In the later labeling, firstly, removing two branches after a pre-training network pooling layer, and extracting pooling layer characteristics of a region to be labeled by using the initial network; secondly, comparing the pooling layer characteristics of the region to be marked with the pooling layer characteristics of all marked regions one by one to obtain cosine similarity; and finally, selecting the top N nearest pooling layer features in the marked area, and converting the pooling layer features into categories by means of a hash table, namely the possible categories of the area to be marked.

Tag information search algorithm. The method attaches the labeling information to the labeled sample, helps a user to screen the labeled sample which is more attached to the current target area through the retrieval label, and improves the accuracy of category identification.

The tag information search algorithm is shown in fig. 6. The labels of the marked samples are organized into label sets according to inverted files in advance, and the form is as follows: the label is used as a main key, and marked samples under the same label are arranged into a list to be used as the value of the main key. Then, traversing the primary keys of the inverted file according to a plurality of labels input by the user, if the primary keys are the same or similar, retrieving marked samples under the primary keys, and keeping the marked sample recommendation algorithm to order the samples. For tag similarity, a TF-IDF algorithm comparison may be employed.

The labeling flow of the present invention is shown in FIG. 7. A round of labeling process includes multiple labeling cycles, one labeling cycle being shown in fig. 7. In a labeling period, a plurality of labeling operators receive unlabeled data sets, labeling tasks are completed on respective hosts, in the labeling process, besides identifying a target area according to known categories, if a new unknown category is found, descriptive nouns (such as colors and shapes) are adopted for the target area to customize category names, and the descriptive nouns are fed back to the inspectors; the inspector gathers the samples uploaded by each annotator through the management server, judges the quality of the samples fed back by the annotators and whether the sample types have original data sets or not, and simultaneously adjusts the model and maintains the data sets in good time, so that the generated type dictionary is used as a reference when the annotators recognize the target area. After the whole labeling process is finished, the inspector sorts out non-repeated categories from the data set, so that the inspector can be helped to check whether the data set has false or missed samples, and final category correction is performed.

The method comprises the following specific steps:

step 1, acquiring a fine-grained image dataset to be marked;

step 2, manually labeling each fine-grained image to be labeled in the fine-grained image data set;

step 3, inputting the manually marked fine-grained images into a preset machine learning model, and outputting the marked fine-grained images;

the preset machine learning model comprises a K-Means clustering model constructed based on SIFT features of marked images and a hierarchical classification model trained by using marked data sets, wherein the preset machine learning model is obtained by training on the basis of marked samples, and the marked samples are manually marked fine-grained image data sets.

The specific training method of the machine learning model comprises the following steps:

constructing an initial machine learning model;

The machine learning model comprises a K-Means clustering model and a hierarchical classification model:

the K-Means clustering model consists of K center points and characteristic subspaces divided by the K center points, wherein the K center points are obtained by using a K-Means clustering algorithm on the basis of SIFT characteristics extracted from the marked region; the K feature subspaces are divided by SIFT features of the marked areas according to distances from K center points; the K center points and the K feature subspaces form a K-fork tree structure, and SIFT features of the region to be marked are conveniently and rapidly determined to be similar to those of the marked region, so that possible categories of the region to be marked are further determined.

The hierarchical classification model adopts ResNet as a backbone network for extracting the characteristics of the marked area, and is connected with two branches after a pooling layer for training the image characteristics of different levels, wherein one of the hierarchical classification model is used for a fine granularity level and the other hierarchical classification model is used for a coarse granularity level; each branch is formed by sequentially connecting a Self-attribute layer, a BatchNorm layer and a full-connection layer from front to back;

The training method of the K-Means clustering model comprises the following steps:

The training phase of the hierarchical classification model comprises the following steps:

all marked areas form a marked data set;

In the invention, different sample recommendation algorithms are adopted based on different labeling sample amounts, and a label information search algorithm is introduced for the bonding degree of the searched labeled sample and the target area. The sample recommendation algorithm comprises the following steps:

early stage recommendation algorithm: and in the early stage of labeling, when the number of labeled areas is smaller than a certain threshold value, the sample recommendation algorithm builds a K-Means cluster model based on SIFT features of the labeled areas. Firstly, judging which center point is nearest to SIFT features of a region to be marked, namely which feature subspace the SIFT features belong to; secondly, comparing SIFT features of the region to be marked with all SIFT features in the feature subspace one by one according to a FLANN matching algorithm; finally, selecting the first N nearest SIFT features in the feature subspace, and converting the first N nearest SIFT features into categories by means of a hash table, namely the possible categories of the to-be-marked area;

late stage recommendation algorithm: in the latter stage of labeling, when the number of labeled regions exceeds a certain threshold, the sample recommendation algorithm is based on a hierarchical classification model trained by the labeled dataset. Firstly, removing two branches after a pre-training network pooling layer, and extracting pooling layer characteristics of a region to be marked by using the initial network; secondly, comparing the pooling layer characteristics of the region to be marked with the pooling layer characteristics of all marked regions one by one to obtain cosine similarity; and finally, selecting the top N nearest pooling layer features in the marked area, and converting the pooling layer features into categories by means of a hash table, namely the possible categories of the area to be marked.

Compared with the traditional fine-grained image semiautomatic labeling method, the method uses different sample recommendation algorithms according to the labeled sample quantity in the early and later stages, considers more comprehensively, introduces a label information search algorithm, and enables the searched labeled sample to be more attached to the appearance of the target area. The corresponding interface operation and labeling flow of the invention is as follows:

the interface operation of the present invention is shown in fig. 2. The user can type in a category name or a plurality of label name searches at the position of the edit text to assist in narrowing the sample category recommended by the semi-automatic labeling method. If the category or tag name of the target area has been determined, the user can type in at the "edit text" for its addition, or click on a sample thumbnail in the underlying table, the system will automatically display the category name in the "edit text" column. The lower part of the editing text is composed of a plurality of tables, namely, samples which are arranged in descending order from left to right and from top to bottom according to the similarity of the system, each sample comprises three parts of information of thumbnail, category and label, and the index is provided for a user to determine the labeling information of the target area.

The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.

Claims

1. A method for semi-automatically labeling fine-grained images, the method comprising:

acquiring a fine-grained image data set to be marked;

2. The method for semi-automatically labeling fine-grained image according to claim 1, wherein the preset machine learning model is obtained based on labeling samples, the labeling samples are manually labeled fine-grained image data sets, and the training method comprises:

constructing an initial machine learning model;

3. The fine grain image semiautomatic labeling method of claim 1, wherein the machine learning model comprises a K-Means clustering model and a hierarchical classification model:

4. The method for semi-automatically labeling a fine-grained image according to claim 3, wherein each branch is formed by sequentially connecting a Self-Attention layer, a BatchNorm layer and a full-connection layer from front to back;

5. The fine-grained image semiautomatic labeling method according to claim 1, wherein the training method of the K-Means clustering model comprises the following steps:

6. The method for semi-automatic labeling of fine-grained images according to claim 1, characterized in that the training phase of the hierarchical classification model comprises the following steps:

all marked areas form a marked data set;

7. The method for semi-automatically labeling the fine-grained image according to claim 1, wherein different sample recommendation algorithms are adopted based on different labeling sample amounts, and a label information search algorithm is introduced for the degree of fit between the searched labeled sample and the target area.

8. The method for semi-automatic labeling of fine-grained images according to claim 7, wherein the sample recommendation algorithm comprises the following steps: