CN113095429A

CN113095429A - Robust weak supervision classification method for incremental new image data

Info

Publication number: CN113095429A
Application number: CN202110447613.XA
Authority: CN
Inventors: 李宇峰; 周志华; 朱永南
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-04-25
Filing date: 2021-04-25
Publication date: 2021-07-09

Abstract

The invention discloses a robust weak supervision classification method for incremental new image data. Specifically, in an open scenario, data deltas emerge, including unlabeled, labeled, new class images. And adding the model and updating the annotation data set for the annotated image. And for the new unlabelled image, judging the class attribution of the image through technologies such as feature description, weak supervised learning and the like. If the data belongs to the known category, the labeled category is predicted based on the weak supervised learning technology such as the label propagation algorithm and the like, and then the labeled category is added into the existing label data set; if the image is detected to belong to the new category which is not seen, the image is put into a buffer, and when a sufficient number of images are collected, the model is updated in time and the buffer is emptied. The invention can effectively realize the robust classification of the incremental new image data.

Description

Robust weak supervision classification method for incremental new image data

Technical Field

The invention relates to a robust weak supervision classification method for incremental new image data, and belongs to the technical field of artificial intelligence and mode recognition tasks in a big data environment.

Background

Digital images are important information carriers in the internet era and play an important role in many daily lives and practical applications. With the rapid growth of digital images for individual users, how to effectively manage and utilize digital images becomes an important and challenging task, with image annotation being one of the most critical tasks. Most of the previous image labeling methods are performed in a static scene, that is, it is assumed that the labeling of data is complete and the set of categories does not change. However, as the image annotation task moves to an open scene, the form of data changes greatly, and the prior art is difficult to be directly applied. This is embodied in two aspects. Firstly, the speed of manual labeling of images is far slower than the speed of image generation, so that a large amount of unlabeled data exists in newly emerging digital images. Secondly, as time goes on, new types of digital image data often appear, and the digital image data can cause poor performance and even miss important information without distinction, so that timely detection and reasonable utilization are urgently needed. The prior art proposes a solution around the above single aspect, but few technologies can simultaneously solve the above two difficulties, and improve the classification efficiency of image labeling in an open scene.

Disclosure of Invention

The purpose of the invention is as follows: the method aims at two common phenomena faced by image data in an open scene in the prior art. Firstly, the speed of data labeling cannot catch up with the speed of data generation, so that the data contains a large number of unlabelled samples, and a large amount of image data resources are wasted if the unlabelled samples are not utilized; secondly, new unseen types of image data often appear along with the time, and if the image data cannot be detected in time and reasonably utilized, the classification model is poor or even completely fails. The invention provides a robust weak supervision classification method for incremental new image data, which can find new images, effectively utilize a large number of unmarked images, make up the defect that the traditional image marking mechanism needs a large number of image marks or is difficult to detect the new images, and improve the effect of image marks in an open scene. The invention takes digital image marking as a research object and explains the key points of the invention, wherein incremental data correspond to digital images which are continuously emerged, and the classification of the data corresponds to the marking task of the images.

The technical scheme is as follows: a robust weak supervision classification method for incremental new image data adopts a divide-and-conquer strategy for the classification task of the incremental image data. And directly adding and updating the labeling unit for the labeling data. For the non-labeled data, a new class discovery unit based on the random forest evaluates the non-labeled data to judge whether the new image belongs to an unknown new class; if the data belongs to the known category, rapidly predicting the labeling category by using the data of the historical known category and a label propagation algorithm; if the image data is detected to belong to the unknown new class, the image data is firstly put into a buffer, when a set number of new class images are collected, the new class finding unit and the known class marking unit are updated in time, and the data buffer is emptied to continuously find more new class images.

On one hand, the image data marked in different categories has obvious difference in characteristics, and starting from the difference in characteristic space, newly emerging image data and historical known marked image data are distinguished, and a new image is detected. Because the scale of the historical labeling data is huge, the efficiency of comparing each newly emerging image with all historical data one by one is not feasible, and the calculation overhead is solved by adopting a sampling and integrated learning technology. Firstly, selecting a plurality of small-scale data from historical known labeled images, then respectively and independently performing label evaluation on a new image on each small-scale data, and finally, integrating all label evaluation results by a labeling unit to finally judge whether the image to be labeled is a new type. Sampling can greatly reduce the data scale and improve the efficiency; the integrated learning can effectively avoid sampling deviation and improve robustness. Overall, the above approach is efficient and robust.

On the other hand, digital images collected continuously in an open environment often contain a large amount of label-free data. If the labeling unit only utilizes a small amount of limited labeled images, a large amount of non-labeled data is ignored, and a large amount of important information which is helpful for improving the performance is missed. By the marker propagation technology of the weak supervised learning, the image labeling unit automatically and comprehensively utilizes limited labeled data and a large amount of unlabeled data, thereby obtaining a higher-quality labeling result. Meanwhile, in order to realize quick response in an open scene more efficiently, the marking unit constructs a model by selecting representative historical data, so that the time overhead of updating the marking unit is reduced, and the marking efficiency is greatly improved.

Based on the two aspects, the invention comprehensively utilizes the integrated scoring strategy to judge whether the image is the new data, and efficiently utilizes the unmarked image to realize the rapid data marking, thereby overcoming the defects that the prior art needs a large amount of image marking or cannot detect the new data, and improving the effect of the image marking in the open scene. Meanwhile, when enough new-type images are collected, the method automatically adds the part of the new-type data into the known class, so that the new-type data can be used for continuously discovering the new-type images.

The discovery unit of the new image data has the working process that: recording the average distance between all leaf nodes and the root node of each decision tree through a completely random forest algorithm of known category image data, and recording the average distance as

Wherein leaf set and leaf_iRespectively representing all the leaf node sets and the ith leaf node. And the spherical radius of the image data under each node is recorded,

wherein, O is the set of data samples under the current node, and c is the mean value of the data samples under the current node.

And (4) for new image data, putting the new image data into a decision tree for prediction, recurrently descending from the root node in sequence according to the division standard, and calculating the distance between the root node and the node cluster center. And if the distance is greater than the spherical radius, recording the distance between the current node and the root node of the decision tree. If the distance is greater than the average distance l₀If the difference between the new image and the existing historical data is far, the new image is predicted to be a new class, otherwise, the new image is predicted to be a known class. And further from the aspect of robustness, voting is carried out by integrating the prediction results of the plurality of decision trees to obtain a final result.

The working process of the labeling unit of the known category image is as follows: by adopting algorithms such as mark propagation of weak supervised learning, the image labeling unit automatically and comprehensively utilizes limited labeled data and a large amount of unlabeled data, thereby obtaining a high-quality labeling result. Meanwhile, in order to realize quick response in an open scene more efficiently, the labeling unit constructs a model by selecting generation historical data, so that the time overhead of updating the labeling unit is reduced, and the labeling efficiency is greatly improved.

When the new images are collected to the set number, the labeling unit is started to update, and the work flow is as follows: when the data buffer collects a set number of new images, the new images are added into the labeling unit as the image data of the known type, and the data buffer is emptied to continuously find the new images. The marking unit adds the new image, processes the new image according to the standard of the existing category and is used for finding more marks of the unmarked image; and the discovery unit of the new image data combines the new image and the known image, and reconstructs a completely random forest to continuously discover the new image.

The implementation of the invention mainly comprises the following steps:

step (1) adding an image and judging whether the image has a mark; if yes, adding the image data into an image data set of a known category, and turning to the step (2), otherwise, turning to the step (3);

updating a known type image annotation unit according to the new data set to obtain a prediction annotation;

step (3) adopting a new class discovery unit to judge whether the image belongs to a new class, if so, turning to step (4), otherwise, adding the image into an image data set of a known class, and turning to step (2);

and (4) if the number of the new data is less than the set threshold, ending the processing of the input image, otherwise, adding the collected new data into the image data set of the known type, and turning to the step (2).

And repeating the steps until all the newly added images are classified.

Drawings

FIG. 1 is an overall workflow diagram of a digital image annotation unit in an open scene;

FIG. 2 is a flowchart of the operation of the new class image data discovery unit;

FIG. 3 is a flowchart of the operation of the known class image labeling unit.

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.

Fig. 1 is a general work flow diagram of a digital image annotation unit in an open scene. A new image is input at step 10 and is observed at step 11 for annotations. If the new image has annotations, it is added to the image data set of known class, and the process proceeds to step 12. Here, a better model is constructed according to the updated data set, and a predicted label is returned; if the new image is not labeled, a determination is made as to whether it belongs to a new class or a known class, as determined at step 13. At step 13 it is evaluated whether the new image belongs to the new class (step 14) by means of the new class image discovery unit. If the known class is evaluated, the known class is added to the image dataset of the known class and the specific known class is evaluated by the image annotation unit of the known class and output at step 18. If the new class is evaluated, the data is added to the buffer of the new class data in step 15, and it is determined whether the new class data exceeds the set threshold in the buffer. If the set threshold value is exceeded, updating the new class discovery unit and the marking unit and emptying the new class data buffer in step 16; otherwise the prediction of the new class is output in step 17; through the above process, the image labeling process is ended (step 19).

Fig. 2 is a flow chart showing the operation of the new class finding unit for digital images in the present invention, which is intended to determine whether a new image belongs to a new class or a known class, corresponding to step 14 in fig. 1. Specifically, beginning at step 20, a feature representation (organized in vector form) of all images, including historical image data (noted D) and a current image (noted x), is collected at step 21. It is worth mentioning that the operation here uses a database represented by images, and adopts incremental development to reduce repeated operations. The extracted features include, but are not limited to, image features such as color, texture, shape, etc., implemented in classical methods of digital image processing. Through a feature extraction link, each image is represented as a feature vector.

Because the invention considers the big data environment, the scale of the historical data is difficult to adopt the ergodic off-line processing, and an efficient approximation method is needed. As shown in step 22, S subsets of size M are randomly sampled from the history data, and S new class detection trees each composed of M image data are constructed according to step 23. The new class detection inference consists of the integrated results of the S new class detection trees.

The new detection tree is constructed by using the recursive binary decision tree which is simple and effective at present. Specifically, assume that the currently input image data set is O, the current recursion level (i.e., the height of the tree) is h (initially, h is 0), and the centers of all samples in O are calculated and recorded as the feature vector c. And (3) calculating the radius r of the spherical feature space formed by the current sample according to the formula (1), wherein r is used for the subsequent new class detection discrimination.

And in the sub-tree division stage, a new data set O' obtained by randomly selecting k features in the O is randomly selected by using the idea of randomly dividing in a random forest to enhance robustness. Obtaining two clusters by adopting a classical K-means algorithm on a data set O', and dividing the data set O into two disjoint subsets O according to the indexes of the clusters₁And O₂Sequentially reacting (O)₁H +1) and (O)₂H +1) as input for constructing the left and right subtrees. Recursively until | O | ═ 1 or h ═ h_mAnd (6) ending. h is_mIs a predetermined algorithm parameter that prevents over-fitting and controls the complexity of the tree.

In the prediction stage, for the current image data x to be detected, the result of the new class is calculated by each new class detection tree in step 24. Specifically, for each tree, leaf nodes are sequentially driven from the root node. When located at a node, the distance of x from the current data center c is calculated, denoted as d, and compared with the current data radius r.

(1) If d is less than or equal to r, respectively calculating the distance between x and the center of the data cluster of the subtree, and selecting the subtree with smaller distance for downward recursive detection;

(2) otherwise, returning the length l of the current node and the root node.

The value of l reflects the degree to which the current data x violates the overall data distribution. The larger the value of l, the less x violates, and the lower the hit rate for x as a new class. Without loss of generality, the average distance between all leaf nodes and root nodes is recorded as l₀. If l is greater than or equal to l₀If yes, predicting x to be a new image, and outputting a predicted value 1; otherwise 0 is output. Accumulating the predicted values of S trees to obtain V (step 25), if the vote is over half, obtaining V>S/2 (step 26), judging the new class, otherwise, judging the known class. The algorithm ends (step 27). The new detection tree is not required to be reconstructed every detection after being constructed, and can be repeatedly utilized for many times as required.

FIG. 3 is a labeling unit of a digital image of a known category, corresponding to step 12 in FIG. 1. If the unmarked image x to be predicted is judged to be a known type by the new type discovery device in step 14 of fig. 1, the image x is also submitted to the image marking unit of the known type to determine the specific type thereof.

The algorithm first performs image preprocessing operation to improve real-time operation efficiency, and processes the input images respectively according to whether labels exist in the input images (step 31). Specifically, the known class K labeled images are classified into a feature set L ═ X₁,X₂,…,X_KAnd the set of labels Y^L＝{y₁,Y₂,…,y_KIn which X is_iRepresentation of features representing images of the i-th class, y_iA flag indicating an i-th class image. In step 33, for the ith class annotation image, the similar image is represented as X_iAnd the efficiency of subsequent algorithms is improved. Calculating an adjacent distance matrix G of the marked image set L and the unmarked image set U according to the requirement of the mark propagation algorithm_ULThe distance between the marked image and the unmarked image is the distance between all the images in the category and the unmarked imageAnd (c) are shown. Specifically, the distance of two feature vectors w and v is usually measured by using a gaussian kernel function, as shown in equation (2), where σ is a gaussian kernel function parameter.

So that image x is not annotated_uAnd the distance between the ith class of annotation image and the ith class of annotation image is expressed as the sum of the distances between the ith class of annotation image and all the annotation data of the ith class, and the distance is expressed by the formula (3) to obtain G_UL。

The adjacency matrix G of the label-free image in step 32_UUThe calculation of (c) is consistent with equation (2).

Since the number of unlabeled images is much greater than that of the labeled images, a threshold τ is introduced in step 34 in order to keep the calculation efficiency better for the large data environment. If the number of unmarked images is larger than tau, discarding the older unmarked image data z, the distance G between the two unmarked image data p and q_p,qUpdate formula G is updated in step 35 as shown in (4)_UU。

Based on good neighbour matrix G who constructs_UU,G_ULAnd label information Y^LIn step 36, the formula (5) is solved using a label propagation algorithm

The prediction mark of the image without annotation can be obtained quickly, and step 37 ends the annotation prediction process of the image of the known category. In the whole work flow, step 33 and step 34 respectively reduce the data scale and reduce the calculation cost by classifying the labeled data and discarding the old unlabeled data, thereby quickly achieving the purpose of label prediction. And the updating of the model only needs to update the adjacency matrix and does not involve any other parameter learning, thereby being beneficial to the modeling of the incremental data.

When a sufficient number of new class images are obtained in step 26 in fig. 2, they can be regarded as new known classes, and the labeling units of the digital images of the known classes are updated again according to step 33 in fig. 3, so that the new class images are further continuously found. The whole digital image unit is automatically updated under the synergistic action of the new type discovery unit in fig. 2 and the known type labeling unit in fig. 3, so that the effects of continuously detecting the new type and labeling the known type are achieved.

Claims

1. A robust weak supervision classification method for incremental new image data is characterized in that a divide-and-conquer strategy is adopted for a classification task of the incremental image data: for the marked image, directly adding and updating a marking unit; for the non-labeled image, a new class discovery unit based on a random forest evaluates the non-labeled data to judge whether the input new image belongs to an unseen new class; if the predicted label belongs to the known category, quickly obtaining the predicted label category by using the data of the historical known category and a label propagation algorithm and then updating the label unit; if the image data belongs to the new category which is not seen, the image data is firstly put into a buffer, when a set number of new-category images are collected, the new-category finding unit and the known-category labeling unit are updated in time, and the data buffer is emptied to continuously find more new-category images.

2. The robust weak supervision classification method for handling incremental new-class image data according to claim 1, wherein the new-class image data discovery unit is operated as follows: recording the average distance between all leaf nodes and the root node of each decision tree through a random forest algorithm of known category image data, and recording the average distance as

Wherein leaf set and leaf_iRespectively representing all the leaf node sets and the ith leaf node, recording the spherical radius of the image data under each node,

wherein O is a set of data samples under the current node, and c is the mean value of the data samples under the current node;

putting the new image data into a decision tree for prediction, recurrently descending from a root node in sequence according to a division standard, and calculating the distance between the new image data and the center of a node cluster; if the distance is larger than the spherical radius, recording the distance between the current node and the root node of the decision tree; if it is larger than the average distance l₀If the difference between the new image and the existing historical data is far, the new image is predicted to be a new class, otherwise, the new image is predicted to be a known class; from the aspect of robustness, the prediction results of a plurality of decision trees are integrated for voting, and a final result is obtained.

3. The robust weak supervision classification method for handling incremental new class image data according to claim 1, characterized in that; the working process of the labeling unit of the known category image is as follows: by adopting a mark propagation algorithm of weak supervised learning, the image labeling unit automatically utilizes limited labeled data and a large amount of unlabeled data to obtain a labeling result; the labeling unit constructs a model by sampling multiple groups of historical data to improve robustness and efficiency.

4. The robust weak supervision classification method for handling incremental new-class image data according to claim 1, wherein the model updating unit has a workflow: when the data buffer collects a set number of new images, adding the new images as the image data of the known type into the labeling unit, and emptying the data buffer; the marking unit adds the new image, processes the new image according to the standard of the existing category and is used for finding more marks of the unmarked image; and the discovery unit of the new image data combines the new image and the known image, and reconstructs a completely random forest to continuously discover the new image.

5. The robust weak supervision classification method for handling incremental new-class image data according to claim 1, wherein the weak supervision classification method for image data is implemented by the following steps:

step (4) if the number of the new data is less than the set threshold, ending the processing of the input image, otherwise, adding the collected new data into the image data set of the known type, and turning to step (2);

and repeating the steps until all the newly added images are classified.

6. The robust weak supervision classification method for handling incremental new-class image data according to claim 2, wherein in the discovery unit of new-class image data, S subsets with size M are randomly sampled from historical data to construct S new-class detection trees, each tree is composed of M image data; the new class detection inference consists of the integrated results of the S new class detection trees;

constructing a new detection tree by adopting a recursive binary decision tree, setting a currently input image data set as O, setting the number of current recursive layers as h, calculating the centers of all samples in the O, and marking as a characteristic vector c; calculating the radius r of a spherical characteristic space formed by the current sample, wherein the r is used for the subsequent new detection and discrimination;

the sub-tree division stage continues to use the random division in the random forestEnhancing the robustness, and randomly selecting k characteristics in the O to obtain a new data set O'; obtaining two clusters by adopting a classical K-means algorithm on a data set O', and dividing the data set O into two disjoint subsets O according to the indexes of the clusters₁And O₂Sequentially reacting (O)₁H +1) and (O)₂H +1) as input for constructing the left and right subtrees; recursively until | O | ═ 1 or h ═ h_mFinishing; h is_mIs a preset algorithm parameter;

in the prediction stage, for the current image data x to be detected, the result of the new type of the image data x is calculated through all the new type detection trees, and for each tree, the root nodes are sequentially moved to the leaf nodes; when the node is positioned at a certain node, calculating the distance between x and the current data center c, recording the distance as d, and comparing the distance with the current data radius r;

(2) otherwise, returning the length l of the current node and the root node;

the value l reflects the degree of the current data x against the overall data distribution, and the average value of the distances between all leaf nodes and the root node is recorded as l₀(ii) a If l is greater than or equal to l₀If yes, predicting x to be a new image, and outputting a predicted value 1; otherwise, outputting 0; accumulating the predicted values of S trees to obtain V, and if the vote is over half, obtaining V>S/2, judging the new type, otherwise, judging the known type; the algorithm ends.

7. The robust weak supervision classification method for handling incremental new-class image data according to claim 6, wherein the new-class detection tree is constructed without being reconstructed every detection, and can be reused as many times as needed.

8. The robust weak supervision classification method for handling incremental new-class image data according to claim 3, characterized in that in the labeling unit of the known-class digital image, the unmarked image x to be predicted passes through the new-class discovery device, and if the unmarked image x is judged to be of the known class, the unmarked image x is also handed to the known-class image labeling unit to determine the specific class thereof;

classifying the known K-class labeled images into a feature set L ═ X₁,X₂,…,X_KAnd the set of labels Y^L＝{y₁,y₂,…,y_KIn which X is_iRepresentation of features representing images of the i-th class, y_iA flag indicating an i-th type image; for the ith type annotation image, the same type image is represented as X_i(ii) a Calculating an adjacent distance matrix G of the marked image set L and the unmarked image set U according to the requirement of the mark propagation algorithm_ULThe distance between the marked image and the unmarked image is represented by the sum of the distances between all the images in the category and the unmarked image; specifically, the distance of two feature vectors w and v is usually measured by using a gaussian kernel function, as shown in formula (2), where σ is a gaussian kernel function parameter;

so that image x is not annotated_uAnd the distance between the ith class of annotation image and the ith class of annotation image is expressed as the sum of the distances between the ith class of annotation image and all the annotation data of the ith class, and the distance is expressed by the formula (3) to obtain G_UL；

Adjacency matrix G of label-free image_UUThe calculation mode of (2) is consistent with the formula;

introducing a threshold value tau, and if the number of unmarked images is greater than tau, discarding the older unmarked image data z and the distance G between the two unmarked image data p and q_p,qThe update formula is shown as (4), and G is updated_UU；

Based on good neighbour matrix G who constructs_UU,G_ULAnd label information Y^LAt the step ofIn 36, the formula (5) is solved by adopting a mark propagation algorithm