CN113642610B

CN113642610B - Distributed asynchronous active labeling method

Info

Publication number: CN113642610B
Application number: CN202110801168.2A
Authority: CN
Inventors: 黄圣君; 宗辰辰; 宁鲲鹏; 唐英鹏
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2021-07-15
Filing date: 2021-07-15
Publication date: 2024-04-02
Anticipated expiration: 2041-07-15
Also published as: CN113642610A

Abstract

The invention provides a distributed asynchronous active labeling method, in a distributed scene with a plurality of server nodes and a worker node, the server nodes are responsible for training and updating a prediction model, and the worker node selects a node to be queried and sends the node to a labeling person for labeling; when the model is updated, each server independently trains a prediction model; each worker actively selects from an unlabeled data pool maintained by the worker, and a diversified sampling strategy is adopted among a plurality of workers; through the framework, information can be efficiently communicated among the server node, the worker node and the annotator through the two shared data pools, so that the server node, the worker node and the annotator can work asynchronously; on one hand, the method avoids the synchronization among the three steps of model training in active learning, instance selection and label inquiry, thereby avoiding the waiting of a marker and improving the marking efficiency; on the other hand, in the update mode of the Multi-Server, the frequency of model update is increased, and various sampling strategies are introduced between the works, so that the effectiveness of active learning sample selection can be maintained.

Description

Distributed asynchronous active labeling method

Technical Field

The invention relates to the field of computer technology application, in particular to a distributed asynchronous active labeling method.

Background

Active learning is a first learning method in a task scenario of learning using limited tag data, and has attracted attention in recent years. Existing active learning methods have met with great success in selecting the most cost effective samples for improved model performance, but these methods typically work in a synchronous manner. Specifically, in each iteration of active learning, three steps need to be performed in sequence. Firstly, training a model based on the current marked data; secondly, selecting the unlabeled samples with the most cost effectiveness according to the trained model; finally, the label of the selected sample is queried from the annotator. These three steps are repeated until the model performance requirements are met or the query budget is exhausted. Obviously, each of these three steps depends on the execution of the last step, which means that they do not execute very asynchronously.

In a practical application scenario, there are typically many annotators that annotate selected unlabeled data together. Particularly in crowdsourcing environments, there are a large number of online users simultaneously providing tag services. While model training and sample selection are typically computationally intensive, this means that it takes a long time to generate samples to be queried for labeling in each round of iteration. Thus, the annotator would have to wait for the unlabeled examples of the next round after each round of annotation. This would greatly impair the user experience of the annotator, reduce annotation efficiency, and severely limit the application of active learning. There is a strong need for a more practical mechanism that allows annotators to continuously mark data without waiting for model training and query selection.

Disclosure of Invention

The invention aims to: in order to solve the problems in the background art, the invention provides a distributed asynchronous active labeling method, which can improve the use experience and the labeling efficiency of a user in an actual active learning labeling scene and further exert the potential advantages of active learning.

The technical scheme is as follows: in order to achieve the above purpose, the invention adopts the following technical scheme:

a distributed asynchronous active labeling method comprises the following steps:

step S1, configuring active learning annotation scene parameters;

the active learning annotation scene comprises m servers for model training and k works for instance selection; each server learns a prediction model based on the marked sample pool L; each worker independently maintains a part of unlabeled sample sets, and all the unlabeled sample sets are defined as unlabeled sample pools D; defining S as a set of selected samples to be queried;

s2, model training stage;

the prediction models trained by each server are independent; updating all servers in the learning stage; when the model training stage is finished, all servers can detect the marked sample pool L; when the marked sample pool L receives new marked sample data after iterative updating, starting a new round of model training updating;

s3, an instance selection stage;

each worker independently maintains part of the unlabeled sample set, and the workers directly and mutually independently operate; each worker acquires the latest model from the training model library updated by all servers, and selects the most cost-effective sample to be queried from unlabeled sample data maintained by the worker; in particular, the method comprises the steps of,

first, each worker completes the trained model { g from all servers directly ₁ ,g ₂ ,…,g _m Retrieving the training model g updated most recently at the current moment ^* The method comprises the steps of carrying out a first treatment on the surface of the The worker then uses an active sample selection algorithm directly to continuously select the most cost-effective sample to be queried from its own maintained unlabeled data pool as follows:

S _j ＝S _j (g ^* ,D _j )

wherein S is _j Represents the active sample selection algorithm employed by the jth worker, D _j Representing unlabeled exemplar sets maintained by the j-th worker; inputting the latest updated training model g at the current moment ^* And the unmarked sample set maintained by the jth worker, obtaining the most cost-effective sample to be queried selected by the jth worker; actually, each worker can adopt different active sample selection algorithms according to own requirements;

and finally, summarizing the samples to be queried selected by all the workers into a set S, and sharing the set S by all the workers.

S4, a label inquiring stage;

the annotator randomly selects unlabeled samples from the sample set S to be queried, marks the selected samples, and adds the marked samples into the marked sample set L.

Further, in the step S1, the labeling sample cell L is specifically expressed as follows:

l represents n _l A set of individual marker samples;

unlabeled sample cell D is specifically represented as follows:

D＝D ₁ ∪D ₂ ∪…∪D _k

d represents k partitioned unlabeled exemplar sets, maintained by k worker respectively.

Further, each unlabeled sample set in the labeled sample pool D is partitioned by means of random selection or clustering.

Further, a plurality of resource management nodes are arranged among the marked sample pool L, the sample set S, server to be queried and the worker and used for coordinating information transmission among each sample set, the server and the worker; and setting a plurality of server management nodes in a server group formed by each server, wherein the server management nodes are used for maintaining all server updated model libraries, so that a worker end can acquire the latest updated model in time.

Further, the model training mode of the server in the step S2 is specifically as follows:

g _i ＝A(L)

wherein A represents a specific algorithm used when learning a predictive model on a server, g _i Is a trained model; the learning algorithms A of the prediction models by different servers are mutually independent; when updating the Server, the model is updated by employing a Multi-Server update mechanism.

The beneficial effects are that:

the distributed asynchronous active labeling method provided by the invention can well cope with large-scale active learning task scenes with large data magnitude or large model scale under the distributed scene with a plurality of server nodes and worker nodes. In the framework, the server node is responsible for training the predictive model, and the worker node focuses on selecting the samples that are most cost-effective for model promotion. The worker nodes exist independently, and the sample to be queried is selected from unmarked data maintained by the worker nodes by downloading a model of the latest version at the current moment from the server node. Further, the sample to be queried is added into a sample pool to be queried commonly maintained by each worker node, so that a marker can conveniently retrieve and mark the sample. When the sample to be queried is marked, the sample to be queried is added into a marking pool, and a server node can update a model according to the new marking pool. When the model is updated, the prediction model is independently trained on each server, and the model updating frequency is improved through the alternate operation of a plurality of servers; in case of instance selection, each worker actively selects from a self-maintained unlabeled data pool, and adopts diversified sampling strategies among a plurality of workers. By implementing the framework, information can be efficiently communicated among the server node, the worker node and the annotators through two shared data pools, so that the server node, the worker node and the annotators can work asynchronously. On one hand, the method avoids the synchronization among the three steps of model training in active learning, instance selection and label inquiry, thereby avoiding the waiting of a labeling person and improving the labeling efficiency. On the other hand, under the Multi-Server Multi-Worker framework, the model updating frequency is increased in a pipeline working mode, and various sampling strategies are introduced between workers, so that the effectiveness of active learning sample selection can be maintained.

Drawings

FIG. 1 is a flow chart of a distributed asynchronous active labeling method provided by the invention;

FIG. 2 is a framework structure diagram of the distributed asynchronous active labeling method provided by the invention;

FIG. 3 is a schematic diagram of interaction of management nodes in the distributed asynchronous active labeling method provided by the invention;

FIG. 4 is a schematic diagram of a Multi-Server update mechanism in the distributed asynchronous active labeling method provided by the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings.

The distributed asynchronous active labeling method provided by the invention is shown in figure 1. The method specifically comprises the following steps:

step S1, configuring active learning annotation scene parameters;

the active learning annotation scene comprises m servers for model training and k works for instance selection; each server learns a prediction model based on the marked sample pool L; each worker independently maintains a part of unlabeled sample sets, and all the unlabeled sample sets are defined as unlabeled sample pools D; defining S as the selected set of samples to be queried. In particular, the method comprises the steps of,

the label-sample cell L is specifically expressed as follows:

l represents n _l A set of individual marker samples;

unlabeled sample cell D is specifically represented as follows:

D＝D ₁ ∪D ₂ ∪…∪D _k

d represents k partitioned unlabeled exemplar sets, maintained by k worker respectively. The unlabeled exemplar sets in the labeled exemplar pool D are partitioned by way of random selection or clustering.

After the specific configuration is carried out, the method provided by the invention can be driven to carry out the active learning task. During execution, it is necessary to continually determine whether model training has reached expectations or whether the query budget has been exhausted to determine whether termination of the learning process is required. If the stopping standard is reached, the model with the best effect is obtained from the model library updated by the history, and the learning process is finished.

Fig. 2 is a view of a scene structure of the learning method of the present invention. Unlike traditional synchronous active learning, the framework is still composed of three parts, namely model training, instance selection and tag query, but allows them to work asynchronously without waiting for each other. In this framework, three data pools, namely a labeled sample pool L, a labeled sample pool D, and a sample pool S to be queried, need to be maintained. These three data pools are maintained by the framework itself, except that configuration is required at initial run-time, and no intervention is required later.

Specifically, step S2, model training stage;

the prediction models trained by each server are independent; updating all servers in the learning stage; when the model training stage is finished, all servers can detect the marked sample pool L; when the marked sample pool L receives new marked sample data after iterative updating, a new round of model training updating is started. Specifically, the model training mode of the server is as follows:

g _i ＝A(L)

S3, an instance selection stage;

first, each worker completes the trained model { g from all servers directly ₁ ,g ₂ ,…,g _m Retrieving the training model g updated most recently at the current moment ^* The method comprises the steps of carrying out a first treatment on the surface of the The worker then uses the active sample selection algorithm directly to continuously select the most cost-effective from its own maintained unlabeled data pool, and the sample to be queried that is most valuable for improving the model performance is as follows:

S _j ＝S _j (g ^* ,D _j )

wherein S is _j Represents the active sample selection algorithm employed by the jth worker, D _j Representing unlabeled exemplar sets maintained by the j-th worker; inputting the latest updated training model g at the current moment ^* And the unmarked sample set maintained by the jth worker, obtaining the most cost-effective sample to be queried selected by the jth worker; in practice, each worker can adopt different active sample selection algorithms according to own requirements.

It should be noted that the worker directly retrieves the latest updated model from the server to perform active sample selection, and g is not limited ^* Whether all samples in the current latest label sample pool have been fully utilized. Thus, the worker does not need to wait for model training when performing instance selection.

S4, a label inquiring stage;

Because the task implementation under the distributed mode is very complex, coordination and synchronization among all nodes need to be comprehensively considered. Therefore, in specific arrangement, for coordination of server nodes and worker nodes, additional management nodes need to be introduced for control, as shown in fig. 3. A resource management node is arranged among the marked sample pool L, the sample set S, server to be inquired and the worker and is used for coordinating information transmission among each sample set, the server and the worker; and setting server management nodes in server groups formed by the servers, wherein the server management nodes are used for maintaining all server updated model libraries, so that a worker end can acquire the latest updated models in time. The number of these two types of nodes can be set according to the specific task size. The basic functions that the management node needs to implement are deterministic, but the specific algorithms or functional extensions that it performs will vary depending on the actual scenario requirements. By introducing the management node, only a model updating algorithm of the server node is required to be set, and only a sample query strategy is required to be selected for the worker node. Furthermore, during the operation of the method of the present invention, the annotator can continuously mark the data without excessive intervention, without or with as little waiting for model training and query selection as possible.

FIG. 4 is a schematic diagram of a Multi-Server update mechanism used by a Server in model training update according to the present invention. In synchronous active learning, once new marker samples are added to the training set, the model is updated in time based on the latest marker sample pool. This ensures that the model can make full use of all the supervision information, and subsequently the selected sample to be marked can thus also fully reflect the requirements of the current model. In an asynchronous scenario, this feature is not preserved. In order to cope with the problem, the invention uses the idea of a pipeline to update the model more frequently by adopting a Multi-Server updating mechanism, thereby utilizing the label data as timely as possible. In order to more intuitively understand the Multi-Server update mechanism used, a specific embodiment is provided below.

Assume that the time cost of one model training is 3 times that of one instance selection. When 3 servers are provided, the first 4 instance selections performed by the worker end are all based on version 0 models, but each round of instance selections then always receives a new model. Therefore, the Multi-Server updating mechanism can enable the worker end to more effectively select an instance to be marked by a marker by greatly improving the updating frequency of the model, and effectively alleviate the problem of reduced cost effectiveness of the label in an asynchronous scene.

The foregoing is only a preferred embodiment of the invention, it being noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the invention.

Claims

1. The distributed asynchronous active labeling method is characterized by comprising the following steps of:

step S1, configuring active learning annotation scene parameters;

s2, model training stage;

s3, an instance selection stage;

S _j ＝S _j (g ^* ,D _j )

finally, summarizing the samples to be queried selected by all the workers into a set S, and sharing the set S by all the workers;

s4, a label inquiring stage;

2. The distributed asynchronous active labeling method according to claim 1, wherein in the step S1, the labeling sample pool L is specifically represented as follows:

l represents n _l Sets of individual marker samplesCombining;

unlabeled sample cell D is specifically represented as follows:

D＝D ₁ ∪D ₂ ∪…∪D _k

3. A distributed asynchronous active labeling method according to claim 2, characterized in that the unlabeled sample sets in the labeled sample pool D are partitioned by means of random selection or clustering.

4. The distributed asynchronous active labeling method according to claim 1, wherein a plurality of resource management nodes are arranged among the labeled sample pool L, the sample set S, server to be queried and the worker for coordinating information transmission among each sample set, the server and the worker; and setting a plurality of server management nodes in a server group formed by each server, wherein the server management nodes are used for maintaining all server updated model libraries, so that a worker end can acquire the latest updated model in time.

5. The distributed asynchronous active labeling method according to claim 1, wherein the model training mode of the server in step S2 is specifically as follows:

g _i ＝A(L)