CN113642610B - Distributed asynchronous active labeling method - Google Patents
Distributed asynchronous active labeling method Download PDFInfo
- Publication number
- CN113642610B CN113642610B CN202110801168.2A CN202110801168A CN113642610B CN 113642610 B CN113642610 B CN 113642610B CN 202110801168 A CN202110801168 A CN 202110801168A CN 113642610 B CN113642610 B CN 113642610B
- Authority
- CN
- China
- Prior art keywords
- worker
- sample
- server
- model
- unlabeled
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000002372 labelling Methods 0.000 title claims abstract description 26
- 238000000034 method Methods 0.000 claims abstract description 18
- 239000003550 marker Substances 0.000 claims abstract description 8
- 230000007246 mechanism Effects 0.000 claims description 9
- 230000005540 biological transmission Effects 0.000 claims description 3
- 238000005070 sampling Methods 0.000 abstract description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000001360 synchronised effect Effects 0.000 description 3
- 230000006978 adaptation Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention provides a distributed asynchronous active labeling method, in a distributed scene with a plurality of server nodes and a worker node, the server nodes are responsible for training and updating a prediction model, and the worker node selects a node to be queried and sends the node to a labeling person for labeling; when the model is updated, each server independently trains a prediction model; each worker actively selects from an unlabeled data pool maintained by the worker, and a diversified sampling strategy is adopted among a plurality of workers; through the framework, information can be efficiently communicated among the server node, the worker node and the annotator through the two shared data pools, so that the server node, the worker node and the annotator can work asynchronously; on one hand, the method avoids the synchronization among the three steps of model training in active learning, instance selection and label inquiry, thereby avoiding the waiting of a marker and improving the marking efficiency; on the other hand, in the update mode of the Multi-Server, the frequency of model update is increased, and various sampling strategies are introduced between the works, so that the effectiveness of active learning sample selection can be maintained.
Description
Technical Field
The invention relates to the field of computer technology application, in particular to a distributed asynchronous active labeling method.
Background
Active learning is a first learning method in a task scenario of learning using limited tag data, and has attracted attention in recent years. Existing active learning methods have met with great success in selecting the most cost effective samples for improved model performance, but these methods typically work in a synchronous manner. Specifically, in each iteration of active learning, three steps need to be performed in sequence. Firstly, training a model based on the current marked data; secondly, selecting the unlabeled samples with the most cost effectiveness according to the trained model; finally, the label of the selected sample is queried from the annotator. These three steps are repeated until the model performance requirements are met or the query budget is exhausted. Obviously, each of these three steps depends on the execution of the last step, which means that they do not execute very asynchronously.
In a practical application scenario, there are typically many annotators that annotate selected unlabeled data together. Particularly in crowdsourcing environments, there are a large number of online users simultaneously providing tag services. While model training and sample selection are typically computationally intensive, this means that it takes a long time to generate samples to be queried for labeling in each round of iteration. Thus, the annotator would have to wait for the unlabeled examples of the next round after each round of annotation. This would greatly impair the user experience of the annotator, reduce annotation efficiency, and severely limit the application of active learning. There is a strong need for a more practical mechanism that allows annotators to continuously mark data without waiting for model training and query selection.
Disclosure of Invention
The invention aims to: in order to solve the problems in the background art, the invention provides a distributed asynchronous active labeling method, which can improve the use experience and the labeling efficiency of a user in an actual active learning labeling scene and further exert the potential advantages of active learning.
The technical scheme is as follows: in order to achieve the above purpose, the invention adopts the following technical scheme:
a distributed asynchronous active labeling method comprises the following steps:
step S1, configuring active learning annotation scene parameters;
the active learning annotation scene comprises m servers for model training and k works for instance selection; each server learns a prediction model based on the marked sample pool L; each worker independently maintains a part of unlabeled sample sets, and all the unlabeled sample sets are defined as unlabeled sample pools D; defining S as a set of selected samples to be queried;
s2, model training stage;
the prediction models trained by each server are independent; updating all servers in the learning stage; when the model training stage is finished, all servers can detect the marked sample pool L; when the marked sample pool L receives new marked sample data after iterative updating, starting a new round of model training updating;
s3, an instance selection stage;
each worker independently maintains part of the unlabeled sample set, and the workers directly and mutually independently operate; each worker acquires the latest model from the training model library updated by all servers, and selects the most cost-effective sample to be queried from unlabeled sample data maintained by the worker; in particular, the method comprises the steps of,
first, each worker completes the trained model { g from all servers directly 1 ,g 2 ,…,g m Retrieving the training model g updated most recently at the current moment * The method comprises the steps of carrying out a first treatment on the surface of the The worker then uses an active sample selection algorithm directly to continuously select the most cost-effective sample to be queried from its own maintained unlabeled data pool as follows:
S j =S j (g * ,D j )
wherein S is j Represents the active sample selection algorithm employed by the jth worker, D j Representing unlabeled exemplar sets maintained by the j-th worker; inputting the latest updated training model g at the current moment * And the unmarked sample set maintained by the jth worker, obtaining the most cost-effective sample to be queried selected by the jth worker; actually, each worker can adopt different active sample selection algorithms according to own requirements;
and finally, summarizing the samples to be queried selected by all the workers into a set S, and sharing the set S by all the workers.
S4, a label inquiring stage;
the annotator randomly selects unlabeled samples from the sample set S to be queried, marks the selected samples, and adds the marked samples into the marked sample set L.
Further, in the step S1, the labeling sample cell L is specifically expressed as follows:
l represents n l A set of individual marker samples;
unlabeled sample cell D is specifically represented as follows:
D=D 1 ∪D 2 ∪…∪D k
d represents k partitioned unlabeled exemplar sets, maintained by k worker respectively.
Further, each unlabeled sample set in the labeled sample pool D is partitioned by means of random selection or clustering.
Further, a plurality of resource management nodes are arranged among the marked sample pool L, the sample set S, server to be queried and the worker and used for coordinating information transmission among each sample set, the server and the worker; and setting a plurality of server management nodes in a server group formed by each server, wherein the server management nodes are used for maintaining all server updated model libraries, so that a worker end can acquire the latest updated model in time.
Further, the model training mode of the server in the step S2 is specifically as follows:
g i =A(L)
wherein A represents a specific algorithm used when learning a predictive model on a server, g i Is a trained model; the learning algorithms A of the prediction models by different servers are mutually independent; when updating the Server, the model is updated by employing a Multi-Server update mechanism.
The beneficial effects are that:
the distributed asynchronous active labeling method provided by the invention can well cope with large-scale active learning task scenes with large data magnitude or large model scale under the distributed scene with a plurality of server nodes and worker nodes. In the framework, the server node is responsible for training the predictive model, and the worker node focuses on selecting the samples that are most cost-effective for model promotion. The worker nodes exist independently, and the sample to be queried is selected from unmarked data maintained by the worker nodes by downloading a model of the latest version at the current moment from the server node. Further, the sample to be queried is added into a sample pool to be queried commonly maintained by each worker node, so that a marker can conveniently retrieve and mark the sample. When the sample to be queried is marked, the sample to be queried is added into a marking pool, and a server node can update a model according to the new marking pool. When the model is updated, the prediction model is independently trained on each server, and the model updating frequency is improved through the alternate operation of a plurality of servers; in case of instance selection, each worker actively selects from a self-maintained unlabeled data pool, and adopts diversified sampling strategies among a plurality of workers. By implementing the framework, information can be efficiently communicated among the server node, the worker node and the annotators through two shared data pools, so that the server node, the worker node and the annotators can work asynchronously. On one hand, the method avoids the synchronization among the three steps of model training in active learning, instance selection and label inquiry, thereby avoiding the waiting of a labeling person and improving the labeling efficiency. On the other hand, under the Multi-Server Multi-Worker framework, the model updating frequency is increased in a pipeline working mode, and various sampling strategies are introduced between workers, so that the effectiveness of active learning sample selection can be maintained.
Drawings
FIG. 1 is a flow chart of a distributed asynchronous active labeling method provided by the invention;
FIG. 2 is a framework structure diagram of the distributed asynchronous active labeling method provided by the invention;
FIG. 3 is a schematic diagram of interaction of management nodes in the distributed asynchronous active labeling method provided by the invention;
FIG. 4 is a schematic diagram of a Multi-Server update mechanism in the distributed asynchronous active labeling method provided by the present invention.
Detailed Description
The invention will be further described with reference to the accompanying drawings.
The distributed asynchronous active labeling method provided by the invention is shown in figure 1. The method specifically comprises the following steps:
step S1, configuring active learning annotation scene parameters;
the active learning annotation scene comprises m servers for model training and k works for instance selection; each server learns a prediction model based on the marked sample pool L; each worker independently maintains a part of unlabeled sample sets, and all the unlabeled sample sets are defined as unlabeled sample pools D; defining S as the selected set of samples to be queried. In particular, the method comprises the steps of,
the label-sample cell L is specifically expressed as follows:
l represents n l A set of individual marker samples;
unlabeled sample cell D is specifically represented as follows:
D=D 1 ∪D 2 ∪…∪D k
d represents k partitioned unlabeled exemplar sets, maintained by k worker respectively. The unlabeled exemplar sets in the labeled exemplar pool D are partitioned by way of random selection or clustering.
After the specific configuration is carried out, the method provided by the invention can be driven to carry out the active learning task. During execution, it is necessary to continually determine whether model training has reached expectations or whether the query budget has been exhausted to determine whether termination of the learning process is required. If the stopping standard is reached, the model with the best effect is obtained from the model library updated by the history, and the learning process is finished.
Fig. 2 is a view of a scene structure of the learning method of the present invention. Unlike traditional synchronous active learning, the framework is still composed of three parts, namely model training, instance selection and tag query, but allows them to work asynchronously without waiting for each other. In this framework, three data pools, namely a labeled sample pool L, a labeled sample pool D, and a sample pool S to be queried, need to be maintained. These three data pools are maintained by the framework itself, except that configuration is required at initial run-time, and no intervention is required later.
Specifically, step S2, model training stage;
the prediction models trained by each server are independent; updating all servers in the learning stage; when the model training stage is finished, all servers can detect the marked sample pool L; when the marked sample pool L receives new marked sample data after iterative updating, a new round of model training updating is started. Specifically, the model training mode of the server is as follows:
g i =A(L)
wherein A represents a specific algorithm used when learning a predictive model on a server, g i Is a trained model; the learning algorithms A of the prediction models by different servers are mutually independent; when updating the Server, the model is updated by employing a Multi-Server update mechanism.
S3, an instance selection stage;
each worker independently maintains part of the unlabeled sample set, and the workers directly and mutually independently operate; each worker acquires the latest model from the training model library updated by all servers, and selects the most cost-effective sample to be queried from unlabeled sample data maintained by the worker; in particular, the method comprises the steps of,
first, each worker completes the trained model { g from all servers directly 1 ,g 2 ,…,g m Retrieving the training model g updated most recently at the current moment * The method comprises the steps of carrying out a first treatment on the surface of the The worker then uses the active sample selection algorithm directly to continuously select the most cost-effective from its own maintained unlabeled data pool, and the sample to be queried that is most valuable for improving the model performance is as follows:
S j =S j (g * ,D j )
wherein S is j Represents the active sample selection algorithm employed by the jth worker, D j Representing unlabeled exemplar sets maintained by the j-th worker; inputting the latest updated training model g at the current moment * And the unmarked sample set maintained by the jth worker, obtaining the most cost-effective sample to be queried selected by the jth worker; in practice, each worker can adopt different active sample selection algorithms according to own requirements.
It should be noted that the worker directly retrieves the latest updated model from the server to perform active sample selection, and g is not limited * Whether all samples in the current latest label sample pool have been fully utilized. Thus, the worker does not need to wait for model training when performing instance selection.
And finally, summarizing the samples to be queried selected by all the workers into a set S, and sharing the set S by all the workers.
S4, a label inquiring stage;
the annotator randomly selects unlabeled samples from the sample set S to be queried, marks the selected samples, and adds the marked samples into the marked sample set L.
Because the task implementation under the distributed mode is very complex, coordination and synchronization among all nodes need to be comprehensively considered. Therefore, in specific arrangement, for coordination of server nodes and worker nodes, additional management nodes need to be introduced for control, as shown in fig. 3. A resource management node is arranged among the marked sample pool L, the sample set S, server to be inquired and the worker and is used for coordinating information transmission among each sample set, the server and the worker; and setting server management nodes in server groups formed by the servers, wherein the server management nodes are used for maintaining all server updated model libraries, so that a worker end can acquire the latest updated models in time. The number of these two types of nodes can be set according to the specific task size. The basic functions that the management node needs to implement are deterministic, but the specific algorithms or functional extensions that it performs will vary depending on the actual scenario requirements. By introducing the management node, only a model updating algorithm of the server node is required to be set, and only a sample query strategy is required to be selected for the worker node. Furthermore, during the operation of the method of the present invention, the annotator can continuously mark the data without excessive intervention, without or with as little waiting for model training and query selection as possible.
FIG. 4 is a schematic diagram of a Multi-Server update mechanism used by a Server in model training update according to the present invention. In synchronous active learning, once new marker samples are added to the training set, the model is updated in time based on the latest marker sample pool. This ensures that the model can make full use of all the supervision information, and subsequently the selected sample to be marked can thus also fully reflect the requirements of the current model. In an asynchronous scenario, this feature is not preserved. In order to cope with the problem, the invention uses the idea of a pipeline to update the model more frequently by adopting a Multi-Server updating mechanism, thereby utilizing the label data as timely as possible. In order to more intuitively understand the Multi-Server update mechanism used, a specific embodiment is provided below.
Assume that the time cost of one model training is 3 times that of one instance selection. When 3 servers are provided, the first 4 instance selections performed by the worker end are all based on version 0 models, but each round of instance selections then always receives a new model. Therefore, the Multi-Server updating mechanism can enable the worker end to more effectively select an instance to be marked by a marker by greatly improving the updating frequency of the model, and effectively alleviate the problem of reduced cost effectiveness of the label in an asynchronous scene.
The foregoing is only a preferred embodiment of the invention, it being noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the invention.
Claims (5)
1. The distributed asynchronous active labeling method is characterized by comprising the following steps of:
step S1, configuring active learning annotation scene parameters;
the active learning annotation scene comprises m servers for model training and k works for instance selection; each server learns a prediction model based on the marked sample pool L; each worker independently maintains a part of unlabeled sample sets, and all the unlabeled sample sets are defined as unlabeled sample pools D; defining S as a set of selected samples to be queried;
s2, model training stage;
the prediction models trained by each server are independent; updating all servers in the learning stage; when the model training stage is finished, all servers can detect the marked sample pool L; when the marked sample pool L receives new marked sample data after iterative updating, starting a new round of model training updating;
s3, an instance selection stage;
each worker independently maintains part of the unlabeled sample set, and the workers directly and mutually independently operate; each worker acquires the latest model from the training model library updated by all servers, and selects the most cost-effective sample to be queried from unlabeled sample data maintained by the worker; in particular, the method comprises the steps of,
first, each worker completes the trained model { g from all servers directly 1 ,g 2 ,…,g m Retrieving the training model g updated most recently at the current moment * The method comprises the steps of carrying out a first treatment on the surface of the The worker then uses an active sample selection algorithm directly to continuously select the most cost-effective sample to be queried from its own maintained unlabeled data pool as follows:
S j =S j (g * ,D j )
wherein S is j Represents the active sample selection algorithm employed by the jth worker, D j Representing unlabeled exemplar sets maintained by the j-th worker; inputting the latest updated training model g at the current moment * And the unmarked sample set maintained by the jth worker, obtaining the most cost-effective sample to be queried selected by the jth worker; actually, each worker can adopt different active sample selection algorithms according to own requirements;
finally, summarizing the samples to be queried selected by all the workers into a set S, and sharing the set S by all the workers;
s4, a label inquiring stage;
the annotator randomly selects unlabeled samples from the sample set S to be queried, marks the selected samples, and adds the marked samples into the marked sample set L.
2. The distributed asynchronous active labeling method according to claim 1, wherein in the step S1, the labeling sample pool L is specifically represented as follows:
l represents n l Sets of individual marker samplesCombining;
unlabeled sample cell D is specifically represented as follows:
D=D 1 ∪D 2 ∪…∪D k
d represents k partitioned unlabeled exemplar sets, maintained by k worker respectively.
3. A distributed asynchronous active labeling method according to claim 2, characterized in that the unlabeled sample sets in the labeled sample pool D are partitioned by means of random selection or clustering.
4. The distributed asynchronous active labeling method according to claim 1, wherein a plurality of resource management nodes are arranged among the labeled sample pool L, the sample set S, server to be queried and the worker for coordinating information transmission among each sample set, the server and the worker; and setting a plurality of server management nodes in a server group formed by each server, wherein the server management nodes are used for maintaining all server updated model libraries, so that a worker end can acquire the latest updated model in time.
5. The distributed asynchronous active labeling method according to claim 1, wherein the model training mode of the server in step S2 is specifically as follows:
g i =A(L)
wherein A represents a specific algorithm used when learning a predictive model on a server, g i Is a trained model; the learning algorithms A of the prediction models by different servers are mutually independent; when updating the Server, the model is updated by employing a Multi-Server update mechanism.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110801168.2A CN113642610B (en) | 2021-07-15 | 2021-07-15 | Distributed asynchronous active labeling method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110801168.2A CN113642610B (en) | 2021-07-15 | 2021-07-15 | Distributed asynchronous active labeling method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113642610A CN113642610A (en) | 2021-11-12 |
CN113642610B true CN113642610B (en) | 2024-04-02 |
Family
ID=78417470
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110801168.2A Active CN113642610B (en) | 2021-07-15 | 2021-07-15 | Distributed asynchronous active labeling method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113642610B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108345661A (en) * | 2018-01-31 | 2018-07-31 | 华南理工大学 | A kind of Wi-Fi clustering methods and system based on extensive Embedding technologies |
CN111160469A (en) * | 2019-12-30 | 2020-05-15 | 湖南大学 | Active learning method of target detection system |
WO2021022572A1 (en) * | 2019-08-07 | 2021-02-11 | 南京智谷人工智能研究院有限公司 | Active sampling method based on meta-learning |
CN113033098A (en) * | 2021-03-26 | 2021-06-25 | 山东科技大学 | Ocean target detection deep learning model training method based on AdaRW algorithm |
-
2021
- 2021-07-15 CN CN202110801168.2A patent/CN113642610B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108345661A (en) * | 2018-01-31 | 2018-07-31 | 华南理工大学 | A kind of Wi-Fi clustering methods and system based on extensive Embedding technologies |
WO2021022572A1 (en) * | 2019-08-07 | 2021-02-11 | 南京智谷人工智能研究院有限公司 | Active sampling method based on meta-learning |
CN111160469A (en) * | 2019-12-30 | 2020-05-15 | 湖南大学 | Active learning method of target detection system |
CN113033098A (en) * | 2021-03-26 | 2021-06-25 | 山东科技大学 | Ocean target detection deep learning model training method based on AdaRW algorithm |
Also Published As
Publication number | Publication date |
---|---|
CN113642610A (en) | 2021-11-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN100527088C (en) | Cooperative scheduling using coroutines and threads | |
CN110168578A (en) | Multitask neural network with task particular path | |
CN108182229A (en) | Information interacting method and device | |
CN110941698A (en) | Service discovery method based on convolutional neural network under BERT | |
CN110321437B (en) | Corpus data processing method and device, electronic equipment and medium | |
CN103198097A (en) | Massive geoscientific data parallel processing method based on distributed file system | |
US20100030761A1 (en) | Method of retrieving and refining information based on tri-gram | |
EP3738047B1 (en) | Interactive and non-interactive execution and rendering of templates to automate control and exploration across systems | |
CN112463337A (en) | Workflow task migration method used in mobile edge computing environment | |
Chen et al. | Self-training enhanced: Network embedding and overlapping community detection with adversarial learning | |
CN110263021B (en) | Theme library generation method based on personalized label system | |
CN105138649A (en) | Data search method and device and terminal | |
CN113642610B (en) | Distributed asynchronous active labeling method | |
CN113918807A (en) | Data recommendation method and device, computing equipment and computer-readable storage medium | |
CN101635001A (en) | Method and apparatus for extracting information from a database | |
CN103699627A (en) | Dummy file parallel data block positioning method based on Hadoop cluster | |
CN106970840A (en) | A kind of Method for HW/SW partitioning of combination task scheduling | |
US20220230712A1 (en) | Systems and methods for template-free reaction predictions | |
Al-Hayali et al. | Genetic Algorithm for finding shortest paths Problem | |
US20050112577A1 (en) | Rna sequence analyzer, and rna sequence analysis method, program and recording medium | |
CN115309865A (en) | Interactive retrieval method, device, equipment and storage medium based on double-tower model | |
CN114691327A (en) | Multi-objective group intelligent optimization method and system for two-stage task scheduling | |
Li et al. | FlashSchema: achieving high quality XML schemas with powerful inference algorithms and large-scale schema data | |
CN116595343B (en) | Manifold ordering learning-based online unsupervised cross-modal retrieval method and system | |
CN113626179B (en) | Universal artificial intelligent model training method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |