CN113642610B - Distributed asynchronous active labeling method - Google Patents

Distributed asynchronous active labeling method Download PDF

Info

Publication number
CN113642610B
CN113642610B CN202110801168.2A CN202110801168A CN113642610B CN 113642610 B CN113642610 B CN 113642610B CN 202110801168 A CN202110801168 A CN 202110801168A CN 113642610 B CN113642610 B CN 113642610B
Authority
CN
China
Prior art keywords
worker
sample
server
model
unlabeled
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110801168.2A
Other languages
Chinese (zh)
Other versions
CN113642610A (en
Inventor
黄圣君
宗辰辰
宁鲲鹏
唐英鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN202110801168.2A priority Critical patent/CN113642610B/en
Publication of CN113642610A publication Critical patent/CN113642610A/en
Application granted granted Critical
Publication of CN113642610B publication Critical patent/CN113642610B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a distributed asynchronous active labeling method, in a distributed scene with a plurality of server nodes and a worker node, the server nodes are responsible for training and updating a prediction model, and the worker node selects a node to be queried and sends the node to a labeling person for labeling; when the model is updated, each server independently trains a prediction model; each worker actively selects from an unlabeled data pool maintained by the worker, and a diversified sampling strategy is adopted among a plurality of workers; through the framework, information can be efficiently communicated among the server node, the worker node and the annotator through the two shared data pools, so that the server node, the worker node and the annotator can work asynchronously; on one hand, the method avoids the synchronization among the three steps of model training in active learning, instance selection and label inquiry, thereby avoiding the waiting of a marker and improving the marking efficiency; on the other hand, in the update mode of the Multi-Server, the frequency of model update is increased, and various sampling strategies are introduced between the works, so that the effectiveness of active learning sample selection can be maintained.

Description

Distributed asynchronous active labeling method
Technical Field
The invention relates to the field of computer technology application, in particular to a distributed asynchronous active labeling method.
Background
Active learning is a first learning method in a task scenario of learning using limited tag data, and has attracted attention in recent years. Existing active learning methods have met with great success in selecting the most cost effective samples for improved model performance, but these methods typically work in a synchronous manner. Specifically, in each iteration of active learning, three steps need to be performed in sequence. Firstly, training a model based on the current marked data; secondly, selecting the unlabeled samples with the most cost effectiveness according to the trained model; finally, the label of the selected sample is queried from the annotator. These three steps are repeated until the model performance requirements are met or the query budget is exhausted. Obviously, each of these three steps depends on the execution of the last step, which means that they do not execute very asynchronously.
In a practical application scenario, there are typically many annotators that annotate selected unlabeled data together. Particularly in crowdsourcing environments, there are a large number of online users simultaneously providing tag services. While model training and sample selection are typically computationally intensive, this means that it takes a long time to generate samples to be queried for labeling in each round of iteration. Thus, the annotator would have to wait for the unlabeled examples of the next round after each round of annotation. This would greatly impair the user experience of the annotator, reduce annotation efficiency, and severely limit the application of active learning. There is a strong need for a more practical mechanism that allows annotators to continuously mark data without waiting for model training and query selection.
Disclosure of Invention
The invention aims to: in order to solve the problems in the background art, the invention provides a distributed asynchronous active labeling method, which can improve the use experience and the labeling efficiency of a user in an actual active learning labeling scene and further exert the potential advantages of active learning.
The technical scheme is as follows: in order to achieve the above purpose, the invention adopts the following technical scheme:
a distributed asynchronous active labeling method comprises the following steps:
step S1, configuring active learning annotation scene parameters;
the active learning annotation scene comprises m servers for model training and k works for instance selection; each server learns a prediction model based on the marked sample pool L; each worker independently maintains a part of unlabeled sample sets, and all the unlabeled sample sets are defined as unlabeled sample pools D; defining S as a set of selected samples to be queried;
s2, model training stage;
the prediction models trained by each server are independent; updating all servers in the learning stage; when the model training stage is finished, all servers can detect the marked sample pool L; when the marked sample pool L receives new marked sample data after iterative updating, starting a new round of model training updating;
s3, an instance selection stage;
each worker independently maintains part of the unlabeled sample set, and the workers directly and mutually independently operate; each worker acquires the latest model from the training model library updated by all servers, and selects the most cost-effective sample to be queried from unlabeled sample data maintained by the worker; in particular, the method comprises the steps of,
first, each worker completes the trained model { g from all servers directly 1 ,g 2 ,…,g m Retrieving the training model g updated most recently at the current moment * The method comprises the steps of carrying out a first treatment on the surface of the The worker then uses an active sample selection algorithm directly to continuously select the most cost-effective sample to be queried from its own maintained unlabeled data pool as follows:
S j =S j (g * ,D j )
wherein S is j Represents the active sample selection algorithm employed by the jth worker, D j Representing unlabeled exemplar sets maintained by the j-th worker; inputting the latest updated training model g at the current moment * And the unmarked sample set maintained by the jth worker, obtaining the most cost-effective sample to be queried selected by the jth worker; actually, each worker can adopt different active sample selection algorithms according to own requirements;
and finally, summarizing the samples to be queried selected by all the workers into a set S, and sharing the set S by all the workers.
S4, a label inquiring stage;
the annotator randomly selects unlabeled samples from the sample set S to be queried, marks the selected samples, and adds the marked samples into the marked sample set L.
Further, in the step S1, the labeling sample cell L is specifically expressed as follows:
l represents n l A set of individual marker samples;
unlabeled sample cell D is specifically represented as follows:
D=D 1 ∪D 2 ∪…∪D k
d represents k partitioned unlabeled exemplar sets, maintained by k worker respectively.
Further, each unlabeled sample set in the labeled sample pool D is partitioned by means of random selection or clustering.
Further, a plurality of resource management nodes are arranged among the marked sample pool L, the sample set S, server to be queried and the worker and used for coordinating information transmission among each sample set, the server and the worker; and setting a plurality of server management nodes in a server group formed by each server, wherein the server management nodes are used for maintaining all server updated model libraries, so that a worker end can acquire the latest updated model in time.
Further, the model training mode of the server in the step S2 is specifically as follows:
g i =A(L)
wherein A represents a specific algorithm used when learning a predictive model on a server, g i Is a trained model; the learning algorithms A of the prediction models by different servers are mutually independent; when updating the Server, the model is updated by employing a Multi-Server update mechanism.
The beneficial effects are that:
the distributed asynchronous active labeling method provided by the invention can well cope with large-scale active learning task scenes with large data magnitude or large model scale under the distributed scene with a plurality of server nodes and worker nodes. In the framework, the server node is responsible for training the predictive model, and the worker node focuses on selecting the samples that are most cost-effective for model promotion. The worker nodes exist independently, and the sample to be queried is selected from unmarked data maintained by the worker nodes by downloading a model of the latest version at the current moment from the server node. Further, the sample to be queried is added into a sample pool to be queried commonly maintained by each worker node, so that a marker can conveniently retrieve and mark the sample. When the sample to be queried is marked, the sample to be queried is added into a marking pool, and a server node can update a model according to the new marking pool. When the model is updated, the prediction model is independently trained on each server, and the model updating frequency is improved through the alternate operation of a plurality of servers; in case of instance selection, each worker actively selects from a self-maintained unlabeled data pool, and adopts diversified sampling strategies among a plurality of workers. By implementing the framework, information can be efficiently communicated among the server node, the worker node and the annotators through two shared data pools, so that the server node, the worker node and the annotators can work asynchronously. On one hand, the method avoids the synchronization among the three steps of model training in active learning, instance selection and label inquiry, thereby avoiding the waiting of a labeling person and improving the labeling efficiency. On the other hand, under the Multi-Server Multi-Worker framework, the model updating frequency is increased in a pipeline working mode, and various sampling strategies are introduced between workers, so that the effectiveness of active learning sample selection can be maintained.
Drawings
FIG. 1 is a flow chart of a distributed asynchronous active labeling method provided by the invention;
FIG. 2 is a framework structure diagram of the distributed asynchronous active labeling method provided by the invention;
FIG. 3 is a schematic diagram of interaction of management nodes in the distributed asynchronous active labeling method provided by the invention;
FIG. 4 is a schematic diagram of a Multi-Server update mechanism in the distributed asynchronous active labeling method provided by the present invention.
Detailed Description
The invention will be further described with reference to the accompanying drawings.
The distributed asynchronous active labeling method provided by the invention is shown in figure 1. The method specifically comprises the following steps:
step S1, configuring active learning annotation scene parameters;
the active learning annotation scene comprises m servers for model training and k works for instance selection; each server learns a prediction model based on the marked sample pool L; each worker independently maintains a part of unlabeled sample sets, and all the unlabeled sample sets are defined as unlabeled sample pools D; defining S as the selected set of samples to be queried. In particular, the method comprises the steps of,
the label-sample cell L is specifically expressed as follows:
l represents n l A set of individual marker samples;
unlabeled sample cell D is specifically represented as follows:
D=D 1 ∪D 2 ∪…∪D k
d represents k partitioned unlabeled exemplar sets, maintained by k worker respectively. The unlabeled exemplar sets in the labeled exemplar pool D are partitioned by way of random selection or clustering.
After the specific configuration is carried out, the method provided by the invention can be driven to carry out the active learning task. During execution, it is necessary to continually determine whether model training has reached expectations or whether the query budget has been exhausted to determine whether termination of the learning process is required. If the stopping standard is reached, the model with the best effect is obtained from the model library updated by the history, and the learning process is finished.
Fig. 2 is a view of a scene structure of the learning method of the present invention. Unlike traditional synchronous active learning, the framework is still composed of three parts, namely model training, instance selection and tag query, but allows them to work asynchronously without waiting for each other. In this framework, three data pools, namely a labeled sample pool L, a labeled sample pool D, and a sample pool S to be queried, need to be maintained. These three data pools are maintained by the framework itself, except that configuration is required at initial run-time, and no intervention is required later.
Specifically, step S2, model training stage;
the prediction models trained by each server are independent; updating all servers in the learning stage; when the model training stage is finished, all servers can detect the marked sample pool L; when the marked sample pool L receives new marked sample data after iterative updating, a new round of model training updating is started. Specifically, the model training mode of the server is as follows:
g i =A(L)
wherein A represents a specific algorithm used when learning a predictive model on a server, g i Is a trained model; the learning algorithms A of the prediction models by different servers are mutually independent; when updating the Server, the model is updated by employing a Multi-Server update mechanism.
S3, an instance selection stage;
each worker independently maintains part of the unlabeled sample set, and the workers directly and mutually independently operate; each worker acquires the latest model from the training model library updated by all servers, and selects the most cost-effective sample to be queried from unlabeled sample data maintained by the worker; in particular, the method comprises the steps of,
first, each worker completes the trained model { g from all servers directly 1 ,g 2 ,…,g m Retrieving the training model g updated most recently at the current moment * The method comprises the steps of carrying out a first treatment on the surface of the The worker then uses the active sample selection algorithm directly to continuously select the most cost-effective from its own maintained unlabeled data pool, and the sample to be queried that is most valuable for improving the model performance is as follows:
S j =S j (g * ,D j )
wherein S is j Represents the active sample selection algorithm employed by the jth worker, D j Representing unlabeled exemplar sets maintained by the j-th worker; inputting the latest updated training model g at the current moment * And the unmarked sample set maintained by the jth worker, obtaining the most cost-effective sample to be queried selected by the jth worker; in practice, each worker can adopt different active sample selection algorithms according to own requirements.
It should be noted that the worker directly retrieves the latest updated model from the server to perform active sample selection, and g is not limited * Whether all samples in the current latest label sample pool have been fully utilized. Thus, the worker does not need to wait for model training when performing instance selection.
And finally, summarizing the samples to be queried selected by all the workers into a set S, and sharing the set S by all the workers.
S4, a label inquiring stage;
the annotator randomly selects unlabeled samples from the sample set S to be queried, marks the selected samples, and adds the marked samples into the marked sample set L.
Because the task implementation under the distributed mode is very complex, coordination and synchronization among all nodes need to be comprehensively considered. Therefore, in specific arrangement, for coordination of server nodes and worker nodes, additional management nodes need to be introduced for control, as shown in fig. 3. A resource management node is arranged among the marked sample pool L, the sample set S, server to be inquired and the worker and is used for coordinating information transmission among each sample set, the server and the worker; and setting server management nodes in server groups formed by the servers, wherein the server management nodes are used for maintaining all server updated model libraries, so that a worker end can acquire the latest updated models in time. The number of these two types of nodes can be set according to the specific task size. The basic functions that the management node needs to implement are deterministic, but the specific algorithms or functional extensions that it performs will vary depending on the actual scenario requirements. By introducing the management node, only a model updating algorithm of the server node is required to be set, and only a sample query strategy is required to be selected for the worker node. Furthermore, during the operation of the method of the present invention, the annotator can continuously mark the data without excessive intervention, without or with as little waiting for model training and query selection as possible.
FIG. 4 is a schematic diagram of a Multi-Server update mechanism used by a Server in model training update according to the present invention. In synchronous active learning, once new marker samples are added to the training set, the model is updated in time based on the latest marker sample pool. This ensures that the model can make full use of all the supervision information, and subsequently the selected sample to be marked can thus also fully reflect the requirements of the current model. In an asynchronous scenario, this feature is not preserved. In order to cope with the problem, the invention uses the idea of a pipeline to update the model more frequently by adopting a Multi-Server updating mechanism, thereby utilizing the label data as timely as possible. In order to more intuitively understand the Multi-Server update mechanism used, a specific embodiment is provided below.
Assume that the time cost of one model training is 3 times that of one instance selection. When 3 servers are provided, the first 4 instance selections performed by the worker end are all based on version 0 models, but each round of instance selections then always receives a new model. Therefore, the Multi-Server updating mechanism can enable the worker end to more effectively select an instance to be marked by a marker by greatly improving the updating frequency of the model, and effectively alleviate the problem of reduced cost effectiveness of the label in an asynchronous scene.
The foregoing is only a preferred embodiment of the invention, it being noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the invention.

Claims (5)

1. The distributed asynchronous active labeling method is characterized by comprising the following steps of:
step S1, configuring active learning annotation scene parameters;
the active learning annotation scene comprises m servers for model training and k works for instance selection; each server learns a prediction model based on the marked sample pool L; each worker independently maintains a part of unlabeled sample sets, and all the unlabeled sample sets are defined as unlabeled sample pools D; defining S as a set of selected samples to be queried;
s2, model training stage;
the prediction models trained by each server are independent; updating all servers in the learning stage; when the model training stage is finished, all servers can detect the marked sample pool L; when the marked sample pool L receives new marked sample data after iterative updating, starting a new round of model training updating;
s3, an instance selection stage;
each worker independently maintains part of the unlabeled sample set, and the workers directly and mutually independently operate; each worker acquires the latest model from the training model library updated by all servers, and selects the most cost-effective sample to be queried from unlabeled sample data maintained by the worker; in particular, the method comprises the steps of,
first, each worker completes the trained model { g from all servers directly 1 ,g 2 ,…,g m Retrieving the training model g updated most recently at the current moment * The method comprises the steps of carrying out a first treatment on the surface of the The worker then uses an active sample selection algorithm directly to continuously select the most cost-effective sample to be queried from its own maintained unlabeled data pool as follows:
S j =S j (g * ,D j )
wherein S is j Represents the active sample selection algorithm employed by the jth worker, D j Representing unlabeled exemplar sets maintained by the j-th worker; inputting the latest updated training model g at the current moment * And the unmarked sample set maintained by the jth worker, obtaining the most cost-effective sample to be queried selected by the jth worker; actually, each worker can adopt different active sample selection algorithms according to own requirements;
finally, summarizing the samples to be queried selected by all the workers into a set S, and sharing the set S by all the workers;
s4, a label inquiring stage;
the annotator randomly selects unlabeled samples from the sample set S to be queried, marks the selected samples, and adds the marked samples into the marked sample set L.
2. The distributed asynchronous active labeling method according to claim 1, wherein in the step S1, the labeling sample pool L is specifically represented as follows:
l represents n l Sets of individual marker samplesCombining;
unlabeled sample cell D is specifically represented as follows:
D=D 1 ∪D 2 ∪…∪D k
d represents k partitioned unlabeled exemplar sets, maintained by k worker respectively.
3. A distributed asynchronous active labeling method according to claim 2, characterized in that the unlabeled sample sets in the labeled sample pool D are partitioned by means of random selection or clustering.
4. The distributed asynchronous active labeling method according to claim 1, wherein a plurality of resource management nodes are arranged among the labeled sample pool L, the sample set S, server to be queried and the worker for coordinating information transmission among each sample set, the server and the worker; and setting a plurality of server management nodes in a server group formed by each server, wherein the server management nodes are used for maintaining all server updated model libraries, so that a worker end can acquire the latest updated model in time.
5. The distributed asynchronous active labeling method according to claim 1, wherein the model training mode of the server in step S2 is specifically as follows:
g i =A(L)
wherein A represents a specific algorithm used when learning a predictive model on a server, g i Is a trained model; the learning algorithms A of the prediction models by different servers are mutually independent; when updating the Server, the model is updated by employing a Multi-Server update mechanism.
CN202110801168.2A 2021-07-15 2021-07-15 Distributed asynchronous active labeling method Active CN113642610B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110801168.2A CN113642610B (en) 2021-07-15 2021-07-15 Distributed asynchronous active labeling method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110801168.2A CN113642610B (en) 2021-07-15 2021-07-15 Distributed asynchronous active labeling method

Publications (2)

Publication Number Publication Date
CN113642610A CN113642610A (en) 2021-11-12
CN113642610B true CN113642610B (en) 2024-04-02

Family

ID=78417470

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110801168.2A Active CN113642610B (en) 2021-07-15 2021-07-15 Distributed asynchronous active labeling method

Country Status (1)

Country Link
CN (1) CN113642610B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108345661A (en) * 2018-01-31 2018-07-31 华南理工大学 A kind of Wi-Fi clustering methods and system based on extensive Embedding technologies
CN111160469A (en) * 2019-12-30 2020-05-15 湖南大学 Active learning method of target detection system
WO2021022572A1 (en) * 2019-08-07 2021-02-11 南京智谷人工智能研究院有限公司 Active sampling method based on meta-learning
CN113033098A (en) * 2021-03-26 2021-06-25 山东科技大学 Ocean target detection deep learning model training method based on AdaRW algorithm

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108345661A (en) * 2018-01-31 2018-07-31 华南理工大学 A kind of Wi-Fi clustering methods and system based on extensive Embedding technologies
WO2021022572A1 (en) * 2019-08-07 2021-02-11 南京智谷人工智能研究院有限公司 Active sampling method based on meta-learning
CN111160469A (en) * 2019-12-30 2020-05-15 湖南大学 Active learning method of target detection system
CN113033098A (en) * 2021-03-26 2021-06-25 山东科技大学 Ocean target detection deep learning model training method based on AdaRW algorithm

Also Published As

Publication number Publication date
CN113642610A (en) 2021-11-12

Similar Documents

Publication Publication Date Title
CN100527088C (en) Cooperative scheduling using coroutines and threads
CN110168578A (en) Multitask neural network with task particular path
CN108182229A (en) Information interacting method and device
CN110941698A (en) Service discovery method based on convolutional neural network under BERT
CN110321437B (en) Corpus data processing method and device, electronic equipment and medium
CN103198097A (en) Massive geoscientific data parallel processing method based on distributed file system
US20100030761A1 (en) Method of retrieving and refining information based on tri-gram
EP3738047B1 (en) Interactive and non-interactive execution and rendering of templates to automate control and exploration across systems
CN112463337A (en) Workflow task migration method used in mobile edge computing environment
Chen et al. Self-training enhanced: Network embedding and overlapping community detection with adversarial learning
CN110263021B (en) Theme library generation method based on personalized label system
CN105138649A (en) Data search method and device and terminal
CN113642610B (en) Distributed asynchronous active labeling method
CN113918807A (en) Data recommendation method and device, computing equipment and computer-readable storage medium
CN101635001A (en) Method and apparatus for extracting information from a database
CN103699627A (en) Dummy file parallel data block positioning method based on Hadoop cluster
CN106970840A (en) A kind of Method for HW/SW partitioning of combination task scheduling
US20220230712A1 (en) Systems and methods for template-free reaction predictions
Al-Hayali et al. Genetic Algorithm for finding shortest paths Problem
US20050112577A1 (en) Rna sequence analyzer, and rna sequence analysis method, program and recording medium
CN115309865A (en) Interactive retrieval method, device, equipment and storage medium based on double-tower model
CN114691327A (en) Multi-objective group intelligent optimization method and system for two-stage task scheduling
Li et al. FlashSchema: achieving high quality XML schemas with powerful inference algorithms and large-scale schema data
CN116595343B (en) Manifold ordering learning-based online unsupervised cross-modal retrieval method and system
CN113626179B (en) Universal artificial intelligent model training method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant