CN112766390A - Method, device and equipment for determining training sample - Google Patents

Method, device and equipment for determining training sample Download PDF

Info

Publication number
CN112766390A
CN112766390A CN202110102500.6A CN202110102500A CN112766390A CN 112766390 A CN112766390 A CN 112766390A CN 202110102500 A CN202110102500 A CN 202110102500A CN 112766390 A CN112766390 A CN 112766390A
Authority
CN
China
Prior art keywords
sample
evaluation
samples
training
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110102500.6A
Other languages
Chinese (zh)
Inventor
白强伟
黄艳香
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Minglue Artificial Intelligence Group Co Ltd
Original Assignee
Shanghai Minglue Artificial Intelligence Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Minglue Artificial Intelligence Group Co Ltd filed Critical Shanghai Minglue Artificial Intelligence Group Co Ltd
Priority to CN202110102500.6A priority Critical patent/CN112766390A/en
Publication of CN112766390A publication Critical patent/CN112766390A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to the technical field of machine learning, and discloses a method for determining training samples, which comprises the following steps: acquiring an unlabeled sample set and a plurality of alternative models, and distributing corresponding labeled sample sets for the alternative models; the multiple alternative models are active learning models with different sample selection strategies; training corresponding alternative models by using the marked sample set to obtain evaluation models corresponding to the alternative models; evaluating the unlabeled sample set by using an evaluation model to obtain a first evaluation result; and determining a training sample according to the first evaluation result. The active learning models with different sample selection strategies are trained through the labeled sample sets to obtain corresponding evaluation models, and the unlabeled sample sets are evaluated by the evaluation models to determine the training samples, so that the tendency of a single active learning algorithm is avoided, and the diversity of the training samples is improved. The application also discloses a device and equipment for determining the training samples.

Description

Method, device and equipment for determining training sample
Technical Field
The present application relates to the field of machine learning technology, and for example, to a method, an apparatus, and a device for determining training samples.
Background
With the advent of the big data age, the data analysis task became more difficult. Data with accurate labeling information is particularly rare due to the large size but low quality of the data. Therefore, how to determine the most valuable partial data from the mass data for manual annotation becomes a difficult problem, so that the data annotation cost is reduced.
In the process of implementing the embodiments of the present disclosure, it is found that at least the following problems exist in the related art: the prior art determines samples for a single active learning algorithm, and different active learning algorithms have different tendencies, which results in the lack of diversity of the finally selected samples.
Disclosure of Invention
The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview nor is intended to identify key/critical elements or to delineate the scope of such embodiments but rather as a prelude to the more detailed description that is presented later.
The embodiment of the disclosure provides a method, a device and equipment for determining a training sample, so as to improve the diversity of the training sample of an active learning model.
In some embodiments, the method comprises: acquiring an unlabeled sample set and a plurality of alternative models, and distributing corresponding labeled sample sets for the alternative models; the multiple alternative models are active learning models with different sample selection strategies; training corresponding alternative models by using the labeled sample set to obtain evaluation models corresponding to the alternative models; evaluating the unlabeled sample set by using the evaluation model to obtain a first evaluation result; and determining a training sample according to the first evaluation result.
In some embodiments, the apparatus comprises: a processor and a memory storing program instructions, characterized in that the processor is configured to perform the above-described method for determining training samples when executing the program instructions.
In some embodiments, the apparatus comprises the above-described means for determining training samples.
The method, the device and the equipment for determining the training samples provided by the embodiment of the disclosure can realize the following technical effects: the unlabelled sample set and the multiple active learning models are obtained, the corresponding labeled sample sets are distributed to the active learning models, the labeled sample sets are used for training samples, the multiple active learning models with different strategies are selected, the evaluation models corresponding to the active learning models are obtained, the unlabelled sample sets are evaluated by the evaluation models, the training samples are determined, the tendency of determining the training samples by a single active learning algorithm is avoided, and the diversity of the training samples is improved.
The foregoing general description and the following description are exemplary and explanatory only and are not restrictive of the application.
Drawings
One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the accompanying drawings and not in limitation thereof, in which elements having the same reference numeral designations are shown as like elements and not in limitation thereof, and wherein:
FIG. 1 is a schematic diagram of a method for determining training samples provided by embodiments of the present disclosure;
FIG. 2 is a schematic diagram of a method for determining training samples provided by embodiments of the present disclosure;
fig. 3 is a schematic diagram of an apparatus for determining training samples according to an embodiment of the present disclosure.
Detailed Description
So that the manner in which the features and elements of the disclosed embodiments can be understood in detail, a more particular description of the disclosed embodiments, briefly summarized above, may be had by reference to the embodiments, some of which are illustrated in the appended drawings. In the following description of the technology, for purposes of explanation, numerous details are set forth in order to provide a thorough understanding of the disclosed embodiments. However, one or more embodiments may be practiced without these details. In other instances, well-known structures and devices may be shown in simplified form in order to simplify the drawing.
The terms "first," "second," and the like in the description and in the claims, and the above-described drawings of embodiments of the present disclosure, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the present disclosure described herein may be made. Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions.
The term "plurality" means two or more unless otherwise specified.
In the embodiment of the present disclosure, the character "/" indicates that the preceding and following objects are in an or relationship. For example, A/B represents: a or B.
The term "and/or" is an associative relationship that describes objects, meaning that three relationships may exist. For example, a and/or B, represents: a or B, or A and B.
In conjunction with fig. 1, an embodiment of the present disclosure provides a method for determining a training sample, including:
step S101: acquiring an unlabeled sample set and a plurality of alternative models, and distributing corresponding labeled sample sets for the alternative models; the multiple candidate models are active learning models with different sample selection strategies.
Step S102: and training the corresponding alternative models by using the marked sample set to obtain the evaluation models corresponding to the alternative models.
Step S103: and evaluating the unlabeled sample set by using the evaluation model to obtain a first evaluation result.
Step S104: and determining a training sample according to the first evaluation result.
By adopting the method for determining the training samples provided by the embodiment of the disclosure, the unlabeled sample set and the plurality of active learning models are obtained, the corresponding labeled sample sets are distributed to the active learning models, the plurality of active learning models with different strategies are selected by using the labeled sample set training samples, and the evaluation models corresponding to the active learning models are obtained, so that the unlabeled sample set is evaluated by using the evaluation models, the training samples are determined, the tendency of determining the training samples by using a single active learning algorithm is avoided, and the diversity of the training samples is improved.
Optionally, when labeled sample sets corresponding to the multiple candidate models are the same, training the corresponding candidate models by using the labeled sample sets.
Optionally, determining a training sample according to the first evaluation result comprises: and determining the unlabeled sample corresponding to the first evaluation result meeting the first preset condition as a training sample.
Optionally, the first evaluation result meeting the first preset condition includes: unlabeled samples are worth labeling. Optionally, when the first evaluation result is a score, determining an unlabeled sample corresponding to the first evaluation result with the score within a preset range as a sample worth labeling. Optionally, a sample worth labeling is determined as the training sample.
In some embodiments, the candidate model is trained through an uncertainty-based sample selection strategy to obtain a two-class evaluation model, and under the condition that the score obtained by the two-class evaluation model is within a preset range, the unlabeled sample corresponding to the first evaluation result is determined to be a sample worth labeling.
In some embodiments, since the sample selection policy in the active learning algorithm is used to evaluate whether a sample is worth labeling, the process of evaluating unlabeled samples by a sample selection policy is represented by y ═ f (x); wherein x is an unlabeled sample, f (x) is an evaluation model for evaluating the unlabeled sample, and y is an evaluation result; optionally, when y is 0, the first evaluation result of the evaluation model is that the unlabeled sample is not worth labeling; when y is 1, the first evaluation result of the evaluation model is worthy of labeling for the unlabeled sample.
Optionally, when the first evaluation result of any evaluation model is that an unlabeled sample is worth labeling, the unlabeled sample corresponding to the first evaluation result is determined as a training sample.
Therefore, each evaluation model is used for evaluating the unmarked sample set to obtain a first evaluation result, and under the condition that the first evaluation result of any evaluation model is that the sample is worth to be marked, the unmarked sample corresponding to the first evaluation result is determined as the training sample, so that the tendency that a single active learning algorithm determines the training sample is avoided, and the diversity of the training samples is improved.
Optionally, determining a training sample according to the first evaluation result comprises: acquiring a newly added labeled sample set according to the first evaluation result; acquiring a new evaluation model according to the new labeled sample set; evaluating the unlabelled samples in the unlabelled sample set by using the newly added evaluation model to obtain a second evaluation result corresponding to the newly added evaluation model; and determining the unlabeled sample corresponding to the second evaluation result meeting the first preset condition as a training sample. Therefore, the newly added and labeled sample set is obtained through the first evaluation result, and the newly added evaluation model is obtained according to the newly added and labeled sample set, so that the newly added evaluation model is used for evaluating the unlabeled samples in the unlabeled sample set to determine the training samples, and the newly added evaluation model can be obtained in the sample evaluation process, thereby avoiding the tendency of determining the training samples by a single active learning algorithm, and improving the diversity of the training samples.
Optionally, obtaining a newly added labeled sample set according to the first evaluation result includes: labeling the unlabeled sample corresponding to the first evaluation result meeting the first preset condition to obtain a labeled sample; and adding the marked samples into the marked sample set to obtain a newly added marked sample set.
Optionally, the unlabeled samples corresponding to the first evaluation result meeting the first preset condition are labeled to obtain labeled samples, the labeled samples are determined as newly added labeled samples, and the newly added labeled samples are added into the labeled sample set to obtain a newly added labeled sample set.
Optionally, the first evaluation result meeting the first preset condition includes: unlabeled samples are worth labeling. Optionally, the unlabeled samples worth labeling are labeled to obtain newly added labeled samples.
Optionally, obtaining a newly added labeled sample set according to the first evaluation result includes: under the condition that the number of unlabelled samples corresponding to the first evaluation result meeting the first preset condition meets the second preset condition, labeling the unlabelled samples corresponding to the first evaluation result meeting the first preset condition to obtain labeled samples; and adding the marked samples into the marked sample set to obtain a newly added marked sample set.
Optionally, satisfying the second preset condition includes: the number of the unlabelled samples corresponding to the first evaluation result meeting the first preset condition exceeds a first set threshold. Optionally, the first set threshold is 60. Optionally, the unlabeled sample corresponding to the first evaluation result meeting the first preset condition includes: the unlabeled samples worth labeling. In some embodiments, when the number of unlabeled samples worth labeling exceeds 60, the unlabeled samples worth labeling are labeled to obtain a new added labeled sample, and the new added labeled sample is added to the labeled sample set to obtain a new added labeled sample set.
Optionally, a sample counter is allocated to the newly added labeled sample set corresponding to each candidate model. Optionally, the number of newly added labeled samples in the newly added labeled sample set corresponding to each candidate model is recorded by a sample counter. Optionally, under the condition that the number of the newly added labeled samples exceeds a first set threshold, a newly added evaluation model is obtained according to the newly added labeled sample set.
Optionally, after the new evaluation model is obtained, the method further includes: and clearing the sample counter, and recording the number of newly-added and labeled samples in the newly-added and labeled sample set corresponding to each alternative model again.
Optionally, the sample counter is cleared whenever the number of newly-added and labeled samples in the newly-added and labeled sample set exceeds a first set threshold, and the number of newly-added and labeled samples in the newly-added and labeled sample set corresponding to each candidate model is recorded again until the number of newly-added and labeled samples in the newly-added and labeled sample set next time exceeds the first set threshold.
Therefore, by acquiring the newly added marked samples and utilizing the newly added marked samples to iterate the newly added evaluation model under the condition that the number of the newly added marked samples exceeds the first set threshold value, the evaluation model can be updated in the process of evaluating the samples, so that the tendency of a single active learning algorithm to determine the training samples is avoided, and the diversity of the training samples is improved.
Optionally, under the condition that a second evaluation result corresponding to the newly added evaluation model meets a third preset condition, the newly added evaluation model is stopped from being used for evaluating the unlabeled sample.
Optionally, satisfying the third preset condition includes: the number of the unlabelled samples corresponding to the second evaluation result meeting the first preset condition exceeds a second set threshold. Optionally, the second set threshold is 1000.
Optionally, an overall sample counter is allocated to the newly added labeled sample set corresponding to each candidate model. Optionally, the total number of the newly added labeled samples in the newly added labeled sample set corresponding to each candidate model is recorded by a total sample counter. Optionally, in a case that the total number of the newly added labeled samples exceeds a second set threshold, stopping evaluating the unlabeled samples by using the newly added evaluation model.
Optionally, obtaining a new evaluation model according to a new labeled sample set includes: and training the alternative model corresponding to the evaluation model by using the newly added and labeled sample set to obtain a newly added evaluation model. Therefore, the evaluation model can be updated in the sample evaluation process, so that the tendency of a single active learning algorithm to determine the training sample is avoided, and the diversity of the training sample is improved.
Optionally, when determining the training sample according to the first evaluation result, the method further includes: and in the unmarked sample set, randomly determining a plurality of unmarked samples corresponding to the first evaluation results which do not meet the first preset condition as training samples. Therefore, the unlabeled samples which do not meet the first preset condition are randomly selected and determined as the training samples, and the richness of the training samples can be increased.
Optionally, when all the evaluation models stop evaluating the unlabeled samples, a number of unlabeled samples corresponding to the first evaluation result that does not satisfy the first preset condition are randomly determined as training samples in the unlabeled sample set.
Optionally, after determining the training sample according to the first evaluation result, the method further includes: and labeling the training samples. Therefore, a first evaluation result is obtained by evaluating the information quantity of the unmarked sample in advance, the training sample is determined according to the first evaluation result, and the training sample is marked, so that the sample data quantity to be marked is reduced, the sample marking cost is reduced, the marked training sample is conveniently used for training the model in the follow-up process, and the generalization performance of the model obtained by the follow-up training process can be improved.
In conjunction with fig. 2, an embodiment of the present disclosure provides a method for determining a training sample, including:
step S201: and acquiring an unlabeled sample set u and an active learning algorithm set S, wherein the active learning algorithm set S comprises a plurality of active learning algorithms with different sample selection strategies.
Step S202: acquiring a training sample data set L; allocating corresponding labeled sample sets L for various active learning algorithmsiAnd distributing corresponding sample counter delta | L for the marked sample seti|;LiThe marked sample set corresponding to the ith active learning algorithm is obtained; delta | LiL is a sample counter corresponding to the ith labeled sample set; the sample counter respectively records the number of newly added marked samples in each corresponding marked sample set; initially, the labeled sample sets corresponding to the active learning algorithms are the same; the training sample data set is the same as each labeled sample set; initially, the number of newly added marked samples recorded by the sample counter is 0, i.e., Δ | Li|←0。
Step S203: training alternative models corresponding to multiple active learning algorithms in the active learning algorithm set S by using the labeled sample set to obtain an evaluation model f corresponding to each alternative modeli
Step S204: selecting an unlabeled sample m from the unlabeled sample set u, and executing updating u ← u-m; using each evaluation model fiEvaluation of unlabelled samples m, i.e. yi=fi(m); obtaining a first evaluation result y of each evaluation model on the unlabeled sample mi,yiIs the evaluation result y corresponding to the ith active learning algorithm in the algorithm set SiIs 0 or 1.
Step S205: if sigmai∈Syi> 0, i.e. the first evaluation result of the arbitrary evaluation model on the unlabeled sample m is unlabeledUnder the condition that the sample m is worth of being labeled, determining the unlabeled sample m as a training sample; and marking the training sample, and adding the marked training sample m' into a training sample data set L, namely L ← L ≧ U.
Step S206: under the condition that the quantity of the unmarked samples worth labeling exceeds a first set threshold value, the unmarked samples worth labeling are labeled to obtain newly added labeled samples, and the newly added labeled samples are added into a labeled sample set corresponding to the evaluation model; for all i ∈ S, if yiCombine samples L when 1i←LiU m' to obtain a newly added labeled sample set; and updates the sample counter Δ | Li|←△|Li|+1。
Step S207: for all i e S, if the number of the newly added marked samples of the marked sample data set is delta | Li|>εiThen, obtaining a new evaluation model f according to the new labeled sample setiAnd the sample counter is cleared, i.e., Δ | Li|←0;εiA threshold value is set for the first.
Step S208: using a newly added evaluation model fiAnd evaluating the unlabeled samples n in the unlabeled sample set to obtain a second evaluation result corresponding to the newly added evaluation model.
Step S209: and under the condition that the second evaluation result of any newly added evaluation model on the unlabeled sample n is that the unlabeled sample n is worth labeling, determining the unlabeled sample n as a training sample.
Step S210: for all i belonging to S, if the total number of newly added labeled samples in the newly added labeled sample set exceeds a second set threshold, namely | Li|>γiIn the case of (3), deleting the active learning algorithm of the candidate model corresponding to the newly added labeled sample set from the active learning algorithm set S, i.e., S ← S-i.
Step S211: judging whether an active learning algorithm exists in the active learning algorithm set S; if no active learning algorithm exists in the active learning algorithm set S, that is
Figure BDA0002916163950000081
In the case of (3), step S212 is executed; otherwise, step S204 is executed.
Step S212: if the unlabeled sample exists in the unlabeled sample set, that is
Figure BDA0002916163950000082
And randomly determining a plurality of unmarked samples corresponding to the first evaluation results which do not meet the first preset condition as training samples in the unmarked sample set.
Step S213: and labeling the training samples, and adding the labeled training samples into a training sample data set L.
In some embodiments, the set of active learning algorithms S ═ { a, B, C }, where a is the active learning algorithm based on the uncertainty sample selection policy, B is the active learning algorithm based on the committee query sample selection policy, and C is the active learning algorithm based on the model change sample selection policy. Optionally, the three active learning algorithms A, B and C are assigned corresponding labeled sample sets, L respectivelyA、LBAnd LC(ii) a Initially, three labeled sample sets LA、LBAnd LCThe same; l and LA、LBAnd LCAre all the same. Optionally, the three labeled sample sets are assigned corresponding sample counters Δ | LA|←0、△|LB| ← 0 and Δ | LCAnd | ← 0, which records the number of newly added labeled samples in each corresponding labeled sample set. Optionally, candidate models corresponding to three active learning algorithms in the labeled sample set training active learning algorithm set S are used to obtain an evaluation model f corresponding to each candidate modelA、fBAnd fC. Optionally, selecting an unlabeled sample m from the unlabeled sample set u, and respectively using the evaluation model fA、fBAnd fCEvaluating the unmarked sample m to obtain a first evaluation result of each evaluation model on the unmarked sample m; if A belongs to S, utilizing the evaluation model fAEvaluation of unlabelled samples m, i.e. yA=fA(m); if B belongs to S, utilizing the evaluation model fBFor unlabeled sampleThis m is evaluated, i.e. yB=fB(m); if C belongs to S, utilizing the evaluation model fCEvaluation of unlabelled samples m, i.e. yC=fC(m)。
By adopting the method for determining the training samples provided by the embodiment of the disclosure, the unlabeled sample set and the plurality of active learning models can be obtained, the corresponding labeled sample set is distributed to the active learning models, the plurality of active learning models with different strategies are selected by using the labeled sample set training samples, and the evaluation model corresponding to each active learning model is obtained, so that the unlabeled sample set is evaluated by using each evaluation model, the training samples are determined, the tendency of determining the training samples by using a single active learning algorithm is avoided, and the diversity of the training samples is improved.
As shown in fig. 3, an apparatus for determining training samples according to an embodiment of the present disclosure includes a processor (processor)100 and a memory (memory)101 storing program instructions. Optionally, the apparatus may also include a Communication Interface (Communication Interface)102 and a bus 103. The processor 100, the communication interface 102, and the memory 101 may communicate with each other via a bus 103. The communication interface 102 may be used for information transfer. The processor 100 may call program instructions in the memory 101 to perform the method for determining training samples of the above embodiments.
Further, the program instructions in the memory 101 may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product.
The memory 101, which is a computer-readable storage medium, may be used for storing software programs, computer-executable programs, such as program instructions/modules corresponding to the methods in the embodiments of the present disclosure. The processor 100 executes functional applications and data processing, i.e. implements the method for determining training samples in the above embodiments, by executing program instructions/modules stored in the memory 101.
The memory 101 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal device, and the like. In addition, the memory 101 may include a high-speed random access memory, and may also include a nonvolatile memory.
By adopting the device for determining the training samples, which is provided by the embodiment of the disclosure, the unlabeled sample set and the plurality of active learning models can be obtained, the corresponding labeled sample set is distributed to the active learning models, the plurality of active learning models with different strategies are selected by using the labeled sample set training samples, and the evaluation models corresponding to the active learning models are obtained, so that the unlabeled sample set is evaluated by using the evaluation models, the training samples are determined, the tendency of a single active learning algorithm to determine the training samples is avoided, and the diversity of the training samples is improved.
The embodiment of the present disclosure provides an apparatus including the above-mentioned device for determining a training sample.
Optionally, the apparatus comprises: computers, servers, etc.
The equipment can obtain the unmarked sample set and the plurality of active learning models by obtaining the unmarked sample set and the plurality of active learning models, and distribute the corresponding marked sample set for the active learning models, and the evaluation models corresponding to the active learning models are obtained by selecting the plurality of active learning models with different strategies from the marked sample set training samples, so that the unmarked sample set is evaluated by the evaluation models, the training samples are determined, the tendency of determining the training samples by a single active learning algorithm is avoided, and the diversity of the training samples is improved.
Embodiments of the present disclosure provide a computer-readable storage medium storing computer-executable instructions configured to perform the above-described method for determining training samples.
Embodiments of the present disclosure provide a computer program product comprising a computer program stored on a computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the above-described method for determining training samples.
The computer-readable storage medium described above may be a transitory computer-readable storage medium or a non-transitory computer-readable storage medium.
The technical solution of the embodiments of the present disclosure may be embodied in the form of a software product, where the computer software product is stored in a storage medium and includes one or more instructions to enable a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method of the embodiments of the present disclosure. And the aforementioned storage medium may be a non-transitory storage medium comprising: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes, and may also be a transient storage medium.
The above description and drawings sufficiently illustrate embodiments of the disclosure to enable those skilled in the art to practice them. Other embodiments may incorporate structural, logical, electrical, process, and other changes. The examples merely typify possible variations. Individual components and functions are optional unless explicitly required, and the sequence of operations may vary. Portions and features of some embodiments may be included in or substituted for those of others. Furthermore, the words used in the specification are words of description only and are not intended to limit the claims. As used in the description of the embodiments and the claims, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. Similarly, the term "and/or" as used in this application is meant to encompass any and all possible combinations of one or more of the associated listed. Furthermore, the terms "comprises" and/or "comprising," when used in this application, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Without further limitation, an element defined by the phrase "comprising an …" does not exclude the presence of other like elements in a process, method or apparatus that comprises the element. In this document, each embodiment may be described with emphasis on differences from other embodiments, and the same and similar parts between the respective embodiments may be referred to each other. For methods, products, etc. of the embodiment disclosures, reference may be made to the description of the method section for relevance if it corresponds to the method section of the embodiment disclosure.
Those of skill in the art would appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software may depend upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments. It can be clearly understood by the skilled person that, for convenience and brevity of description, the specific working processes of the system, the apparatus and the unit described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the embodiments disclosed herein, the disclosed methods, products (including but not limited to devices, apparatuses, etc.) may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units may be merely a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form. The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to implement the present embodiment. In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. In the description corresponding to the flowcharts and block diagrams in the figures, operations or steps corresponding to different blocks may also occur in different orders than disclosed in the description, and sometimes there is no specific order between the different operations or steps. For example, two sequential operations or steps may in fact be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. Each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Claims (10)

1. A method for determining training samples, comprising:
acquiring an unlabeled sample set and a plurality of alternative models, and distributing corresponding labeled sample sets for the alternative models; the multiple alternative models are active learning models with different sample selection strategies;
training corresponding alternative models by using the labeled sample set to obtain evaluation models corresponding to the alternative models;
evaluating the unlabeled sample set by using the evaluation model to obtain a first evaluation result;
and determining a training sample according to the first evaluation result.
2. The method of claim 1, wherein determining training samples based on the first evaluation comprises:
and determining the unlabeled sample corresponding to the first evaluation result meeting the first preset condition as a training sample.
3. The method of claim 1, wherein determining training samples based on the first evaluation comprises:
acquiring a newly added labeled sample set according to the first evaluation result;
acquiring a new evaluation model according to the new labeled sample set;
evaluating the unlabeled samples in the unlabeled sample set by using the newly added evaluation model to obtain a second evaluation result corresponding to the newly added evaluation model;
and determining the unlabeled sample corresponding to the second evaluation result meeting the first preset condition as a training sample.
4. The method of claim 3, wherein obtaining a new set of annotated samples from the first evaluation comprises:
labeling the unlabeled sample corresponding to the first evaluation result meeting the first preset condition to obtain a labeled sample;
and adding the marked sample into the marked sample set to obtain a newly added marked sample set.
5. The method of claim 3, wherein obtaining a new set of annotated samples from the first evaluation comprises:
under the condition that the number of unlabelled samples corresponding to the first evaluation result meeting the first preset condition meets the second preset condition, labeling the unlabelled samples corresponding to the first evaluation result meeting the first preset condition to obtain labeled samples;
and adding the marked sample into the marked sample set to obtain a newly added marked sample set.
6. The method of claim 3, wherein obtaining a new evaluation model based on the new labeled sample set comprises:
and training the alternative model corresponding to the evaluation model by using the newly added labeled sample set to obtain a newly added evaluation model.
7. The method according to any one of claims 2 to 6, wherein determining the training sample according to the first evaluation result further comprises:
and randomly determining a plurality of unmarked samples corresponding to the first evaluation results which do not meet the first preset condition as training samples in the unmarked sample set.
8. The method of claim 7, wherein determining the training sample according to the first evaluation result further comprises:
and labeling the training samples.
9. An apparatus for determining training samples, comprising a processor and a memory having stored thereon program instructions, characterized in that the processor is configured to perform the method for determining training samples according to any one of claims 1 to 8 when executing the program instructions.
10. An apparatus, characterized in that it comprises a device for determining training samples as claimed in claim 9.
CN202110102500.6A 2021-01-26 2021-01-26 Method, device and equipment for determining training sample Pending CN112766390A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110102500.6A CN112766390A (en) 2021-01-26 2021-01-26 Method, device and equipment for determining training sample

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110102500.6A CN112766390A (en) 2021-01-26 2021-01-26 Method, device and equipment for determining training sample

Publications (1)

Publication Number Publication Date
CN112766390A true CN112766390A (en) 2021-05-07

Family

ID=75705691

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110102500.6A Pending CN112766390A (en) 2021-01-26 2021-01-26 Method, device and equipment for determining training sample

Country Status (1)

Country Link
CN (1) CN112766390A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113284141A (en) * 2021-07-09 2021-08-20 武汉精创电子技术有限公司 Model determination method, device and equipment for defect detection
CN114417871A (en) * 2021-12-17 2022-04-29 北京百度网讯科技有限公司 Model training and named entity recognition method and device, electronic equipment and medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110245721A (en) * 2019-06-25 2019-09-17 深圳市腾讯计算机系统有限公司 Training method, device and the electronic equipment of neural network model
CN110378396A (en) * 2019-06-26 2019-10-25 北京百度网讯科技有限公司 Sample data mask method, device, computer equipment and storage medium
CN110766080A (en) * 2019-10-24 2020-02-07 腾讯科技(深圳)有限公司 Method, device and equipment for determining labeled sample and storage medium
CN111126574A (en) * 2019-12-30 2020-05-08 腾讯科技(深圳)有限公司 Method and device for training machine learning model based on endoscopic image and storage medium
CN111310799A (en) * 2020-01-20 2020-06-19 中国人民大学 Active learning algorithm based on historical evaluation result
CN112085219A (en) * 2020-10-13 2020-12-15 北京百度网讯科技有限公司 Model training method, short message auditing method, device, equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110245721A (en) * 2019-06-25 2019-09-17 深圳市腾讯计算机系统有限公司 Training method, device and the electronic equipment of neural network model
CN110378396A (en) * 2019-06-26 2019-10-25 北京百度网讯科技有限公司 Sample data mask method, device, computer equipment and storage medium
CN110766080A (en) * 2019-10-24 2020-02-07 腾讯科技(深圳)有限公司 Method, device and equipment for determining labeled sample and storage medium
CN111126574A (en) * 2019-12-30 2020-05-08 腾讯科技(深圳)有限公司 Method and device for training machine learning model based on endoscopic image and storage medium
CN111310799A (en) * 2020-01-20 2020-06-19 中国人民大学 Active learning algorithm based on historical evaluation result
CN112085219A (en) * 2020-10-13 2020-12-15 北京百度网讯科技有限公司 Model training method, short message auditing method, device, equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JOHNSON: ""安利一个开源的好工具Label Studio, 闭环数据标注和模型训练"" *
知乎: ""利用已经训练过的神经网络辅助数据标注是否可行?"" *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113284141A (en) * 2021-07-09 2021-08-20 武汉精创电子技术有限公司 Model determination method, device and equipment for defect detection
CN114417871A (en) * 2021-12-17 2022-04-29 北京百度网讯科技有限公司 Model training and named entity recognition method and device, electronic equipment and medium
CN114417871B (en) * 2021-12-17 2023-01-31 北京百度网讯科技有限公司 Model training and named entity recognition method, device, electronic equipment and medium

Similar Documents

Publication Publication Date Title
CN110866181B (en) Resource recommendation method, device and storage medium
CN104915734B (en) Commodity popularity prediction method based on time sequence and system thereof
CN112766390A (en) Method, device and equipment for determining training sample
Qian et al. State reduction for network intervention in probabilistic Boolean networks
CN108921587B (en) Data processing method and device and server
CN111402579A (en) Road congestion degree prediction method, electronic device and readable storage medium
CN108536467A (en) Location processing method, device, terminal device and the storage medium of code
CN112070550A (en) Keyword determination method, device and equipment based on search platform and storage medium
CN110705489A (en) Training method and device of target recognition network, computer equipment and storage medium
CN112800178A (en) Answer generation method and device, electronic equipment and readable storage medium
CN115659226A (en) Data processing system for acquiring APP label
CN108415971B (en) Method and device for recommending supply and demand information by using knowledge graph
US10313457B2 (en) Collaborative filtering in directed graph
CN112215655A (en) Client portrait label management method and system
CN110929526A (en) Sample generation method and device and electronic equipment
CN112364169B (en) Nlp-based wifi identification method, electronic device and medium
CN111382342B (en) Method, device and equipment for acquiring hot search words and storage medium
CN112800226A (en) Method for obtaining text classification model, method, device and equipment for text classification
CN114168871A (en) Method and device for page jump, electronic equipment and storage medium
CN115114415A (en) Question-answer knowledge base updating method and device, computer equipment and storage medium
CN112667654A (en) Business work order updating method and system
CN112085030A (en) Similar image determining method and device
KR20170085396A (en) Feature Vector Clustering and Database Generating Method for Scanning Books Identification
CN116384473B (en) Calculation graph improvement and information pushing method and device
CN111104569A (en) Region segmentation method and device for database table and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination