CN113743431B - Data selection method and device - Google Patents

Data selection method and device Download PDF

Info

Publication number
CN113743431B
CN113743431B CN202010475317.6A CN202010475317A CN113743431B CN 113743431 B CN113743431 B CN 113743431B CN 202010475317 A CN202010475317 A CN 202010475317A CN 113743431 B CN113743431 B CN 113743431B
Authority
CN
China
Prior art keywords
sample
category
candidate
samples
uncertainty
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010475317.6A
Other languages
Chinese (zh)
Other versions
CN113743431A (en
Inventor
付彬
孙健
唐呈光
李杨
赵学敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN202010475317.6A priority Critical patent/CN113743431B/en
Publication of CN113743431A publication Critical patent/CN113743431A/en
Application granted granted Critical
Publication of CN113743431B publication Critical patent/CN113743431B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data selection method and a data selection device, relates to the technical field of data processing, and mainly aims to optimize selection of samples in an active learning process and avoid the problem of inclination of sample data selection. The main technical scheme of the invention is as follows: obtaining a set of candidate samples, the set of candidate samples comprising a plurality of candidate samples belonging to different categories; calculating uncertainty of the category according to candidate samples belonging to the same category; calculating the category uncertainty distribution of the candidate sample set according to the uncertainties of different categories; and selecting a first candidate sample in a first category to enter a sample pool according to the category uncertainty distribution of the candidate sample set.

Description

Data selection method and device
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a data selection method and apparatus.
Background
With the popularization of artificial intelligence, deep learning is greatly broken through in various practical applications, and many problems are not limited, but obtaining a large amount of accurately marked data still requires high cost, training of a model still requires a large amount of time and effort, and the problems become the limitation of current deep learning. And the active learning can obtain higher model learning accuracy by using fewer labeling samples through screening unlabeled data.
Active learning (active learning) is a sub-field of artificial intelligence, also known as query learning, and optimal experimental design in the statistical field. The algorithm comprises two basic modules: the learning module and the selection strategy. The active learning actively picks part of samples from the unlabeled sample set through a selection strategy, gives the labeled samples to the experts in the related fields for labeling, and then adds the labeled samples to a training data set to a learning module for training. And stopping when the learning module meets the termination condition, otherwise, continuously and repeatedly obtaining more labeling samples to train. However, when the existing active learning is to select samples, the uncertainty of the samples (i.e. the degree to which the samples cannot be effectively identified and distinguished by the model) and the correlation between the samples and other samples (i.e. the approximation degree between samples) are mainly considered, and when a large number of samples of different types are required to be selected, if the number of samples of a certain type and the number of uncertainties are large, the sample data is inclined due to the sample tendency and the sample type selected by the active learning, so that the model training optimization effect is affected.
Disclosure of Invention
In view of the above problems, the present invention provides a data selection method and apparatus, which are mainly aimed at optimizing the selection of samples in the active learning process, and avoiding the problem of inclination of the selection of sample data.
In order to achieve the above purpose, the present invention mainly provides the following technical solutions:
in one aspect, the present invention provides a data selection method, which specifically includes:
obtaining a set of candidate samples, the set of candidate samples comprising a plurality of candidate samples belonging to different categories;
calculating uncertainty of the category according to candidate samples belonging to the same category;
calculating the category uncertainty distribution of the candidate sample set according to the uncertainties of different categories;
and selecting a first candidate sample in a first category to enter a sample pool according to the category uncertainty distribution of the candidate sample set.
Preferably, the selecting the first candidate sample in the first category into the sample pool includes:
if the first category is a category which is not in the sample pool, selecting a candidate sample with the largest sample uncertainty from candidate samples of the first category as a first candidate sample;
and if the first category is the existing category in the sample pool, obtaining a second sample belonging to the first category in the sample pool, and selecting a candidate sample with the smallest correlation mean value with the second sample as a first candidate sample.
Preferably, the method further comprises:
a correlation matrix between candidate samples of the first class is calculated, the correlation matrix being used to calculate correlations between candidate samples.
Preferably, the sample uncertainty is calculated by:
predicting information entropy of the third sample belonging to the first category by a distance measurement method based on text characteristics;
and/or predicting the information entropy of the third sample belonging to the first category based on the machine learning model;
and/or predicting the information entropy of the third sample belonging to the first category based on the method of the pre-training model;
and determining the sample uncertainty of the third sample according to the information entropy of the third sample belonging to the first category.
Preferably, the method further comprises:
and stopping selecting samples in the candidate sample set if the number of samples in the sample pool reaches a preset threshold value.
Preferably, the method further comprises:
classifying and predicting the user data and the expanded corpus by using a preset model to generate sample data;
and selecting sample data with a prediction result within a preset range from the sample data to form a candidate sample set.
In another aspect, the present invention provides a data selecting apparatus, specifically including:
an obtaining unit configured to obtain a candidate sample set including a plurality of candidate samples belonging to different categories;
the first determining unit is used for calculating the uncertainty of the category according to the candidate samples belonging to the same category and obtained by the obtaining unit;
the second determining unit is used for calculating the category uncertainty distribution of the candidate sample set according to the uncertainty of different categories obtained by the first determining unit;
and the selection unit is used for selecting a first candidate sample in a first category to enter the sample pool according to the category uncertainty distribution of the candidate sample set determined by the second determination unit.
Preferably, the selecting unit includes:
the first selection module is used for selecting a candidate sample with the largest sample uncertainty from candidate samples of the first category as a first candidate sample if the first category is a category which is not in the sample pool;
and the second selection module is used for obtaining a second sample belonging to the first category in the sample pool if the first category is the category existing in the sample pool, and selecting a candidate sample with the smallest correlation mean value with the second sample as a first candidate sample.
Preferably, the selecting unit further includes:
a calculation module for calculating a correlation matrix between candidate samples of the first category before the second selection module selects the first candidate sample, the correlation matrix being used for calculating correlations between candidate samples.
Preferably, when the first selection module selects the first candidate sample, the sample uncertainty is calculated by:
predicting information entropy of the third sample belonging to the first category by a distance measuring device based on text characteristics;
and/or predicting, by the machine learning model-based device, an information entropy of the third sample belonging to the first category;
and/or, predicting, by the pre-training model-based device, an information entropy of the third sample belonging to the first category;
and determining the sample uncertainty of the third sample according to the information entropy of the third sample belonging to the first category.
Preferably, the selecting unit is further configured to stop selecting samples from the candidate sample set if the number of samples in the sample pool reaches a preset threshold.
Preferably, the acquisition unit includes:
the generation module is used for carrying out classification prediction on the user data and the expanded corpus by using a preset model to generate sample data;
and the screening module is used for selecting sample data with a prediction result within a preset range from the sample data obtained by the generating module to form a candidate sample set.
In another aspect, the present invention provides a processor, where the processor is configured to execute a program, where the program executes the data selection method described above.
By means of the technical scheme, the data selection method and device are mainly applied to an active learning process, the diversity of selected samples is improved by extracting candidate samples according to the types, uncertainty of different types in a candidate sample set is counted by utilizing uncertainty of the candidate samples, uncertainty distribution of each type is calculated based on the uncertainty of the different types, and a first candidate sample is selected from first types selected by the uncertainty distribution, so that the sample selection ensures the diversity of the sample types and simultaneously ensures the selection of samples in the types with high uncertainty, that is, the distribution sampling of the samples is realized by utilizing statistics of the uncertainty distribution of the types, the probability of the selected samples also exists in the types with low uncertainty, and the inclination problem of sample selection in the existing active learning process is avoided.
The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
FIG. 1 is a flow chart of a data selection method according to an embodiment of the present invention;
FIG. 2 is a flow chart of another data selection method according to an embodiment of the present invention;
FIG. 3 is a block diagram showing a data selecting device according to an embodiment of the present invention;
fig. 4 shows a block diagram of another data selecting device according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
According to the data selection method provided by the embodiment of the invention, the optimization and improvement are carried out on the active learning in the model training process, so that the samples selected through the active learning are more various, the model can be trained more effectively after manual labeling, and the training effect of the model is improved. The specific steps of the method are shown in fig. 1, and the method comprises the following steps:
step 101, obtaining a candidate sample set.
Wherein the set of candidate samples comprises a plurality of candidate samples belonging to different categories. The class of candidate samples in this step is preset based on the sample set. For example, for a sample set of training emotion recognition models, its preset categories may include happy, sad, worry, excited, etc.
The candidate samples in the candidate sample set obtained in this step refer to samples that cannot be used for training the model effectively, that is, the model cannot accurately identify the category corresponding to the sample. That is, the class information in the candidate sample is insufficient to train the model effectively, for example, class probabilities of a plurality of classes are included in the class information in the candidate sample, and the magnitudes of these probabilities are similar. The object of the embodiment of the invention is to select part of samples from the candidate samples for manual labeling, so that the candidate samples can be used for more effectively training the model, and the accuracy of the model in identifying the samples is improved.
Step 102, calculating the uncertainty of the category according to the candidate samples belonging to the same category.
The uncertainty of the category in this step is calculated according to the uncertainty of all candidate samples in the category, for example, the uncertainty of the category may be the average value of the uncertainty of all candidate samples, or may be the sum. The uncertainty of the candidate sample is used for measuring the difficulty of the model in identifying or distinguishing the candidate sample, the uncertainty can be quantified through different algorithms, such as a prediction score obtained by performing category prediction through a model, and the model can be a model to be trained, a word vector model and the like.
In this embodiment, the class corresponding to the candidate sample is generally the class with the highest class probability, so that different numbers of candidate samples can be obtained from each class of the candidate sample set according to different class classification. Wherein a candidate sample is only classified into one category.
And 103, calculating the category uncertainty distribution of the candidate sample set according to the uncertainties of different categories.
The step of determining the distribution state of all the categories in the candidate sample set is based on uncertainty. The distribution is used primarily to select a first category from among the categories. The uncertainty distribution of the category can be obtained by comparing uncertainties of various categories, such as normalization. Thus, the class uncertainty distribution can also be understood as the sampling probability corresponding to each class. The greater the sampling probability, the higher the uncertainty of the candidate sample in the corresponding class, and the greater the likelihood of selecting a candidate sample from the class.
Step 104, selecting a first candidate sample in the first category to enter a sample pool according to the category uncertainty distribution of the candidate sample set.
The step is based on the category uncertainty distribution, and the selected category is firstly determined, namely the higher the uncertainty of the category is, the greater the probability that the category is selected is, but the category with low uncertainty also has lower probability of being selected. Thereafter, a first candidate sample is selected from the selected first category. By selecting the candidate samples, the candidate samples can be selected from more categories to a certain extent, the variety of the sample categories is improved, the uncertainty and the variety of the selected samples are considered, and the inclination problem of sample selection is avoided.
In addition, the first candidate sample is placed in the sample cell in order to define the total number of candidate samples selected using the sample cell. And when the number of the candidate samples in the sample pool is insufficient, repeating the operation of selecting the first candidate sample until the sample pool has enough samples, and outputting the selected candidate samples for manual labeling.
According to the data selection method provided by the embodiment of the invention, mainly, the improvement and optimization of the active learning process are realized, the candidate samples are classified according to categories, the uncertainty corresponding to each category is determined based on the uncertainty of the candidate samples, the category uncertainty distribution is calculated based on the category uncertainty, the first category of the candidate sample to be selected is determined according to the category uncertainty distribution, the first candidate sample is selected from the candidate samples belonging to the first category and enters a sample pool, and the selection is performed, so that the candidate samples can be effectively selected from a plurality of categories, the uncertainty of the candidate samples is ensured, meanwhile, the source diversity of the candidate sample category is considered, the sample submitted to an expert for marking is more representative, and the optimization effect of the manually marked sample in the model training process is improved.
The data selection method provided by the embodiment of the invention is beneficial to selecting samples capable of more effectively training the model for the model. The invention can promote the training effect of various models in the current service platform, especially along with the development of the Internet, the invention can effectively train the models for providing the man-machine dialogue service, and the man-machine dialogue service is widely applied in a plurality of industries such as electronic commerce, telecommunication, government affairs, finance, education, entertainment, health, travel and the like, for example, in the electronic commerce industry, a user can realize the functions of invoicing, prompting delivery, checking material flow, changing address, receiving express delivery and the like through dialogue with intelligent customer service; in another example, in the telecommunications industry or the industry of the whole operator, the user can realize the functions of checking telephone charge, checking flow, buying packages, reporting faults, modifying passwords and the like through dialogue with the intelligent customer service.
Further, for the data selection method shown in fig. 1, in the following embodiment, candidate samples are generated based on corpus in text, active learning is further performed, and a part of candidate samples are selected for manual labeling, where a specific process is shown in fig. 2, and the method includes:
step 201, a candidate sample set is obtained.
The method comprises the steps of carrying out classification prediction on corpus data in a text by using a preset model to generate sample data, and then selecting sample data with a prediction result in a preset range to form a candidate sample set according to a prediction result of the category in the sample data. The text source of the corpus data is not limited to the user data or other expanded corpus, and the text source can be determined according to application requirements.
In addition, the preset model can be an existing model to be trained, and when the preset model predicts the classification of the language data, the classification type of the preset model is preset. The input of the preset model is corpus data, the category identification of each category is preset, and the output is the probability that the corpus belongs to each category. In the step, whether the output corpus data is a candidate sample is determined by setting a prediction result in a preset range. Generally, the preset range may be set to 0.3-0.7, when the output prediction result has a class with a probability value greater than 0.7, the identification of the class may be marked on the corpus data to obtain sample data with marking information, that is, it is determined that the corpus data belongs to the class, when the output prediction result has a class with a probability value less than 0.3, it may be determined that the corpus data does not belong to the class, further judging from probabilities of other classes, and when the probability value of the corpus data is the largest class, or the probability values of a plurality of classes are all within the preset range, it is described that the preset model is difficult to identify the class of the corpus data, and for such corpus data, after active learning, representative corpus data is selected to perform manual marking, and the model is optimized for training, so that the model has a better class identification effect on such samples. Therefore, the sample data generated based on the corpus data of the prediction result within the preset range is the sample in the candidate sample set in the step.
Step 202, determining correlation between candidate samples in the same category.
Wherein, the correlation between candidate samples is used to measure the similarity between samples, and training the model by using similar corpus samples is generally difficult to improve the model recognition capability. That is, in the active learning process, candidate samples with high correlation are selected as few as possible, so that the effect of the artificial labeling sample on model training can be improved.
In this step, the correlation between candidate samples may be obtained by calculating the cosine similarity of the word vector, or may be obtained by weighting the score of the model prediction result, which is not limited in detail in this embodiment. Finally, the correlation between all candidate samples in the same class can be represented as a correlation matrix.
Step 203, calculating the uncertainty of the category according to the candidate samples belonging to the same category.
For this step, the uncertainty of each candidate sample in the same class is first determined. In this regard, it may be determined by a preset algorithm, which is not specifically limited in this embodiment, for example, the present step uses a committee scoring mechanism, where at least two prediction algorithms, such as a machine learning model prediction score, word2vec, word frequency-inverse document frequency (TF-IDF), are set, and these algorithms may all predict a candidate sample to obtain a probability score that the sample belongs to the determined class, calculate, using these scores, an information entropy of the sample belongs to the class, and determine the information entropy as an uncertainty of the candidate sample for the class, where the specific calculation method is as follows:
wherein H (X) represents the entropy of the candidate sample, N is the number of categories, pmodel (X i ) Representing class x to which a sample predicted using a model belongs i Score of Pwortvec (X) i ) The representation is a predictive score calculated using word2vec, P tfidf (x i ) Representing the prediction score calculated using the TF-IDF method.
Further, uncertainty for the class is counted for candidate samples according to the same class. The uncertainty of a category needs to determine all candidate samples which are not selected in the category, and in general, after the candidate samples in a certain category are selected, the candidate samples are deleted from the category, so that the candidate samples which are not selected are the candidate samples in the category, the uncertainty of the candidate samples is determined according to the mode of determining the uncertainty of the candidate samples, and the average value of the uncertainty of the candidate samples is taken as the uncertainty of the category.
And 204, calculating the category uncertainty distribution of the candidate sample set according to the uncertainties of different categories.
The class uncertainty distribution for each class in the candidate sample set may be calculated in this step using a normalized exponential function (softmax function). In this embodiment, besides applying the normalized exponential function, other distribution types may be selected, such as dirichlet distribution, beta distribution, etc., where it is to be noted that the selected distribution type needs to be positively correlated with the magnitude of the class uncertainty, so that the sampling probability of each class may be determined based on the distribution state.
Step 205, selecting a first candidate sample in a first category to enter a sample pool according to the category uncertainty distribution of the candidate sample set.
Since this step is repeated several times in this embodiment to ensure that a sufficient number of candidate samples are selected, there are two cases when selecting the category of candidate samples, one of which has been selected in the past selection process and the other of which has not been selected, and this step will employ an unused candidate sample selection method for both cases.
Specifically, it is necessary to determine whether the currently selected first category is selected, and for this purpose, it is necessary to record each selected category.
If the first category is a category which is not selected in the sample pool, namely is not selected before, selecting a candidate sample with the highest sample uncertainty from candidate samples of the first category as a first candidate sample for manual labeling. While deleting the candidate sample in the class. The calculation manner of the sample uncertainty is described in step 203, and assuming that the sample uncertainty of the third sample in the first category is calculated, the determination may be made by determining the information entropy of the third sample in the first category, where the information entropy may be any one of the information entropy of predicting that the third sample belongs to the first category based on the distance measurement method of text features, the information entropy of predicting that the third sample belongs to the first category based on the machine learning model, the information entropy of predicting that the third sample belongs to the first category based on the pre-training model, or the combined value of multiple information entropies, and specifically, see the formula about H (X) in step 203.
If the first class is an existing class in the sample pool, i.e. has been selected, it is indicated that the candidate sample with the highest uncertainty in the class has been selected, and at this time, it is not necessary to select the candidate sample with the highest uncertainty from the candidate samples (second samples) that have been selected before, and for this purpose, it is necessary to use the correlation matrix in step 202, use the correlation matrix, and obtain the second sample belonging to the first class, i.e. the candidate sample that has been selected in the first class, from the sample pool, and select the candidate sample with the lowest correlation mean with the second sample from the candidate samples that have not been selected as the first candidate sample. While deleting the selected candidate sample in the first category.
And 206, judging whether the number of candidate samples in the sample pool reaches a preset threshold value.
The preset threshold value can be set in a self-defined mode, when the number of candidate samples reaches the preset threshold value, the selection of the samples in the candidate sample set is stopped, and the candidate samples in the sample pool are output so as to be manually marked by an expert. And when the number of candidate samples does not reach the threshold, step 207 is performed.
Step 207, updating the category uncertainty distribution of the candidate sample set according to the unselected candidate samples.
Since the first candidate sample is selected, the number of candidate samples in the first category will also change, and the uncertainty of one category will change, resulting in the change of the uncertainty distribution of other categories in the candidate sample set, and thus the sampling probability will change. I.e. the update of this step is performed as described above for step 204. It can be seen that steps 204-207 of the embodiment of the present invention are a loop operation based on the number of candidate samples in the sample pool, and each time a candidate sample is selected, the category uncertainty distribution is updated, the category is selected based on the new category uncertainty distribution, and the candidate samples are continuously selected from the category and added to the sample pool until the number of candidate samples in the sample pool reaches the preset threshold.
Further, as an implementation of the data selection method shown in fig. 1 and 2, the embodiment of the invention provides a data selection device, which is mainly aimed at optimizing the selection of samples in the active learning process and avoiding the problem of inclination of the selection of sample data. For convenience of reading, the details of the foregoing method embodiment are not described one by one in the embodiment of the present apparatus, but it should be clear that the apparatus in this embodiment can correspondingly implement all the details of the foregoing method embodiment. The device is shown in fig. 3, and specifically comprises:
an obtaining unit 31 for obtaining a set of candidate samples, the set of candidate samples comprising a plurality of candidate samples belonging to different categories;
a first determining unit 32, configured to calculate, according to the candidate samples belonging to the same category obtained by the obtaining unit 31, an uncertainty of the category;
a second determining unit 33, configured to calculate a category uncertainty distribution of the candidate sample set according to the uncertainties of different categories obtained by the first determining unit 32;
a selecting unit 34, configured to select a first candidate sample in a first category to enter the sample pool according to the category uncertainty distribution of the candidate sample set determined by the second determining unit 33.
Further, as shown in fig. 4, the selecting unit 34 includes:
a first selecting module 341, configured to select, from candidate samples of the first class, a candidate sample with a maximum sample uncertainty as a first candidate sample if the first class is a class that is not in the sample pool;
a second selecting module 342, configured to obtain a second sample belonging to the first category in the sample pool if the first category is an existing category in the sample pool, and select a candidate sample with the smallest correlation mean with the second sample as a first candidate sample.
Further, as shown in fig. 4, the selecting unit 34 further includes:
a calculating module 343, configured to calculate a correlation matrix between candidate samples of the first class before the second selecting module 342 selects the first candidate sample, where the correlation matrix is used to calculate correlation between candidate samples.
Further, as shown in fig. 4, when the first selection module 341 selects the first candidate sample, the sample uncertainty is calculated as follows:
predicting information entropy of the third sample belonging to the first category by a distance measuring device based on text characteristics;
and/or predicting, by the machine learning model-based device, an information entropy of the third sample belonging to the first category;
and/or, predicting, by the pre-training model-based device, an information entropy of the third sample belonging to the first category;
and determining the sample uncertainty of the third sample according to the information entropy of the third sample belonging to the first category.
Further, the selecting unit 34 is further configured to stop selecting samples from the candidate sample set if the number of samples in the sample pool reaches a preset threshold.
Further, as shown in fig. 4, the acquisition unit 31 includes:
the generating module 311 is configured to perform classification prediction on the user data and the expanded corpus by using a preset model, and generate sample data;
and a screening module 312, configured to select sample data with a prediction result within a preset range from the sample data obtained by the generating module 311 to form a candidate sample set.
In addition, the embodiment of the invention also provides a processor, which is used for running a program, wherein the program runs to execute the data selection method provided by any one of the embodiments.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.
It will be appreciated that the relevant features of the methods and apparatus described above may be referenced to one another. In addition, the "first", "second", and the like in the above embodiments are for distinguishing the embodiments, and do not represent the merits and merits of the embodiments.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
The algorithms and displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general-purpose systems may also be used with the teachings herein. The required structure for a construction of such a system is apparent from the description above. In addition, the present invention is not directed to any particular programming language. It will be appreciated that the teachings of the present invention described herein may be implemented in a variety of programming languages, and the above description of specific languages is provided for disclosure of enablement and best mode of the present invention.
Furthermore, the memory may include volatile memory, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM), in a computer readable medium, the memory including at least one memory chip.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, etc., such as Read Only Memory (ROM) or flash RAM. Memory is an example of a computer-readable medium.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises an element.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and changes may be made to the present application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. which are within the spirit and principles of the present application are intended to be included within the scope of the claims of the present application.

Claims (11)

1. A method of data selection, the method comprising:
obtaining a candidate sample set, wherein the candidate sample set comprises a plurality of candidate samples belonging to different categories, and the candidate samples are generated based on corpus in text;
calculating uncertainty of the category according to candidate samples belonging to the same category;
calculating the category uncertainty distribution of the candidate sample set according to the uncertainty of different categories, wherein the category uncertainty distribution is the sampling probability corresponding to each category;
according to the uncertain distribution of the types of the candidate sample set, selecting a first candidate sample in a first type to enter a sample pool comprises the following steps:
if the first category is a category which is not in the sample pool, selecting a candidate sample with the largest sample uncertainty from candidate samples of the first category as a first candidate sample;
and if the first category is the existing category in the sample pool, obtaining a second sample belonging to the first category in the sample pool, and selecting a candidate sample with the smallest correlation mean value with the second sample as a first candidate sample.
2. The method according to claim 1, wherein the method further comprises:
a correlation matrix between candidate samples of the first class is calculated, the correlation matrix being used to calculate correlations between candidate samples.
3. The method of claim 1, wherein the sample uncertainty is calculated by:
predicting information entropy of the third sample belonging to the first category by a distance measurement method based on text characteristics;
and/or predicting the information entropy of the third sample belonging to the first category based on the machine learning model;
and/or predicting the information entropy of the third sample belonging to the first category based on the method of the pre-training model;
and determining the sample uncertainty of the third sample according to the information entropy of the third sample belonging to the first category.
4. A method according to any one of claims 1-3, characterized in that the method further comprises:
and stopping selecting samples in the candidate sample set if the number of samples in the sample pool reaches a preset threshold value.
5. The method of claim 1, wherein the obtaining a set of candidate samples comprises:
classifying and predicting the user data and the expanded corpus by using a preset model to generate sample data;
and selecting sample data with a prediction result within a preset range from the sample data to form a candidate sample set.
6. A data selection apparatus, the apparatus comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a candidate sample set, the candidate sample set comprises a plurality of candidate samples belonging to different categories, and the candidate samples are generated based on corpus in text;
the first determining unit is used for calculating the uncertainty of the category according to the candidate samples belonging to the same category and obtained by the obtaining unit;
the second determining unit is used for calculating the category uncertainty distribution of the candidate sample set according to the uncertainty of different categories obtained by the first determining unit, wherein the category uncertainty distribution is the sampling probability corresponding to each category;
a selecting unit, configured to select, according to the category uncertainty distribution of the candidate sample set determined by the second determining unit, a first candidate sample in a first category to enter a sample pool, where the selecting unit includes:
the first selection module is used for selecting a candidate sample with the largest sample uncertainty from candidate samples of the first category as a first candidate sample if the first category is a category which is not in the sample pool;
and the second selection module is used for obtaining a second sample belonging to the first category in the sample pool if the first category is the category existing in the sample pool, and selecting a candidate sample with the smallest correlation mean value with the second sample as a first candidate sample.
7. The apparatus of claim 6, wherein the selection unit further comprises:
a calculation module for calculating a correlation matrix between candidate samples of the first category before the second selection module selects the first candidate sample, the correlation matrix being used for calculating correlations between candidate samples.
8. The apparatus of claim 6, wherein the first selection module selects the first candidate sample by calculating the sample uncertainty by:
predicting information entropy of the third sample belonging to the first category by a distance measuring device based on text characteristics;
and/or predicting, by the machine learning model-based device, an information entropy of the third sample belonging to the first category;
and/or, predicting, by the pre-training model-based device, an information entropy of the third sample belonging to the first category;
and determining the sample uncertainty of the third sample according to the information entropy of the third sample belonging to the first category.
9. The apparatus according to any of claims 6-8, wherein the selection unit is further configured to stop selecting samples in the candidate sample set if the number of samples in the sample pool reaches a preset threshold.
10. The apparatus of claim 6, wherein the acquisition unit comprises:
the generation module is used for carrying out classification prediction on the user data and the expanded corpus by using a preset model to generate sample data;
and the screening module is used for selecting sample data with a prediction result within a preset range from the sample data obtained by the generating module to form a candidate sample set.
11. A processor, characterized in that the processor is arranged to run a program, wherein the program when run performs the data selection method according to any of claims 1-5.
CN202010475317.6A 2020-05-29 2020-05-29 Data selection method and device Active CN113743431B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010475317.6A CN113743431B (en) 2020-05-29 2020-05-29 Data selection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010475317.6A CN113743431B (en) 2020-05-29 2020-05-29 Data selection method and device

Publications (2)

Publication Number Publication Date
CN113743431A CN113743431A (en) 2021-12-03
CN113743431B true CN113743431B (en) 2024-04-02

Family

ID=78724675

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010475317.6A Active CN113743431B (en) 2020-05-29 2020-05-29 Data selection method and device

Country Status (1)

Country Link
CN (1) CN113743431B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108717547A (en) * 2018-03-30 2018-10-30 国信优易数据有限公司 The method and device of sample data generation method and device, training pattern
CN108830312A (en) * 2018-06-01 2018-11-16 苏州中科天启遥感科技有限公司 A kind of integrated learning approach adaptively expanded based on sample
CN110688909A (en) * 2019-09-05 2020-01-14 南京有春科技有限公司 Method, device and equipment for identifying urban black and odorous water body and storage medium
EP3654065A1 (en) * 2018-11-16 2020-05-20 Bayerische Motoren Werke Aktiengesellschaft Apparatus and method for characterizing an object based on measurement samples from one or more location sensors

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8972244B2 (en) * 2013-01-25 2015-03-03 Xerox Corporation Sampling and optimization in phrase-based machine translation using an enriched language model representation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108717547A (en) * 2018-03-30 2018-10-30 国信优易数据有限公司 The method and device of sample data generation method and device, training pattern
CN108830312A (en) * 2018-06-01 2018-11-16 苏州中科天启遥感科技有限公司 A kind of integrated learning approach adaptively expanded based on sample
EP3654065A1 (en) * 2018-11-16 2020-05-20 Bayerische Motoren Werke Aktiengesellschaft Apparatus and method for characterizing an object based on measurement samples from one or more location sensors
CN110688909A (en) * 2019-09-05 2020-01-14 南京有春科技有限公司 Method, device and equipment for identifying urban black and odorous water body and storage medium

Also Published As

Publication number Publication date
CN113743431A (en) 2021-12-03

Similar Documents

Publication Publication Date Title
CN110348580B (en) Method and device for constructing GBDT model, and prediction method and device
CN112632385A (en) Course recommendation method and device, computer equipment and medium
US11537930B2 (en) Information processing device, information processing method, and program
CN112163419B (en) Text emotion recognition method and device, computer equipment and storage medium
CN112732871B (en) Multi-label classification method for acquiring client intention labels through robot induction
CN114037876A (en) Model optimization method and device
CN113515629A (en) Document classification method and device, computer equipment and storage medium
US20170278013A1 (en) Stereoscopic learning for classification
Milea et al. Prediction of the msci euro index based on fuzzy grammar fragments extracted from european central bank statements
CN112100377A (en) Text classification method and device, computer equipment and storage medium
CN114722198A (en) Method, system and related device for determining product classification code
US9053434B2 (en) Determining an obverse weight
Escalante et al. Particle swarm model selection for authorship verification
CN113743431B (en) Data selection method and device
CN115408527B (en) Text classification method and device, electronic equipment and storage medium
CN116629716A (en) Intelligent interaction system work efficiency analysis method
CN110879821A (en) Method, device, equipment and storage medium for generating rating card model derivative label
CN117523218A (en) Label generation, training of image classification model and image classification method and device
Chiu et al. A hybrid wine classification model for quality prediction
CN113761918A (en) Data processing method and device
CN114443840A (en) Text classification method, device and equipment
CN111737465A (en) Method and device for realizing multi-level and multi-class Chinese text classification
CN116932487B (en) Quantized data analysis method and system based on data paragraph division
CN117521673B (en) Natural language processing system with analysis training performance
CN116304058B (en) Method and device for identifying negative information of enterprise, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant