CN113743431A - Data selection method and device - Google Patents
Data selection method and device Download PDFInfo
- Publication number
- CN113743431A CN113743431A CN202010475317.6A CN202010475317A CN113743431A CN 113743431 A CN113743431 A CN 113743431A CN 202010475317 A CN202010475317 A CN 202010475317A CN 113743431 A CN113743431 A CN 113743431A
- Authority
- CN
- China
- Prior art keywords
- sample
- category
- candidate
- samples
- uncertainty
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000010187 selection method Methods 0.000 title claims abstract description 18
- 238000000034 method Methods 0.000 claims abstract description 48
- 238000009826 distribution Methods 0.000 claims abstract description 37
- 238000012549 training Methods 0.000 claims description 18
- 239000011159 matrix material Substances 0.000 claims description 13
- 238000010801 machine learning Methods 0.000 claims description 7
- 238000012216 screening Methods 0.000 claims description 4
- 238000005259 measurement Methods 0.000 claims description 3
- 238000000691 measurement method Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 abstract description 17
- 238000012545 processing Methods 0.000 abstract description 7
- 238000003860 storage Methods 0.000 description 11
- 238000010586 diagram Methods 0.000 description 10
- 238000004590 computer program Methods 0.000 description 9
- 230000006870 function Effects 0.000 description 9
- 238000004422 calculation algorithm Methods 0.000 description 6
- 230000000875 corresponding effect Effects 0.000 description 6
- 230000000694 effects Effects 0.000 description 6
- 230000008859 change Effects 0.000 description 5
- 238000005070 sampling Methods 0.000 description 5
- 238000002372 labelling Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000008909 emotion recognition Effects 0.000 description 1
- 238000013401 experimental design Methods 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a data selection method and a data selection device, relates to the technical field of data processing, and mainly aims to optimize selection of a sample in an active learning process and avoid the problem of selection inclination of the sample data. The main technical scheme of the invention is as follows: obtaining a set of candidate samples comprising a plurality of candidate samples belonging to different categories; calculating the uncertainty of the category according to the candidate samples belonging to the same category; calculating the category uncertain distribution of the candidate sample set according to the uncertainty of different categories; and selecting a first candidate sample in the first category to enter a sample pool according to the category uncertain distribution of the candidate sample set.
Description
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a data selection method and apparatus.
Background
With the popularization of artificial intelligence, deep learning makes a great breakthrough in various practical applications, many problems are not limited, but obtaining a large amount of accurately labeled data still requires high cost, and training of models still requires a large amount of time and energy, which also becomes the limitation of the current deep learning. Active learning can obtain higher model learning accuracy by screening unlabeled data and utilizing fewer labeled samples.
Active learning (active learning) is a sub-field of artificial intelligence, and is also called query learning and optimal experimental design in the field of statistics. The algorithm includes two basic modules: a learning module and a selection strategy. Active learning actively selects part of samples from an unlabeled sample set through a selection strategy, delivers the samples to experts in related fields for labeling, and then adds the labeled samples to a training data set to be sent to a learning module for training. And stopping when the learning module meets the termination condition, otherwise, continuously and repeatedly obtaining more marked samples for training. However, when a sample is selected in the existing active learning, mainly the uncertainty of the sample itself (i.e. the degree to which the sample cannot be effectively identified and distinguished by the model) and the correlation between the sample and other samples (i.e. the degree of approximation between samples) are considered, when a large number of samples of different classes need to be selected, if the number of samples of a certain class and the number of uncertainties are both large, the sample selected by the active learning tends to be inclined to the class of samples, thereby affecting the optimization effect of model training.
Disclosure of Invention
In view of the above problems, the present invention provides a data selection method and apparatus, and mainly aims to optimize the selection of samples in the active learning process and avoid the problem of sample data selection tilt.
In order to achieve the purpose, the invention mainly provides the following technical scheme:
in one aspect, the present invention provides a data selection method, which specifically includes:
obtaining a set of candidate samples comprising a plurality of candidate samples belonging to different categories;
calculating the uncertainty of the category according to the candidate samples belonging to the same category;
calculating the category uncertain distribution of the candidate sample set according to the uncertainty of different categories;
and selecting a first candidate sample in the first category to enter a sample pool according to the category uncertain distribution of the candidate sample set.
Preferably, the selecting a first candidate sample in the first category into the sample pool includes:
if the first category is a category which is not in the sample pool, selecting a candidate sample with the largest sample uncertainty from the candidate samples of the first category as a first candidate sample;
and if the first category is the existing category in the sample pool, obtaining a second sample belonging to the first category in the sample pool, and selecting the candidate sample with the smallest correlation mean value with the second sample as the first candidate sample.
Preferably, the method further comprises:
calculating a correlation matrix between the candidate samples of the first category, the correlation matrix being used to calculate the correlation between the candidate samples.
Preferably, the sample uncertainty is calculated by:
predicting the information entropy of the third sample belonging to the first category based on a distance measurement method of text features;
and/or predicting the information entropy of the third sample belonging to the first category by a machine learning model-based method;
and/or predicting the information entropy of the third sample belonging to the first category based on a pre-training model method;
and determining the sample uncertainty of the third sample according to the information entropy of the third sample belonging to the first category.
Preferably, the method further comprises:
and if the number of samples in the sample pool reaches a preset threshold value, stopping selecting samples in the candidate sample set.
Preferably, the method further comprises:
classifying and predicting user data and the expanded corpora by using a preset model to generate sample data;
and selecting sample data with the prediction result within a preset range from the sample data to form a candidate sample set.
In another aspect, the present invention provides a data selecting apparatus, which specifically includes:
an obtaining unit configured to obtain a candidate sample set, the candidate sample set including a plurality of candidate samples belonging to different categories;
the first determining unit is used for calculating the uncertainty of the category according to the candidate samples belonging to the same category and obtained by the obtaining unit;
the second determining unit is used for calculating the category uncertain distribution of the candidate sample set according to the uncertainty of different categories obtained by the first determining unit;
and the selecting unit is used for selecting a first candidate sample in the first category to enter a sample pool according to the category uncertain distribution of the candidate sample set determined by the second determining unit.
Preferably, the selection unit includes:
a first selection module, configured to select, as a first candidate sample, a candidate sample with a largest sample uncertainty from among candidate samples of the first category if the first category is a category that is not present in the sample pool;
and a second selecting module, configured to, if the first category is an existing category in the sample pool, obtain a second sample belonging to the first category in the sample pool, and select, as the first candidate sample, a candidate sample having a smallest average correlation with the second sample.
Preferably, the selection unit further includes:
a calculating module, configured to calculate a correlation matrix between the candidate samples of the first category before the second selecting module selects the first candidate sample, where the correlation matrix is used to calculate a correlation between the candidate samples.
Preferably, when the first selection module selects the first candidate sample, the sample uncertainty is calculated as follows:
the distance measurement device based on the text features predicts the information entropy of the third sample belonging to the first category;
and/or, the machine learning model-based apparatus predicts an information entropy of the third sample belonging to the first class;
and/or, predicting, by the pre-training model-based apparatus, the information entropy of the third sample belonging to the first class;
and determining the sample uncertainty of the third sample according to the information entropy of the third sample belonging to the first category.
Preferably, the selecting unit is further configured to stop selecting samples from the candidate sample set if the number of samples in the sample pool reaches a preset threshold.
Preferably, the acquiring unit includes:
the generating module is used for carrying out classification prediction on the user data and the expanded corpora by utilizing a preset model to generate sample data;
and the screening module is used for selecting sample data with the prediction result within a preset range from the sample data obtained by the generating module to form a candidate sample set.
In another aspect, the present invention provides a processor, configured to execute a program, where the program executes the data selection method described above.
By the technical scheme, the data selection method and the data selection device provided by the invention are mainly applied to the active learning process, the diversity of the selected samples is improved by extracting the candidate samples according to the categories, in addition, the invention also utilizes the uncertainty of the candidate sample to count the uncertainty of different categories in the candidate sample set, calculates the uncertainty distribution of each category based on the uncertainty of different categories, and selecting a first candidate sample from the first category selected by the uncertain distribution, so that the sample selection ensures the diversity of the sample categories and also ensures the selection of the samples in the categories with high uncertainty, namely, the invention realizes the distribution sampling of the samples by utilizing the statistics of the uncertain distribution of the categories, so that the probability of the selected samples also exists in the categories with low uncertainty, and the problem of the inclination of the sample selection in the existing active learning process is avoided.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 is a flow chart illustrating a data selection method according to an embodiment of the present invention;
FIG. 2 is a flow chart illustrating another data selection method proposed by an embodiment of the present invention;
fig. 3 is a block diagram illustrating a data selection apparatus according to an embodiment of the present invention;
fig. 4 is a block diagram showing another data selection apparatus according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
According to the data selection method provided by the embodiment of the invention, optimization improvement is carried out aiming at active learning in the model training process, so that samples selected through the active learning are more various, and after manual marking, the model can be trained more effectively, and the training effect of the model is improved. The specific steps of the method are shown in fig. 1, and the method comprises the following steps:
Wherein the set of candidate samples comprises a plurality of candidate samples belonging to different categories. The category of the candidate sample in this step is set in advance based on the sample set. For example, for a sample set for training an emotion recognition model, the preset categories may include happiness, sadness, worry, excitement, and the like.
The candidate samples in the candidate sample set obtained in this step are samples for which the model cannot be trained effectively, that is, the model cannot accurately identify the category corresponding to the sample. That is, the class information in the candidate sample is not sufficient to effectively train the model, for example, the candidate sample has class probabilities of multiple classes included in the class information, and the magnitudes of the probabilities are similar. The embodiment of the invention aims to select a part of samples from the candidate samples for manual labeling so as to train the model more effectively by using the candidate samples and improve the accuracy of the model for identifying the samples.
The uncertainty of the class in this step is calculated according to the uncertainties of all candidate samples in the class, for example, the uncertainty of the class may be the average of all candidate samples uncertainty, or the sum. The uncertainty of the candidate sample is used for measuring the identification or distinguishing difficulty of the model for the candidate sample, and the uncertainty can be quantified through different algorithms, such as a prediction score obtained by performing category prediction through the model, wherein the model can be a model to be trained, a word vector model and the like.
In this embodiment, the category corresponding to the candidate sample is generally the category with the highest category probability, and thus, by dividing according to different categories, candidate samples with different numbers can be obtained in each category of the candidate sample set. Wherein a candidate sample is only classified into one category.
And 103, calculating the category uncertain distribution of the candidate sample set according to the uncertainty of different categories.
All the categories in the determined candidate sample set in the step are based on the distribution state of the uncertainty. The distribution is primarily used to select the first category from the categories. The category uncertainty distribution can be obtained by comparing the uncertainty of each category, for example, by performing normalization processing. Therefore, the class uncertainty distribution can also be understood as the sampling probability corresponding to each class. The greater the sampling probability, the greater the uncertainty of the candidate sample in the corresponding category, and the greater the likelihood of selecting a candidate sample from the category.
And 104, selecting a first candidate sample in the first category to enter a sample pool according to the category uncertain distribution of the candidate sample set.
In this step, based on the category uncertainty distribution, the selected category is determined first, that is, the higher the uncertainty of the category is, the higher the probability that the category is selected is, but the category with low uncertainty also has a lower probability of being selected. Thereafter, a first candidate sample is selected from the selected first category. By the selection, the candidate samples can be selected from more types to a certain extent, the diversity of the sample types is improved, the uncertainty and the diversity of the selected samples are considered, and the problem of inclination of sample selection is avoided.
Furthermore, the first candidate sample is put into the pool in order to define the total number of candidate samples selected using the pool. When the number of the candidate samples in the sample pool is insufficient, the operation of selecting the first candidate sample is repeated until enough samples exist in the sample pool, and the selected candidate samples are output and manually labeled.
Through the description of the above embodiments, the data selection method provided by the embodiments of the present invention is mainly to improve and optimize the active learning process, the uncertainty corresponding to each category is determined based on the uncertainty of the candidate sample by dividing the candidate sample into categories, and calculating the uncertain distribution of the categories based on the uncertainty of the categories, determining the first category of the candidate samples to be selected according to the uncertain distribution of the categories, selecting the first candidate sample from the candidate samples belonging to the first category to enter a sample pool, so as to effectively realize the selection of the candidate samples in a plurality of categories, ensures the uncertainty of the candidate sample, simultaneously considers the diversity of the candidate sample category sources, the method and the system enable the sample submitted to the expert for labeling to be more representative, and therefore the optimization effect of the manually labeled sample in the model training process is improved.
The data selection method provided by the embodiment of the invention is beneficial to selecting the sample capable of training the model more effectively for the model. Based on the invention, the training effect of various models in the current service platform can be improved, particularly along with the development of the Internet, the invention can effectively train the models providing man-machine conversation service, and the man-machine conversation service is widely applied to various industries such as e-commerce, telecommunication, government affairs, finance, education, entertainment, health, tourism and the like, for example, in the e-commerce industry, a user can realize the functions of invoicing, urging delivery, checking logistics, changing addresses, receiving express delivery and the like by conversing with an intelligent customer service; for another example, in the telecommunication industry or the whole operator industry, the user can realize the functions of telephone charge checking, flow checking, package purchase, fault reporting, password modification and the like through conversation with the intelligent customer service.
Further, for the data selection method shown in fig. 1, in the following embodiments, candidate samples are generated based on corpora in a text, active learning is further performed, and a part of the candidate samples are selected for manual labeling, where a specific process is shown in fig. 2, and includes:
Classifying and predicting the corpus data in the text by using a preset model to generate sample data, and then selecting the sample data with the prediction result in a preset range to form a candidate sample set according to the prediction result of the category in the sample data. The text source of the corpus data is not limited to user data or other expanded corpora, and the text source can be determined according to application requirements.
In addition, the preset model may be an existing model to be trained, and when the preset model performs classification prediction on the speech data, the classification category of the preset model is preset. That is, the input of the preset model is corpus data, the category identification of each category is preset, and the output is the probability that the corpus belongs to each category. In this step, whether the output corpus data is a candidate sample is determined by setting the prediction result in a preset range. In general, the preset range may be set to 0.3-0.7, and when there is a category having a probability value greater than 0.7 in the output prediction result, the identification of the category can be marked on the corpus data to obtain sample data with marking information, that is, it is determined that the corpus data belongs to the category, and when there is a category having a probability value less than 0.3 in the output prediction result, it may be determined that the corpus data does not belong to the category, further judging from the probabilities of other categories, when the category with the highest probability value in the corpus data, or the probability values of a plurality of categories are all in the preset range, the preset model is difficult to identify the categories of the corpus data, after the corpus data needs to be actively learned, representative corpus data is selected for manual annotation, and the model is optimally trained, so that the model has a better class identification effect on the samples. Therefore, the sample data generated based on the corpus data with the prediction result in the preset range is the sample in the candidate sample set in the step.
The correlation between candidate samples is used to measure the similarity between samples, and it is generally difficult to improve the model recognition capability by training the model using similar corpus samples. That is, in the active learning process, the candidate samples with high correlation need to be selected as few as possible, so that the effect of the manually labeled samples on the model training can be improved.
In this step, the correlation between the candidate samples may be obtained by calculating the cosine similarity of the word vector, or may be obtained by weighting the scores of the model prediction results, which is not specifically limited in this embodiment. Finally, the correlation between all candidate samples in the same category can be represented as a correlation matrix.
For this step, the uncertainty of each candidate sample in the same class is first determined. For this, the information entropy can be determined by a preset algorithm, which is not specifically limited in this embodiment, for example, a committee scoring mechanism is adopted in this step, and at least two prediction algorithms such as a machine learning model prediction score, a word vector representation (word2vec), and a word frequency-inverse document frequency (TF-IDF) are provided in this mechanism, and these algorithms can predict a candidate sample to obtain a probability score that the sample belongs to a determined category, calculate an information entropy that the sample belongs to the category by using these scores, and determine the information entropy as an uncertainty of the candidate sample for the category, and the specific calculation method is as follows:
where H (X) represents the entropy of the candidate samples, N is the number of classes, Pmodel (x)i) Representing the class x to which the sample predicted by the model belongsiFraction of (A), Pworkvec (X)i) The representation is a prediction score, P, calculated using word2vectfidf(xi) Representing the prediction score calculated using the TF-IDF method.
Furthermore, for candidate samples from the same class, the uncertainty of the class is counted. The uncertainty of a category needs to determine all candidate samples in the category that have not been selected, and in general, after a candidate sample in a category is selected, the candidate sample is deleted from the category, so that the candidate sample that has not been selected is a candidate sample existing in the category, the uncertainty of the candidate samples is determined according to the above-mentioned manner of determining the uncertainty of the candidate sample, and the mean value of the uncertainty of the candidate samples is used as the uncertainty of the category.
And step 204, calculating the category uncertain distribution of the candidate sample set according to the uncertainty of different categories.
In this step, a normalized exponential function (softmax function) may be used to calculate the category uncertainty distribution of each category in the candidate sample set. In this embodiment, in addition to applying the normalized exponential function, other distribution types may be selected, such as dirichlet distribution, beta distribution, and the like, and it should be noted that the selected distribution type needs to be positively correlated with the magnitude of the category uncertainty, so that the sampling probability of each category may be determined based on the distribution state.
In this embodiment, this step is repeatedly performed for a plurality of times to ensure that a sufficient number of candidate samples are selected, so when selecting the category of the candidate samples, there are two cases, one is that the category has been selected in the previous selection process, and the other is not selected, and this step will adopt a different candidate sample selection method for the two cases.
Specifically, it is necessary to first determine whether the currently selected first category has been selected, and for this reason, it is necessary to record the category selected each time.
And if the first category is a category which is not in the sample pool, namely the first category is not selected before, selecting the candidate sample with the highest sample uncertainty from the candidate samples in the first category as the first candidate sample for manual marking. The candidate sample is also deleted in the category. The way of calculating the uncertainty of the sample is explained in step 203, and assuming that the uncertainty of the sample of the third sample in the first category is calculated, the uncertainty of the sample can be determined by determining the entropy of the third sample belonging to the first category, and the entropy of the information can be any one of the entropy of the information that the third sample belongs to the first category predicted by a distance measurement method based on text features, the entropy of the information that the third sample belongs to the first category predicted by a method based on a machine learning model, the entropy of the information that the third sample belongs to the first category predicted by a method based on a pre-trained model, or a combination of multiple entropy of the information, which can be specifically referred to the formula regarding h (x) in step 203.
If the first category is an existing category in the sample pool, that is, the first category has already been selected, it indicates that the candidate sample with the highest uncertainty in the category has already been selected, and at this time, it is not necessary to select the sample with the highest uncertainty, but to select the candidate sample with the largest difference from the candidate samples (second samples) that have already been selected before, and for this purpose, it is necessary to use the correlation matrix in step 202, use the correlation matrix, obtain the second sample belonging to the first category, that is, the candidate sample that has already been selected in the first category, from the sample pool, and select the candidate sample with the lowest mean value of correlation with the second sample from the candidate samples that have not been selected, as the first candidate sample. The selected candidate sample is also deleted in the first category.
And step 206, judging whether the number of the candidate samples in the sample pool reaches a preset threshold value.
The preset threshold value can be set in a self-defined mode, when the number of the candidate samples reaches the preset threshold value, the selection of the samples in the candidate sample set is stopped, and the candidate samples in the sample pool are output so as to be conveniently labeled by an expert. And when the number of candidate samples does not reach the threshold, step 207 is executed.
And step 207, updating the category uncertainty distribution of the candidate sample set according to the unselected candidate samples.
Because the first candidate sample is selected, the number of candidate samples in the first category will change accordingly, and the uncertainty of the first category also changes correspondingly, and the change of the uncertainty of one category will cause the uncertainty distribution of other categories in the candidate sample set to change, thereby causing the sampling probability to change. Namely, the updating of this step is to execute the content of step 204 described above. As can be seen, step 204 and step 207 of the embodiment of the present invention are loop operations based on the number of candidate samples in the sample pool, each time a candidate sample is selected, the category uncertainty distribution is updated, the category is selected based on the new category uncertainty distribution, and the candidate samples are continuously selected from the category and added to the sample pool until the number of candidate samples in the sample pool reaches the preset threshold.
Further, as an implementation of the data selection method shown in fig. 1 and 2, an embodiment of the present invention provides a data selection apparatus, which mainly aims to optimize selection of a sample in an active learning process and avoid a problem of inclination of selection of sample data. For convenience of reading, details in the foregoing method embodiments are not described in detail again in this apparatus embodiment, but it should be clear that the apparatus in this embodiment can correspondingly implement all the contents in the foregoing method embodiments. As shown in fig. 3, the apparatus specifically includes:
an obtaining unit 31 configured to obtain a candidate sample set, where the candidate sample set includes a plurality of candidate samples belonging to different categories;
a first determining unit 32, configured to calculate an uncertainty of the category according to the candidate samples belonging to the same category obtained by the obtaining unit 31;
a second determining unit 33, configured to calculate a category uncertainty distribution of the candidate sample set according to the uncertainty of different categories obtained by the first determining unit 32;
a selecting unit 34, configured to select a first candidate sample in the first category to enter the sample pool according to the category uncertainty distribution of the candidate sample set determined by the second determining unit 33.
Further, as shown in fig. 4, the selecting unit 34 includes:
a first selecting module 341, configured to select, as a first candidate sample, a candidate sample with the largest sample uncertainty from among candidate samples in the first category if the first category is a category that is not present in the sample pool;
a second selecting module 342, configured to, if the first category is an existing category in the sample pool, obtain a second sample belonging to the first category in the sample pool, and select a candidate sample with a smallest average correlation value with the second sample as the first candidate sample.
Further, as shown in fig. 4, the selecting unit 34 further includes:
a calculating module 343, configured to calculate a correlation matrix between the candidate samples of the first category before the second selecting module 342 selects the first candidate sample, where the correlation matrix is used to calculate the correlation between the candidate samples.
Further, as shown in fig. 4, when the first selection module 341 selects the first candidate sample, the sample uncertainty is calculated as follows:
the distance measurement device based on the text features predicts the information entropy of the third sample belonging to the first category;
and/or, the machine learning model-based apparatus predicts an information entropy of the third sample belonging to the first class;
and/or, predicting, by the pre-training model-based apparatus, the information entropy of the third sample belonging to the first class;
and determining the sample uncertainty of the third sample according to the information entropy of the third sample belonging to the first category.
Further, the selecting unit 34 is further configured to stop selecting samples from the candidate sample set if the number of samples in the sample pool reaches a preset threshold.
Further, as shown in fig. 4, the acquiring unit 31 includes:
the generating module 311 is configured to perform classification prediction on the user data and the expanded corpus by using a preset model, and generate sample data;
a screening module 312, configured to select sample data with a prediction result within a preset range from the sample data obtained by the generating module 311 to form a candidate sample set.
In addition, an embodiment of the present invention further provides a processor, where the processor is configured to execute a program, where the program executes the data selection method provided in any one of the above embodiments.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
It will be appreciated that the relevant features of the method and apparatus described above are referred to one another. In addition, "first", "second", and the like in the above embodiments are for distinguishing the embodiments, and do not represent merits of the embodiments.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.
In addition, the memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.
Claims (13)
1. A method of data selection, the method comprising:
obtaining a set of candidate samples comprising a plurality of candidate samples belonging to different categories;
calculating the uncertainty of the category according to the candidate samples belonging to the same category;
calculating the category uncertain distribution of the candidate sample set according to the uncertainty of different categories;
and selecting a first candidate sample in the first category to enter a sample pool according to the category uncertain distribution of the candidate sample set.
2. The method of claim 1, wherein selecting the first candidate sample in the first category into the sample pool comprises:
if the first category is a category which is not in the sample pool, selecting a candidate sample with the largest sample uncertainty from the candidate samples of the first category as a first candidate sample;
and if the first category is the existing category in the sample pool, obtaining a second sample belonging to the first category in the sample pool, and selecting the candidate sample with the smallest correlation mean value with the second sample as the first candidate sample.
3. The method of claim 2, further comprising:
calculating a correlation matrix between the candidate samples of the first category, the correlation matrix being used to calculate the correlation between the candidate samples.
4. The method of claim 2, wherein the sample uncertainty is calculated by:
predicting the information entropy of the third sample belonging to the first category based on a distance measurement method of text features;
and/or predicting the information entropy of the third sample belonging to the first category by a machine learning model-based method;
and/or predicting the information entropy of the third sample belonging to the first category based on a pre-training model method;
and determining the sample uncertainty of the third sample according to the information entropy of the third sample belonging to the first category.
5. The method according to any one of claims 2-4, further comprising:
and if the number of samples in the sample pool reaches a preset threshold value, stopping selecting samples in the candidate sample set.
6. The method of claim 1, wherein the obtaining a set of candidate samples comprises:
classifying and predicting user data and the expanded corpora by using a preset model to generate sample data;
and selecting sample data with a prediction result within a preset range from the sample data to form a candidate sample set.
7. A data selection apparatus, the apparatus comprising:
an obtaining unit configured to obtain a candidate sample set, the candidate sample set including a plurality of candidate samples belonging to different categories;
the first determining unit is used for calculating the uncertainty of the category according to the candidate samples belonging to the same category and obtained by the obtaining unit;
the second determining unit is used for calculating the category uncertain distribution of the candidate sample set according to the uncertainty of different categories obtained by the first determining unit;
and the selecting unit is used for selecting a first candidate sample in the first category to enter a sample pool according to the category uncertain distribution of the candidate sample set determined by the second determining unit.
8. The apparatus of claim 7, wherein the selection unit comprises:
a first selection module, configured to select, as a first candidate sample, a candidate sample with a largest sample uncertainty from among candidate samples of the first category if the first category is a category that is not present in the sample pool;
and a second selecting module, configured to, if the first category is an existing category in the sample pool, obtain a second sample belonging to the first category in the sample pool, and select, as the first candidate sample, a candidate sample having a smallest average correlation with the second sample.
9. The apparatus of claim 8, wherein the selection unit further comprises:
a calculating module, configured to calculate a correlation matrix between the candidate samples of the first category before the second selecting module selects the first candidate sample, where the correlation matrix is used to calculate a correlation between the candidate samples.
10. The apparatus of claim 8, wherein the sample uncertainty is calculated by the first selection module when selecting the first candidate sample as follows:
the distance measurement device based on the text features predicts the information entropy of the third sample belonging to the first category;
and/or, the machine learning model-based apparatus predicts an information entropy of the third sample belonging to the first class;
and/or, predicting, by the pre-training model-based apparatus, the information entropy of the third sample belonging to the first class;
and determining the sample uncertainty of the third sample according to the information entropy of the third sample belonging to the first category.
11. The apparatus according to any of claims 8-10, wherein the selecting unit is further configured to stop selecting samples from the candidate sample set if the number of samples in the sample pool reaches a preset threshold.
12. The apparatus of claim 7, wherein the obtaining unit comprises:
the generating module is used for carrying out classification prediction on the user data and the expanded corpora by utilizing a preset model to generate sample data;
and the screening module is used for selecting sample data with the prediction result within a preset range from the sample data obtained by the generating module to form a candidate sample set.
13. A processor for running a program, wherein the program is run to perform the data selection method of any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010475317.6A CN113743431B (en) | 2020-05-29 | 2020-05-29 | Data selection method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010475317.6A CN113743431B (en) | 2020-05-29 | 2020-05-29 | Data selection method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113743431A true CN113743431A (en) | 2021-12-03 |
CN113743431B CN113743431B (en) | 2024-04-02 |
Family
ID=78724675
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010475317.6A Active CN113743431B (en) | 2020-05-29 | 2020-05-29 | Data selection method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113743431B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140214397A1 (en) * | 2013-01-25 | 2014-07-31 | Xerox Corporation | Sampling and optimization in phrase-based machine translation using an enriched language model representation |
CN108717547A (en) * | 2018-03-30 | 2018-10-30 | 国信优易数据有限公司 | The method and device of sample data generation method and device, training pattern |
CN108830312A (en) * | 2018-06-01 | 2018-11-16 | 苏州中科天启遥感科技有限公司 | A kind of integrated learning approach adaptively expanded based on sample |
CN110688909A (en) * | 2019-09-05 | 2020-01-14 | 南京有春科技有限公司 | Method, device and equipment for identifying urban black and odorous water body and storage medium |
EP3654065A1 (en) * | 2018-11-16 | 2020-05-20 | Bayerische Motoren Werke Aktiengesellschaft | Apparatus and method for characterizing an object based on measurement samples from one or more location sensors |
-
2020
- 2020-05-29 CN CN202010475317.6A patent/CN113743431B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140214397A1 (en) * | 2013-01-25 | 2014-07-31 | Xerox Corporation | Sampling and optimization in phrase-based machine translation using an enriched language model representation |
CN108717547A (en) * | 2018-03-30 | 2018-10-30 | 国信优易数据有限公司 | The method and device of sample data generation method and device, training pattern |
CN108830312A (en) * | 2018-06-01 | 2018-11-16 | 苏州中科天启遥感科技有限公司 | A kind of integrated learning approach adaptively expanded based on sample |
EP3654065A1 (en) * | 2018-11-16 | 2020-05-20 | Bayerische Motoren Werke Aktiengesellschaft | Apparatus and method for characterizing an object based on measurement samples from one or more location sensors |
CN110688909A (en) * | 2019-09-05 | 2020-01-14 | 南京有春科技有限公司 | Method, device and equipment for identifying urban black and odorous water body and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN113743431B (en) | 2024-04-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110276066B (en) | Entity association relation analysis method and related device | |
CN110348580B (en) | Method and device for constructing GBDT model, and prediction method and device | |
CN112632385A (en) | Course recommendation method and device, computer equipment and medium | |
CN110619044B (en) | Emotion analysis method, system, storage medium and equipment | |
CN111783993A (en) | Intelligent labeling method and device, intelligent platform and storage medium | |
CN110827797B (en) | Voice response event classification processing method and device | |
CN113515629A (en) | Document classification method and device, computer equipment and storage medium | |
CN112528031A (en) | Work order intelligent distribution method and system | |
CN114722805B (en) | Little sample emotion classification method based on size instructor knowledge distillation | |
CN110888983A (en) | Positive and negative emotion analysis method, terminal device and storage medium | |
CN112667803A (en) | Text emotion classification method and device | |
CN114722198A (en) | Method, system and related device for determining product classification code | |
CN114996464A (en) | Text grading method and device using ordered information | |
CN114254622B (en) | Intention recognition method and device | |
CN115186057A (en) | Method and device for obtaining text classification model | |
CN113743431B (en) | Data selection method and device | |
Chiu et al. | A hybrid wine classification model for quality prediction | |
CN110727767B (en) | Method and system for expanding text sample | |
CN113761918A (en) | Data processing method and device | |
CN104750734B (en) | Sorting technique and device based on linear SVM | |
CN113468936A (en) | Food material identification method, device and equipment | |
CN116932487B (en) | Quantized data analysis method and system based on data paragraph division | |
CN116304058B (en) | Method and device for identifying negative information of enterprise, electronic equipment and storage medium | |
CN110969011B (en) | Text emotion analysis method and device, storage medium and processor | |
CN114416970B (en) | Text classification model with roles and dialogue text classification method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |