CN111667028A - Reliable negative sample determination method and related device - Google Patents
Reliable negative sample determination method and related device Download PDFInfo
- Publication number
- CN111667028A CN111667028A CN202010657192.9A CN202010657192A CN111667028A CN 111667028 A CN111667028 A CN 111667028A CN 202010657192 A CN202010657192 A CN 202010657192A CN 111667028 A CN111667028 A CN 111667028A
- Authority
- CN
- China
- Prior art keywords
- samples
- unlabeled
- sample
- positive
- probability
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 64
- 238000005192 partition Methods 0.000 claims description 22
- 238000012545 processing Methods 0.000 claims description 17
- 238000004590 computer program Methods 0.000 claims description 5
- 241000820057 Ithone Species 0.000 claims 1
- 238000012549 training Methods 0.000 abstract description 23
- 239000000523 sample Substances 0.000 description 189
- 238000004364 calculation method Methods 0.000 description 11
- 238000005516 engineering process Methods 0.000 description 9
- 238000013473 artificial intelligence Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 7
- 238000010801 machine learning Methods 0.000 description 7
- 238000003058 natural language processing Methods 0.000 description 7
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 241000282414 Homo sapiens Species 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 238000007781 pre-processing Methods 0.000 description 3
- 238000012216 screening Methods 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 238000000638 solvent extraction Methods 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 239000012468 concentrated sample Substances 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000009828 non-uniform distribution Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000002035 prolonged effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Probability & Statistics with Applications (AREA)
- Image Analysis (AREA)
Abstract
The embodiment of the application discloses a reliable negative sample determination method and a related device. In order to save the time for determining the reliable negative samples, p positive samples and u unlabeled samples required by the modeling are respectively described by using the characteristics of n dimensions, and the reliable negative samples are screened from the unlabeled samples according to the commonalities and differences of the positive samples and the unlabeled samples in the same characteristic dimension. According to the characteristics included by the samples, positive sample probabilities and negative sample probabilities corresponding to the characteristics included by the u unlabeled samples are determined, then the label sample probabilities that the u unlabeled samples belong to the negative samples are determined, and therefore the reliable negative samples are screened out according to the label sample probabilities. According to the scheme, model training is not required to be implemented, and offline can be directly completed, so that the time for determining the reliable negative sample is greatly shortened, the method can be well adapted to the application scene with high timeliness, and the model training efficiency for the application scene is greatly improved.
Description
Technical Field
The present application relates to the field of data processing, and in particular, to a reliable negative sample determination method and related apparatus.
Background
The Reliable Negative (RN) samples refer to samples with a high probability of being negative samples in the unlabeled samples, and the RN samples are commonly used in semi-supervised learning, such as positive unlabeled learning (pure learning).
The semi-supervised learning based on the RN sample has wide application scenes, such as similar population expansion (lookalike) and the like, and corresponding functions can be realized through a network model obtained through the semi-supervised learning aiming at different application scenes.
In the model training process, an RN set relative to a positive sample set (P set) needs to be determined from massive unlabeled samples, in the related technology, the RN set formed by the RN samples is generally found out in a multi-round training mode, the model training time can be prolonged remarkably, and the time consumption of the model training is intolerable for some application scenes with high timeliness.
Disclosure of Invention
In order to solve the technical problem, the application provides a reliable negative sample determination method and a related device, so that the time for determining the reliable negative sample is shortened, the method can be well adapted to application scenes with high timeliness, and the model training efficiency for the application scenes is greatly improved.
The embodiment of the application discloses the following technical scheme:
in one aspect, an embodiment of the present application provides a reliable negative example determination method, where examples are described by features of n dimensions, where the examples include p positive examples forming a positive example set and u unlabeled examples forming an unlabeled example set, and the method is performed by a data processing device, and the method includes:
determining positive sample probabilities and negative sample probabilities respectively corresponding to the features included in the u unlabeled samples according to the features included in the p positive samples and the u unlabeled samples;
determining the probability that the u unlabeled samples belong to the labeled samples of the negative samples respectively according to the probability of the positive samples and the probability of the negative samples respectively corresponding to the characteristics of the u unlabeled samples;
determining an unlabeled exemplar with the labeled exemplar probability above a threshold as a reliable negative exemplar.
In another aspect, an embodiment of the present application provides a reliable negative example determining apparatus, where a sample is described by features of n dimensions, the sample includes p positive examples that form a positive example set and u unlabeled examples that form an unlabeled example set, and the apparatus includes a first determining unit, a second determining unit, and a third determining unit:
the first determining unit is configured to determine, according to features included in the p positive samples and the u unlabeled samples, positive sample probabilities and negative sample probabilities respectively corresponding to the features included in the u unlabeled samples;
the second determining unit is configured to determine, according to the positive sample probabilities and the negative sample probabilities respectively corresponding to the features included in the u unlabeled samples, the label sample probabilities that the u unlabeled samples respectively belong to the negative sample;
the third determining unit is configured to determine the unlabeled exemplar with the labeled exemplar probability higher than the threshold value as a reliable negative exemplar.
In another aspect, an embodiment of the present application further provides a reliable negative example determining apparatus, where the apparatus includes a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to perform the method of the above aspect according to instructions in the program code.
In another aspect, an embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium is used to store a computer program, where the computer program is used to execute the method in the foregoing aspect.
According to the technical scheme, in order to save time for determining the reliable negative samples, p positive samples and u unlabeled samples required by the modeling are respectively described by using the characteristics of n dimensions, and the reliable negative samples are screened from the unlabeled samples by using the commonalities and differences of the positive samples and the unlabeled samples in the same characteristic dimension. The method comprises the steps of determining positive sample probability and negative sample probability corresponding to the features included in the u unlabeled samples according to the features included in the samples, determining the label sample probability that the u unlabeled samples belong to the negative samples according to the determined positive sample probability and the determined negative sample probability, and screening reliable negative samples from the u unlabeled samples according to the label sample probability. According to the scheme, model training is not required to be implemented, and offline can be directly completed, so that the time for determining the reliable negative sample is greatly shortened, the method can be well adapted to the application scene with high timeliness, and the model training efficiency for the application scene is greatly improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
FIG. 1 is a schematic diagram of a reliable negative example determination scenario provided by an embodiment of the present application;
FIG. 2 is a flowchart of a method for determining a reliable negative example according to an embodiment of the present disclosure;
FIG. 3 is a block diagram of an apparatus for determining a reliable negative example according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of a server according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a terminal device according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described below with reference to the accompanying drawings.
The reliable negative samples are important training bases for completing model training in semi-supervised learning, however, in the related art, the reliable negative samples can be determined from the unlabeled samples only after additional model training processes are adopted, and the method is considerable in time consumption, so that the time for completing model training is long. For some application scenarios with high timeliness, the training time of the model is hard to be sufficient.
Therefore, the embodiment of the application provides a reliable negative sample determination method to shorten the time for determining the reliable negative sample. The method for determining reliable negative samples provided by the embodiments of the present application may relate to Artificial Intelligence (AI), which is a theory, method, technique, and application system that simulates, extends, and expands human intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, obtains knowledge, and uses the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level.
In the embodiment of the present application, the artificial intelligence software technology mainly involved includes the natural language processing technology and the deep learning direction.
Next, an execution body of the embodiment of the present application will be described. The reliable negative example determination method provided by the application can be executed by a processing device. The processing device may be a terminal device, and the terminal device may be, for example, a smart phone, a computer, a Personal Digital Assistant (PDA), a tablet computer, a Point of sale (POS), a vehicle-mounted computer, or the like. The data processing device may also be a server, wherein the server may be an independent server, a server in a cluster, a cloud server, or the like.
In the embodiment of the present application, the data processing apparatus may have the capability of implementing natural language processing, which is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, and the like.
For example, Text preprocessing (Text preprocessing) and Semantic understanding (Semantic understating) in Natural Language Processing (NLP) may be involved, including words, word segmentation (word/Semantic segmentation), word tagging (word tagging), sentence classification (word/Semantic classification), and the like.
In the embodiment of the present application, the data processing device may implement preprocessing, such as semantic understanding, semantic conversion, and the like, on features included in the positive sample and the unlabeled sample by implementing the NLP technology.
The processing device may be provided with Machine Learning (ML) capabilities. ML is a multi-field interdisciplinary, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks.
For example, Deep Learning (Deep Learning) in Machine Learning (ML) may be involved, including various types of artificial neural networks (artificial neural networks).
In the embodiment of the application, the method for determining the reliable negative sample mainly relates to application to various artificial neural networks, for example, in PU Learning, the determined reliable negative sample is applied to training of a network model, and the like.
In the scenario shown in fig. 1, the processing device is a server 100, and p positive examples 100 and u unlabeled examples 102 are obtained, and it should be emphasized that an unlabeled example is not to say that a sample is not completely unlabeled, but an unlabeled example does not have a label related to a positive example with respect to a positive example, for example, a label of a positive example is used to identify a positive example as user information having a specific feature, and an unlabeled example does not have a label identifying whether the positive example is user information having a specific feature.
In the application scenario of PU Learning, the positive sample set may provide the seed packet users for the service party, and the purpose is to find similar users in the data carousel related to the application scenario according to the seed packet.
In the scenario shown in fig. 1, the sample is described by n-7-dimensional features, which may describe features that the sample itself has from different dimensions. The more similar an unlabeled exemplar includes features that are included in a positive exemplar, the higher the likelihood that the unlabeled exemplar belongs to a positive exemplar, and the less similar an unlabeled exemplar includes features that are included in a positive exemplar, the higher the likelihood that the unlabeled exemplar belongs to a negative exemplar. Thus, the server 100 can screen reliable negative examples from unlabeled examples based on the commonalities and differences exhibited by the positive examples and the unlabeled examples in the same feature dimension.
Specifically, the server 100 determines, according to features included in p positive samples 100 and u unlabeled samples 102, positive sample probabilities and negative sample probabilities corresponding to the features included in the u unlabeled samples, that is, the features of each dimension in the unlabeled samples have the positive sample probabilities and the negative sample probabilities, the positive sample probabilities are used to identify the probabilities that the unlabeled samples are determined to be the positive samples when having the features, and the negative sample probabilities are used to identify the probabilities that the unlabeled samples are determined to be the negative samples when having the features.
Based on determining the probability of the positive sample and the probability of the negative sample corresponding to the features, the server 100 may determine the probability of the u unlabeled samples 102 belonging to the negative sample. And based on the probability of the label samples, screening out higher non-label samples as reliable negative samples.
The scheme does not need to implement model training, can be directly finished off line, bypasses the model training method, and directly calculates the probability that the U concentrated samples are negative samples, thereby greatly speeding up the calculation process. And the part with the slowest calculation efficiency can be extracted out of off-line calculation. Therefore, the time for determining the reliable negative sample is greatly shortened, the method can be well adapted to application scenes with high timeliness, and the model training efficiency aiming at the application scenes is greatly improved.
Fig. 2 is a flowchart of a method for determining a reliable negative example according to an embodiment of the present disclosure. In the scenario shown in fig. 2, the exemplars are described by features of n dimensions, including p positive exemplars constituting the positive exemplar set and u unlabeled exemplars constituting the unlabeled exemplar set.
S201: and determining the positive sample probability and the negative sample probability respectively corresponding to the features included in the u unlabeled samples according to the features included in the p positive samples and the u unlabeled samples.
In the present application, it is found that if the features describing a sample are all null, the probabilities of one unlabeled sample as a positive sample and a negative sample are respectively:
where P (+) is the probability that a sample in the set of positive samples P + the set of unlabeled samples U belongs to a positive sample, i.e.The i.e. | operation represents the number of samples in the set. In the same way, P (-) is the probability that the sample in the positive sample set P + unlabeled sample set U belongs to the negative sample, i.e.
When adding valid features to a sample, if there are already n-1 features, adding the nth feature, the probability that the sample becomes a positive or negative sample becomes:
wherein,after the ith characteristic is added, the specific gravity of the sample inclining to the positive sample and the negative sample is equivalent to the positive sample probability and the negative sample probability determined in the step. x is the number ofiIs the ith feature of the n dimensional features.
S202: and determining the probability that the u unlabeled samples belong to the labeled samples of the negative samples respectively according to the probability of the positive samples and the probability of the negative samples respectively corresponding to the characteristics of the u unlabeled samples.
After the positive sample probability and the negative sample probability corresponding to each feature are determined, the probability that the u unlabeled samples belong to the negative samples respectively can be determined based on the features included in the u unlabeled samples. Optionally, a specific determination manner of the probability that an unlabeled exemplar belongs to a labeled exemplar of a negative exemplar may be formula 1:
s203: determining an unlabeled exemplar with the labeled exemplar probability above a threshold as a reliable negative exemplar.
When the probability of the determined label sample is higher, the probability that the corresponding label-free sample belongs to the reliable negative sample is higher, and in the application, the threshold value is used as a basis for determining the reliable negative sample.
And traversing the unlabeled sample set formed by the u unlabeled samples to determine the reliable negative samples forming the RN set. The scheme does not need to implement model training, and can be directly finished off line, so that the time for determining the reliable negative sample is greatly shortened.
Aiming at an application scene with higher adaptability timeliness, a reliable negative sample set can be quickly determined from a non-label sample set according to a positive sample set corresponding to the application scene, so that the training efficiency of a network model aiming at the application scene can be greatly accelerated, and an available network model can be obtained as soon as possible to perform related services for the application scene.
Next, an alternative determination method of the positive sample probability and the negative sample probability of the feature is described, and when an unlabeled sample has one feature and does not have the feature, the unlabeled sample actually affects whether the final judgment of the unlabeled sample is a reliable negative sample. Therefore, in this manner, not only the positive and negative sample probabilities corresponding to an unlabeled sample having a feature, but also the positive and negative sample probabilities corresponding to an unlabeled sample not having the feature are determined.
For convenience of description, the ith feature of the n-dimensional features is taken as an example, and any one of the n-dimensional features may refer to a subsequent processing mode of the ith feature to determine the positive and negative sample probabilities.
For the ith feature, one optional manner of S201 is:
s2011: determining a first number from the p positive samples and a second number from the u unlabeled samples according to the ith feature.
Wherein the first number is the number of positive samples including the ith feature in the p positive samples, and the second number is the number of unlabeled samples including the ith feature in the u unlabeled samples.
S2012: and determining a first positive sample probability and a first negative sample probability respectively corresponding to the unlabeled sample having the ith feature and a second positive sample probability and a second negative sample probability respectively corresponding to the unlabeled sample not having the ith feature according to the first quantity and the second quantity.
The first positive sample probability and the first negative sample probability may be expressed by:
wherein,is the probability of the first positive sample,is the probability of the first negative sample,the number of positive samples having the ith feature (i.e. the first number),the number of unlabeled exemplars with the ith feature (i.e., the second number).
The second positive sample probability and the second negative sample probability may be expressed by:
When the labeled sample probability of u unlabeled samples is calculated in S202, for one unlabeled sample, the positive sample probability corresponding to the i-th feature corresponding to the one unlabeled sampleAnd negative sample probabilityThe unlabeled sample can be assigned according to whether it has the ith characteristicTake a value ofOrAnd letTake a value ofOr
In order to better improve the accuracy of determining the reliable negative sample based on the features, the actual numerical values of the features in different samples can be further considered in the calculation process, and more refined processing is performed based on the positions of the actual numerical values in a plurality of partitions determined by the feature value variable ranges of the features.
In this embodiment, the description is continued by taking the ith feature as an example, and a plurality of partitions can be specified based on the feature value variable range of the ith feature.
If the feature value of the ith feature of the unlabeled sample processed at this time is in the tth partition of the multiple partitions, the first number determined according to the ith feature is the number of positive samples including the ith feature in the tth partition in the p positive samples.
The second number determined according to the ith feature is the number of unlabeled samples in the u unlabeled samples including the ith feature in the t-th partition.
In this embodiment, to facilitate the partitioning, information corresponding to the ith feature may be quantized to obtain a corresponding number, for example, the ith feature corresponds to a job, and different jobs may be quantized to the corresponding number for partitioning.
In the embodiment of the present application, the partition manner is not limited, and for example, the variable range of the feature value may be divided into a plurality of partitions, and if the variable range of the feature value is 0 to 100, 10 partitions may be provided in units of 10. Except for the partition in the equal division mode, the values of the characteristics can be randomly partitioned into buckets according to the requirement. Even the characteristic value can be converted by a function and then subjected to bucket division. It is sufficient to drop the feature value one-to-one to one of the T buckets. The bucket dividing mode can adapt to application scenes with non-uniform distribution of feature values, so that the quantity of one feature falling into each bucket is balanced.
Correspondingly, when the positive and negative sample probabilities of the ith characteristic are determined on the basis, the partition t into which the characteristic value of the ith characteristic falls is used for lettingTake a value ofAnd letTake a value ofWherein the value range of t is the number of the subareas.
Thus, the same feature may have a corresponding plurality of positive sample probabilities and a plurality of negative sample probabilities, respectively corresponding to different partitions. Therefore, the accuracy of the influence of the characteristics on the judgment of the sample as a positive sample and a negative sample is improved.
As mentioned above, although an unlabeled exemplar has features that are not all n-dimensional features, whether the unlabeled exemplar has the ith feature will affect the determination of the unlabeled exemplar as a reliable negative exemplar. Therefore, after the first positive sample probability, the second positive sample probability, the first negative sample probability and the second negative sample probability corresponding to each feature are determined, the method can be applied to determining the label sample probability of an unlabeled sample.
For example, the target sample is one of the u unlabeled samples, and for the target sample, one possible implementation manner of S202 is:
s2021: a first feature set and a second feature set are determined from the target sample.
Wherein the first set of features includes features of the n-dimensional features that the target sample has, and the second set of features includes features of the n-dimensional features that the target sample does not have.
S2022: and determining the label sample probability of the target sample belonging to the negative sample according to the first positive sample probability and the first negative sample probability corresponding to the features in the first feature set and the second positive sample probability and the second negative sample probability corresponding to the features in the second feature set.
Taking equation 1 as an example for explanation, the calculation is performedIf the target sample has the ith feature,taking the value as the first positive sample probability, if the target sample does not have the ith characteristic,the value is the second positive sample probability. In the calculation ofIf the target sample has the ith feature,taking the value as the first negative sample probability, if the target sample does not have the ith characteristic,the value is the second negative sample probability.
In order to further improve the accuracy of the scheme provided by the embodiment of the application, the features to be relied on should be independent, that is, whether the current feature appears, what the feature value is, whether other features appear, what the value is, have no relation. If these problems can be sufficiently considered in feature division, the determined n-dimensional features may all be independent features, but in some cases, the result of feature division may not be ideal, and it is necessary to determine independent features from the n-dimensional features before calculating the positive and negative sample probabilities. The process of determining the independent characteristics can be completed off line, and the calculation efficiency is higher.
The manner in which the independent features are determined can be referred to the following flow:
s301: determining independent feature parameters between the features of the n dimensions.
S302: and according to the independent feature parameters, determining m-dimensional features from the n-dimensional features as independent features.
The independent feature parameters may be determined in various ways, such as mutual information, chi-square distribution, information gain, pearson coefficients, correlation coefficients, and even correlations between features through modeling, and the like. The independent characteristic parameter is used for representing the degree of independence between two characteristics.
If the mode of determining the independent characteristic parameter is mutual information calculation, an optional implementation manner of S301 is:
s3011: and determining mutual information between every two characteristics in the characteristics of the n dimensions.
The mutual information between two features can be expressed by the following formula:
wherein X, Y represents two features to be calculated, and x and y are feature values of the features respectively, and the value range of the feature values can be between 0 and 10. p (X) represents the probability that X is the value X in the samples of the positive sample set + unlabeled sample set, p (Y) represents the probability that Y is the value Y in the samples of the positive sample set + unlabeled sample set, and p (X, Y) represents the probability that X, Y is the value X, Y in the samples of the positive sample set + unlabeled sample set, respectively.
Mutual Information (Mutual Information) is a useful Information measure in Information theory, which can be seen as the amount of Information contained in a random variable about another random variable, or the unsuitability of a random variable to decrease due to the knowledge of another random variable. However, it is not intuitively reflected whether the two features are independent from each other only from mutual information, and entropy corresponding to the features needs to be introduced.
S3012: and determining a mutual information ratio as an independent characteristic parameter according to the mutual information and the entropy of the corresponding characteristic.
The entropy of each feature can be determined by:
H(X)=-∑p(x)logp(x)
through a traversal calculation mode, the mutual information ratio of any two feature keys in the features of n dimensions is determined, and the following process can be specifically adopted:
a) assume that there are N features, numbered 1-N, and the candidate set is F, which includes all N-dimensional features.
b) For X, the value is taken from 1 to N-1
c) For Y, the value is taken from i +1 to N
Accordingly, an optional implementation manner of S201 includes:
and determining the probability of the positive sample and the probability of the negative sample corresponding to the independent features included in the u unlabeled samples according to the independent features included in the p positive samples and the u unlabeled samples.
That is to say, in the process of determining the reliable negative examples in the unlabeled examples based on the features, if the independent features are determined from the features of the n dimensions, the reliable negative examples are determined according to the independent features without considering the dependent features, so that the calculation amount is reduced, and the calculation efficiency is further improved.
Assuming that the independent features determined from the features of the n dimensions include f, the method for determining the probability of the labeled sample of the u unlabeled samples can be adjusted from formula 1 to formula 2 according to the probability of the positive sample and the probability of the negative sample respectively corresponding to the independent features:
through the mode, compared with a PU Learning algorithm needing additional training in the related technology, the effect promotion is basically similar, but the consumed time is only 25%, and the efficiency is greatly improved.
For the above-described reliable negative example determination method, an embodiment of the present application further provides a reliable negative example determination apparatus.
Referring to fig. 3, fig. 3 is a reliable negative example determining apparatus provided in an embodiment of the present application, where examples are described by features of n dimensions, and the examples include p positive examples constituting a positive example set and u unlabeled examples constituting an unlabeled example set, and the apparatus includes a first determining unit 301, a second determining unit 302, and a third determining unit 303:
the first determining unit 301 is configured to determine, according to features included in the p positive samples and the u unlabeled samples, positive sample probabilities and negative sample probabilities respectively corresponding to the features included in the u unlabeled samples;
the second determining unit 302 is configured to determine, according to the positive sample probabilities and the negative sample probabilities respectively corresponding to the features included in the u unlabeled samples, the labeled sample probabilities that the u unlabeled samples respectively belong to the negative sample;
the third determining unit 303 is configured to determine an unlabeled exemplar with the probability of the labeled exemplar being higher than a threshold value as a reliable negative exemplar.
Optionally, the ith feature is an ith feature of the n dimensions, and for the ith feature, the first determining unit is further configured to:
determining a first number from the p positive samples and a second number from the u unlabeled samples according to the ith feature; wherein the first number is the number of positive samples including the ith feature in the p positive samples, and the second number is the number of unlabeled samples including the ith feature in the u unlabeled samples;
and determining a first positive sample probability and a first negative sample probability respectively corresponding to the unlabeled sample having the ith feature and a second positive sample probability and a second negative sample probability respectively corresponding to the unlabeled sample not having the ith feature according to the first quantity and the second quantity.
Optionally, if the characteristic value of the ith characteristic is in a tth partition, the tth partition is one of a plurality of partitions determined according to the variable range of the characteristic value of the ith characteristic; the first number determined according to the ith feature is the number of positive samples including the ith feature in the t-th partition in the p positive samples, and the second number determined according to the ith feature is the number of unlabeled samples including the ith feature in the t-th partition in the u unlabeled samples.
Optionally, the target sample is one of the u unlabeled samples, and for the target sample, the second determining unit is further configured to:
determining a first feature set and a second feature set according to the target sample; wherein the first set of features includes features of the n-dimensional features that the target sample has, and the second set of features includes features of the n-dimensional features that the target sample does not have;
and determining the label sample probability of the target sample belonging to the negative sample according to the first positive sample probability and the first negative sample probability corresponding to the features in the first feature set and the second positive sample probability and the second negative sample probability corresponding to the features in the second feature set.
Optionally, the apparatus further includes a fourth determining unit:
the fourth determining unit is configured to determine independent feature parameters between the features of the n dimensions;
according to the independent feature parameters, m dimensional features are determined from the n dimensional features to be used as independent features;
the first determining unit is further configured to determine, according to the independent features included in the p positive samples and the u unlabeled samples, positive sample probabilities and negative sample probabilities respectively corresponding to the independent features included in the u unlabeled samples.
Optionally, the fourth determining unit is further configured to:
determining mutual information between every two features in the n-dimension features;
and determining a mutual information ratio as an independent characteristic parameter according to the mutual information and the entropy of the corresponding characteristic.
Therefore, in order to save the time for determining the reliable negative samples, p positive samples and u unlabeled samples required by the modeling are respectively described by using the features of n dimensions, and the reliable negative samples are screened from the unlabeled samples by using the commonalities and differences of the positive samples and the unlabeled samples in the same feature dimension. The method comprises the steps of determining positive sample probability and negative sample probability corresponding to the features included in the u unlabeled samples according to the features included in the samples, determining the label sample probability that the u unlabeled samples belong to the negative samples according to the determined positive sample probability and the determined negative sample probability, and screening reliable negative samples from the u unlabeled samples according to the label sample probability. According to the scheme, model training is not required to be implemented, and offline can be directly completed, so that the time for determining the reliable negative sample is greatly shortened, the method can be well adapted to the application scene with high timeliness, and the model training efficiency for the application scene is greatly improved.
The embodiment of the application also provides a server and a terminal device for reliable negative sample determination, and the server and the terminal device can be the processing device. The server and the terminal device for reliable negative example determination provided by the embodiment of the present application will be described in terms of hardware implementation.
Referring to fig. 4, fig. 4 is a schematic diagram of a server 1400 provided by an embodiment of the present application, where the server 1400 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 1422 (e.g., one or more processors) and a memory 1432, one or more storage media 1430 (e.g., one or more mass storage devices) for storing applications 1442 or data 1444. Memory 1432 and storage media 1430, among other things, may be transient or persistent storage. The program stored on storage medium 1430 may include one or more modules (not shown), each of which may include a sequence of instructions operating on a server. Still further, a central processor 1422 may be disposed in communication with storage medium 1430 for executing a series of instruction operations on storage medium 1430 on server 1400.
The server 1400 may also include one or more power supplies 1426, one or more wired or wireless network interfaces 1450, one or more input-output interfaces 1458, and/or one or more operating systems 1441, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 4.
Wherein the samples are described by features of n dimensions, the samples include p positive samples constituting a positive sample set and u unlabeled samples constituting an unlabeled sample set, and the CPU 1422 is configured to perform the following steps:
determining positive sample probabilities and negative sample probabilities respectively corresponding to the features included in the u unlabeled samples according to the features included in the p positive samples and the u unlabeled samples;
determining the probability that the u unlabeled samples belong to the labeled samples of the negative samples respectively according to the probability of the positive samples and the probability of the negative samples respectively corresponding to the characteristics of the u unlabeled samples;
determining an unlabeled exemplar with the labeled exemplar probability above a threshold as a reliable negative exemplar.
Optionally, the CPU 1422 may further execute the method steps of any specific implementation of the reliable negative example determination method in the embodiment of the present application.
For the above-described reliable negative example determination method, the present application further provides a terminal device for reliable negative example determination, so that the above-described reliable negative example determination method is practically implemented and applied.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a terminal device according to an embodiment of the present application. For convenience of explanation, only the parts related to the embodiments of the present application are shown, and details of the specific technology are not disclosed.
Fig. 5 is a block diagram illustrating a partial structure related to a terminal provided in an embodiment of the present application. Referring to fig. 5, the terminal includes: radio Frequency (RF) circuit 1510, memory 1520, input unit 1530, display unit 1540, sensor 1550, audio circuit 1560, wireless fidelity (WiFi) module 1570, processor 1580, and power 1590. Those skilled in the art will appreciate that the tablet configuration shown in fig. 9 is not intended to be limiting of tablets and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
The following describes each component of the tablet pc in detail with reference to fig. 5:
the memory 1520 may be used to store software programs and modules, and the processor 1580 implements various functional applications of the terminal and data processing by operating the software programs and modules stored in the memory 1520. The memory 1520 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 1520 may include high-speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
The processor 1580 is a control center of the terminal, connects various parts of the entire tablet pc using various interfaces and lines, and performs various functions of the tablet pc and processes data by operating or executing software programs and/or modules stored in the memory 1520 and calling data stored in the memory 1520, thereby integrally monitoring the tablet pc. Optionally, the processor 1580 may include one or more processing units; preferably, the processor 1580 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, and the like, and a modem processor, which mainly handles wireless communications. It is to be appreciated that the modem processor may not be integrated into the processor 1580.
In the embodiment of the present application, the terminal includes a memory 1520 that can store the program code and transmit the program code to the processor.
The processor 1580 included in the terminal may execute the lane speed limit determining method provided by the above-described embodiment according to an instruction in the program code.
Embodiments of the present application further provide a computer-readable storage medium for storing a computer program, where the computer program is configured to execute the reliable negative example determining method provided in the foregoing embodiments.
Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium may be at least one of the following media: various media that can store program codes, such as read-only memory (ROM), RAM, magnetic disk, or optical disk.
It should be noted that, in the present specification, all the embodiments are described in a progressive manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus and system embodiments, since they are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The above description is only one specific embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (10)
1. A method of reliable negative exemplar determination, wherein an exemplar is described by features of n dimensions, the exemplar comprising p positive exemplars that make up a set of positive exemplars and u unlabeled exemplars that make up a set of unlabeled exemplars, the method being performed by a data processing apparatus, the method comprising:
determining positive sample probabilities and negative sample probabilities respectively corresponding to the features included in the u unlabeled samples according to the features included in the p positive samples and the u unlabeled samples;
determining the probability that the u unlabeled samples belong to the labeled samples of the negative samples respectively according to the probability of the positive samples and the probability of the negative samples respectively corresponding to the characteristics of the u unlabeled samples;
determining an unlabeled exemplar with the labeled exemplar probability above a threshold as a reliable negative exemplar.
2. The method according to claim 1, wherein the ith feature is an ith feature in the n-dimensional features, and for the ith feature, the determining, according to the features included in the p positive samples and the u unlabeled samples, positive sample probabilities and negative sample probabilities respectively corresponding to the features included in the u unlabeled samples includes:
determining a first number from the p positive samples and a second number from the u unlabeled samples according to the ith feature; wherein the first number is the number of positive samples including the ith feature in the p positive samples, and the second number is the number of unlabeled samples including the ith feature in the u unlabeled samples;
and determining a first positive sample probability and a first negative sample probability respectively corresponding to the unlabeled sample having the ith feature and a second positive sample probability and a second negative sample probability respectively corresponding to the unlabeled sample not having the ith feature according to the first quantity and the second quantity.
3. The method according to claim 2, wherein if the eigenvalue of the ith characteristic is in a tth partition, the tth partition is one of a plurality of partitions determined according to the eigenvalue variable range of the ith characteristic; the first number determined according to the ith feature is the number of positive samples including the ith feature in the t-th partition in the p positive samples, and the second number determined according to the ith feature is the number of unlabeled samples including the ith feature in the t-th partition in the u unlabeled samples.
4. The method according to claim 2, wherein a target exemplar is one of the u unlabeled exemplars, and the determining, for the target exemplar, the probability that the u unlabeled exemplars belong to negative exemplars according to the probability of positive exemplars and the probability of negative exemplars respectively corresponding to the features included in the u unlabeled exemplars includes:
determining a first feature set and a second feature set according to the target sample; wherein the first set of features includes features of the n-dimensional features that the target sample has, and the second set of features includes features of the n-dimensional features that the target sample does not have;
and determining the label sample probability of the target sample belonging to the negative sample according to the first positive sample probability and the first negative sample probability corresponding to the features in the first feature set and the second positive sample probability and the second negative sample probability corresponding to the features in the second feature set.
5. The method of claim 1, further comprising:
determining independent feature parameters between the features of the n dimensions;
according to the independent feature parameters, m dimensional features are determined from the n dimensional features to be used as independent features;
determining, according to features included in the p positive samples and the u unlabeled samples, positive sample probabilities and negative sample probabilities respectively corresponding to the features included in the u unlabeled samples, including:
and determining the probability of the positive sample and the probability of the negative sample corresponding to the independent features included in the u unlabeled samples according to the independent features included in the p positive samples and the u unlabeled samples.
6. The method of claim 5, wherein determining independent feature parameters between the n-dimensional features comprises:
determining mutual information between every two features in the n-dimension features;
and determining a mutual information ratio as an independent characteristic parameter according to the mutual information and the entropy of the corresponding characteristic.
7. A reliable negative exemplar determination apparatus, characterized in that an exemplar is described by features of n dimensions, the exemplar including p positive exemplars constituting a set of positive exemplars and u unlabeled exemplars constituting a set of unlabeled exemplars, the apparatus comprising a first determination unit, a second determination unit and a third determination unit:
the first determining unit is configured to determine, according to features included in the p positive samples and the u unlabeled samples, positive sample probabilities and negative sample probabilities respectively corresponding to the features included in the u unlabeled samples;
the second determining unit is configured to determine, according to the positive sample probabilities and the negative sample probabilities respectively corresponding to the features included in the u unlabeled samples, the label sample probabilities that the u unlabeled samples respectively belong to the negative sample;
the third determining unit is configured to determine the unlabeled exemplar with the labeled exemplar probability higher than the threshold value as a reliable negative exemplar.
8. The apparatus of claim 7, wherein the ith feature is an ith one of the n-dimensional features, and for the ith feature, the first determining unit is further configured to:
determining a first number from the p positive samples and a second number from the u unlabeled samples according to the ith feature; wherein the first number is the number of positive samples including the ith feature in the p positive samples, and the second number is the number of unlabeled samples including the ith feature in the u unlabeled samples;
and determining a first positive sample probability and a first negative sample probability respectively corresponding to the unlabeled sample having the ith feature and a second positive sample probability and a second negative sample probability respectively corresponding to the unlabeled sample not having the ith feature according to the first quantity and the second quantity.
9. A reliable negative example determination device, the device comprising a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to perform the method of any of claims 1-6 according to instructions in the program code.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium is used to store a computer program for performing the method of any of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010657192.9A CN111667028B (en) | 2020-07-09 | 2020-07-09 | Reliable negative sample determination method and related device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010657192.9A CN111667028B (en) | 2020-07-09 | 2020-07-09 | Reliable negative sample determination method and related device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111667028A true CN111667028A (en) | 2020-09-15 |
CN111667028B CN111667028B (en) | 2024-03-12 |
Family
ID=72391674
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010657192.9A Active CN111667028B (en) | 2020-07-09 | 2020-07-09 | Reliable negative sample determination method and related device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111667028B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112784883A (en) * | 2021-01-07 | 2021-05-11 | 厦门大学 | Cold water coral distribution prediction method and system based on sample selection expansion |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017143919A1 (en) * | 2016-02-26 | 2017-08-31 | 阿里巴巴集团控股有限公司 | Method and apparatus for establishing data identification model |
WO2018166457A1 (en) * | 2017-03-15 | 2018-09-20 | 阿里巴巴集团控股有限公司 | Neural network model training method and device, transaction behavior risk identification method and device |
CN109902708A (en) * | 2018-12-29 | 2019-06-18 | 华为技术有限公司 | A kind of recommended models training method and relevant apparatus |
CN109934249A (en) * | 2018-12-14 | 2019-06-25 | 网易(杭州)网络有限公司 | Data processing method, device, medium and calculating equipment |
CN111310814A (en) * | 2020-02-07 | 2020-06-19 | 支付宝(杭州)信息技术有限公司 | Method and device for training business prediction model by utilizing unbalanced positive and negative samples |
-
2020
- 2020-07-09 CN CN202010657192.9A patent/CN111667028B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017143919A1 (en) * | 2016-02-26 | 2017-08-31 | 阿里巴巴集团控股有限公司 | Method and apparatus for establishing data identification model |
WO2018166457A1 (en) * | 2017-03-15 | 2018-09-20 | 阿里巴巴集团控股有限公司 | Neural network model training method and device, transaction behavior risk identification method and device |
CN109934249A (en) * | 2018-12-14 | 2019-06-25 | 网易(杭州)网络有限公司 | Data processing method, device, medium and calculating equipment |
CN109902708A (en) * | 2018-12-29 | 2019-06-18 | 华为技术有限公司 | A kind of recommended models training method and relevant apparatus |
CN111310814A (en) * | 2020-02-07 | 2020-06-19 | 支付宝(杭州)信息技术有限公司 | Method and device for training business prediction model by utilizing unbalanced positive and negative samples |
Non-Patent Citations (1)
Title |
---|
裔阳;周绍光;赵鹏飞;胡屹群;: "基于正样本和未标记样本的遥感图像分类方法", 计算机工程与应用, no. 04, 28 February 2017 (2017-02-28), pages 161 - 165 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112784883A (en) * | 2021-01-07 | 2021-05-11 | 厦门大学 | Cold water coral distribution prediction method and system based on sample selection expansion |
Also Published As
Publication number | Publication date |
---|---|
CN111667028B (en) | 2024-03-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111382868B (en) | Neural network structure searching method and device | |
Xie et al. | A Survey on Machine Learning‐Based Mobile Big Data Analysis: Challenges and Applications | |
CN108108455B (en) | Destination pushing method and device, storage medium and electronic equipment | |
CN108629358B (en) | Object class prediction method and device | |
CN107894827B (en) | Application cleaning method and device, storage medium and electronic equipment | |
WO2019062418A1 (en) | Application cleaning method and apparatus, storage medium and electronic device | |
CN110598869B (en) | Classification method and device based on sequence model and electronic equipment | |
CN107678531B (en) | Application cleaning method and device, storage medium and electronic equipment | |
CN108197225B (en) | Image classification method and device, storage medium and electronic equipment | |
CN113837669B (en) | Evaluation index construction method of label system and related device | |
WO2019120007A1 (en) | Method and apparatus for predicting user gender, and electronic device | |
CN112949662B (en) | Image processing method and device, computer equipment and storage medium | |
CN111353303A (en) | Word vector construction method and device, electronic equipment and storage medium | |
CN111882048A (en) | Neural network structure searching method and related equipment | |
CN115879508A (en) | Data processing method and related device | |
Gao et al. | A deep learning framework with spatial-temporal attention mechanism for cellular traffic prediction | |
CN112862021B (en) | Content labeling method and related device | |
CN112925912B (en) | Text processing method, synonymous text recall method and apparatus | |
Ghebriout et al. | Harmonic-NAS: Hardware-Aware Multimodal Neural Architecture Search on Resource-constrained Devices | |
CN111667028A (en) | Reliable negative sample determination method and related device | |
CN114547308B (en) | Text processing method, device, electronic equipment and storage medium | |
CN116957006A (en) | Training method, device, equipment, medium and program product of prediction model | |
CN115512693B (en) | Audio recognition method, acoustic model training method, device and storage medium | |
CN115222047A (en) | Model training method, device, equipment and storage medium | |
CN115221316A (en) | Knowledge base processing method, model training method, computer device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |