CN111667028A - Reliable negative sample determination method and related device - Google Patents

Reliable negative sample determination method and related device Download PDF

Info

Publication number
CN111667028A
CN111667028A CN202010657192.9A CN202010657192A CN111667028A CN 111667028 A CN111667028 A CN 111667028A CN 202010657192 A CN202010657192 A CN 202010657192A CN 111667028 A CN111667028 A CN 111667028A
Authority
CN
China
Prior art keywords
samples
unlabeled
sample
positive
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010657192.9A
Other languages
Chinese (zh)
Other versions
CN111667028B (en
Inventor
叶佳木
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010657192.9A priority Critical patent/CN111667028B/en
Publication of CN111667028A publication Critical patent/CN111667028A/en
Application granted granted Critical
Publication of CN111667028B publication Critical patent/CN111667028B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the application discloses a reliable negative sample determination method and a related device. In order to save the time for determining the reliable negative samples, p positive samples and u unlabeled samples required by the modeling are respectively described by using the characteristics of n dimensions, and the reliable negative samples are screened from the unlabeled samples according to the commonalities and differences of the positive samples and the unlabeled samples in the same characteristic dimension. According to the characteristics included by the samples, positive sample probabilities and negative sample probabilities corresponding to the characteristics included by the u unlabeled samples are determined, then the label sample probabilities that the u unlabeled samples belong to the negative samples are determined, and therefore the reliable negative samples are screened out according to the label sample probabilities. According to the scheme, model training is not required to be implemented, and offline can be directly completed, so that the time for determining the reliable negative sample is greatly shortened, the method can be well adapted to the application scene with high timeliness, and the model training efficiency for the application scene is greatly improved.

Description

Reliable negative sample determination method and related device
Technical Field
The present application relates to the field of data processing, and in particular, to a reliable negative sample determination method and related apparatus.
Background
The Reliable Negative (RN) samples refer to samples with a high probability of being negative samples in the unlabeled samples, and the RN samples are commonly used in semi-supervised learning, such as positive unlabeled learning (pure learning).
The semi-supervised learning based on the RN sample has wide application scenes, such as similar population expansion (lookalike) and the like, and corresponding functions can be realized through a network model obtained through the semi-supervised learning aiming at different application scenes.
In the model training process, an RN set relative to a positive sample set (P set) needs to be determined from massive unlabeled samples, in the related technology, the RN set formed by the RN samples is generally found out in a multi-round training mode, the model training time can be prolonged remarkably, and the time consumption of the model training is intolerable for some application scenes with high timeliness.
Disclosure of Invention
In order to solve the technical problem, the application provides a reliable negative sample determination method and a related device, so that the time for determining the reliable negative sample is shortened, the method can be well adapted to application scenes with high timeliness, and the model training efficiency for the application scenes is greatly improved.
The embodiment of the application discloses the following technical scheme:
in one aspect, an embodiment of the present application provides a reliable negative example determination method, where examples are described by features of n dimensions, where the examples include p positive examples forming a positive example set and u unlabeled examples forming an unlabeled example set, and the method is performed by a data processing device, and the method includes:
determining positive sample probabilities and negative sample probabilities respectively corresponding to the features included in the u unlabeled samples according to the features included in the p positive samples and the u unlabeled samples;
determining the probability that the u unlabeled samples belong to the labeled samples of the negative samples respectively according to the probability of the positive samples and the probability of the negative samples respectively corresponding to the characteristics of the u unlabeled samples;
determining an unlabeled exemplar with the labeled exemplar probability above a threshold as a reliable negative exemplar.
In another aspect, an embodiment of the present application provides a reliable negative example determining apparatus, where a sample is described by features of n dimensions, the sample includes p positive examples that form a positive example set and u unlabeled examples that form an unlabeled example set, and the apparatus includes a first determining unit, a second determining unit, and a third determining unit:
the first determining unit is configured to determine, according to features included in the p positive samples and the u unlabeled samples, positive sample probabilities and negative sample probabilities respectively corresponding to the features included in the u unlabeled samples;
the second determining unit is configured to determine, according to the positive sample probabilities and the negative sample probabilities respectively corresponding to the features included in the u unlabeled samples, the label sample probabilities that the u unlabeled samples respectively belong to the negative sample;
the third determining unit is configured to determine the unlabeled exemplar with the labeled exemplar probability higher than the threshold value as a reliable negative exemplar.
In another aspect, an embodiment of the present application further provides a reliable negative example determining apparatus, where the apparatus includes a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to perform the method of the above aspect according to instructions in the program code.
In another aspect, an embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium is used to store a computer program, where the computer program is used to execute the method in the foregoing aspect.
According to the technical scheme, in order to save time for determining the reliable negative samples, p positive samples and u unlabeled samples required by the modeling are respectively described by using the characteristics of n dimensions, and the reliable negative samples are screened from the unlabeled samples by using the commonalities and differences of the positive samples and the unlabeled samples in the same characteristic dimension. The method comprises the steps of determining positive sample probability and negative sample probability corresponding to the features included in the u unlabeled samples according to the features included in the samples, determining the label sample probability that the u unlabeled samples belong to the negative samples according to the determined positive sample probability and the determined negative sample probability, and screening reliable negative samples from the u unlabeled samples according to the label sample probability. According to the scheme, model training is not required to be implemented, and offline can be directly completed, so that the time for determining the reliable negative sample is greatly shortened, the method can be well adapted to the application scene with high timeliness, and the model training efficiency for the application scene is greatly improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
FIG. 1 is a schematic diagram of a reliable negative example determination scenario provided by an embodiment of the present application;
FIG. 2 is a flowchart of a method for determining a reliable negative example according to an embodiment of the present disclosure;
FIG. 3 is a block diagram of an apparatus for determining a reliable negative example according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of a server according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a terminal device according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described below with reference to the accompanying drawings.
The reliable negative samples are important training bases for completing model training in semi-supervised learning, however, in the related art, the reliable negative samples can be determined from the unlabeled samples only after additional model training processes are adopted, and the method is considerable in time consumption, so that the time for completing model training is long. For some application scenarios with high timeliness, the training time of the model is hard to be sufficient.
Therefore, the embodiment of the application provides a reliable negative sample determination method to shorten the time for determining the reliable negative sample. The method for determining reliable negative samples provided by the embodiments of the present application may relate to Artificial Intelligence (AI), which is a theory, method, technique, and application system that simulates, extends, and expands human intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, obtains knowledge, and uses the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level.
In the embodiment of the present application, the artificial intelligence software technology mainly involved includes the natural language processing technology and the deep learning direction.
Next, an execution body of the embodiment of the present application will be described. The reliable negative example determination method provided by the application can be executed by a processing device. The processing device may be a terminal device, and the terminal device may be, for example, a smart phone, a computer, a Personal Digital Assistant (PDA), a tablet computer, a Point of sale (POS), a vehicle-mounted computer, or the like. The data processing device may also be a server, wherein the server may be an independent server, a server in a cluster, a cloud server, or the like.
In the embodiment of the present application, the data processing apparatus may have the capability of implementing natural language processing, which is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, and the like.
For example, Text preprocessing (Text preprocessing) and Semantic understanding (Semantic understating) in Natural Language Processing (NLP) may be involved, including words, word segmentation (word/Semantic segmentation), word tagging (word tagging), sentence classification (word/Semantic classification), and the like.
In the embodiment of the present application, the data processing device may implement preprocessing, such as semantic understanding, semantic conversion, and the like, on features included in the positive sample and the unlabeled sample by implementing the NLP technology.
The processing device may be provided with Machine Learning (ML) capabilities. ML is a multi-field interdisciplinary, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks.
For example, Deep Learning (Deep Learning) in Machine Learning (ML) may be involved, including various types of artificial neural networks (artificial neural networks).
In the embodiment of the application, the method for determining the reliable negative sample mainly relates to application to various artificial neural networks, for example, in PU Learning, the determined reliable negative sample is applied to training of a network model, and the like.
In the scenario shown in fig. 1, the processing device is a server 100, and p positive examples 100 and u unlabeled examples 102 are obtained, and it should be emphasized that an unlabeled example is not to say that a sample is not completely unlabeled, but an unlabeled example does not have a label related to a positive example with respect to a positive example, for example, a label of a positive example is used to identify a positive example as user information having a specific feature, and an unlabeled example does not have a label identifying whether the positive example is user information having a specific feature.
In the application scenario of PU Learning, the positive sample set may provide the seed packet users for the service party, and the purpose is to find similar users in the data carousel related to the application scenario according to the seed packet.
In the scenario shown in fig. 1, the sample is described by n-7-dimensional features, which may describe features that the sample itself has from different dimensions. The more similar an unlabeled exemplar includes features that are included in a positive exemplar, the higher the likelihood that the unlabeled exemplar belongs to a positive exemplar, and the less similar an unlabeled exemplar includes features that are included in a positive exemplar, the higher the likelihood that the unlabeled exemplar belongs to a negative exemplar. Thus, the server 100 can screen reliable negative examples from unlabeled examples based on the commonalities and differences exhibited by the positive examples and the unlabeled examples in the same feature dimension.
Specifically, the server 100 determines, according to features included in p positive samples 100 and u unlabeled samples 102, positive sample probabilities and negative sample probabilities corresponding to the features included in the u unlabeled samples, that is, the features of each dimension in the unlabeled samples have the positive sample probabilities and the negative sample probabilities, the positive sample probabilities are used to identify the probabilities that the unlabeled samples are determined to be the positive samples when having the features, and the negative sample probabilities are used to identify the probabilities that the unlabeled samples are determined to be the negative samples when having the features.
Based on determining the probability of the positive sample and the probability of the negative sample corresponding to the features, the server 100 may determine the probability of the u unlabeled samples 102 belonging to the negative sample. And based on the probability of the label samples, screening out higher non-label samples as reliable negative samples.
The scheme does not need to implement model training, can be directly finished off line, bypasses the model training method, and directly calculates the probability that the U concentrated samples are negative samples, thereby greatly speeding up the calculation process. And the part with the slowest calculation efficiency can be extracted out of off-line calculation. Therefore, the time for determining the reliable negative sample is greatly shortened, the method can be well adapted to application scenes with high timeliness, and the model training efficiency aiming at the application scenes is greatly improved.
Fig. 2 is a flowchart of a method for determining a reliable negative example according to an embodiment of the present disclosure. In the scenario shown in fig. 2, the exemplars are described by features of n dimensions, including p positive exemplars constituting the positive exemplar set and u unlabeled exemplars constituting the unlabeled exemplar set.
S201: and determining the positive sample probability and the negative sample probability respectively corresponding to the features included in the u unlabeled samples according to the features included in the p positive samples and the u unlabeled samples.
In the present application, it is found that if the features describing a sample are all null, the probabilities of one unlabeled sample as a positive sample and a negative sample are respectively:
Figure BDA0002577183150000061
where P (+) is the probability that a sample in the set of positive samples P + the set of unlabeled samples U belongs to a positive sample, i.e.
Figure BDA0002577183150000062
The i.e. | operation represents the number of samples in the set. In the same way, P (-) is the probability that the sample in the positive sample set P + unlabeled sample set U belongs to the negative sample, i.e.
Figure BDA0002577183150000063
When adding valid features to a sample, if there are already n-1 features, adding the nth feature, the probability that the sample becomes a positive or negative sample becomes:
Figure BDA0002577183150000064
wherein,
Figure BDA0002577183150000065
after the ith characteristic is added, the specific gravity of the sample inclining to the positive sample and the negative sample is equivalent to the positive sample probability and the negative sample probability determined in the step. x is the number ofiIs the ith feature of the n dimensional features.
S202: and determining the probability that the u unlabeled samples belong to the labeled samples of the negative samples respectively according to the probability of the positive samples and the probability of the negative samples respectively corresponding to the characteristics of the u unlabeled samples.
After the positive sample probability and the negative sample probability corresponding to each feature are determined, the probability that the u unlabeled samples belong to the negative samples respectively can be determined based on the features included in the u unlabeled samples. Optionally, a specific determination manner of the probability that an unlabeled exemplar belongs to a labeled exemplar of a negative exemplar may be formula 1:
Figure BDA0002577183150000071
s203: determining an unlabeled exemplar with the labeled exemplar probability above a threshold as a reliable negative exemplar.
When the probability of the determined label sample is higher, the probability that the corresponding label-free sample belongs to the reliable negative sample is higher, and in the application, the threshold value is used as a basis for determining the reliable negative sample.
And traversing the unlabeled sample set formed by the u unlabeled samples to determine the reliable negative samples forming the RN set. The scheme does not need to implement model training, and can be directly finished off line, so that the time for determining the reliable negative sample is greatly shortened.
Aiming at an application scene with higher adaptability timeliness, a reliable negative sample set can be quickly determined from a non-label sample set according to a positive sample set corresponding to the application scene, so that the training efficiency of a network model aiming at the application scene can be greatly accelerated, and an available network model can be obtained as soon as possible to perform related services for the application scene.
Next, an alternative determination method of the positive sample probability and the negative sample probability of the feature is described, and when an unlabeled sample has one feature and does not have the feature, the unlabeled sample actually affects whether the final judgment of the unlabeled sample is a reliable negative sample. Therefore, in this manner, not only the positive and negative sample probabilities corresponding to an unlabeled sample having a feature, but also the positive and negative sample probabilities corresponding to an unlabeled sample not having the feature are determined.
For convenience of description, the ith feature of the n-dimensional features is taken as an example, and any one of the n-dimensional features may refer to a subsequent processing mode of the ith feature to determine the positive and negative sample probabilities.
For the ith feature, one optional manner of S201 is:
s2011: determining a first number from the p positive samples and a second number from the u unlabeled samples according to the ith feature.
Wherein the first number is the number of positive samples including the ith feature in the p positive samples, and the second number is the number of unlabeled samples including the ith feature in the u unlabeled samples.
S2012: and determining a first positive sample probability and a first negative sample probability respectively corresponding to the unlabeled sample having the ith feature and a second positive sample probability and a second negative sample probability respectively corresponding to the unlabeled sample not having the ith feature according to the first quantity and the second quantity.
The first positive sample probability and the first negative sample probability may be expressed by:
Figure BDA0002577183150000081
wherein,
Figure BDA0002577183150000082
is the probability of the first positive sample,
Figure BDA0002577183150000083
is the probability of the first negative sample,
Figure BDA0002577183150000084
the number of positive samples having the ith feature (i.e. the first number),
Figure BDA0002577183150000085
the number of unlabeled exemplars with the ith feature (i.e., the second number).
The second positive sample probability and the second negative sample probability may be expressed by:
Figure BDA0002577183150000086
wherein,
Figure BDA0002577183150000087
is the probability of the second positive sample,
Figure BDA0002577183150000088
is the second negative sample probability.
When the labeled sample probability of u unlabeled samples is calculated in S202, for one unlabeled sample, the positive sample probability corresponding to the i-th feature corresponding to the one unlabeled sample
Figure BDA0002577183150000089
And negative sample probability
Figure BDA00025771831500000810
The unlabeled sample can be assigned according to whether it has the ith characteristic
Figure BDA00025771831500000811
Take a value of
Figure BDA00025771831500000812
Or
Figure BDA00025771831500000813
And let
Figure BDA00025771831500000814
Take a value of
Figure BDA00025771831500000815
Or
Figure BDA00025771831500000816
In order to better improve the accuracy of determining the reliable negative sample based on the features, the actual numerical values of the features in different samples can be further considered in the calculation process, and more refined processing is performed based on the positions of the actual numerical values in a plurality of partitions determined by the feature value variable ranges of the features.
In this embodiment, the description is continued by taking the ith feature as an example, and a plurality of partitions can be specified based on the feature value variable range of the ith feature.
If the feature value of the ith feature of the unlabeled sample processed at this time is in the tth partition of the multiple partitions, the first number determined according to the ith feature is the number of positive samples including the ith feature in the tth partition in the p positive samples.
The second number determined according to the ith feature is the number of unlabeled samples in the u unlabeled samples including the ith feature in the t-th partition.
In this embodiment, to facilitate the partitioning, information corresponding to the ith feature may be quantized to obtain a corresponding number, for example, the ith feature corresponds to a job, and different jobs may be quantized to the corresponding number for partitioning.
In the embodiment of the present application, the partition manner is not limited, and for example, the variable range of the feature value may be divided into a plurality of partitions, and if the variable range of the feature value is 0 to 100, 10 partitions may be provided in units of 10. Except for the partition in the equal division mode, the values of the characteristics can be randomly partitioned into buckets according to the requirement. Even the characteristic value can be converted by a function and then subjected to bucket division. It is sufficient to drop the feature value one-to-one to one of the T buckets. The bucket dividing mode can adapt to application scenes with non-uniform distribution of feature values, so that the quantity of one feature falling into each bucket is balanced.
Correspondingly, when the positive and negative sample probabilities of the ith characteristic are determined on the basis, the partition t into which the characteristic value of the ith characteristic falls is used for letting
Figure BDA0002577183150000091
Take a value of
Figure BDA0002577183150000092
And let
Figure BDA0002577183150000093
Take a value of
Figure BDA0002577183150000094
Wherein the value range of t is the number of the subareas.
Thus, the same feature may have a corresponding plurality of positive sample probabilities and a plurality of negative sample probabilities, respectively corresponding to different partitions. Therefore, the accuracy of the influence of the characteristics on the judgment of the sample as a positive sample and a negative sample is improved.
As mentioned above, although an unlabeled exemplar has features that are not all n-dimensional features, whether the unlabeled exemplar has the ith feature will affect the determination of the unlabeled exemplar as a reliable negative exemplar. Therefore, after the first positive sample probability, the second positive sample probability, the first negative sample probability and the second negative sample probability corresponding to each feature are determined, the method can be applied to determining the label sample probability of an unlabeled sample.
For example, the target sample is one of the u unlabeled samples, and for the target sample, one possible implementation manner of S202 is:
s2021: a first feature set and a second feature set are determined from the target sample.
Wherein the first set of features includes features of the n-dimensional features that the target sample has, and the second set of features includes features of the n-dimensional features that the target sample does not have.
S2022: and determining the label sample probability of the target sample belonging to the negative sample according to the first positive sample probability and the first negative sample probability corresponding to the features in the first feature set and the second positive sample probability and the second negative sample probability corresponding to the features in the second feature set.
Taking equation 1 as an example for explanation, the calculation is performed
Figure BDA0002577183150000101
If the target sample has the ith feature,
Figure BDA0002577183150000102
taking the value as the first positive sample probability, if the target sample does not have the ith characteristic,
Figure BDA0002577183150000103
the value is the second positive sample probability. In the calculation of
Figure BDA0002577183150000104
If the target sample has the ith feature,
Figure BDA0002577183150000105
taking the value as the first negative sample probability, if the target sample does not have the ith characteristic,
Figure BDA0002577183150000106
the value is the second negative sample probability.
In order to further improve the accuracy of the scheme provided by the embodiment of the application, the features to be relied on should be independent, that is, whether the current feature appears, what the feature value is, whether other features appear, what the value is, have no relation. If these problems can be sufficiently considered in feature division, the determined n-dimensional features may all be independent features, but in some cases, the result of feature division may not be ideal, and it is necessary to determine independent features from the n-dimensional features before calculating the positive and negative sample probabilities. The process of determining the independent characteristics can be completed off line, and the calculation efficiency is higher.
The manner in which the independent features are determined can be referred to the following flow:
s301: determining independent feature parameters between the features of the n dimensions.
S302: and according to the independent feature parameters, determining m-dimensional features from the n-dimensional features as independent features.
The independent feature parameters may be determined in various ways, such as mutual information, chi-square distribution, information gain, pearson coefficients, correlation coefficients, and even correlations between features through modeling, and the like. The independent characteristic parameter is used for representing the degree of independence between two characteristics.
If the mode of determining the independent characteristic parameter is mutual information calculation, an optional implementation manner of S301 is:
s3011: and determining mutual information between every two characteristics in the characteristics of the n dimensions.
The mutual information between two features can be expressed by the following formula:
Figure BDA0002577183150000107
wherein X, Y represents two features to be calculated, and x and y are feature values of the features respectively, and the value range of the feature values can be between 0 and 10. p (X) represents the probability that X is the value X in the samples of the positive sample set + unlabeled sample set, p (Y) represents the probability that Y is the value Y in the samples of the positive sample set + unlabeled sample set, and p (X, Y) represents the probability that X, Y is the value X, Y in the samples of the positive sample set + unlabeled sample set, respectively.
Mutual Information (Mutual Information) is a useful Information measure in Information theory, which can be seen as the amount of Information contained in a random variable about another random variable, or the unsuitability of a random variable to decrease due to the knowledge of another random variable. However, it is not intuitively reflected whether the two features are independent from each other only from mutual information, and entropy corresponding to the features needs to be introduced.
S3012: and determining a mutual information ratio as an independent characteristic parameter according to the mutual information and the entropy of the corresponding characteristic.
The entropy of each feature can be determined by:
H(X)=-∑p(x)logp(x)
the mutual information ratio between every two characteristics is as follows:
Figure BDA0002577183150000111
through a traversal calculation mode, the mutual information ratio of any two feature keys in the features of n dimensions is determined, and the following process can be specifically adopted:
a) assume that there are N features, numbered 1-N, and the candidate set is F, which includes all N-dimensional features.
b) For X, the value is taken from 1 to N-1
c) For Y, the value is taken from i +1 to N
If it is not
Figure BDA0002577183150000112
Deleting Y from the set F, and returning to the step b) and recalculating.
Accordingly, an optional implementation manner of S201 includes:
and determining the probability of the positive sample and the probability of the negative sample corresponding to the independent features included in the u unlabeled samples according to the independent features included in the p positive samples and the u unlabeled samples.
That is to say, in the process of determining the reliable negative examples in the unlabeled examples based on the features, if the independent features are determined from the features of the n dimensions, the reliable negative examples are determined according to the independent features without considering the dependent features, so that the calculation amount is reduced, and the calculation efficiency is further improved.
Assuming that the independent features determined from the features of the n dimensions include f, the method for determining the probability of the labeled sample of the u unlabeled samples can be adjusted from formula 1 to formula 2 according to the probability of the positive sample and the probability of the negative sample respectively corresponding to the independent features:
Figure BDA0002577183150000121
through the mode, compared with a PU Learning algorithm needing additional training in the related technology, the effect promotion is basically similar, but the consumed time is only 25%, and the efficiency is greatly improved.
For the above-described reliable negative example determination method, an embodiment of the present application further provides a reliable negative example determination apparatus.
Referring to fig. 3, fig. 3 is a reliable negative example determining apparatus provided in an embodiment of the present application, where examples are described by features of n dimensions, and the examples include p positive examples constituting a positive example set and u unlabeled examples constituting an unlabeled example set, and the apparatus includes a first determining unit 301, a second determining unit 302, and a third determining unit 303:
the first determining unit 301 is configured to determine, according to features included in the p positive samples and the u unlabeled samples, positive sample probabilities and negative sample probabilities respectively corresponding to the features included in the u unlabeled samples;
the second determining unit 302 is configured to determine, according to the positive sample probabilities and the negative sample probabilities respectively corresponding to the features included in the u unlabeled samples, the labeled sample probabilities that the u unlabeled samples respectively belong to the negative sample;
the third determining unit 303 is configured to determine an unlabeled exemplar with the probability of the labeled exemplar being higher than a threshold value as a reliable negative exemplar.
Optionally, the ith feature is an ith feature of the n dimensions, and for the ith feature, the first determining unit is further configured to:
determining a first number from the p positive samples and a second number from the u unlabeled samples according to the ith feature; wherein the first number is the number of positive samples including the ith feature in the p positive samples, and the second number is the number of unlabeled samples including the ith feature in the u unlabeled samples;
and determining a first positive sample probability and a first negative sample probability respectively corresponding to the unlabeled sample having the ith feature and a second positive sample probability and a second negative sample probability respectively corresponding to the unlabeled sample not having the ith feature according to the first quantity and the second quantity.
Optionally, if the characteristic value of the ith characteristic is in a tth partition, the tth partition is one of a plurality of partitions determined according to the variable range of the characteristic value of the ith characteristic; the first number determined according to the ith feature is the number of positive samples including the ith feature in the t-th partition in the p positive samples, and the second number determined according to the ith feature is the number of unlabeled samples including the ith feature in the t-th partition in the u unlabeled samples.
Optionally, the target sample is one of the u unlabeled samples, and for the target sample, the second determining unit is further configured to:
determining a first feature set and a second feature set according to the target sample; wherein the first set of features includes features of the n-dimensional features that the target sample has, and the second set of features includes features of the n-dimensional features that the target sample does not have;
and determining the label sample probability of the target sample belonging to the negative sample according to the first positive sample probability and the first negative sample probability corresponding to the features in the first feature set and the second positive sample probability and the second negative sample probability corresponding to the features in the second feature set.
Optionally, the apparatus further includes a fourth determining unit:
the fourth determining unit is configured to determine independent feature parameters between the features of the n dimensions;
according to the independent feature parameters, m dimensional features are determined from the n dimensional features to be used as independent features;
the first determining unit is further configured to determine, according to the independent features included in the p positive samples and the u unlabeled samples, positive sample probabilities and negative sample probabilities respectively corresponding to the independent features included in the u unlabeled samples.
Optionally, the fourth determining unit is further configured to:
determining mutual information between every two features in the n-dimension features;
and determining a mutual information ratio as an independent characteristic parameter according to the mutual information and the entropy of the corresponding characteristic.
Therefore, in order to save the time for determining the reliable negative samples, p positive samples and u unlabeled samples required by the modeling are respectively described by using the features of n dimensions, and the reliable negative samples are screened from the unlabeled samples by using the commonalities and differences of the positive samples and the unlabeled samples in the same feature dimension. The method comprises the steps of determining positive sample probability and negative sample probability corresponding to the features included in the u unlabeled samples according to the features included in the samples, determining the label sample probability that the u unlabeled samples belong to the negative samples according to the determined positive sample probability and the determined negative sample probability, and screening reliable negative samples from the u unlabeled samples according to the label sample probability. According to the scheme, model training is not required to be implemented, and offline can be directly completed, so that the time for determining the reliable negative sample is greatly shortened, the method can be well adapted to the application scene with high timeliness, and the model training efficiency for the application scene is greatly improved.
The embodiment of the application also provides a server and a terminal device for reliable negative sample determination, and the server and the terminal device can be the processing device. The server and the terminal device for reliable negative example determination provided by the embodiment of the present application will be described in terms of hardware implementation.
Referring to fig. 4, fig. 4 is a schematic diagram of a server 1400 provided by an embodiment of the present application, where the server 1400 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 1422 (e.g., one or more processors) and a memory 1432, one or more storage media 1430 (e.g., one or more mass storage devices) for storing applications 1442 or data 1444. Memory 1432 and storage media 1430, among other things, may be transient or persistent storage. The program stored on storage medium 1430 may include one or more modules (not shown), each of which may include a sequence of instructions operating on a server. Still further, a central processor 1422 may be disposed in communication with storage medium 1430 for executing a series of instruction operations on storage medium 1430 on server 1400.
The server 1400 may also include one or more power supplies 1426, one or more wired or wireless network interfaces 1450, one or more input-output interfaces 1458, and/or one or more operating systems 1441, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 4.
Wherein the samples are described by features of n dimensions, the samples include p positive samples constituting a positive sample set and u unlabeled samples constituting an unlabeled sample set, and the CPU 1422 is configured to perform the following steps:
determining positive sample probabilities and negative sample probabilities respectively corresponding to the features included in the u unlabeled samples according to the features included in the p positive samples and the u unlabeled samples;
determining the probability that the u unlabeled samples belong to the labeled samples of the negative samples respectively according to the probability of the positive samples and the probability of the negative samples respectively corresponding to the characteristics of the u unlabeled samples;
determining an unlabeled exemplar with the labeled exemplar probability above a threshold as a reliable negative exemplar.
Optionally, the CPU 1422 may further execute the method steps of any specific implementation of the reliable negative example determination method in the embodiment of the present application.
For the above-described reliable negative example determination method, the present application further provides a terminal device for reliable negative example determination, so that the above-described reliable negative example determination method is practically implemented and applied.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a terminal device according to an embodiment of the present application. For convenience of explanation, only the parts related to the embodiments of the present application are shown, and details of the specific technology are not disclosed.
Fig. 5 is a block diagram illustrating a partial structure related to a terminal provided in an embodiment of the present application. Referring to fig. 5, the terminal includes: radio Frequency (RF) circuit 1510, memory 1520, input unit 1530, display unit 1540, sensor 1550, audio circuit 1560, wireless fidelity (WiFi) module 1570, processor 1580, and power 1590. Those skilled in the art will appreciate that the tablet configuration shown in fig. 9 is not intended to be limiting of tablets and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
The following describes each component of the tablet pc in detail with reference to fig. 5:
the memory 1520 may be used to store software programs and modules, and the processor 1580 implements various functional applications of the terminal and data processing by operating the software programs and modules stored in the memory 1520. The memory 1520 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 1520 may include high-speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
The processor 1580 is a control center of the terminal, connects various parts of the entire tablet pc using various interfaces and lines, and performs various functions of the tablet pc and processes data by operating or executing software programs and/or modules stored in the memory 1520 and calling data stored in the memory 1520, thereby integrally monitoring the tablet pc. Optionally, the processor 1580 may include one or more processing units; preferably, the processor 1580 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, and the like, and a modem processor, which mainly handles wireless communications. It is to be appreciated that the modem processor may not be integrated into the processor 1580.
In the embodiment of the present application, the terminal includes a memory 1520 that can store the program code and transmit the program code to the processor.
The processor 1580 included in the terminal may execute the lane speed limit determining method provided by the above-described embodiment according to an instruction in the program code.
Embodiments of the present application further provide a computer-readable storage medium for storing a computer program, where the computer program is configured to execute the reliable negative example determining method provided in the foregoing embodiments.
Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium may be at least one of the following media: various media that can store program codes, such as read-only memory (ROM), RAM, magnetic disk, or optical disk.
It should be noted that, in the present specification, all the embodiments are described in a progressive manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus and system embodiments, since they are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The above description is only one specific embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method of reliable negative exemplar determination, wherein an exemplar is described by features of n dimensions, the exemplar comprising p positive exemplars that make up a set of positive exemplars and u unlabeled exemplars that make up a set of unlabeled exemplars, the method being performed by a data processing apparatus, the method comprising:
determining positive sample probabilities and negative sample probabilities respectively corresponding to the features included in the u unlabeled samples according to the features included in the p positive samples and the u unlabeled samples;
determining the probability that the u unlabeled samples belong to the labeled samples of the negative samples respectively according to the probability of the positive samples and the probability of the negative samples respectively corresponding to the characteristics of the u unlabeled samples;
determining an unlabeled exemplar with the labeled exemplar probability above a threshold as a reliable negative exemplar.
2. The method according to claim 1, wherein the ith feature is an ith feature in the n-dimensional features, and for the ith feature, the determining, according to the features included in the p positive samples and the u unlabeled samples, positive sample probabilities and negative sample probabilities respectively corresponding to the features included in the u unlabeled samples includes:
determining a first number from the p positive samples and a second number from the u unlabeled samples according to the ith feature; wherein the first number is the number of positive samples including the ith feature in the p positive samples, and the second number is the number of unlabeled samples including the ith feature in the u unlabeled samples;
and determining a first positive sample probability and a first negative sample probability respectively corresponding to the unlabeled sample having the ith feature and a second positive sample probability and a second negative sample probability respectively corresponding to the unlabeled sample not having the ith feature according to the first quantity and the second quantity.
3. The method according to claim 2, wherein if the eigenvalue of the ith characteristic is in a tth partition, the tth partition is one of a plurality of partitions determined according to the eigenvalue variable range of the ith characteristic; the first number determined according to the ith feature is the number of positive samples including the ith feature in the t-th partition in the p positive samples, and the second number determined according to the ith feature is the number of unlabeled samples including the ith feature in the t-th partition in the u unlabeled samples.
4. The method according to claim 2, wherein a target exemplar is one of the u unlabeled exemplars, and the determining, for the target exemplar, the probability that the u unlabeled exemplars belong to negative exemplars according to the probability of positive exemplars and the probability of negative exemplars respectively corresponding to the features included in the u unlabeled exemplars includes:
determining a first feature set and a second feature set according to the target sample; wherein the first set of features includes features of the n-dimensional features that the target sample has, and the second set of features includes features of the n-dimensional features that the target sample does not have;
and determining the label sample probability of the target sample belonging to the negative sample according to the first positive sample probability and the first negative sample probability corresponding to the features in the first feature set and the second positive sample probability and the second negative sample probability corresponding to the features in the second feature set.
5. The method of claim 1, further comprising:
determining independent feature parameters between the features of the n dimensions;
according to the independent feature parameters, m dimensional features are determined from the n dimensional features to be used as independent features;
determining, according to features included in the p positive samples and the u unlabeled samples, positive sample probabilities and negative sample probabilities respectively corresponding to the features included in the u unlabeled samples, including:
and determining the probability of the positive sample and the probability of the negative sample corresponding to the independent features included in the u unlabeled samples according to the independent features included in the p positive samples and the u unlabeled samples.
6. The method of claim 5, wherein determining independent feature parameters between the n-dimensional features comprises:
determining mutual information between every two features in the n-dimension features;
and determining a mutual information ratio as an independent characteristic parameter according to the mutual information and the entropy of the corresponding characteristic.
7. A reliable negative exemplar determination apparatus, characterized in that an exemplar is described by features of n dimensions, the exemplar including p positive exemplars constituting a set of positive exemplars and u unlabeled exemplars constituting a set of unlabeled exemplars, the apparatus comprising a first determination unit, a second determination unit and a third determination unit:
the first determining unit is configured to determine, according to features included in the p positive samples and the u unlabeled samples, positive sample probabilities and negative sample probabilities respectively corresponding to the features included in the u unlabeled samples;
the second determining unit is configured to determine, according to the positive sample probabilities and the negative sample probabilities respectively corresponding to the features included in the u unlabeled samples, the label sample probabilities that the u unlabeled samples respectively belong to the negative sample;
the third determining unit is configured to determine the unlabeled exemplar with the labeled exemplar probability higher than the threshold value as a reliable negative exemplar.
8. The apparatus of claim 7, wherein the ith feature is an ith one of the n-dimensional features, and for the ith feature, the first determining unit is further configured to:
determining a first number from the p positive samples and a second number from the u unlabeled samples according to the ith feature; wherein the first number is the number of positive samples including the ith feature in the p positive samples, and the second number is the number of unlabeled samples including the ith feature in the u unlabeled samples;
and determining a first positive sample probability and a first negative sample probability respectively corresponding to the unlabeled sample having the ith feature and a second positive sample probability and a second negative sample probability respectively corresponding to the unlabeled sample not having the ith feature according to the first quantity and the second quantity.
9. A reliable negative example determination device, the device comprising a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to perform the method of any of claims 1-6 according to instructions in the program code.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium is used to store a computer program for performing the method of any of claims 1-6.
CN202010657192.9A 2020-07-09 2020-07-09 Reliable negative sample determination method and related device Active CN111667028B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010657192.9A CN111667028B (en) 2020-07-09 2020-07-09 Reliable negative sample determination method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010657192.9A CN111667028B (en) 2020-07-09 2020-07-09 Reliable negative sample determination method and related device

Publications (2)

Publication Number Publication Date
CN111667028A true CN111667028A (en) 2020-09-15
CN111667028B CN111667028B (en) 2024-03-12

Family

ID=72391674

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010657192.9A Active CN111667028B (en) 2020-07-09 2020-07-09 Reliable negative sample determination method and related device

Country Status (1)

Country Link
CN (1) CN111667028B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112784883A (en) * 2021-01-07 2021-05-11 厦门大学 Cold water coral distribution prediction method and system based on sample selection expansion

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017143919A1 (en) * 2016-02-26 2017-08-31 阿里巴巴集团控股有限公司 Method and apparatus for establishing data identification model
WO2018166457A1 (en) * 2017-03-15 2018-09-20 阿里巴巴集团控股有限公司 Neural network model training method and device, transaction behavior risk identification method and device
CN109902708A (en) * 2018-12-29 2019-06-18 华为技术有限公司 A kind of recommended models training method and relevant apparatus
CN109934249A (en) * 2018-12-14 2019-06-25 网易(杭州)网络有限公司 Data processing method, device, medium and calculating equipment
CN111310814A (en) * 2020-02-07 2020-06-19 支付宝(杭州)信息技术有限公司 Method and device for training business prediction model by utilizing unbalanced positive and negative samples

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017143919A1 (en) * 2016-02-26 2017-08-31 阿里巴巴集团控股有限公司 Method and apparatus for establishing data identification model
WO2018166457A1 (en) * 2017-03-15 2018-09-20 阿里巴巴集团控股有限公司 Neural network model training method and device, transaction behavior risk identification method and device
CN109934249A (en) * 2018-12-14 2019-06-25 网易(杭州)网络有限公司 Data processing method, device, medium and calculating equipment
CN109902708A (en) * 2018-12-29 2019-06-18 华为技术有限公司 A kind of recommended models training method and relevant apparatus
CN111310814A (en) * 2020-02-07 2020-06-19 支付宝(杭州)信息技术有限公司 Method and device for training business prediction model by utilizing unbalanced positive and negative samples

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
裔阳;周绍光;赵鹏飞;胡屹群;: "基于正样本和未标记样本的遥感图像分类方法", 计算机工程与应用, no. 04, 28 February 2017 (2017-02-28), pages 161 - 165 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112784883A (en) * 2021-01-07 2021-05-11 厦门大学 Cold water coral distribution prediction method and system based on sample selection expansion

Also Published As

Publication number Publication date
CN111667028B (en) 2024-03-12

Similar Documents

Publication Publication Date Title
CN111382868B (en) Neural network structure searching method and device
Xie et al. A Survey on Machine Learning‐Based Mobile Big Data Analysis: Challenges and Applications
CN108108455B (en) Destination pushing method and device, storage medium and electronic equipment
CN108629358B (en) Object class prediction method and device
CN107894827B (en) Application cleaning method and device, storage medium and electronic equipment
WO2019062418A1 (en) Application cleaning method and apparatus, storage medium and electronic device
CN110598869B (en) Classification method and device based on sequence model and electronic equipment
CN107678531B (en) Application cleaning method and device, storage medium and electronic equipment
CN108197225B (en) Image classification method and device, storage medium and electronic equipment
CN113837669B (en) Evaluation index construction method of label system and related device
WO2019120007A1 (en) Method and apparatus for predicting user gender, and electronic device
CN112949662B (en) Image processing method and device, computer equipment and storage medium
CN111353303A (en) Word vector construction method and device, electronic equipment and storage medium
CN111882048A (en) Neural network structure searching method and related equipment
CN115879508A (en) Data processing method and related device
Gao et al. A deep learning framework with spatial-temporal attention mechanism for cellular traffic prediction
CN112862021B (en) Content labeling method and related device
CN112925912B (en) Text processing method, synonymous text recall method and apparatus
Ghebriout et al. Harmonic-NAS: Hardware-Aware Multimodal Neural Architecture Search on Resource-constrained Devices
CN111667028A (en) Reliable negative sample determination method and related device
CN114547308B (en) Text processing method, device, electronic equipment and storage medium
CN116957006A (en) Training method, device, equipment, medium and program product of prediction model
CN115512693B (en) Audio recognition method, acoustic model training method, device and storage medium
CN115222047A (en) Model training method, device, equipment and storage medium
CN115221316A (en) Knowledge base processing method, model training method, computer device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant