CN113780367A - Classification model training and data classification method and device, and electronic equipment - Google Patents

Classification model training and data classification method and device, and electronic equipment Download PDF

Info

Publication number
CN113780367A
CN113780367A CN202110956976.6A CN202110956976A CN113780367A CN 113780367 A CN113780367 A CN 113780367A CN 202110956976 A CN202110956976 A CN 202110956976A CN 113780367 A CN113780367 A CN 113780367A
Authority
CN
China
Prior art keywords
labeled
sample
classifier
data
training data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202110956976.6A
Other languages
Chinese (zh)
Inventor
王康
高洋波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sankuai Online Technology Co Ltd
Original Assignee
Beijing Sankuai Online Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sankuai Online Technology Co Ltd filed Critical Beijing Sankuai Online Technology Co Ltd
Priority to CN202110956976.6A priority Critical patent/CN113780367A/en
Publication of CN113780367A publication Critical patent/CN113780367A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a classification model training and data classification method, a classification model training and data classification device, an electronic device and storage medium method, an electronic device and storage medium device, and the classification model training method comprises the following steps: acquiring marked training data and training data to be marked; for each sample to be labeled, determining the similarity between the sample to be labeled and each labeled sample, determining the probability of a positive sample of the sample to be labeled based on the similarity between the sample to be labeled and each labeled sample, performing multiple labeling operations on the training data to be labeled to obtain multiple labeled training data, wherein the labeled training data comprises: a plurality of samples to be marked and a label of each sample to be marked; for each classifier in the classification model, the classifier is trained using labeled training data corresponding to the classifier.

Description

Classification model training and data classification method and device, and electronic equipment
Technical Field
The application relates to the field of machine learning, in particular to a classification model training and data classification method and device, electronic equipment and a storage medium.
Background
Identifying whether a user conducts abnormal behaviors such as gambling behaviors and current abnormal behaviors during the APP using process is a key link in the APP operation and maintenance process. The behavior data of the user's behavior may be classified into normal behavior or abnormal behavior by using a classification model trained in advance to identify whether the user performs abnormal behavior in the process of using the APP.
When the classification model is trained in advance, the classification model is trained by using samples and labels of the samples, the samples are behavior data of behaviors of the user, and the labels of the samples indicate that the behaviors of the user are normal behaviors or abnormal behaviors.
The related personnel confirm whether the behavior of a user is abnormal behavior according to experience, and set the sample of the user, namely the label of the behavior data of the behavior of the user, so as to mark the sample as a positive sample or a negative sample. And if the related personnel confirm that the behavior of the user is normal, marking the sample as a positive sample, and obtaining a label of the sample, which indicates that the behavior of the user is normal. And if the related personnel confirm that the behavior of the user is abnormal, marking the sample as a negative sample, and obtaining a label of the sample, which indicates that the behavior of the user is abnormal.
Due to the limitation of labor cost of labeling, related personnel can label only a sample with a small proportion in a mass of samples to obtain a label of the sample with the small proportion.
On one hand, the number of positive samples and the number of negative samples are insufficient, which may cause a problem of low precision of the classification model after training. On the other hand, a large number of samples cannot be used for training a classification model because the samples are not labeled, and existing samples are not fully utilized.
Disclosure of Invention
The application provides a classification model training method, a data classification method, a classification model training device, a data classification device, an electronic device and a storage medium.
According to a first aspect of embodiments of the present application, there is provided a classification model training method, including:
acquiring marked training data and training data to be marked, wherein the training data to be marked comprises: a plurality of samples to be labeled, the labeled training data comprising: a plurality of labeled samples, a label for each of the labeled samples;
for each sample to be labeled, determining the similarity between the sample to be labeled and each labeled sample, determining the positive sample probability of the sample to be labeled based on the similarity between the sample to be labeled and each labeled sample, and performing multiple labeling operations on the training data to be labeled to obtain a plurality of labeled training data, wherein the labeled training data comprises: the plurality of samples to be labeled and the label of each sample to be labeled, wherein the labeling operation comprises the following steps: acquiring a random number, and labeling the samples to be labeled based on the positive sample probability of the samples to be labeled and the random number for each sample to be labeled to obtain a label of the sample to be labeled;
and for each classifier in the classification model, training the classifier by using labeled training data corresponding to the classifier, wherein each classifier respectively corresponds to one labeled training data in the plurality of labeled training data, and the labeled training data corresponding to each classifier is different.
According to a second aspect of the embodiments of the present application, there is provided a data classification method, including:
acquiring data to be classified;
for each classifier in a classification model, inputting the data to be classified into the classifier to obtain a prediction result corresponding to the classifier, wherein the classification model is trained in advance according to the classification model training method;
and determining the classification result of the data to be classified based on the corresponding prediction result of each classifier.
According to a third aspect of the embodiments of the present application, there is provided a classification model training apparatus, including:
a training data obtaining unit configured to obtain labeled training data and training data to be labeled, where the training data to be labeled includes: a plurality of samples to be labeled, the labeled training data comprising: a plurality of labeled samples, a label for each of the labeled samples;
the labeling unit is configured to determine, for each to-be-labeled sample, a similarity between the to-be-labeled sample and each labeled sample, determine a positive sample probability of the to-be-labeled sample based on the similarity between the to-be-labeled sample and each labeled sample, and perform multiple labeling operations on the to-be-labeled training data to obtain multiple labeled training data, where the labeled training data includes: the plurality of samples to be labeled and the label of each sample to be labeled, wherein the labeling operation comprises the following steps: acquiring a random number, and labeling the samples to be labeled based on the positive sample probability of the samples to be labeled and the random number for each sample to be labeled to obtain a label of the sample to be labeled;
a training unit configured to train, for each classifier in the classification model, the classifier using labeled training data corresponding to the classifier, wherein each classifier corresponds to one labeled training data in the plurality of labeled training data, and the labeled training data corresponding to each classifier is different.
According to a fourth aspect of the embodiments of the present application, there is provided a data sorting apparatus including:
a to-be-classified data acquisition unit configured to acquire to-be-classified data;
the sub-classification unit is configured to input the data to be classified into the classifier for each classifier in a classification model to obtain a prediction result corresponding to the classifier, wherein the classification model is trained in advance according to the classification model training method;
and the classification unit is configured to determine a classification result of the data to be classified based on the prediction result corresponding to each classifier.
According to a fifth aspect of embodiments of the present application, there is provided an electronic apparatus, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute the instructions to implement a classification model training method and a data classification method.
According to a sixth aspect of embodiments herein, there is provided a computer-readable medium, wherein instructions, when executed by a processor of an electronic device, enable the electronic device to perform a classification model training method and a data classification method.
The classification model training method, the data classification method, the classification model training device, the data classification device, the electronic device and the storage medium provided by the embodiment of the application realize that the positive sample probability of a sample to be labeled is determined according to the similarity between the sample to be labeled which is not labeled and a labeled sample, and the sample to be labeled is labeled based on the positive sample probability and the random number of the sample to be labeled to obtain the label of the sample to be labeled. Therefore, a large number of unlabelled samples to be labeled can be automatically labeled to obtain a large number of labels of the samples to be labeled, the number of positive samples and the number of negative samples which can be used for training the classification model are greatly increased to obtain a large number of positive samples and a large number of negative samples, and further the classification model can be sufficiently trained by utilizing the large number of positive samples and the large number of negative samples, so that the classification model has higher precision after the classification model is trained. The method has the advantages that a large number of existing unlabelled samples to be labeled are fully utilized, the utilization efficiency of existing computer resources, namely a large number of unlabelled samples to be labeled, is improved, meanwhile, each unlabelled sample to be labeled is converted into a positive sample or a negative sample which can be directly processed by a classification model, each unlabelled sample to be labeled does not need to be stored in a related storage space, and the storage space is saved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
FIG. 1 is a flow chart of one of the classification model training methods provided by the embodiments of the present application;
FIG. 2 is a flow chart of one of the data classification methods provided by the embodiments of the present application;
FIG. 3 is a schematic structural diagram of a classification model training apparatus provided in an embodiment of the present application;
fig. 4 shows a schematic structural diagram of a data classification apparatus provided in an embodiment of the present application.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
Fig. 1 is a flowchart of a classification model training method provided in an embodiment of the present application. The method comprises the following steps:
step 101, obtaining labeled training data and training data to be labeled.
In the present application, the training data to be labeled includes: a plurality of samples to be labeled.
And the sample to be marked is a sample needing to be marked. The sample may be behavior data of the user's behavior during use of the APP.
The training data to be labeled comprises a plurality of samples to be labeled, and the training data to be labeled does not comprise the label of each sample to be labeled in the plurality of samples to be labeled, so that the training data to be labeled needs to be labeled, and each sample to be labeled in the training data to be labeled has the label.
In the present application, labeled training data includes: a plurality of labeled exemplars, a label for each labeled exemplar.
The labeled exemplars are exemplars that have been labeled, each labeled exemplar having a label.
The label for each labeled exemplar may be one of: a value such as 1 indicating that the behavior of the user to which the labeled sample belongs is normal behavior, and a value such as-1 indicating that the behavior of the user to which the labeled sample belongs is abnormal behavior. The abnormal behavior can be gambling behavior, cash-out behavior, bill-swiping behavior, cheating behavior, abnormal goods returning behavior, non-self-transaction behavior and the like performed by the user in the process of using the APP.
102, acquiring the positive sample probability of each sample to be labeled, and performing multiple labeling operations on the training data to be labeled to obtain multiple labeled training data.
In the application, for each sample to be labeled in the data to be labeled, determining the similarity between the sample to be labeled and each labeled sample; and determining the positive sample probability of the sample to be labeled based on the similarity between the sample to be labeled and each labeled sample.
In the present application, for each to-be-labeled sample in the to-be-labeled data, the positive sample probability of the to-be-labeled sample indicates the probability that the to-be-labeled sample is a positive sample.
For a to-be-labeled sample in the to-be-labeled data, when the similarity between the to-be-labeled sample and each labeled sample is determined, the similarity between the labeled sample and the to-be-labeled sample can be calculated for each labeled sample, so that the similarity between the to-be-labeled sample and each labeled sample is obtained.
For a marked sample and a sample to be marked, the marked sample can be vectorized to obtain a vector representing the marked sample, the sample to be marked is vectorized to obtain a vector representing the sample to be marked, the vector similarity of the vector representing the marked sample and the vector representing the sample to be marked is calculated, and the vector similarity is used as the similarity of the marked sample and the sample to be marked.
In some embodiments, determining the similarity between the sample to be labeled and each labeled sample comprises: calculating the Euclidean distance between the marked sample and the sample to be marked for each marked sample; and determining the similarity between the marked sample and the sample to be marked based on the Euclidean distance between the marked sample and the sample to be marked.
In the present application, for a labeled sample and a to-be-labeled sample, the euclidean distance between the labeled sample and the to-be-labeled sample may be calculated, and the result of subtracting the euclidean distance between the labeled sample and the to-be-labeled sample from 1 is used as the similarity between the labeled sample and the to-be-labeled sample.
In the application, for each sample to be labeled in the data to be labeled, when determining the probability of a positive sample of the sample to be labeled based on the similarity between the sample to be labeled and the labeled sample in the labeled training data, the labeled sample with the similarity to the sample to be labeled being greater than the similarity threshold value in all the labeled samples may be determined, and the probability of the positive sample of the sample to be labeled may be obtained by dividing the number of the positive samples in the labeled samples with the similarity to the sample to be labeled being greater than the similarity threshold value by the total number of the labeled samples with the similarity to the sample to be labeled being greater than the similarity threshold value.
In some embodiments, for a sample to be labeled in the data to be labeled, determining the positive sample probability of the sample to be labeled based on the similarity between the sample to be labeled and each labeled sample includes: determining a similar labeled sample set of the to-be-labeled sample based on the similarity between the to-be-labeled sample and each labeled sample, wherein the similar labeled sample set of the to-be-labeled sample comprises: the marked samples with the highest similarity with the sample to be marked in the marked training data are marked in a preset number; and determining the proportion of the number of the marked positive samples in the similar marked sample set of the sample to be marked to the number of the marked samples in the similar marked sample set of the sample to be marked as the positive sample probability of the sample to be marked.
The preset number is K, and for any one sample to be labeled in the data to be labeled, when the positive sample probability of the sample to be labeled is determined based on the similarity between the sample to be labeled and each labeled sample, the K labeled samples with the highest similarity to the sample to be labeled can be determined, and the K labeled samples with the highest similarity to the sample to be labeled form a similar labeled sample set of the sample to be labeled.
In other words, for any sample to be labeled in the data to be labeled, all labeled samples are sorted in the order from high to low in similarity with the sample to be labeled, and after all labeled samples are sorted in the order from high to low in similarity with the sample to be labeled, the first K labeled samples form a similar labeled sample set of the sample to be labeled.
In the present application, a labeled sample that is a positive sample may be referred to as a labeled positive sample.
For a to-be-labeled sample in the to-be-labeled data, when the positive sample probability of the to-be-labeled sample is determined based on the similarity between the to-be-labeled sample and each labeled sample, the number of labeled positive samples in the similar labeled sample set of the to-be-labeled sample may be divided by the number of labeled samples in the similar sample set to obtain the proportion between the number of labeled positive samples in the similar labeled sample set of the to-be-labeled sample and the similar labeled sample set of the to-be-labeled sample, and the proportion is used as the positive sample probability of the to-be-labeled sample.
In the application, after the positive sample probability of each sample to be labeled is obtained, multiple labeling operations may be performed on training data to be labeled to obtain multiple labeled training data.
In the application, each time the labeling operation is performed on the data to be labeled, one labeled training data can be obtained.
For example, N labeling operations are performed on the training data to be labeled, resulting in N labeled training data. And performing labeling operation on the training data to be labeled for the 1 st time to obtain the 1 st labeled training data, performing labeling operation on the training data to be labeled for the 2 nd time to obtain the 2 nd labeled training data, and so on, performing labeling operation on the training data to be labeled for the Nth time to obtain the Nth labeled training data.
In the present application, for each labeled training data, the labeled training data comprises: a plurality of samples to be labeled in the training data to be labeled and the label of each sample to be labeled.
For a sample i to be labeled in a plurality of samples to be labeled, each labeled training data comprises the sample i to be labeled, and each labeled training data comprises a label of the sample i to be labeled.
For one labeled training data and a sample i to be labeled, the label of the sample i to be labeled in the labeled training data may be the same as the label of the sample i to be labeled in the other labeled training data, and the label of the sample i to be labeled in the labeled training data may also be different from the label of the sample i to be labeled in the other labeled training data.
In this application, the annotation operation performed on the data to be annotated includes: and acquiring a random number, and labeling each sample to be labeled based on the positive sample probability of the sample to be labeled and the acquired random number to obtain a label of the sample to be labeled.
In the execution process of one marking operation executed on data to be marked, a random number can be generated through a random number generation algorithm, so that a random number is obtained, and the generated random number is located in a preset interval. The left end point value of the preset interval is a preset value larger than 0, and the right end point value of the preset interval is a preset value smaller than 1. Then, for each sample to be labeled, labeling the sample to be labeled based on the positive sample probability of the sample to be labeled and the random number obtained in the execution process of the labeling operation to obtain the label of the sample to be labeled.
In the present application, in the process of executing one labeling operation on data to be labeled, for any sample to be labeled in the data to be labeled, when labeling the sample to be labeled based on the positive sample probability of the sample to be labeled and the obtained random number to obtain the label of the sample to be labeled, the following formula may be used to obtain the label of the sample to be labeled
Figure BDA0003220686770000081
Figure BDA0003220686770000082
Wherein Credk(i) Represents the positive sample probability, beta, of the sample to be labelediBeta, a condition that is the argument of a function named pi, representing the random number acquired during the execution of this annotation operationi≤Credk(i) If beta isi≤Credk(i) If it holds that the function value of the function named pi is 1, if betai≤Credk(i) If it does not hold, the function value of the function named pi is 0, if betai≤Credk(i) Is established, i.e. betaiLess than or equal to the positive sample probability of the sample to be labeled, pi (beta)i≤Credk(i) Is 1, if betai>Credk(i) Beta when it is not standingiGreater than the positive sample probability of the sample to be labeled, pi (beta)i≤Credk(i) ) is 0. And if the label of the sample to be labeled is 1, taking the sample to be labeled as a positive sample, and if the label of the sample to be labeled is-1, taking the sample to be labeled as a negative sample.
In some embodiments, in the process of executing any one annotation operation on the data to be annotated, acquiring the random number includes: random numbers are sampled from the uniform distribution of the preset interval.
The left end point value of the preset interval is a preset value larger than 0, and the right end point value of the preset interval is a preset value smaller than 1. In the execution process of any one labeling operation executed on data to be labeled, when a random number is acquired, a numerical value can be sampled from the uniform distribution of the preset interval in a random mode, and the numerical value is a random number, so that a random number is sampled from the uniform distribution of the preset interval.
Step 103, for each classifier in the classification model, the classifier is trained by using the labeled training data corresponding to the classifier.
In this application, the classification model includes a plurality of classifiers. The classifier may be a decision Tree such as a Gradient Boosting iterative decision Tree (Gradient Boosting decision Tree) or a neural network for classification such as a convolutional neural network.
In the present application, the number of times of performing labeling operations on data to be labeled is equal to the number of classifiers in the classification model, and correspondingly, the number of obtained labeled training data is equal to the number of classifiers in the classification model.
For example, if the classification model includes 3 classifiers, 3 labeling operations are performed on the data to be labeled, so as to obtain 3 labeled training data.
In the present application, each classifier corresponds to one labeled training data of the plurality of labeled training data, and the labeled training data corresponding to each classifier is different. The labeled training data corresponding to each classifier may be determined in a random manner.
In this application, for a classifier in a classification model, when the classifier is trained by using labeled training data corresponding to the classifier, the classifier is trained by using a sample in the labeled training data corresponding to the classifier and a label of a sample in the labeled training data corresponding to the classifier each time, and the samples used for training the classifier each time are different.
For a classifier in the classification model, each time the classifier is trained, the sample used for training the classifier is input into the classifier, and a prediction result corresponding to the classifier can be obtained, where the prediction result corresponding to the classifier can include: the probability that the behavior of the user to which the input sample belongs is normal behavior, and the probability that the behavior of the user to which the input sample belongs is abnormal behavior. The sum of the probability that the behavior of the user to which the input sample belongs is normal behavior and the probability that the behavior of the user to which the input sample belongs is abnormal behavior is 1.
For a classifier in the classification model, each time the classifier is trained, the classifier can determine the maximum probability in the prediction results corresponding to the classifier, and output the prediction classification results according to the maximum probability in the prediction results corresponding to the classifier. The prediction classification result may be 1 or-1, where 1 may indicate the behavior of the user to which the input sample belongs, and-1 may indicate the behavior of the user to which the input sample belongs. If the maximum probability in the prediction results corresponding to the classifier is the probability that the behavior of the user to which the input sample belongs is normal, the classifier outputs a prediction classification result 1, and if the maximum probability in the prediction results corresponding to the classifier is the probability that the behavior of the user to which the input sample belongs is abnormal, the classifier outputs a prediction classification result-1.
For a classifier in the classification model, each time the classifier is trained, whether loss exists or not can be determined according to a prediction classification result output by the classifier and a label of a sample used for training the classifier, and if loss exists, a parameter value of a parameter of the classifier is updated.
In the present application, the training of the classification model is completed after the training of each classifier in the classification model is completed. The classification model may then be utilized to classify the data to be classified.
Fig. 2 is a flowchart of one data classification method provided in an embodiment of the present application. The method comprises the following steps:
step 201, obtaining data to be classified.
In the present application, the data to be classified may be behavior data of a behavior of the user in a process of using the APP.
Step 202, for each classifier in the classification model, inputting the data to be classified into the classifier to obtain a prediction result corresponding to the classifier.
Before step 201 is performed, the classification model is trained in advance according to the above-described classification model training method. The classification model includes a plurality of classifiers, which may be decision trees such as gradient boosting iterative decision trees or neural networks used for classification such as convolutional neural networks.
For each classifier in the classification model, inputting data to be classified into the classifier, and outputting a prediction result corresponding to the classifier by the classifier.
The prediction result corresponding to the classifier may include: the normal probability corresponding to the classifier is the probability that the behavior of the user to which the data to be classified belongs is normal behavior, and the abnormal probability corresponding to the classifier is the probability that the behavior of the user to which the data to be classified belongs is abnormal behavior.
Step 203, determining the classification result of the data to be classified based on the corresponding prediction result of each classifier.
In the application, after the prediction result corresponding to each classifier is obtained, the classification result of the data to be classified may be determined based on the prediction result corresponding to each classifier.
In the application, for each classifier, a sub-classification result corresponding to the classifier can be determined according to a prediction result corresponding to the classifier, the sub-classification result corresponding to the classifier can be 1 or-1, 1 can indicate the behavior normal behavior of a user to which data to be classified belongs, and-1 can indicate the behavior abnormal behavior of the user to which the data to be classified belongs. If the maximum probability in the prediction result corresponding to the classifier is the normal probability corresponding to the classifier, the sub-classification result corresponding to the classifier is 1, and if the maximum probability in the prediction result corresponding to the classifier is the abnormal probability corresponding to the classifier, the sub-classification result corresponding to the classifier is-1.
In the application, the classification result of the data to be classified can be determined according to the sub-classification result corresponding to each classifier. The sub-classification result with the largest number in the sub-classification results corresponding to each classifier can be determined, and the sub-classification result with the largest number is determined as the classification result of the data to be classified.
In some embodiments, determining the classification result of the data to be classified based on the prediction result corresponding to each classifier comprises: calculating the average value of the abnormal probability corresponding to each classifier to obtain the average abnormal probability; when the average abnormal probability is smaller than the probability threshold, generating a classification result indicating that the behavior of the user to which the data to be classified belongs is normal; and when the average abnormal probability is larger than or equal to the probability threshold, generating a classification result indicating that the behavior of the user to which the data to be classified belongs is abnormal.
In the application, when the classification result of the data to be classified is determined based on the prediction result corresponding to each classifier, the average value of the abnormal probability corresponding to each classifier can be calculated to obtain the average abnormal probability. When the average value of the abnormal probabilities corresponding to each classifier is smaller than the probability threshold, a classification result indicating that the behavior of the user to which the data to be classified belongs is a normal behavior may be generated, and the classification result indicating that the behavior of the user to which the data to be classified belongs is a normal behavior may be 1. When the average abnormal probability is greater than or equal to the probability threshold, a classification result indicating that the behavior of the user to which the data to be classified belongs is abnormal may be generated, and the classification result indicating that the behavior of the user to which the data to be classified belongs is abnormal may be-1.
Please refer to fig. 3, which illustrates a schematic structural diagram of a classification model training apparatus according to an embodiment of the present application. As shown in fig. 3, the classification model training apparatus includes: a training data acquisition unit 301, a labeling unit 302, and a training unit 303.
The training data obtaining unit 301 is configured to obtain labeled training data and training data to be labeled, where the training data to be labeled includes: a plurality of samples to be labeled, the labeled training data comprising: a plurality of labeled samples, a label for each of the labeled samples;
the labeling unit 302 is configured to, for each of the samples to be labeled, determine a similarity between the sample to be labeled and each of the labeled samples, determine a positive sample probability of the sample to be labeled based on the similarity between the sample to be labeled and each of the labeled samples, and perform a plurality of labeling operations on the training data to be labeled to obtain a plurality of labeled training data, where the labeled training data includes: the plurality of samples to be labeled and the label of each sample to be labeled, wherein the labeling operation comprises the following steps: acquiring a random number, and labeling the samples to be labeled based on the positive sample probability of the samples to be labeled and the random number for each sample to be labeled to obtain a label of the sample to be labeled;
the training unit 303 is configured to, for each classifier in the classification model, train the classifier using labeled training data corresponding to the classifier, where each classifier respectively corresponds to one labeled training data in the plurality of labeled training data, and the labeled training data corresponding to each classifier is different.
In some embodiments, the labeling unit 302 is further configured to determine a similar labeled sample set of the to-be-labeled samples based on similarity between the to-be-labeled samples and labeled samples in labeled training data, where the similar labeled sample set of the to-be-labeled samples includes: the marked samples with the highest similarity with the samples to be marked in the marked training data are preset in number; and determining the proportion of the number of the marked positive samples in the similar marked sample set to the number of the marked samples in the similar sample set as the positive sample probability of the sample to be marked.
In some embodiments, the labeling unit 302 is further configured to sample the random number from a uniform distribution of a preset interval.
In some embodiments, the labeling unit 302 is further configured to calculate, for each labeled sample, a euclidean distance between the labeled sample and the sample to be labeled; and determining the similarity between the marked sample and the sample to be marked based on the Euclidean distance.
Please refer to fig. 4, which illustrates a schematic structural diagram of a data classification apparatus according to an embodiment of the present application. As shown in fig. 4, the data sorting apparatus includes: a data to be classified acquisition unit 401, a sub-classification unit 402, and a classification unit 403.
The data to be classified acquisition unit 401 is configured to acquire data to be classified;
the sub-classification unit 402 is configured to, for each classifier in a classification model, input the data to be classified into the classifier to obtain a prediction result corresponding to the classifier, where the classification model is trained in advance according to the above classification model training method;
the classification unit 403 is configured to determine a classification result of the data to be classified based on the prediction result corresponding to each classifier.
In some embodiments, the classification unit 403 is further configured to calculate an average value of the abnormal probabilities corresponding to each classifier, resulting in an average abnormal probability; when the average abnormal probability is smaller than a probability threshold, generating a classification result indicating that the behavior of the user to which the data to be classified belongs is normal; and when the average abnormal probability is larger than or equal to a probability threshold, generating a classification result indicating that the behavior of the user to which the data to be classified belongs is abnormal.
The present application further provides an electronic device that may be configured with one or more processors; a memory for storing one or more programs, the one or more programs may include instructions for performing the operations described in the above embodiments. The program or programs, when executed by one or more processors, cause the one or more processors to perform the operations described in the classification model training embodiments above.
The present application also provides a computer readable medium, which may be included in an electronic device; or the device can be independently arranged and not assembled into the electronic equipment. The computer-readable medium carries one or more programs which, when executed by an electronic device, cause the electronic device to perform the operations described in the data classification method embodiments above.
It should be noted that the readable storage medium described in this application can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a message execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a message execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable messages for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer messages.
The above description is only a preferred embodiment of the present request and is illustrative of the principles of the technology employed. It will be understood by those skilled in the art that the scope of the invention herein referred to is not limited to the technical embodiments with the specific combination of the above technical features, but also encompasses other technical embodiments with any combination of the above technical features or their equivalents without departing from the inventive concept. For example, technical embodiments formed by mutually replacing the above-mentioned features with (but not limited to) technical features having similar functions disclosed in the present application.

Claims (10)

1. A classification model training method, the method comprising:
acquiring marked training data and training data to be marked, wherein the training data to be marked comprises: a plurality of samples to be labeled, the labeled training data comprising: a plurality of labeled samples, a label for each of the labeled samples;
for each sample to be labeled, determining the similarity between the sample to be labeled and each labeled sample, determining the positive sample probability of the sample to be labeled based on the similarity between the sample to be labeled and each labeled sample, and performing multiple labeling operations on the training data to be labeled to obtain a plurality of labeled training data, wherein the labeled training data comprises: the plurality of samples to be labeled and the label of each sample to be labeled, wherein the labeling operation comprises the following steps: acquiring a random number, and labeling the samples to be labeled based on the positive sample probability of the samples to be labeled and the random number for each sample to be labeled to obtain a label of the sample to be labeled;
for each classifier in the classification model, training the classifier by using labeled training data corresponding to the classifier, wherein each classifier respectively corresponds to one labeled training data in the plurality of labeled training data, and the labeled training data corresponding to each classifier is different.
2. The method of claim 1, wherein determining the positive sample probability of the to-be-labeled sample based on the similarity of the to-be-labeled sample and each labeled sample comprises:
determining a similar labeled sample set of the to-be-labeled samples based on the similarity between the to-be-labeled samples and each labeled sample, wherein the similar labeled sample set of the to-be-labeled samples comprises: the marked samples with the highest similarity with the samples to be marked in the marked training data are preset in number;
and determining the proportion of the number of the marked positive samples in the similar marked sample set to the number of the marked samples in the similar sample set as the positive sample probability of the sample to be marked.
3. The method of claim 1, wherein obtaining the random number comprises:
and sampling the random numbers from the uniform distribution of the preset interval.
4. The method of claim 1, wherein determining the similarity between the sample to be labeled and each labeled sample comprises:
calculating the Euclidean distance between the labeled sample and the sample to be labeled for each labeled sample; and determining the similarity between the marked sample and the sample to be marked based on the Euclidean distance.
5. A method of data classification, the method comprising:
acquiring data to be classified;
for each classifier in a classification model, inputting the data to be classified into the classifier to obtain a prediction result corresponding to the classifier, wherein the classification model is trained in advance according to the classification model training method of any one of claims 1-4;
and determining the classification result of the data to be classified based on the corresponding prediction result of each classifier.
6. The method of claim 5, wherein the prediction results corresponding to the classifier comprise: the abnormal probability corresponds to the classifier, and the abnormal probability is the probability that the behavior of the user to which the data to be classified belongs is abnormal;
determining the classification result of the data to be classified based on the prediction result corresponding to each classifier comprises:
calculating the average value of the abnormal probability corresponding to each classifier to obtain the average abnormal probability;
when the average abnormal probability is smaller than a probability threshold, generating a classification result indicating that the behavior of the user to which the data to be classified belongs is normal;
and when the average abnormal probability is larger than or equal to a probability threshold, generating a classification result indicating that the behavior of the user to which the data to be classified belongs is abnormal.
7. A classification model training apparatus, characterized in that the apparatus comprises:
a training data obtaining unit configured to obtain labeled training data and training data to be labeled, where the training data to be labeled includes: a plurality of samples to be labeled, the labeled training data comprising: a plurality of labeled samples, a label for each of the labeled samples;
the labeling unit is configured to determine, for each to-be-labeled sample, a similarity between the to-be-labeled sample and each labeled sample, determine a positive sample probability of the to-be-labeled sample based on the similarity between the to-be-labeled sample and each labeled sample, and perform multiple labeling operations on the to-be-labeled training data to obtain multiple labeled training data, where the labeled training data includes: the plurality of samples to be labeled and the label of each sample to be labeled, wherein the labeling operation comprises the following steps: acquiring a random number, and labeling the samples to be labeled based on the positive sample probability of the samples to be labeled and the random number for each sample to be labeled to obtain a label of the sample to be labeled;
a training unit configured to train, for each classifier in the classification model, the classifier using labeled training data corresponding to the classifier, wherein each classifier corresponds to one labeled training data in the plurality of labeled training data, and the labeled training data corresponding to each classifier is different.
8. An apparatus for classifying data, the apparatus comprising:
a to-be-classified data acquisition unit configured to acquire to-be-classified data;
a sub-classification unit, configured to, for each classifier in a classification model, input the data to be classified into the classifier to obtain a prediction result corresponding to the classifier, wherein the classification model is trained in advance according to the classification model training method of any one of claims 1 to 4;
and the classification unit is configured to determine a classification result of the data to be classified based on the prediction result corresponding to each classifier.
9. An electronic device, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the classification model training method of any one of claims 1 to 4 and the data classification method of any one of claims 5 to 6.
10. A computer readable medium, wherein the instructions when executed by a processor of an electronic device, enable the electronic device to perform the classification model training method of any one of claims 1 to 4 and the data classification method of any one of claims 5 to 6.
CN202110956976.6A 2021-08-19 2021-08-19 Classification model training and data classification method and device, and electronic equipment Withdrawn CN113780367A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110956976.6A CN113780367A (en) 2021-08-19 2021-08-19 Classification model training and data classification method and device, and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110956976.6A CN113780367A (en) 2021-08-19 2021-08-19 Classification model training and data classification method and device, and electronic equipment

Publications (1)

Publication Number Publication Date
CN113780367A true CN113780367A (en) 2021-12-10

Family

ID=78838442

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110956976.6A Withdrawn CN113780367A (en) 2021-08-19 2021-08-19 Classification model training and data classification method and device, and electronic equipment

Country Status (1)

Country Link
CN (1) CN113780367A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114241243A (en) * 2021-12-20 2022-03-25 百度在线网络技术(北京)有限公司 Training method and device of image classification model, electronic equipment and storage medium
CN114612699A (en) * 2022-03-10 2022-06-10 京东科技信息技术有限公司 Image data processing method and device
CN114996590A (en) * 2022-08-04 2022-09-02 上海钐昆网络科技有限公司 Classification method, classification device, classification equipment and storage medium
CN117493514A (en) * 2023-11-09 2024-02-02 广州方舟信息科技有限公司 Text labeling method, text labeling device, electronic equipment and storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114241243A (en) * 2021-12-20 2022-03-25 百度在线网络技术(北京)有限公司 Training method and device of image classification model, electronic equipment and storage medium
CN114241243B (en) * 2021-12-20 2023-04-25 百度在线网络技术(北京)有限公司 Training method and device for image classification model, electronic equipment and storage medium
CN114612699A (en) * 2022-03-10 2022-06-10 京东科技信息技术有限公司 Image data processing method and device
CN114996590A (en) * 2022-08-04 2022-09-02 上海钐昆网络科技有限公司 Classification method, classification device, classification equipment and storage medium
CN117493514A (en) * 2023-11-09 2024-02-02 广州方舟信息科技有限公司 Text labeling method, text labeling device, electronic equipment and storage medium
CN117493514B (en) * 2023-11-09 2024-05-14 广州方舟信息科技有限公司 Text labeling method, text labeling device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN113780367A (en) Classification model training and data classification method and device, and electronic equipment
CN110046254B (en) Method and apparatus for generating a model
CN110717039A (en) Text classification method and device, electronic equipment and computer-readable storage medium
CN109872162B (en) Wind control classification and identification method and system for processing user complaint information
CN110580308B (en) Information auditing method and device, electronic equipment and storage medium
CN112883990A (en) Data classification method and device, computer storage medium and electronic equipment
CN112181490B (en) Method, device, equipment and medium for identifying function category in function point evaluation method
CN109993391B (en) Method, device, equipment and medium for dispatching network operation and maintenance task work order
CN115565038A (en) Content audit, content audit model training method and related device
CN114792089A (en) Method, apparatus and program product for managing computer system
US20190266281A1 (en) Natural Language Processing and Classification
CN111354354B (en) Training method, training device and terminal equipment based on semantic recognition
CN113658002B (en) Transaction result generation method and device based on decision tree, electronic equipment and medium
CN108628863B (en) Information acquisition method and device
CN112989050A (en) Table classification method, device, equipment and storage medium
CN116578700A (en) Log classification method, log classification device, equipment and medium
CN116823164A (en) Business approval method, device, equipment and storage medium
CN113469237B (en) User intention recognition method, device, electronic equipment and storage medium
CN115345600A (en) RPA flow generation method and device
CN113139368B (en) Text editing method and system
CN110968690B (en) Clustering division method and device for words, equipment and storage medium
CN112465149A (en) Same-city part identification method and device, electronic equipment and storage medium
CN112115229A (en) Text intention recognition method, device and system and text classification system
CN112767022B (en) Mobile application function evolution trend prediction method and device and computer equipment
US10565189B2 (en) Augmentation of a run-time query

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20211210