CN113919936A - Sample data processing method and device - Google Patents

Sample data processing method and device Download PDF

Info

Publication number
CN113919936A
CN113919936A CN202111107477.6A CN202111107477A CN113919936A CN 113919936 A CN113919936 A CN 113919936A CN 202111107477 A CN202111107477 A CN 202111107477A CN 113919936 A CN113919936 A CN 113919936A
Authority
CN
China
Prior art keywords
sample data
preset
unlabeled
labeling
unmarked
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111107477.6A
Other languages
Chinese (zh)
Other versions
CN113919936B (en
Inventor
王珍
孙祥坤
陈昶汝
杨丽娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bairong Zhixin Beijing Technology Co ltd
Original Assignee
Bairong Zhixin Beijing Credit Investigation Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bairong Zhixin Beijing Credit Investigation Co Ltd filed Critical Bairong Zhixin Beijing Credit Investigation Co Ltd
Priority to CN202111107477.6A priority Critical patent/CN113919936B/en
Publication of CN113919936A publication Critical patent/CN113919936A/en
Application granted granted Critical
Publication of CN113919936B publication Critical patent/CN113919936B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

The application discloses a sample data processing method and device, and relates to the technical field of data processing. The method of the present application comprises: acquiring a sample data set, wherein the sample data set comprises a plurality of unmarked sample data; respectively inputting each unlabeled sample data into a preset scoring model to obtain a scoring result corresponding to each unlabeled sample data; performing multiple rounds of labeling processing on the multiple unlabeled sample data according to the grading result corresponding to each unlabeled sample data to obtain multiple first positive sample data and multiple first negative sample data; performing multiple rounds of iterative training on a preset machine learning model and a preset deep learning model according to a plurality of first positive sample data and a plurality of first negative sample data to obtain a first label prediction model and a second label prediction model; and performing multi-round labeling processing on the plurality of residual unlabeled sample data according to the first label prediction model and the second label prediction model.

Description

Sample data processing method and device
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for processing sample data.
Background
With the continuous development of society and the gradual upgrade of people's consumption concept, more and more people promote self living standard through the mode of loan. In the process that the loan applicant applies for loan service to the financial institution platform, in order to reduce loan risk, the financial institution platform determines whether the borrower will default repayment according to the personal information data and the behavior data of the borrower, namely determines whether the borrower is a bad client.
At present, a financial institution platform usually trains and obtains a prediction model based on a large amount of sample data (personal information data and/or behavioral performance data of historical borrowers) with labels (good customer labels or bad customer labels), and then inputs the personal information data and/or behavioral performance data of the borrowers to be evaluated into the prediction model so as to predict the probability that the model outputs the borrowers to be evaluated as bad customers. However, in the credit domain, it is relatively easy to acquire unlabeled sample data, and it takes a lot of manpower and material resources to label a large amount of unlabeled sample data. Therefore, how efficiently financial institution platforms label a large amount of unlabeled sample data is currently the subject for solution.
Disclosure of Invention
The embodiment of the application provides a sample data processing method and device, and mainly aims to efficiently label a large amount of sample data without labels.
In order to solve the above technical problem, an embodiment of the present application provides the following technical solutions:
in a first aspect, the present application provides a method for processing sample data, including:
acquiring a sample data set, wherein the sample data set comprises a plurality of unlabeled sample data without labels;
inputting each unlabeled sample data into a preset scoring model respectively to obtain a scoring result corresponding to each unlabeled sample data;
performing multiple rounds of labeling processing on the multiple unlabeled sample data according to the grading result corresponding to each unlabeled sample data until a first preset stopping condition is reached so as to obtain multiple first positive sample data and multiple first negative sample data;
performing multiple rounds of iterative training on a preset machine learning model and a preset deep learning model according to the first positive sample data and the first negative sample data until a second preset stop condition is reached to obtain a first label prediction model and a second label prediction model;
and performing multi-round labeling processing on a plurality of residual unlabeled sample data according to the first label prediction model and the second label prediction model until a third preset stop condition is reached.
Optionally, the performing, according to a scoring result corresponding to each unlabeled sample data, multiple rounds of labeling processing on the unlabeled sample data until a first preset stop condition is reached to obtain multiple first positive sample data and multiple first negative sample data includes:
for each round of labeling process:
according to a grading result corresponding to each unmarked sample data, carrying out forward sequencing on the unmarked sample data to obtain a first sequence;
acquiring X unmarked sample data in the first sequence which is ranked at the top; clustering the X unlabeled sample data according to a preset unsupervised clustering algorithm to obtain a plurality of clusters; calculating the average score corresponding to each cluster according to the score result corresponding to each unlabeled sample data; labeling each unlabeled sample data contained in the cluster with the highest average score to obtain a plurality of first positive sample data;
acquiring Y unmarked sample data in the first sequence which is ranked backwards; clustering the Y unlabeled sample data according to the preset unsupervised clustering algorithm to obtain a plurality of clusters; calculating the average score corresponding to each cluster according to the score result corresponding to each unlabeled sample data; and labeling each unlabeled sample data contained in the cluster with the lowest average score to obtain a plurality of first negative sample data.
Optionally, the performing, according to the first label prediction model and the second label prediction model, multiple rounds of labeling processing on multiple remaining unlabeled sample data until a third preset stop condition is reached includes:
for each round of labeling process:
respectively inputting each residual unmarked sample data into the first label prediction model to obtain a first prediction probability corresponding to each residual unmarked sample data, and performing forward sequencing on the residual unmarked sample data according to the first prediction probability corresponding to each residual unmarked sample data to obtain a second sequence, wherein the first prediction probability is used for indicating the probability that the residual unmarked sample data should be marked with a negative label;
inputting each residual unmarked sample data into the second label prediction model respectively to obtain a second prediction probability corresponding to each residual unmarked sample data, and performing forward sequencing on the residual unmarked sample data according to the second prediction probability corresponding to each residual unmarked sample data to obtain a third sequence, wherein the second prediction probability is used for indicating the probability that the residual unmarked sample data should be marked with a negative label;
determining intersections of a first number, ranked first, of the remaining unlabeled sample data in the second sequence and a number, ranked first, of the remaining unlabeled sample data in the third sequence as first target sample data, and determining intersections of B remaining unlabeled sample data ranked last in the second sequence and B remaining unlabeled sample data ranked last in the third sequence as second target sample data;
clustering the first target sample data according to a preset unsupervised clustering algorithm to obtain a plurality of clusters; calculating the average score corresponding to each cluster according to the score result corresponding to each first target sample data; labeling each first target sample data contained in the cluster with the lowest average score to obtain a plurality of second negative sample data;
clustering a plurality of second target sample data according to the preset unsupervised clustering algorithm to obtain a plurality of clusters; calculating the average score corresponding to each cluster according to the score result corresponding to each second target sample data; labeling each second target sample data contained in the cluster with the highest average score to obtain a plurality of second positive sample data;
training the first label prediction model and the second label prediction model using a plurality of the first positive sample data, a plurality of the second positive sample data, a plurality of the first negative sample data, and a plurality of the second negative sample data.
Optionally, after the step of respectively inputting each unlabeled sample data into a preset scoring model to obtain a scoring result corresponding to each unlabeled sample data, the method further includes:
performing dimensionality reduction processing on each unmarked sample data by using a preset dimensionality reduction algorithm;
performing multiple rounds of labeling processing on the multiple unlabeled sample data according to the scoring result corresponding to each unlabeled sample data until a first preset stop condition is reached to obtain multiple first positive sample data and multiple first negative sample data, including:
and performing multiple rounds of labeling processing on the unmarked sample data subjected to the dimensionality reduction processing according to a scoring result corresponding to each unmarked sample data subjected to the dimensionality reduction processing until a first preset stopping condition is reached so as to obtain a plurality of first positive sample data and a plurality of first negative sample data.
Optionally, the first preset stop condition is: the current labeling round number reaches a first preset round number threshold value, the current labeling time reaches a first preset time threshold value, and the ratio of the labeled sample data to the unlabeled sample data is greater than any one of the preset threshold values; the second preset stop condition is as follows: the current iteration training round number reaches any one of a second preset round number threshold value and the current iteration training time length reaches a second preset time length threshold value; the third preset stop condition is as follows: and labeling each residual unmarked sample data.
In a second aspect, the present application further provides a device for processing sample data, where the device includes:
the device comprises an acquisition unit, a storage unit and a processing unit, wherein the acquisition unit is used for acquiring a sample data set, and the sample data set comprises a plurality of unlabeled sample data without labels;
the input unit is used for respectively inputting each unlabeled sample data into a preset scoring model so as to obtain a scoring result corresponding to each unlabeled sample data;
the first labeling unit is used for performing multi-round labeling processing on the plurality of unlabeled sample data according to the scoring result corresponding to each unlabeled sample data until a first preset stopping condition is reached so as to obtain a plurality of first positive sample data and a plurality of first negative sample data;
the training unit is used for carrying out multiple rounds of iterative training on a preset machine learning model and a preset deep learning model according to the first positive sample data and the negative sample data until a second preset stop condition is reached so as to obtain a first label prediction model and a second label prediction model;
and the second labeling unit is used for performing multi-round labeling processing on a plurality of residual unlabeled sample data according to the first label prediction model and the second label prediction model until a third preset stop condition is reached.
Optionally, the first labeling unit is specifically configured to, for each round of labeling processing:
according to a grading result corresponding to each unmarked sample data, carrying out forward sequencing on the unmarked sample data to obtain a first sequence;
acquiring X unmarked sample data in the first sequence which is ranked at the top; clustering the X unlabeled sample data according to a preset unsupervised clustering algorithm to obtain a plurality of clusters; calculating the average score corresponding to each cluster according to the score result corresponding to each unlabeled sample data; labeling each unlabeled sample data contained in the cluster with the highest average score to obtain a plurality of first positive sample data;
acquiring Y unmarked sample data in the first sequence which is ranked backwards; clustering the Y unlabeled sample data according to the preset unsupervised clustering algorithm to obtain a plurality of clusters; calculating the average score corresponding to each cluster according to the score result corresponding to each unlabeled sample data; and labeling each unlabeled sample data contained in the cluster with the lowest average score to obtain a plurality of first negative sample data.
Optionally, the second labeling unit is specifically configured to, for each round of labeling processing:
respectively inputting each residual unmarked sample data into the first label prediction model to obtain a first prediction probability corresponding to each residual unmarked sample data, and performing forward sequencing on the residual unmarked sample data according to the first prediction probability corresponding to each residual unmarked sample data to obtain a second sequence, wherein the first prediction probability is used for indicating the probability that the residual unmarked sample data should be marked with a negative label;
inputting each residual unmarked sample data into the second label prediction model respectively to obtain a second prediction probability corresponding to each residual unmarked sample data, and performing forward sequencing on the residual unmarked sample data according to the second prediction probability corresponding to each residual unmarked sample data to obtain a third sequence, wherein the second prediction probability is used for indicating the probability that the residual unmarked sample data should be marked with a negative label;
determining intersections of a first number, ranked first, of the remaining unlabeled sample data in the second sequence and a number, ranked first, of the remaining unlabeled sample data in the third sequence as first target sample data, and determining intersections of B remaining unlabeled sample data ranked last in the second sequence and B remaining unlabeled sample data ranked last in the third sequence as second target sample data;
clustering the first target sample data according to a preset unsupervised clustering algorithm to obtain a plurality of clusters; calculating the average score corresponding to each cluster according to the score result corresponding to each first target sample data; labeling each first target sample data contained in the cluster with the lowest average score to obtain a plurality of second negative sample data;
clustering a plurality of second target sample data according to the preset unsupervised clustering algorithm to obtain a plurality of clusters; calculating the average score corresponding to each cluster according to the score result corresponding to each second target sample data; labeling each second target sample data contained in the cluster with the highest average score to obtain a plurality of second positive sample data;
training the first label prediction model and the second label prediction model using a plurality of the first positive sample data, a plurality of the second positive sample data, a plurality of the first negative sample data, and a plurality of the second negative sample data.
Optionally, the apparatus further comprises:
the dimension reduction unit is used for inputting each unlabeled sample data into a preset scoring model through the input unit respectively so as to obtain a scoring result corresponding to each unlabeled sample data, and then performing dimension reduction processing on each unlabeled sample data through a preset dimension reduction algorithm;
the first labeling unit is specifically configured to perform multiple rounds of labeling processing on the unmarked sample data subjected to the dimensionality reduction processing according to a scoring result corresponding to each unmarked sample data subjected to the dimensionality reduction processing until a first preset stop condition is reached, so as to obtain a plurality of first positive sample data and a plurality of first negative sample data.
Optionally, the first preset stop condition is: the current labeling round number reaches a first preset round number threshold value, the current labeling time reaches a first preset time threshold value, and the ratio of the labeled sample data to the unlabeled sample data is greater than any one of the preset threshold values; the second preset stop condition is as follows: the current iteration training round number reaches any one of a second preset round number threshold value and the current iteration training time length reaches a second preset time length threshold value; the third preset stop condition is as follows: and labeling each residual unmarked sample data.
In a third aspect, an embodiment of the present application provides a storage medium, where the storage medium includes a stored program, and when the program runs, a device in which the storage medium is located is controlled to execute the method for processing sample data in the first aspect.
In a fourth aspect, an embodiment of the present application provides an apparatus for processing sample data, where the apparatus includes a storage medium; and one or more processors, the storage medium coupled with the processors, the processors configured to execute program instructions stored in the storage medium; the program instructions, when executed, implement the method for processing sample data of the first aspect.
By means of the technical scheme, the technical scheme provided by the application at least has the following advantages:
the application provides a sample data processing method and a device, which can respectively input each unmarked sample data into a preset scoring model by a target financial institution platform after the target financial institution platform acquires the sample data set containing a plurality of unmarked sample data to obtain a scoring result corresponding to each unmarked sample data, perform multi-round labeling processing on the plurality of unmarked sample data according to the scoring result corresponding to each unmarked sample data until a first preset stopping condition is reached to obtain a plurality of first positive sample data and a plurality of first negative sample data, perform multi-round iterative training on a preset machine learning model and a preset deep learning model according to the plurality of first positive sample data and the plurality of first negative sample data until a second preset stopping condition is reached to obtain a first label prediction model and a second label prediction model, and finally, performing multi-round labeling processing on a plurality of residual unlabeled sample data according to the first label prediction model and the second label prediction model until a third preset stop condition is reached, thereby completing the operation of labeling all unlabeled sample data, namely, the target financial institution platform can efficiently label a large amount of unlabeled sample data through two multi-round labeling processing before and after.
The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present application will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present application are illustrated by way of example and not by way of limitation in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:
fig. 1 is a flowchart illustrating a sample data processing method according to an embodiment of the present application;
fig. 2 is a flowchart illustrating another sample data processing method provided in an embodiment of the present application;
fig. 3 is a block diagram illustrating a sample data processing apparatus according to an embodiment of the present disclosure;
fig. 4 is a block diagram illustrating another sample data processing apparatus according to an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which this application belongs.
An embodiment of the present application provides a method for processing sample data, as shown in fig. 1, the method includes:
101. and acquiring a sample data set.
The acquired sample data set includes a plurality of unlabeled sample data, where the unlabeled sample data is specifically unlabeled sample data, and the unlabeled sample data may be, but is not limited to: and the historical borrowers correspond to personal information data and/or behavior data, and the historical borrowers are the borrowers who have applied for loan on the target financial institution platform (or other financial institution platforms).
In the embodiment of the present application, the target financial institution platform first needs to acquire a sample data set containing a plurality of unlabeled sample data.
102. And respectively inputting each unlabeled sample data into a preset scoring model to obtain a scoring result corresponding to each unlabeled sample data.
The preset scoring model is a model which is trained in advance by a target financial institution platform and is used for determining a scoring result corresponding to a borrower, and after personal information data and/or behavior performance data corresponding to a certain borrower are input into the preset scoring model, the preset scoring model can output the scoring result corresponding to the borrower.
In the embodiment of the application, after obtaining a sample data set including a plurality of unlabeled sample data, the target financial institution platform needs to input each unlabeled sample data into the preset scoring model, so that the preset scoring model outputs a scoring result corresponding to each unlabeled sample data, thereby obtaining a scoring result corresponding to each unlabeled sample data, where the scoring result corresponding to any one unlabeled sample data is specifically a credit score corresponding to the unlabeled sample data.
103. And performing multiple rounds of labeling processing on the plurality of unlabeled sample data according to the grading result corresponding to each unlabeled sample data until a first preset stopping condition is reached so as to obtain a plurality of first positive sample data and a plurality of first negative sample data.
The first positive sample data is sample data with a positive label (namely a high-quality customer label), and the first negative sample data is sample data with a negative label (namely a low-quality customer label); the first preset stop condition may be, but is not limited to: the current number of rounds of tagging reaches a first preset round threshold, the current time for tagging reaches the first preset time threshold, and the ratio of the tagged sample data (i.e. the tagged sample data) to the untagged sample data is greater than any one of the preset thresholds, wherein the first preset round threshold may be, but is not limited to: 10, 20, 30, etc., and the first preset duration threshold may be, but is not limited to: 1 hour, 5 hours, 10 hours, etc., and the preset threshold may be, but is not limited to: 0.2, 0.3, 0.4, etc.
In the embodiment of the application, after obtaining the scoring result corresponding to each unmarked sample data, the target financial institution platform may perform multiple rounds of labeling processing on the multiple unmarked sample data according to the scoring result corresponding to each unmarked sample data until reaching a first preset stop condition, thereby obtaining multiple first positive sample data and multiple first negative sample data.
104. And performing multiple rounds of iterative training on the preset machine learning model and the preset deep learning model according to the first positive sample data and the negative sample data until a second preset stop condition is reached to obtain a first label prediction model and a second label prediction model.
The preset machine learning model is a model created in advance according to a target machine learning algorithm, and the target machine learning algorithm may be, but is not limited to: a decision tree algorithm, a lightweight gradient elevator algorithm, and the like, the preset deep learning model is a model created in advance according to a target deep learning algorithm, and the target deep learning algorithm may be, but is not limited to: convolutional neural network algorithms, cyclic neural network algorithms, recurrent neural network algorithms, and the like; wherein, the second preset stop condition may be, but is not limited to: any one of the current iteration training round number reaching a second preset round number threshold and the current iteration training duration reaching a second preset duration threshold, where the second preset round number threshold may be, but is not limited to: 10, 20, 30, etc., and the second preset duration threshold may be, but is not limited to: 1 hour, 5 hours, 10 hours, etc.
In the embodiment of the application, after obtaining a plurality of first positive sample data and a plurality of first negative sample data, the target financial institution platform may perform a plurality of rounds of iterative training on the preset machine learning model according to the plurality of first positive sample data and the plurality of first negative sample data until reaching a second preset stop condition, thereby obtaining a first label prediction model, and perform a plurality of rounds of iterative training on the preset deep learning model according to the plurality of first positive sample data and the plurality of first negative sample data until reaching the second preset stop condition, thereby obtaining a second label prediction model.
When the target financial institution platform performs multiple rounds of iterative training on the preset machine learning model according to the multiple first positive sample data and the multiple first negative sample data and performs multiple rounds of iterative training on the preset deep learning model according to the multiple first positive sample data and the multiple first negative sample data, the preset machine learning model and the preset deep learning model can be trained by adopting the existing machine learning model training method and the existing deep learning model training method, and the details are not repeated in the embodiment of the application.
105. And performing multi-round labeling processing on the plurality of residual unlabeled sample data according to the first label prediction model and the second label prediction model until a third preset stop condition is reached.
Wherein, the plurality of remaining unlabeled sample data are sample data which are not labeled in step 103; wherein the third preset stop condition is: and labeling each residual unmarked sample data.
In the embodiment of the application, after the target financial institution platform trains and obtains the first label prediction model and the second label prediction model based on the first positive sample data and the negative sample data, the target financial institution platform can perform multiple rounds of labeling processing on the remaining unlabeled sample data according to the first label prediction model and the second label prediction model until a third preset stop condition is reached, so that the labeling work on all the unlabeled sample data is completed.
The embodiment of the application provides a sample data processing method, which can be used for respectively inputting each unmarked sample data into a preset scoring model by a target financial institution platform after the target financial institution platform acquires a sample data set containing a plurality of unmarked sample data to obtain a scoring result corresponding to each unmarked sample data, performing multiple rounds of labeling processing on the plurality of unmarked sample data according to the scoring result corresponding to each unmarked sample data until a first preset stopping condition is reached to obtain a plurality of first positive sample data and a plurality of first negative sample data, performing multiple rounds of iterative training on a preset machine learning model and a preset deep learning model according to the plurality of first positive sample data and the plurality of first negative sample data until a second preset stopping condition is reached to obtain a first label prediction model and a second label prediction model, and finally, performing multi-round labeling processing on a plurality of residual unlabeled sample data according to the first label prediction model and the second label prediction model until a third preset stop condition is reached, thereby completing the operation of labeling all unlabeled sample data, namely, the target financial institution platform can efficiently label a large amount of unlabeled sample data through two multi-round labeling processing before and after.
To explain in more detail below, an embodiment of the present application provides another method for processing sample data, specifically as shown in fig. 2, the method includes:
201. and acquiring a sample data set.
For the step 201, the description of the corresponding part in fig. 1 may be referred to for obtaining the sample data set, and details of the embodiment of the present invention will not be described here.
202. And respectively inputting each unlabeled sample data into a preset scoring model to obtain a scoring result corresponding to each unlabeled sample data.
In step 202, each unlabeled sample data is input into the preset scoring model to obtain a scoring result corresponding to each unlabeled sample data, which may refer to the description of the corresponding part in fig. 1, and details of the embodiment of the present invention will not be repeated here.
203. And performing multiple rounds of labeling processing on the plurality of unlabeled sample data according to the grading result corresponding to each unlabeled sample data until a first preset stopping condition is reached so as to obtain a plurality of first positive sample data and a plurality of first negative sample data.
In the embodiment of the application, after obtaining the scoring result corresponding to each unmarked sample data, the target financial institution platform may perform multiple rounds of labeling processing on the multiple unmarked sample data according to the scoring result corresponding to each unmarked sample data until reaching a first preset stop condition, thereby obtaining multiple first positive sample data and multiple first negative sample data.
Specifically, in this step, the target financial institution platform may perform multiple rounds of tagging on the multiple unlabeled sample data according to the scoring result corresponding to each unlabeled sample data in the following manner until a first preset stop condition is reached, so as to obtain multiple first positive sample data and multiple first negative sample data:
for each round of labeling process:
(1) forward sorting the plurality of unlabeled sample data according to the scoring result corresponding to each unlabeled sample data to obtain a first sequence;
(2a) acquiring X unmarked sample data in the first sequence which is ranked at the front; secondly, clustering the X unlabeled sample data according to a preset unsupervised clustering algorithm to obtain a plurality of clusters; thirdly, calculating the average score corresponding to each cluster according to the score result corresponding to each unlabeled sample data, namely for any cluster, summing the score results corresponding to each unlabeled sample data in the cluster, and dividing the calculation result by the number of the plurality of unlabeled sample data in the cluster, so as to obtain the average score corresponding to the cluster; finally, labeling each unlabeled sample data contained in the cluster with the highest average score, namely labeling each unlabeled sample data contained in the cluster with the highest average score with a positive label (namely a high-quality client label), so as to obtain a plurality of first positive sample data;
(2b) acquiring Y unmarked sample data in the first sequence; secondly, clustering the Y unlabeled sample data according to a preset unsupervised clustering algorithm to obtain a plurality of clusters; thirdly, calculating the average score corresponding to each cluster according to the score result corresponding to each unlabeled sample data, namely for any cluster, summing the score results corresponding to each unlabeled sample data in the cluster, and dividing the calculation result by the number of the plurality of unlabeled sample data in the cluster, so as to obtain the average score corresponding to the cluster; and finally, performing labeling processing on each unlabeled sample data contained in the cluster with the lowest average score, namely labeling each unlabeled sample data contained in the cluster with the lowest average score with a negative label (namely an inferior customer label), thereby obtaining a plurality of first negative sample data.
The preset unsupervised clustering algorithm may be, but is not limited to: a K-means clustering algorithm, a hierarchical clustering algorithm, and the like; x may be, but is not limited to: 50. 100, 200, etc., and Y may be, but is not limited to: 50. 100, 200, etc.
Further, in this embodiment of the present application, after the target financial institution platform respectively inputs each unlabeled sample data into the preset scoring model, so as to obtain a scoring result corresponding to each unlabeled sample data, a preset dimensionality reduction algorithm may be further used to perform dimensionality reduction on each unlabeled sample data, so as to merge or delete a part of features corresponding to each unlabeled sample data, so that the target financial institution platform can more accurately cluster a plurality of unlabeled sample data subjected to dimensionality reduction in a process of performing multiple rounds of tagging on the plurality of unlabeled sample data subjected to dimensionality reduction according to the scoring result corresponding to each unlabeled sample data subjected to dimensionality reduction, where the preset dimensionality reduction algorithm may be, but is not limited to: a principal component analysis algorithm, a Lasso algorithm, an LDA algorithm, and the like.
204. And performing multiple rounds of iterative training on the preset machine learning model and the preset deep learning model according to the first positive sample data and the negative sample data until a second preset stop condition is reached to obtain a first label prediction model and a second label prediction model.
In step 204, multiple rounds of iterative training are performed on the preset machine learning model and the preset deep learning model according to the multiple first positive sample data and the multiple first negative sample data until a second preset stop condition is reached to obtain the first label prediction model and the second label prediction model, which may refer to the description of the corresponding part in fig. 1, and details of the embodiment of the present invention will not be repeated here.
205. And performing multi-round labeling processing on the plurality of residual unlabeled sample data according to the first label prediction model and the second label prediction model until a third preset stop condition is reached.
In the embodiment of the application, after the target financial institution platform trains and obtains the first label prediction model and the second label prediction model based on the first positive sample data and the negative sample data, the target financial institution platform can perform multiple rounds of labeling processing on the remaining unlabeled sample data according to the first label prediction model and the second label prediction model until a third preset stop condition is reached, so that the labeling work on all the unlabeled sample data is completed.
Specifically, in this step, the target financial institution platform may perform multiple rounds of tagging on a plurality of remaining unlabeled sample data according to the first tag prediction model and the second tag prediction model in the following manner until a third preset stop condition is reached:
for each round of labeling process:
(1) firstly, inputting each residual unmarked sample data into a first label prediction model respectively so as to obtain a first prediction probability corresponding to each residual unmarked sample data, and then carrying out forward sequencing on a plurality of residual unmarked sample data according to the first prediction probability corresponding to each residual unmarked sample data so as to obtain a second sequence, wherein the first prediction probability corresponding to any residual unmarked sample data is used for indicating the probability that the residual unmarked sample data should be marked with a negative label (namely, an inferior client label);
(2) inputting each residual unmarked sample data into a second label prediction model respectively to obtain a second prediction probability corresponding to each residual unmarked sample data, and then carrying out forward sequencing on a plurality of residual unmarked sample data according to the second prediction probability corresponding to each residual unmarked sample data to obtain a third sequence, wherein the second prediction probability corresponding to any residual unmarked sample data is used for indicating the probability that the residual unmarked sample data should be marked with a negative label (namely an inferior client label);
(3) firstly, determining the intersection of the a remaining unmarked sample data ranked earlier in the second sequence and the a remaining unmarked sample data ranked earlier in the third sequence as first target sample data, and then determining the intersection of the B remaining unmarked sample data ranked later in the second sequence and the B remaining unmarked sample data ranked later in the third sequence as second target sample data, wherein a may be but is not limited to: 50. 100, 200, etc., B may be, but is not limited to: 50. positive integers of 100, 200, etc.;
(4) firstly, clustering a plurality of first target sample data according to a preset unsupervised clustering algorithm to obtain a plurality of clusters; secondly, calculating the average score corresponding to each cluster according to the score result corresponding to each first target sample data, namely for any cluster, summing the score results corresponding to all the unlabeled sample data in the cluster, and dividing the calculation result by the number of the unlabeled sample data in the cluster so as to obtain the average score corresponding to the cluster; finally, labeling each first target sample data contained in the cluster with the lowest average score, that is, labeling each first target sample data contained in the cluster with the lowest average score with a negative label (i.e., a bad customer label), thereby obtaining a plurality of second negative sample data, wherein the preset unsupervised clustering algorithm may be, but is not limited to: a K-means clustering algorithm, a hierarchical clustering algorithm, and the like;
(5) firstly, clustering a plurality of second target sample data according to a preset unsupervised clustering algorithm to obtain a plurality of clusters; secondly, calculating the average score corresponding to each cluster according to the score result corresponding to each second target sample data, namely for any cluster, summing the score results corresponding to all the unlabeled sample data in the cluster, and dividing the calculation result by the number of the unlabeled sample data in the cluster, so as to obtain the average score corresponding to the cluster; finally, labeling each second target sample data contained in the cluster with the highest average score, namely labeling each second target sample data contained in the cluster with the highest average score with a positive label (namely a high-quality client label), so as to obtain a plurality of second positive sample data;
(6) the first label prediction model is retrained using the plurality of first positive sample data, the plurality of second positive sample data, the plurality of first negative sample data, and the plurality of second negative sample data, and the second label prediction model is retrained using the plurality of first positive sample data, the plurality of second positive sample data, the plurality of first negative sample data, and the plurality of second negative sample data.
In order to achieve the above object, according to another aspect of the present application, an embodiment of the present application further provides a storage medium, where the storage medium includes a stored program, and when the program runs, a device in which the storage medium is located is controlled to execute the above sample data processing method.
In order to achieve the above object, according to another aspect of the present application, an embodiment of the present application further provides a device for processing sample data, where the device includes a storage medium; and one or more processors, the storage medium coupled with the processors, the processors configured to execute program instructions stored in the storage medium; and when the program instruction runs, executing the sample data processing method.
Further, as an implementation of the method shown in fig. 1 and fig. 2, another embodiment of the present application further provides a device for processing sample data. The embodiment of the apparatus corresponds to the embodiment of the method, and for convenience of reading, details in the embodiment of the apparatus are not repeated one by one, but it should be clear that the apparatus in the embodiment can correspondingly implement all the contents in the embodiment of the method. The device is applied to efficiently labeling a large amount of unlabeled sample data, and specifically as shown in fig. 3, the device comprises:
an obtaining unit 31, configured to obtain a sample data set, where the sample data set includes a plurality of unlabeled sample data without labels;
the input unit 32 is configured to input each unlabeled sample data into a preset scoring model, so as to obtain a scoring result corresponding to each unlabeled sample data;
a first labeling unit 33, configured to perform multiple rounds of labeling processing on the multiple unlabeled sample data according to a scoring result corresponding to each unlabeled sample data until a first preset stop condition is reached, so as to obtain multiple first positive sample data and multiple first negative sample data;
a training unit 34, configured to perform multiple rounds of iterative training on a preset machine learning model and a preset deep learning model according to a plurality of first positive sample data and a plurality of first negative sample data until a second preset stop condition is reached, so as to obtain a first label prediction model and a second label prediction model;
and the second labeling unit 35 is configured to perform multiple rounds of labeling processing on multiple remaining unlabeled sample data according to the first label prediction model and the second label prediction model until a third preset stop condition is reached.
Further, as shown in fig. 4, the first labeling unit 33 is specifically configured to, for each round of labeling processing:
according to a grading result corresponding to each unmarked sample data, carrying out forward sequencing on the unmarked sample data to obtain a first sequence;
acquiring X unmarked sample data in the first sequence which is ranked at the top; clustering the X unlabeled sample data according to a preset unsupervised clustering algorithm to obtain a plurality of clusters; calculating the average score corresponding to each cluster according to the score result corresponding to each unlabeled sample data; labeling each unlabeled sample data contained in the cluster with the highest average score to obtain a plurality of first positive sample data;
acquiring Y unmarked sample data in the first sequence which is ranked backwards; clustering the Y unlabeled sample data according to the preset unsupervised clustering algorithm to obtain a plurality of clusters; calculating the average score corresponding to each cluster according to the score result corresponding to each unlabeled sample data; and labeling each unlabeled sample data contained in the cluster with the lowest average score to obtain a plurality of first negative sample data.
Further, as shown in fig. 4, the second labeling unit 35 is specifically configured to, for each round of labeling processing:
respectively inputting each residual unmarked sample data into the first label prediction model to obtain a first prediction probability corresponding to each residual unmarked sample data, and performing forward sequencing on the residual unmarked sample data according to the first prediction probability corresponding to each residual unmarked sample data to obtain a second sequence, wherein the first prediction probability is used for indicating the probability that the residual unmarked sample data should be marked with a negative label;
inputting each residual unmarked sample data into the second label prediction model respectively to obtain a second prediction probability corresponding to each residual unmarked sample data, and performing forward sequencing on the residual unmarked sample data according to the second prediction probability corresponding to each residual unmarked sample data to obtain a third sequence, wherein the second prediction probability is used for indicating the probability that the residual unmarked sample data should be marked with a negative label;
determining intersections of a first number, ranked first, of the remaining unlabeled sample data in the second sequence and a number, ranked first, of the remaining unlabeled sample data in the third sequence as first target sample data, and determining intersections of B remaining unlabeled sample data ranked last in the second sequence and B remaining unlabeled sample data ranked last in the third sequence as second target sample data;
clustering the first target sample data according to a preset unsupervised clustering algorithm to obtain a plurality of clusters; calculating the average score corresponding to each cluster according to the score result corresponding to each first target sample data; labeling each first target sample data contained in the cluster with the lowest average score to obtain a plurality of second negative sample data;
clustering a plurality of second target sample data according to the preset unsupervised clustering algorithm to obtain a plurality of clusters; calculating the average score corresponding to each cluster according to the score result corresponding to each second target sample data; labeling each second target sample data contained in the cluster with the highest average score to obtain a plurality of second positive sample data;
training the first label prediction model and the second label prediction model using a plurality of the first positive sample data, a plurality of the second positive sample data, a plurality of the first negative sample data, and a plurality of the second negative sample data.
Further, as shown in fig. 4, the apparatus further includes:
the dimension reduction unit 36 is configured to, after the input unit 32 respectively inputs each unlabeled sample data into a preset scoring model to obtain a scoring result corresponding to each unlabeled sample data, perform dimension reduction processing on each unlabeled sample data by using a preset dimension reduction algorithm;
the first labeling unit 33 is specifically configured to perform multiple rounds of labeling processing on the unmarked sample data subjected to the dimension reduction processing according to a scoring result corresponding to each unmarked sample data subjected to the dimension reduction processing until a first preset stop condition is reached, so as to obtain a plurality of first positive sample data and a plurality of first negative sample data.
Further, as shown in fig. 4, the first preset stop condition is: the current labeling round number reaches a first preset round number threshold value, the current labeling time reaches a first preset time threshold value, and the ratio of the labeled sample data to the unlabeled sample data is greater than any one of the preset threshold values; the second preset stop condition is as follows: the current iteration training round number reaches any one of a second preset round number threshold value and the current iteration training time length reaches a second preset time length threshold value; the third preset stop condition is as follows: and labeling each residual unmarked sample data.
The embodiment of the application provides a sample data processing method and a device, after a target financial institution platform obtains a sample data set containing a plurality of unmarked sample data, the target financial institution platform respectively inputs each unmarked sample data into a preset scoring model to obtain a scoring result corresponding to each unmarked sample data, and performs a plurality of rounds of labeling processing on the plurality of unmarked sample data according to the scoring result corresponding to each unmarked sample data until a first preset stop condition is reached to obtain a plurality of first positive sample data and a plurality of first negative sample data, and performs a plurality of rounds of iterative training on a preset machine learning model and a preset deep learning model according to the plurality of first positive sample data and the plurality of first negative sample data until a second preset stop condition is reached to obtain a first label prediction model and a second label prediction model, and finally, performing multi-round labeling processing on a plurality of residual unlabeled sample data according to the first label prediction model and the second label prediction model until a third preset stop condition is reached, thereby completing the operation of labeling all unlabeled sample data, namely, the target financial institution platform can efficiently label a large amount of unlabeled sample data through two multi-round labeling processing before and after.
The sample data processing device comprises a processor and a memory, wherein the acquisition unit, the input unit, the first labeling unit, the training unit, the second labeling unit and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more, and a large amount of sample data without labels can be effectively labeled by adjusting kernel parameters.
The embodiment of the application provides a storage medium, which includes a stored program, wherein when the program runs, a device where the storage medium is located is controlled to execute the sample data processing method.
The storage medium may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
The embodiment of the application also provides a sample data processing device, which comprises a storage medium; and one or more processors, the storage medium coupled with the processors, the processors configured to execute program instructions stored in the storage medium; and when the program instruction runs, executing the sample data processing method.
The embodiment of the application provides equipment, the equipment comprises a processor, a memory and a program which is stored on the memory and can run on the processor, and the following steps are realized when the processor executes the program:
acquiring a sample data set, wherein the sample data set comprises a plurality of unlabeled sample data without labels;
inputting each unlabeled sample data into a preset scoring model respectively to obtain a scoring result corresponding to each unlabeled sample data;
performing multiple rounds of labeling processing on the multiple unlabeled sample data according to the grading result corresponding to each unlabeled sample data until a first preset stopping condition is reached so as to obtain multiple first positive sample data and multiple first negative sample data;
performing multiple rounds of iterative training on a preset machine learning model and a preset deep learning model according to the first positive sample data and the first negative sample data until a second preset stop condition is reached to obtain a first label prediction model and a second label prediction model;
and performing multi-round labeling processing on a plurality of residual unlabeled sample data according to the first label prediction model and the second label prediction model until a third preset stop condition is reached.
Further, the performing multiple rounds of labeling processing on the multiple unlabeled sample data according to the scoring result corresponding to each unlabeled sample data until a first preset stop condition is reached to obtain multiple first positive sample data and multiple first negative sample data includes:
for each round of labeling process:
according to a grading result corresponding to each unmarked sample data, carrying out forward sequencing on the unmarked sample data to obtain a first sequence;
acquiring X unmarked sample data in the first sequence which is ranked at the top; clustering the X unlabeled sample data according to a preset unsupervised clustering algorithm to obtain a plurality of clusters; calculating the average score corresponding to each cluster according to the score result corresponding to each unlabeled sample data; labeling each unlabeled sample data contained in the cluster with the highest average score to obtain a plurality of first positive sample data;
acquiring Y unmarked sample data in the first sequence which is ranked backwards; clustering the Y unlabeled sample data according to the preset unsupervised clustering algorithm to obtain a plurality of clusters; calculating the average score corresponding to each cluster according to the score result corresponding to each unlabeled sample data; and labeling each unlabeled sample data contained in the cluster with the lowest average score to obtain a plurality of first negative sample data.
Further, the performing multiple rounds of labeling processing on multiple remaining unlabeled sample data according to the first label prediction model and the second label prediction model until a third preset stop condition is reached includes:
for each round of labeling process:
respectively inputting each residual unmarked sample data into the first label prediction model to obtain a first prediction probability corresponding to each residual unmarked sample data, and performing forward sequencing on the residual unmarked sample data according to the first prediction probability corresponding to each residual unmarked sample data to obtain a second sequence, wherein the first prediction probability is used for indicating the probability that the residual unmarked sample data should be marked with a negative label;
inputting each residual unmarked sample data into the second label prediction model respectively to obtain a second prediction probability corresponding to each residual unmarked sample data, and performing forward sequencing on the residual unmarked sample data according to the second prediction probability corresponding to each residual unmarked sample data to obtain a third sequence, wherein the second prediction probability is used for indicating the probability that the residual unmarked sample data should be marked with a negative label;
determining intersections of a first number, ranked first, of the remaining unlabeled sample data in the second sequence and a number, ranked first, of the remaining unlabeled sample data in the third sequence as first target sample data, and determining intersections of B remaining unlabeled sample data ranked last in the second sequence and B remaining unlabeled sample data ranked last in the third sequence as second target sample data;
clustering the first target sample data according to a preset unsupervised clustering algorithm to obtain a plurality of clusters; calculating the average score corresponding to each cluster according to the score result corresponding to each first target sample data; labeling each first target sample data contained in the cluster with the lowest average score to obtain a plurality of second negative sample data;
clustering a plurality of second target sample data according to the preset unsupervised clustering algorithm to obtain a plurality of clusters; calculating the average score corresponding to each cluster according to the score result corresponding to each second target sample data; labeling each second target sample data contained in the cluster with the highest average score to obtain a plurality of second positive sample data;
training the first label prediction model and the second label prediction model using a plurality of the first positive sample data, a plurality of the second positive sample data, a plurality of the first negative sample data, and a plurality of the second negative sample data.
Further, after the step of inputting each unlabeled sample data into a preset scoring model respectively to obtain a scoring result corresponding to each unlabeled sample data, the method further includes:
performing dimensionality reduction processing on each unmarked sample data by using a preset dimensionality reduction algorithm;
performing multiple rounds of labeling processing on the multiple unlabeled sample data according to the scoring result corresponding to each unlabeled sample data until a first preset stop condition is reached to obtain multiple first positive sample data and multiple first negative sample data, including:
and performing multiple rounds of labeling processing on the unmarked sample data subjected to the dimensionality reduction processing according to a scoring result corresponding to each unmarked sample data subjected to the dimensionality reduction processing until a first preset stopping condition is reached so as to obtain a plurality of first positive sample data and a plurality of first negative sample data.
Further, the first preset stop condition is: the current labeling round number reaches a first preset round number threshold value, the current labeling time reaches a first preset time threshold value, and the ratio of the labeled sample data to the unlabeled sample data is greater than any one of the preset threshold values; the second preset stop condition is as follows: the current iteration training round number reaches any one of a second preset round number threshold value and the current iteration training time length reaches a second preset time length threshold value; the third preset stop condition is as follows: and labeling each residual unmarked sample data.
The present application further provides a computer program product adapted to perform program code for initializing the following method steps when executed on a data processing device: acquiring a sample data set, wherein the sample data set comprises a plurality of unlabeled sample data without labels; inputting each unlabeled sample data into a preset scoring model respectively to obtain a scoring result corresponding to each unlabeled sample data; performing multiple rounds of labeling processing on the multiple unlabeled sample data according to the grading result corresponding to each unlabeled sample data until a first preset stopping condition is reached so as to obtain multiple first positive sample data and multiple first negative sample data; performing multiple rounds of iterative training on a preset machine learning model and a preset deep learning model according to the first positive sample data and the first negative sample data until a second preset stop condition is reached to obtain a first label prediction model and a second label prediction model; and performing multi-round labeling processing on a plurality of residual unlabeled sample data according to the first label prediction model and the second label prediction model until a third preset stop condition is reached.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. A method for processing sample data, the method comprising:
acquiring a sample data set, wherein the sample data set comprises a plurality of unlabeled sample data without labels;
inputting each unlabeled sample data into a preset scoring model respectively to obtain a scoring result corresponding to each unlabeled sample data;
performing multiple rounds of labeling processing on the multiple unlabeled sample data according to the grading result corresponding to each unlabeled sample data until a first preset stopping condition is reached so as to obtain multiple first positive sample data and multiple first negative sample data;
performing multiple rounds of iterative training on a preset machine learning model and a preset deep learning model according to the first positive sample data and the first negative sample data until a second preset stop condition is reached to obtain a first label prediction model and a second label prediction model;
and performing multi-round labeling processing on a plurality of residual unlabeled sample data according to the first label prediction model and the second label prediction model until a third preset stop condition is reached.
2. The method according to claim 1, wherein said performing multiple rounds of labeling on a plurality of unlabeled sample data according to a scoring result corresponding to each unlabeled sample data until a first preset stop condition is reached to obtain a plurality of first positive sample data and a plurality of first negative sample data comprises:
for each round of labeling process:
according to a grading result corresponding to each unmarked sample data, carrying out forward sequencing on the unmarked sample data to obtain a first sequence;
acquiring X unmarked sample data in the first sequence which is ranked at the top; clustering the X unlabeled sample data according to a preset unsupervised clustering algorithm to obtain a plurality of clusters; calculating the average score corresponding to each cluster according to the score result corresponding to each unlabeled sample data; labeling each unlabeled sample data contained in the cluster with the highest average score to obtain a plurality of first positive sample data;
acquiring Y unmarked sample data in the first sequence which is ranked backwards; clustering the Y unlabeled sample data according to the preset unsupervised clustering algorithm to obtain a plurality of clusters; calculating the average score corresponding to each cluster according to the score result corresponding to each unlabeled sample data; and labeling each unlabeled sample data contained in the cluster with the lowest average score to obtain a plurality of first negative sample data.
3. The method of claim 1, wherein the performing multiple rounds of labeling on the plurality of remaining unlabeled sample data according to the first label prediction model and the second label prediction model until a third preset stop condition is reached comprises:
for each round of labeling process:
respectively inputting each residual unmarked sample data into the first label prediction model to obtain a first prediction probability corresponding to each residual unmarked sample data, and performing forward sequencing on the residual unmarked sample data according to the first prediction probability corresponding to each residual unmarked sample data to obtain a second sequence, wherein the first prediction probability is used for indicating the probability that the residual unmarked sample data should be marked with a negative label;
inputting each residual unmarked sample data into the second label prediction model respectively to obtain a second prediction probability corresponding to each residual unmarked sample data, and performing forward sequencing on the residual unmarked sample data according to the second prediction probability corresponding to each residual unmarked sample data to obtain a third sequence, wherein the second prediction probability is used for indicating the probability that the residual unmarked sample data should be marked with a negative label;
determining intersections of a first number, ranked first, of the remaining unlabeled sample data in the second sequence and a number, ranked first, of the remaining unlabeled sample data in the third sequence as first target sample data, and determining intersections of B remaining unlabeled sample data ranked last in the second sequence and B remaining unlabeled sample data ranked last in the third sequence as second target sample data;
clustering the first target sample data according to a preset unsupervised clustering algorithm to obtain a plurality of clusters; calculating the average score corresponding to each cluster according to the score result corresponding to each first target sample data; labeling each first target sample data contained in the cluster with the lowest average score to obtain a plurality of second negative sample data;
clustering a plurality of second target sample data according to the preset unsupervised clustering algorithm to obtain a plurality of clusters; calculating the average score corresponding to each cluster according to the score result corresponding to each second target sample data; labeling each second target sample data contained in the cluster with the highest average score to obtain a plurality of second positive sample data;
training the first label prediction model and the second label prediction model using a plurality of the first positive sample data, a plurality of the second positive sample data, a plurality of the first negative sample data, and a plurality of the second negative sample data.
4. The method according to claim 1, wherein after said inputting each said unlabeled sample data into a preset scoring model respectively to obtain a scoring result corresponding to each said unlabeled sample data, the method further comprises:
performing dimensionality reduction processing on each unmarked sample data by using a preset dimensionality reduction algorithm;
performing multiple rounds of labeling processing on the multiple unlabeled sample data according to the scoring result corresponding to each unlabeled sample data until a first preset stop condition is reached to obtain multiple first positive sample data and multiple first negative sample data, including:
and performing multiple rounds of labeling processing on the unmarked sample data subjected to the dimensionality reduction processing according to a scoring result corresponding to each unmarked sample data subjected to the dimensionality reduction processing until a first preset stopping condition is reached so as to obtain a plurality of first positive sample data and a plurality of first negative sample data.
5. The method according to claim 1, characterized in that the first preset stop condition is: the current labeling round number reaches a first preset round number threshold value, the current labeling time reaches a first preset time threshold value, and the ratio of the labeled sample data to the unlabeled sample data is greater than any one of the preset threshold values; the second preset stop condition is as follows: the current iteration training round number reaches any one of a second preset round number threshold value and the current iteration training time length reaches a second preset time length threshold value; the third preset stop condition is as follows: and labeling each residual unmarked sample data.
6. An apparatus for processing sample data, the apparatus comprising:
the device comprises an acquisition unit, a storage unit and a processing unit, wherein the acquisition unit is used for acquiring a sample data set, and the sample data set comprises a plurality of unlabeled sample data without labels;
the input unit is used for respectively inputting each unlabeled sample data into a preset scoring model so as to obtain a scoring result corresponding to each unlabeled sample data;
the first labeling unit is used for performing multi-round labeling processing on the plurality of unlabeled sample data according to the scoring result corresponding to each unlabeled sample data until a first preset stopping condition is reached so as to obtain a plurality of first positive sample data and a plurality of first negative sample data;
the training unit is used for carrying out multiple rounds of iterative training on a preset machine learning model and a preset deep learning model according to the first positive sample data and the negative sample data until a second preset stop condition is reached so as to obtain a first label prediction model and a second label prediction model;
and the second labeling unit is used for performing multi-round labeling processing on a plurality of residual unlabeled sample data according to the first label prediction model and the second label prediction model until a third preset stop condition is reached.
7. The apparatus of claim 6,
the first labeling unit is specifically configured to, for each round of labeling processing:
according to a grading result corresponding to each unmarked sample data, carrying out forward sequencing on the unmarked sample data to obtain a first sequence;
acquiring X unmarked sample data in the first sequence which is ranked at the top; clustering the X unlabeled sample data according to a preset unsupervised clustering algorithm to obtain a plurality of clusters; calculating the average score corresponding to each cluster according to the score result corresponding to each unlabeled sample data; labeling each unlabeled sample data contained in the cluster with the highest average score to obtain a plurality of first positive sample data;
acquiring Y unmarked sample data in the first sequence which is ranked backwards; clustering the Y unlabeled sample data according to the preset unsupervised clustering algorithm to obtain a plurality of clusters; calculating the average score corresponding to each cluster according to the score result corresponding to each unlabeled sample data; and labeling each unlabeled sample data contained in the cluster with the lowest average score to obtain a plurality of first negative sample data.
8. The apparatus of claim 6,
the second labeling unit is specifically configured to, for each round of labeling processing:
respectively inputting each residual unmarked sample data into the first label prediction model to obtain a first prediction probability corresponding to each residual unmarked sample data, and performing forward sequencing on the residual unmarked sample data according to the first prediction probability corresponding to each residual unmarked sample data to obtain a second sequence, wherein the first prediction probability is used for indicating the probability that the residual unmarked sample data should be marked with a negative label;
inputting each residual unmarked sample data into the second label prediction model respectively to obtain a second prediction probability corresponding to each residual unmarked sample data, and performing forward sequencing on the residual unmarked sample data according to the second prediction probability corresponding to each residual unmarked sample data to obtain a third sequence, wherein the second prediction probability is used for indicating the probability that the residual unmarked sample data should be marked with a negative label;
determining intersections of a first number, ranked first, of the remaining unlabeled sample data in the second sequence and a number, ranked first, of the remaining unlabeled sample data in the third sequence as first target sample data, and determining intersections of B remaining unlabeled sample data ranked last in the second sequence and B remaining unlabeled sample data ranked last in the third sequence as second target sample data;
clustering the first target sample data according to a preset unsupervised clustering algorithm to obtain a plurality of clusters; calculating the average score corresponding to each cluster according to the score result corresponding to each first target sample data; labeling each first target sample data contained in the cluster with the lowest average score to obtain a plurality of second negative sample data;
clustering a plurality of second target sample data according to the preset unsupervised clustering algorithm to obtain a plurality of clusters; calculating the average score corresponding to each cluster according to the score result corresponding to each second target sample data; labeling each second target sample data contained in the cluster with the highest average score to obtain a plurality of second positive sample data;
training the first label prediction model and the second label prediction model using a plurality of the first positive sample data, a plurality of the second positive sample data, a plurality of the first negative sample data, and a plurality of the second negative sample data.
9. A storage medium, comprising a stored program, wherein when the program is run, a device on which the storage medium is located is controlled to execute the method for processing sample data according to any one of claims 1 to 5.
10. An apparatus for processing sample data, the apparatus comprising a storage medium; and one or more processors, the storage medium coupled with the processors, the processors configured to execute program instructions stored in the storage medium; the program instructions when executed perform a method of processing sample data as claimed in any one of claims 1 to 5.
CN202111107477.6A 2021-09-22 2021-09-22 Sample data processing method and device Active CN113919936B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111107477.6A CN113919936B (en) 2021-09-22 2021-09-22 Sample data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111107477.6A CN113919936B (en) 2021-09-22 2021-09-22 Sample data processing method and device

Publications (2)

Publication Number Publication Date
CN113919936A true CN113919936A (en) 2022-01-11
CN113919936B CN113919936B (en) 2022-08-05

Family

ID=79235538

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111107477.6A Active CN113919936B (en) 2021-09-22 2021-09-22 Sample data processing method and device

Country Status (1)

Country Link
CN (1) CN113919936B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114418752A (en) * 2022-03-28 2022-04-29 北京芯盾时代科技有限公司 Method and device for processing user data without type label, electronic equipment and medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109615089A (en) * 2018-10-12 2019-04-12 国网浙江省电力有限公司衢州供电公司 Power information acquires the generation method of label and the work order distributing method based on this
CN110070183A (en) * 2019-03-11 2019-07-30 中国科学院信息工程研究所 A kind of the neural network model training method and device of weak labeled data
US20190279297A1 (en) * 2017-02-08 2019-09-12 Tencent Technology (Shenzhen) Company Limited Credit scoring method and server
CN111046952A (en) * 2019-12-12 2020-04-21 深圳市随手金服信息科技有限公司 Method and device for establishing label mining model, storage medium and terminal
CN111143865A (en) * 2019-12-26 2020-05-12 国网湖北省电力有限公司 User behavior analysis system and method for automatically generating label on ciphertext data
CN111209929A (en) * 2019-12-19 2020-05-29 平安信托有限责任公司 Access data processing method and device, computer equipment and storage medium
US20200382536A1 (en) * 2019-05-31 2020-12-03 Gurucul Solutions, Llc Anomaly detection in cybersecurity and fraud applications
CN112288453A (en) * 2019-07-23 2021-01-29 北京京东尚科信息技术有限公司 Label selection method and device
US20210150631A1 (en) * 2019-11-19 2021-05-20 Intuit Inc. Machine learning approach to automatically disambiguate ambiguous electronic transaction labels
US20210158195A1 (en) * 2019-11-26 2021-05-27 International Business Machines Corporation Data label verification
CN113128536A (en) * 2019-12-31 2021-07-16 奇安信科技集团股份有限公司 Unsupervised learning method, system, computer device and readable storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190279297A1 (en) * 2017-02-08 2019-09-12 Tencent Technology (Shenzhen) Company Limited Credit scoring method and server
CN109615089A (en) * 2018-10-12 2019-04-12 国网浙江省电力有限公司衢州供电公司 Power information acquires the generation method of label and the work order distributing method based on this
CN110070183A (en) * 2019-03-11 2019-07-30 中国科学院信息工程研究所 A kind of the neural network model training method and device of weak labeled data
US20200382536A1 (en) * 2019-05-31 2020-12-03 Gurucul Solutions, Llc Anomaly detection in cybersecurity and fraud applications
CN112288453A (en) * 2019-07-23 2021-01-29 北京京东尚科信息技术有限公司 Label selection method and device
US20210150631A1 (en) * 2019-11-19 2021-05-20 Intuit Inc. Machine learning approach to automatically disambiguate ambiguous electronic transaction labels
US20210158195A1 (en) * 2019-11-26 2021-05-27 International Business Machines Corporation Data label verification
CN111046952A (en) * 2019-12-12 2020-04-21 深圳市随手金服信息科技有限公司 Method and device for establishing label mining model, storage medium and terminal
CN111209929A (en) * 2019-12-19 2020-05-29 平安信托有限责任公司 Access data processing method and device, computer equipment and storage medium
CN111143865A (en) * 2019-12-26 2020-05-12 国网湖北省电力有限公司 User behavior analysis system and method for automatically generating label on ciphertext data
CN113128536A (en) * 2019-12-31 2021-07-16 奇安信科技集团股份有限公司 Unsupervised learning method, system, computer device and readable storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
唐晓波等: "基于特征分析和标签提取的医生画像构建研究", 《情报科学》 *
杨文浩等: "基于 BERT 和深层等长卷积的新闻标签分类", 《计算机与现代化》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114418752A (en) * 2022-03-28 2022-04-29 北京芯盾时代科技有限公司 Method and device for processing user data without type label, electronic equipment and medium

Also Published As

Publication number Publication date
CN113919936B (en) 2022-08-05

Similar Documents

Publication Publication Date Title
CN110348580B (en) Method and device for constructing GBDT model, and prediction method and device
CN108389125B (en) Overdue risk prediction method and device for credit application
US20210271809A1 (en) Machine learning process implementation method and apparatus, device, and storage medium
CN108418825B (en) Risk model training and junk account detection methods, devices and equipment
CN110533018B (en) Image classification method and device
CN111783993A (en) Intelligent labeling method and device, intelligent platform and storage medium
CN112766619B (en) Commodity time sequence data prediction method and system
CN109063743B (en) Construction method of medical data classification model based on semi-supervised multitask learning
CN113435998B (en) Loan overdue prediction method and device, electronic equipment and storage medium
CN113177700B (en) Risk assessment method, system, electronic equipment and storage medium
Wu et al. Optimized deep learning framework for water distribution data-driven modeling
WO2019000293A1 (en) Techniques for dense video descriptions
CN113919936B (en) Sample data processing method and device
CN109597982B (en) Abstract text recognition method and device
US20200257984A1 (en) Systems and methods for domain adaptation
CN111126038B (en) Information acquisition model generation method and device and information acquisition method and device
Samsani et al. A real-time automatic human facial expression recognition system using deep neural networks
CN117523218A (en) Label generation, training of image classification model and image classification method and device
CN113870005A (en) Method and device for determining hyper-parameters
CN115115920A (en) Data training method and device
CN113987170A (en) Multi-label text classification method based on convolutional neural network
CN113377960A (en) Analysis method, processor and device for platform commodity comments
CN113487453A (en) Legal judgment prediction method and system based on criminal elements
KR20220075964A (en) Method and apparatus for performing labeling for training data
CN109558582B (en) Visual angle-based sentence emotion analysis method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 100000 floors 1-3, block a, global creative Plaza, No. 10, Furong street, Chaoyang District, Beijing

Patentee after: Bairong Zhixin (Beijing) Technology Co.,Ltd.

Address before: 100000 floors 1-3, block a, global creative Plaza, No. 10, Furong street, Chaoyang District, Beijing

Patentee before: Bairong Zhixin (Beijing) credit investigation Co.,Ltd.