CN113919936A

CN113919936A - Sample data processing method and device

Info

Publication number: CN113919936A
Application number: CN202111107477.6A
Authority: CN
Inventors: 王珍; 孙祥坤; 陈昶汝; 杨丽娟
Original assignee: Bairong Zhixin Beijing Credit Investigation Co Ltd
Current assignee: Bairong Zhixin Beijing Technology Co ltd
Priority date: 2021-09-22
Filing date: 2021-09-22
Publication date: 2022-01-11
Anticipated expiration: 2041-09-22
Also published as: CN113919936B

Abstract

The application discloses a sample data processing method and device, and relates to the technical field of data processing. The method of the present application comprises: acquiring a sample data set, wherein the sample data set comprises a plurality of unmarked sample data; respectively inputting each unlabeled sample data into a preset scoring model to obtain a scoring result corresponding to each unlabeled sample data; performing multiple rounds of labeling processing on the multiple unlabeled sample data according to the grading result corresponding to each unlabeled sample data to obtain multiple first positive sample data and multiple first negative sample data; performing multiple rounds of iterative training on a preset machine learning model and a preset deep learning model according to a plurality of first positive sample data and a plurality of first negative sample data to obtain a first label prediction model and a second label prediction model; and performing multi-round labeling processing on the plurality of residual unlabeled sample data according to the first label prediction model and the second label prediction model.

Description

Sample data processing method and device

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for processing sample data.

Background

With the continuous development of society and the gradual upgrade of people's consumption concept, more and more people promote self living standard through the mode of loan. In the process that the loan applicant applies for loan service to the financial institution platform, in order to reduce loan risk, the financial institution platform determines whether the borrower will default repayment according to the personal information data and the behavior data of the borrower, namely determines whether the borrower is a bad client.

At present, a financial institution platform usually trains and obtains a prediction model based on a large amount of sample data (personal information data and/or behavioral performance data of historical borrowers) with labels (good customer labels or bad customer labels), and then inputs the personal information data and/or behavioral performance data of the borrowers to be evaluated into the prediction model so as to predict the probability that the model outputs the borrowers to be evaluated as bad customers. However, in the credit domain, it is relatively easy to acquire unlabeled sample data, and it takes a lot of manpower and material resources to label a large amount of unlabeled sample data. Therefore, how efficiently financial institution platforms label a large amount of unlabeled sample data is currently the subject for solution.

Disclosure of Invention

The embodiment of the application provides a sample data processing method and device, and mainly aims to efficiently label a large amount of sample data without labels.

In order to solve the above technical problem, an embodiment of the present application provides the following technical solutions:

in a first aspect, the present application provides a method for processing sample data, including:

acquiring a sample data set, wherein the sample data set comprises a plurality of unlabeled sample data without labels;

inputting each unlabeled sample data into a preset scoring model respectively to obtain a scoring result corresponding to each unlabeled sample data;

performing multiple rounds of labeling processing on the multiple unlabeled sample data according to the grading result corresponding to each unlabeled sample data until a first preset stopping condition is reached so as to obtain multiple first positive sample data and multiple first negative sample data;

performing multiple rounds of iterative training on a preset machine learning model and a preset deep learning model according to the first positive sample data and the first negative sample data until a second preset stop condition is reached to obtain a first label prediction model and a second label prediction model;

and performing multi-round labeling processing on a plurality of residual unlabeled sample data according to the first label prediction model and the second label prediction model until a third preset stop condition is reached.

Optionally, the performing, according to a scoring result corresponding to each unlabeled sample data, multiple rounds of labeling processing on the unlabeled sample data until a first preset stop condition is reached to obtain multiple first positive sample data and multiple first negative sample data includes:

for each round of labeling process:

according to a grading result corresponding to each unmarked sample data, carrying out forward sequencing on the unmarked sample data to obtain a first sequence;

acquiring X unmarked sample data in the first sequence which is ranked at the top; clustering the X unlabeled sample data according to a preset unsupervised clustering algorithm to obtain a plurality of clusters; calculating the average score corresponding to each cluster according to the score result corresponding to each unlabeled sample data; labeling each unlabeled sample data contained in the cluster with the highest average score to obtain a plurality of first positive sample data;

acquiring Y unmarked sample data in the first sequence which is ranked backwards; clustering the Y unlabeled sample data according to the preset unsupervised clustering algorithm to obtain a plurality of clusters; calculating the average score corresponding to each cluster according to the score result corresponding to each unlabeled sample data; and labeling each unlabeled sample data contained in the cluster with the lowest average score to obtain a plurality of first negative sample data.

Optionally, the performing, according to the first label prediction model and the second label prediction model, multiple rounds of labeling processing on multiple remaining unlabeled sample data until a third preset stop condition is reached includes:

for each round of labeling process:

respectively inputting each residual unmarked sample data into the first label prediction model to obtain a first prediction probability corresponding to each residual unmarked sample data, and performing forward sequencing on the residual unmarked sample data according to the first prediction probability corresponding to each residual unmarked sample data to obtain a second sequence, wherein the first prediction probability is used for indicating the probability that the residual unmarked sample data should be marked with a negative label;

inputting each residual unmarked sample data into the second label prediction model respectively to obtain a second prediction probability corresponding to each residual unmarked sample data, and performing forward sequencing on the residual unmarked sample data according to the second prediction probability corresponding to each residual unmarked sample data to obtain a third sequence, wherein the second prediction probability is used for indicating the probability that the residual unmarked sample data should be marked with a negative label;

determining intersections of a first number, ranked first, of the remaining unlabeled sample data in the second sequence and a number, ranked first, of the remaining unlabeled sample data in the third sequence as first target sample data, and determining intersections of B remaining unlabeled sample data ranked last in the second sequence and B remaining unlabeled sample data ranked last in the third sequence as second target sample data;

clustering the first target sample data according to a preset unsupervised clustering algorithm to obtain a plurality of clusters; calculating the average score corresponding to each cluster according to the score result corresponding to each first target sample data; labeling each first target sample data contained in the cluster with the lowest average score to obtain a plurality of second negative sample data;

clustering a plurality of second target sample data according to the preset unsupervised clustering algorithm to obtain a plurality of clusters; calculating the average score corresponding to each cluster according to the score result corresponding to each second target sample data; labeling each second target sample data contained in the cluster with the highest average score to obtain a plurality of second positive sample data;

training the first label prediction model and the second label prediction model using a plurality of the first positive sample data, a plurality of the second positive sample data, a plurality of the first negative sample data, and a plurality of the second negative sample data.

Optionally, after the step of respectively inputting each unlabeled sample data into a preset scoring model to obtain a scoring result corresponding to each unlabeled sample data, the method further includes:

performing dimensionality reduction processing on each unmarked sample data by using a preset dimensionality reduction algorithm;

performing multiple rounds of labeling processing on the multiple unlabeled sample data according to the scoring result corresponding to each unlabeled sample data until a first preset stop condition is reached to obtain multiple first positive sample data and multiple first negative sample data, including:

and performing multiple rounds of labeling processing on the unmarked sample data subjected to the dimensionality reduction processing according to a scoring result corresponding to each unmarked sample data subjected to the dimensionality reduction processing until a first preset stopping condition is reached so as to obtain a plurality of first positive sample data and a plurality of first negative sample data.

Optionally, the first preset stop condition is: the current labeling round number reaches a first preset round number threshold value, the current labeling time reaches a first preset time threshold value, and the ratio of the labeled sample data to the unlabeled sample data is greater than any one of the preset threshold values; the second preset stop condition is as follows: the current iteration training round number reaches any one of a second preset round number threshold value and the current iteration training time length reaches a second preset time length threshold value; the third preset stop condition is as follows: and labeling each residual unmarked sample data.

In a second aspect, the present application further provides a device for processing sample data, where the device includes:

the device comprises an acquisition unit, a storage unit and a processing unit, wherein the acquisition unit is used for acquiring a sample data set, and the sample data set comprises a plurality of unlabeled sample data without labels;

the input unit is used for respectively inputting each unlabeled sample data into a preset scoring model so as to obtain a scoring result corresponding to each unlabeled sample data;

the first labeling unit is used for performing multi-round labeling processing on the plurality of unlabeled sample data according to the scoring result corresponding to each unlabeled sample data until a first preset stopping condition is reached so as to obtain a plurality of first positive sample data and a plurality of first negative sample data;

the training unit is used for carrying out multiple rounds of iterative training on a preset machine learning model and a preset deep learning model according to the first positive sample data and the negative sample data until a second preset stop condition is reached so as to obtain a first label prediction model and a second label prediction model;

and the second labeling unit is used for performing multi-round labeling processing on a plurality of residual unlabeled sample data according to the first label prediction model and the second label prediction model until a third preset stop condition is reached.

Optionally, the first labeling unit is specifically configured to, for each round of labeling processing:

Optionally, the second labeling unit is specifically configured to, for each round of labeling processing:

Optionally, the apparatus further comprises:

the dimension reduction unit is used for inputting each unlabeled sample data into a preset scoring model through the input unit respectively so as to obtain a scoring result corresponding to each unlabeled sample data, and then performing dimension reduction processing on each unlabeled sample data through a preset dimension reduction algorithm;

the first labeling unit is specifically configured to perform multiple rounds of labeling processing on the unmarked sample data subjected to the dimensionality reduction processing according to a scoring result corresponding to each unmarked sample data subjected to the dimensionality reduction processing until a first preset stop condition is reached, so as to obtain a plurality of first positive sample data and a plurality of first negative sample data.

In a third aspect, an embodiment of the present application provides a storage medium, where the storage medium includes a stored program, and when the program runs, a device in which the storage medium is located is controlled to execute the method for processing sample data in the first aspect.

In a fourth aspect, an embodiment of the present application provides an apparatus for processing sample data, where the apparatus includes a storage medium; and one or more processors, the storage medium coupled with the processors, the processors configured to execute program instructions stored in the storage medium; the program instructions, when executed, implement the method for processing sample data of the first aspect.

By means of the technical scheme, the technical scheme provided by the application at least has the following advantages:

the application provides a sample data processing method and a device, which can respectively input each unmarked sample data into a preset scoring model by a target financial institution platform after the target financial institution platform acquires the sample data set containing a plurality of unmarked sample data to obtain a scoring result corresponding to each unmarked sample data, perform multi-round labeling processing on the plurality of unmarked sample data according to the scoring result corresponding to each unmarked sample data until a first preset stopping condition is reached to obtain a plurality of first positive sample data and a plurality of first negative sample data, perform multi-round iterative training on a preset machine learning model and a preset deep learning model according to the plurality of first positive sample data and the plurality of first negative sample data until a second preset stopping condition is reached to obtain a first label prediction model and a second label prediction model, and finally, performing multi-round labeling processing on a plurality of residual unlabeled sample data according to the first label prediction model and the second label prediction model until a third preset stop condition is reached, thereby completing the operation of labeling all unlabeled sample data, namely, the target financial institution platform can efficiently label a large amount of unlabeled sample data through two multi-round labeling processing before and after.

The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present application will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present application are illustrated by way of example and not by way of limitation in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:

fig. 1 is a flowchart illustrating a sample data processing method according to an embodiment of the present application;

fig. 2 is a flowchart illustrating another sample data processing method provided in an embodiment of the present application;

fig. 3 is a block diagram illustrating a sample data processing apparatus according to an embodiment of the present disclosure;

fig. 4 is a block diagram illustrating another sample data processing apparatus according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which this application belongs.

An embodiment of the present application provides a method for processing sample data, as shown in fig. 1, the method includes:

101. and acquiring a sample data set.

The acquired sample data set includes a plurality of unlabeled sample data, where the unlabeled sample data is specifically unlabeled sample data, and the unlabeled sample data may be, but is not limited to: and the historical borrowers correspond to personal information data and/or behavior data, and the historical borrowers are the borrowers who have applied for loan on the target financial institution platform (or other financial institution platforms).

In the embodiment of the present application, the target financial institution platform first needs to acquire a sample data set containing a plurality of unlabeled sample data.

102. And respectively inputting each unlabeled sample data into a preset scoring model to obtain a scoring result corresponding to each unlabeled sample data.

The preset scoring model is a model which is trained in advance by a target financial institution platform and is used for determining a scoring result corresponding to a borrower, and after personal information data and/or behavior performance data corresponding to a certain borrower are input into the preset scoring model, the preset scoring model can output the scoring result corresponding to the borrower.

In the embodiment of the application, after obtaining a sample data set including a plurality of unlabeled sample data, the target financial institution platform needs to input each unlabeled sample data into the preset scoring model, so that the preset scoring model outputs a scoring result corresponding to each unlabeled sample data, thereby obtaining a scoring result corresponding to each unlabeled sample data, where the scoring result corresponding to any one unlabeled sample data is specifically a credit score corresponding to the unlabeled sample data.

103. And performing multiple rounds of labeling processing on the plurality of unlabeled sample data according to the grading result corresponding to each unlabeled sample data until a first preset stopping condition is reached so as to obtain a plurality of first positive sample data and a plurality of first negative sample data.

The first positive sample data is sample data with a positive label (namely a high-quality customer label), and the first negative sample data is sample data with a negative label (namely a low-quality customer label); the first preset stop condition may be, but is not limited to: the current number of rounds of tagging reaches a first preset round threshold, the current time for tagging reaches the first preset time threshold, and the ratio of the tagged sample data (i.e. the tagged sample data) to the untagged sample data is greater than any one of the preset thresholds, wherein the first preset round threshold may be, but is not limited to: 10, 20, 30, etc., and the first preset duration threshold may be, but is not limited to: 1 hour, 5 hours, 10 hours, etc., and the preset threshold may be, but is not limited to: 0.2, 0.3, 0.4, etc.

In the embodiment of the application, after obtaining the scoring result corresponding to each unmarked sample data, the target financial institution platform may perform multiple rounds of labeling processing on the multiple unmarked sample data according to the scoring result corresponding to each unmarked sample data until reaching a first preset stop condition, thereby obtaining multiple first positive sample data and multiple first negative sample data.

104. And performing multiple rounds of iterative training on the preset machine learning model and the preset deep learning model according to the first positive sample data and the negative sample data until a second preset stop condition is reached to obtain a first label prediction model and a second label prediction model.

The preset machine learning model is a model created in advance according to a target machine learning algorithm, and the target machine learning algorithm may be, but is not limited to: a decision tree algorithm, a lightweight gradient elevator algorithm, and the like, the preset deep learning model is a model created in advance according to a target deep learning algorithm, and the target deep learning algorithm may be, but is not limited to: convolutional neural network algorithms, cyclic neural network algorithms, recurrent neural network algorithms, and the like; wherein, the second preset stop condition may be, but is not limited to: any one of the current iteration training round number reaching a second preset round number threshold and the current iteration training duration reaching a second preset duration threshold, where the second preset round number threshold may be, but is not limited to: 10, 20, 30, etc., and the second preset duration threshold may be, but is not limited to: 1 hour, 5 hours, 10 hours, etc.

In the embodiment of the application, after obtaining a plurality of first positive sample data and a plurality of first negative sample data, the target financial institution platform may perform a plurality of rounds of iterative training on the preset machine learning model according to the plurality of first positive sample data and the plurality of first negative sample data until reaching a second preset stop condition, thereby obtaining a first label prediction model, and perform a plurality of rounds of iterative training on the preset deep learning model according to the plurality of first positive sample data and the plurality of first negative sample data until reaching the second preset stop condition, thereby obtaining a second label prediction model.

When the target financial institution platform performs multiple rounds of iterative training on the preset machine learning model according to the multiple first positive sample data and the multiple first negative sample data and performs multiple rounds of iterative training on the preset deep learning model according to the multiple first positive sample data and the multiple first negative sample data, the preset machine learning model and the preset deep learning model can be trained by adopting the existing machine learning model training method and the existing deep learning model training method, and the details are not repeated in the embodiment of the application.

105. And performing multi-round labeling processing on the plurality of residual unlabeled sample data according to the first label prediction model and the second label prediction model until a third preset stop condition is reached.

Wherein, the plurality of remaining unlabeled sample data are sample data which are not labeled in step 103; wherein the third preset stop condition is: and labeling each residual unmarked sample data.

In the embodiment of the application, after the target financial institution platform trains and obtains the first label prediction model and the second label prediction model based on the first positive sample data and the negative sample data, the target financial institution platform can perform multiple rounds of labeling processing on the remaining unlabeled sample data according to the first label prediction model and the second label prediction model until a third preset stop condition is reached, so that the labeling work on all the unlabeled sample data is completed.

The embodiment of the application provides a sample data processing method, which can be used for respectively inputting each unmarked sample data into a preset scoring model by a target financial institution platform after the target financial institution platform acquires a sample data set containing a plurality of unmarked sample data to obtain a scoring result corresponding to each unmarked sample data, performing multiple rounds of labeling processing on the plurality of unmarked sample data according to the scoring result corresponding to each unmarked sample data until a first preset stopping condition is reached to obtain a plurality of first positive sample data and a plurality of first negative sample data, performing multiple rounds of iterative training on a preset machine learning model and a preset deep learning model according to the plurality of first positive sample data and the plurality of first negative sample data until a second preset stopping condition is reached to obtain a first label prediction model and a second label prediction model, and finally, performing multi-round labeling processing on a plurality of residual unlabeled sample data according to the first label prediction model and the second label prediction model until a third preset stop condition is reached, thereby completing the operation of labeling all unlabeled sample data, namely, the target financial institution platform can efficiently label a large amount of unlabeled sample data through two multi-round labeling processing before and after.

To explain in more detail below, an embodiment of the present application provides another method for processing sample data, specifically as shown in fig. 2, the method includes:

201. and acquiring a sample data set.

For the step 201, the description of the corresponding part in fig. 1 may be referred to for obtaining the sample data set, and details of the embodiment of the present invention will not be described here.

202. And respectively inputting each unlabeled sample data into a preset scoring model to obtain a scoring result corresponding to each unlabeled sample data.

In step 202, each unlabeled sample data is input into the preset scoring model to obtain a scoring result corresponding to each unlabeled sample data, which may refer to the description of the corresponding part in fig. 1, and details of the embodiment of the present invention will not be repeated here.

203. And performing multiple rounds of labeling processing on the plurality of unlabeled sample data according to the grading result corresponding to each unlabeled sample data until a first preset stopping condition is reached so as to obtain a plurality of first positive sample data and a plurality of first negative sample data.

Specifically, in this step, the target financial institution platform may perform multiple rounds of tagging on the multiple unlabeled sample data according to the scoring result corresponding to each unlabeled sample data in the following manner until a first preset stop condition is reached, so as to obtain multiple first positive sample data and multiple first negative sample data:

for each round of labeling process:

(1) forward sorting the plurality of unlabeled sample data according to the scoring result corresponding to each unlabeled sample data to obtain a first sequence;

(2a) acquiring X unmarked sample data in the first sequence which is ranked at the front; secondly, clustering the X unlabeled sample data according to a preset unsupervised clustering algorithm to obtain a plurality of clusters; thirdly, calculating the average score corresponding to each cluster according to the score result corresponding to each unlabeled sample data, namely for any cluster, summing the score results corresponding to each unlabeled sample data in the cluster, and dividing the calculation result by the number of the plurality of unlabeled sample data in the cluster, so as to obtain the average score corresponding to the cluster; finally, labeling each unlabeled sample data contained in the cluster with the highest average score, namely labeling each unlabeled sample data contained in the cluster with the highest average score with a positive label (namely a high-quality client label), so as to obtain a plurality of first positive sample data;

(2b) acquiring Y unmarked sample data in the first sequence; secondly, clustering the Y unlabeled sample data according to a preset unsupervised clustering algorithm to obtain a plurality of clusters; thirdly, calculating the average score corresponding to each cluster according to the score result corresponding to each unlabeled sample data, namely for any cluster, summing the score results corresponding to each unlabeled sample data in the cluster, and dividing the calculation result by the number of the plurality of unlabeled sample data in the cluster, so as to obtain the average score corresponding to the cluster; and finally, performing labeling processing on each unlabeled sample data contained in the cluster with the lowest average score, namely labeling each unlabeled sample data contained in the cluster with the lowest average score with a negative label (namely an inferior customer label), thereby obtaining a plurality of first negative sample data.

The preset unsupervised clustering algorithm may be, but is not limited to: a K-means clustering algorithm, a hierarchical clustering algorithm, and the like; x may be, but is not limited to: 50. 100, 200, etc., and Y may be, but is not limited to: 50. 100, 200, etc.

Further, in this embodiment of the present application, after the target financial institution platform respectively inputs each unlabeled sample data into the preset scoring model, so as to obtain a scoring result corresponding to each unlabeled sample data, a preset dimensionality reduction algorithm may be further used to perform dimensionality reduction on each unlabeled sample data, so as to merge or delete a part of features corresponding to each unlabeled sample data, so that the target financial institution platform can more accurately cluster a plurality of unlabeled sample data subjected to dimensionality reduction in a process of performing multiple rounds of tagging on the plurality of unlabeled sample data subjected to dimensionality reduction according to the scoring result corresponding to each unlabeled sample data subjected to dimensionality reduction, where the preset dimensionality reduction algorithm may be, but is not limited to: a principal component analysis algorithm, a Lasso algorithm, an LDA algorithm, and the like.

204. And performing multiple rounds of iterative training on the preset machine learning model and the preset deep learning model according to the first positive sample data and the negative sample data until a second preset stop condition is reached to obtain a first label prediction model and a second label prediction model.

In step 204, multiple rounds of iterative training are performed on the preset machine learning model and the preset deep learning model according to the multiple first positive sample data and the multiple first negative sample data until a second preset stop condition is reached to obtain the first label prediction model and the second label prediction model, which may refer to the description of the corresponding part in fig. 1, and details of the embodiment of the present invention will not be repeated here.

205. And performing multi-round labeling processing on the plurality of residual unlabeled sample data according to the first label prediction model and the second label prediction model until a third preset stop condition is reached.

Specifically, in this step, the target financial institution platform may perform multiple rounds of tagging on a plurality of remaining unlabeled sample data according to the first tag prediction model and the second tag prediction model in the following manner until a third preset stop condition is reached:

for each round of labeling process:

(1) firstly, inputting each residual unmarked sample data into a first label prediction model respectively so as to obtain a first prediction probability corresponding to each residual unmarked sample data, and then carrying out forward sequencing on a plurality of residual unmarked sample data according to the first prediction probability corresponding to each residual unmarked sample data so as to obtain a second sequence, wherein the first prediction probability corresponding to any residual unmarked sample data is used for indicating the probability that the residual unmarked sample data should be marked with a negative label (namely, an inferior client label);

(2) inputting each residual unmarked sample data into a second label prediction model respectively to obtain a second prediction probability corresponding to each residual unmarked sample data, and then carrying out forward sequencing on a plurality of residual unmarked sample data according to the second prediction probability corresponding to each residual unmarked sample data to obtain a third sequence, wherein the second prediction probability corresponding to any residual unmarked sample data is used for indicating the probability that the residual unmarked sample data should be marked with a negative label (namely an inferior client label);

(3) firstly, determining the intersection of the a remaining unmarked sample data ranked earlier in the second sequence and the a remaining unmarked sample data ranked earlier in the third sequence as first target sample data, and then determining the intersection of the B remaining unmarked sample data ranked later in the second sequence and the B remaining unmarked sample data ranked later in the third sequence as second target sample data, wherein a may be but is not limited to: 50. 100, 200, etc., B may be, but is not limited to: 50. positive integers of 100, 200, etc.;

(4) firstly, clustering a plurality of first target sample data according to a preset unsupervised clustering algorithm to obtain a plurality of clusters; secondly, calculating the average score corresponding to each cluster according to the score result corresponding to each first target sample data, namely for any cluster, summing the score results corresponding to all the unlabeled sample data in the cluster, and dividing the calculation result by the number of the unlabeled sample data in the cluster so as to obtain the average score corresponding to the cluster; finally, labeling each first target sample data contained in the cluster with the lowest average score, that is, labeling each first target sample data contained in the cluster with the lowest average score with a negative label (i.e., a bad customer label), thereby obtaining a plurality of second negative sample data, wherein the preset unsupervised clustering algorithm may be, but is not limited to: a K-means clustering algorithm, a hierarchical clustering algorithm, and the like;

(5) firstly, clustering a plurality of second target sample data according to a preset unsupervised clustering algorithm to obtain a plurality of clusters; secondly, calculating the average score corresponding to each cluster according to the score result corresponding to each second target sample data, namely for any cluster, summing the score results corresponding to all the unlabeled sample data in the cluster, and dividing the calculation result by the number of the unlabeled sample data in the cluster, so as to obtain the average score corresponding to the cluster; finally, labeling each second target sample data contained in the cluster with the highest average score, namely labeling each second target sample data contained in the cluster with the highest average score with a positive label (namely a high-quality client label), so as to obtain a plurality of second positive sample data;

(6) the first label prediction model is retrained using the plurality of first positive sample data, the plurality of second positive sample data, the plurality of first negative sample data, and the plurality of second negative sample data, and the second label prediction model is retrained using the plurality of first positive sample data, the plurality of second positive sample data, the plurality of first negative sample data, and the plurality of second negative sample data.

In order to achieve the above object, according to another aspect of the present application, an embodiment of the present application further provides a storage medium, where the storage medium includes a stored program, and when the program runs, a device in which the storage medium is located is controlled to execute the above sample data processing method.

In order to achieve the above object, according to another aspect of the present application, an embodiment of the present application further provides a device for processing sample data, where the device includes a storage medium; and one or more processors, the storage medium coupled with the processors, the processors configured to execute program instructions stored in the storage medium; and when the program instruction runs, executing the sample data processing method.

Further, as an implementation of the method shown in fig. 1 and fig. 2, another embodiment of the present application further provides a device for processing sample data. The embodiment of the apparatus corresponds to the embodiment of the method, and for convenience of reading, details in the embodiment of the apparatus are not repeated one by one, but it should be clear that the apparatus in the embodiment can correspondingly implement all the contents in the embodiment of the method. The device is applied to efficiently labeling a large amount of unlabeled sample data, and specifically as shown in fig. 3, the device comprises:

an obtaining unit 31, configured to obtain a sample data set, where the sample data set includes a plurality of unlabeled sample data without labels;

the input unit 32 is configured to input each unlabeled sample data into a preset scoring model, so as to obtain a scoring result corresponding to each unlabeled sample data;

a first labeling unit 33, configured to perform multiple rounds of labeling processing on the multiple unlabeled sample data according to a scoring result corresponding to each unlabeled sample data until a first preset stop condition is reached, so as to obtain multiple first positive sample data and multiple first negative sample data;

a training unit 34, configured to perform multiple rounds of iterative training on a preset machine learning model and a preset deep learning model according to a plurality of first positive sample data and a plurality of first negative sample data until a second preset stop condition is reached, so as to obtain a first label prediction model and a second label prediction model;

and the second labeling unit 35 is configured to perform multiple rounds of labeling processing on multiple remaining unlabeled sample data according to the first label prediction model and the second label prediction model until a third preset stop condition is reached.

Further, as shown in fig. 4, the first labeling unit 33 is specifically configured to, for each round of labeling processing:

Further, as shown in fig. 4, the second labeling unit 35 is specifically configured to, for each round of labeling processing:

Further, as shown in fig. 4, the apparatus further includes:

the dimension reduction unit 36 is configured to, after the input unit 32 respectively inputs each unlabeled sample data into a preset scoring model to obtain a scoring result corresponding to each unlabeled sample data, perform dimension reduction processing on each unlabeled sample data by using a preset dimension reduction algorithm;

the first labeling unit 33 is specifically configured to perform multiple rounds of labeling processing on the unmarked sample data subjected to the dimension reduction processing according to a scoring result corresponding to each unmarked sample data subjected to the dimension reduction processing until a first preset stop condition is reached, so as to obtain a plurality of first positive sample data and a plurality of first negative sample data.

Further, as shown in fig. 4, the first preset stop condition is: the current labeling round number reaches a first preset round number threshold value, the current labeling time reaches a first preset time threshold value, and the ratio of the labeled sample data to the unlabeled sample data is greater than any one of the preset threshold values; the second preset stop condition is as follows: the current iteration training round number reaches any one of a second preset round number threshold value and the current iteration training time length reaches a second preset time length threshold value; the third preset stop condition is as follows: and labeling each residual unmarked sample data.

The embodiment of the application provides a sample data processing method and a device, after a target financial institution platform obtains a sample data set containing a plurality of unmarked sample data, the target financial institution platform respectively inputs each unmarked sample data into a preset scoring model to obtain a scoring result corresponding to each unmarked sample data, and performs a plurality of rounds of labeling processing on the plurality of unmarked sample data according to the scoring result corresponding to each unmarked sample data until a first preset stop condition is reached to obtain a plurality of first positive sample data and a plurality of first negative sample data, and performs a plurality of rounds of iterative training on a preset machine learning model and a preset deep learning model according to the plurality of first positive sample data and the plurality of first negative sample data until a second preset stop condition is reached to obtain a first label prediction model and a second label prediction model, and finally, performing multi-round labeling processing on a plurality of residual unlabeled sample data according to the first label prediction model and the second label prediction model until a third preset stop condition is reached, thereby completing the operation of labeling all unlabeled sample data, namely, the target financial institution platform can efficiently label a large amount of unlabeled sample data through two multi-round labeling processing before and after.

The sample data processing device comprises a processor and a memory, wherein the acquisition unit, the input unit, the first labeling unit, the training unit, the second labeling unit and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.

The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more, and a large amount of sample data without labels can be effectively labeled by adjusting kernel parameters.

The embodiment of the application provides a storage medium, which includes a stored program, wherein when the program runs, a device where the storage medium is located is controlled to execute the sample data processing method.

The storage medium may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.

The embodiment of the application also provides a sample data processing device, which comprises a storage medium; and one or more processors, the storage medium coupled with the processors, the processors configured to execute program instructions stored in the storage medium; and when the program instruction runs, executing the sample data processing method.

The embodiment of the application provides equipment, the equipment comprises a processor, a memory and a program which is stored on the memory and can run on the processor, and the following steps are realized when the processor executes the program:

Further, the performing multiple rounds of labeling processing on the multiple unlabeled sample data according to the scoring result corresponding to each unlabeled sample data until a first preset stop condition is reached to obtain multiple first positive sample data and multiple first negative sample data includes:

for each round of labeling process:

Further, the performing multiple rounds of labeling processing on multiple remaining unlabeled sample data according to the first label prediction model and the second label prediction model until a third preset stop condition is reached includes:

for each round of labeling process:

Further, after the step of inputting each unlabeled sample data into a preset scoring model respectively to obtain a scoring result corresponding to each unlabeled sample data, the method further includes:

Further, the first preset stop condition is: the current labeling round number reaches a first preset round number threshold value, the current labeling time reaches a first preset time threshold value, and the ratio of the labeled sample data to the unlabeled sample data is greater than any one of the preset threshold values; the second preset stop condition is as follows: the current iteration training round number reaches any one of a second preset round number threshold value and the current iteration training time length reaches a second preset time length threshold value; the third preset stop condition is as follows: and labeling each residual unmarked sample data.

The present application further provides a computer program product adapted to perform program code for initializing the following method steps when executed on a data processing device: acquiring a sample data set, wherein the sample data set comprises a plurality of unlabeled sample data without labels; inputting each unlabeled sample data into a preset scoring model respectively to obtain a scoring result corresponding to each unlabeled sample data; performing multiple rounds of labeling processing on the multiple unlabeled sample data according to the grading result corresponding to each unlabeled sample data until a first preset stopping condition is reached so as to obtain multiple first positive sample data and multiple first negative sample data; performing multiple rounds of iterative training on a preset machine learning model and a preset deep learning model according to the first positive sample data and the first negative sample data until a second preset stop condition is reached to obtain a first label prediction model and a second label prediction model; and performing multi-round labeling processing on a plurality of residual unlabeled sample data according to the first label prediction model and the second label prediction model until a third preset stop condition is reached.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method for processing sample data, the method comprising:

2. The method according to claim 1, wherein said performing multiple rounds of labeling on a plurality of unlabeled sample data according to a scoring result corresponding to each unlabeled sample data until a first preset stop condition is reached to obtain a plurality of first positive sample data and a plurality of first negative sample data comprises:

for each round of labeling process:

3. The method of claim 1, wherein the performing multiple rounds of labeling on the plurality of remaining unlabeled sample data according to the first label prediction model and the second label prediction model until a third preset stop condition is reached comprises:

for each round of labeling process:

4. The method according to claim 1, wherein after said inputting each said unlabeled sample data into a preset scoring model respectively to obtain a scoring result corresponding to each said unlabeled sample data, the method further comprises:

5. The method according to claim 1, characterized in that the first preset stop condition is: the current labeling round number reaches a first preset round number threshold value, the current labeling time reaches a first preset time threshold value, and the ratio of the labeled sample data to the unlabeled sample data is greater than any one of the preset threshold values; the second preset stop condition is as follows: the current iteration training round number reaches any one of a second preset round number threshold value and the current iteration training time length reaches a second preset time length threshold value; the third preset stop condition is as follows: and labeling each residual unmarked sample data.

6. An apparatus for processing sample data, the apparatus comprising:

7. The apparatus of claim 6,

the first labeling unit is specifically configured to, for each round of labeling processing:

8. The apparatus of claim 6,

the second labeling unit is specifically configured to, for each round of labeling processing:

9. A storage medium, comprising a stored program, wherein when the program is run, a device on which the storage medium is located is controlled to execute the method for processing sample data according to any one of claims 1 to 5.

10. An apparatus for processing sample data, the apparatus comprising a storage medium; and one or more processors, the storage medium coupled with the processors, the processors configured to execute program instructions stored in the storage medium; the program instructions when executed perform a method of processing sample data as claimed in any one of claims 1 to 5.