CN112488228A

CN112488228A - Bidirectional clustering method for wind control system data completion

Info

Publication number: CN112488228A
Application number: CN202011439471.4A
Authority: CN
Inventors: 郑小禄; 诸葛天心; 刘羽中; 胡亮; 仵伟强; 尹昌
Original assignee: Jingke Internet Technology Shandong Co ltd
Current assignee: Jingke Internet Technology Shandong Co ltd
Priority date: 2020-12-07
Filing date: 2020-12-07
Publication date: 2021-03-12

Abstract

The invention relates to the technical field of cluster analysis, in particular to a bidirectional clustering method facing wind control system data completion.

Description

Bidirectional clustering method for wind control system data completion

Technical Field

The invention relates to the technical field of cluster analysis, in particular to a bidirectional clustering method for wind control system data completion.

Background

With the development of information technology and the internet, more and more machine learning algorithms are applied to the traditional financial field. In the traditional financial field, attention is paid to how to perform financial wind control through big data combined with machine learning. Most of the traditional wind control models are built on a supervised learning task with labels. However, with the increasing of data volume, storage errors, unreliable acquisition equipment, unstable network state, malicious fraud of users and other reasons, most of the acquired data are incomplete. And the incomplete data may be redundant, noisy, missing, etc. Data loss is a common phenomenon in a wind control system, and the lost data volume grows exponentially along with the growth of the user scale and the service scale. Missing data affects the accuracy and reliability of wind control decisions, e.g., various mature wind control models based on structured integrity data do not have any place to use; failure to make decisions due to data loss, etc. Data loss brings many adverse effects to the wind control system, which not only affects user experience, but also improves decision risk.

The potential factor model based on matrix decomposition is widely used in data completion facing wind control systems. However, the conventional latent factor model can only be complemented from a single dimension, with a loss of accuracy. The full utilization of information from multiple dimensions has become an important research direction for data completion.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provide a bidirectional clustering method for wind control system data completion to solve the problems of insufficient speed and efficiency of data missing completion.

The invention is realized by the following technical scheme: a bidirectional clustering method facing wind control system data completion comprises five steps of example clustering, attribute clustering, local matrix construction, local matrix filling and matrix filling, wherein:

the example clustering is to distribute the sample points into different clusters I, the mass center of each cluster I is different, the mass center of each cluster I is obtained by updating a first formula, the similarity in the example clustering is calculated by a first distance calculation formula, and the first distance calculation formula is

Wherein D represents the number of attributes of the data object, and the formula of the subset c allocated to the cluster in the example cluster is

The attribute clustering is to perform attribute dimension clustering on data obtained by example clustering and allocate the data to different clusters II, wherein the centroids of the clusters II are different, the centroids of the clusters II are obtained by updating a formula II, the similarity in the attribute clustering is calculated by a distance formula I, and the formula of a subset d allocated to the clusters in the attribute clustering is shown as

The local matrix is constructed to carry out united clustering on example clustering and attribute clustering to obtain a local matrix;

the local matrix filling is to fill the local matrix with a potential factor model according to the relevance of the user and the item to obtain a complete matrix, wherein the potential factor model is A ═ UV^TWherein A is a local model, and U and V are potential factor matrixes of users and characteristic items respectively;

and the matrix filling is to fill the filled local matrix into the matrix to obtain a complete matrix.

Further, both the first update formula and the second update formula are

Wherein Center is defined as the centroid of the kth cluster, Center_kRepresents the kth class cluster, | C_kAnd | represents the number of data objects in the kth class cluster.

Further, the Center is calculated_kAnd then, selecting the point which is closest to the centroid from the sample points, and updating the point to the centroid.

The invention has the beneficial effects that: the method mainly aims at high similarity in clusters and low similarity between clusters, sample points are distributed to different clusters, the centroid obtained by the example clustering is subjected to attribute dimension clustering by the attribute clustering, information of the example dimension and the attribute dimension is fully considered, potential rules among rows and columns are effectively captured by combined clustering, a local matrix is constructed according to the potential rules, users and items in the local matrix have strong correlation, and the local matrix is filled through a potential factor model. Compared with the existing method based on potential factor filling, such as matrix decomposition, multi-clustering and the like, the method captures local information from two dimensions through example clustering and attribute clustering, and has more sufficient mining and utilization on the local information, thereby obtaining a better completion effect.

Drawings

FIG. 1 is a schematic flow chart of the main steps of the present invention;

FIG. 2 is a flow chart of the overall algorithm process of the present invention;

FIG. 3 is a data diagram of the present invention;

fig. 4 is a visual comparison diagram of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the following examples, the samples are shown in attached Table 1,

attached table 1

Appendix 1 is part of a public data set "Lendingclub" that provides personal information and performance of a customer, often used for the accuracy judgment of a test algorithm for whether a customer performs or not based on the customer personal information. Each row in appendix 1 is information of a client, each column is all attributes of the client, and the last column is the performance of the client, a label that is typically used for algorithmic prediction of whether a client performs or not.

In an example cluster, the u-th subset c is obtained_uIs of the formula

Wherein R is the data of the whole table, R_u,:For a local example matrix composed of all rows belonging to the u-th subset in the entire table, v_cThe centroid vector, which is the u-th subset, is the representative eigenvector of this local matrix.

FIG. 2 is an example subset matrix.

Attached table 2

The attribute clustering is to perform attribute dimension clustering on data obtained by example clustering and allocate the data to different attributesIn the second cluster, the centroids of the second clusters are different, the centroids of the second clusters are obtained by updating a second formula, the similarity in the attribute clustering is calculated by a first distance formula, and the mth subset d is obtained in the attribute clustering_mIs given by the formula

Wherein

The local matrix data obtained for attribute clustering, as shown in the attached table 3,

a local attribute matrix, V, composed for all columns of the entire table belonging to the mth subset_:,dThe centroid vector, which is the mth subset, is the representative eigenvector of this local matrix. FIG. 3 is an example of an attribute subset matrix. It is noted that "default" is usually regarded as a label, not an attribute, and therefore when an attribute is clustered, the attribute data is usually deleted and then clustered, that is, the clustering operation is performed

The column "default or not" is not included in this.

Attached table 3

Example 1

As shown in fig. 1 to 3, a bidirectional clustering method for wind control system data completion includes five steps of example clustering, attribute clustering, local matrix construction, local matrix filling, and matrix filling, where:

the example clustering is to distribute the sample points into different clusters I, the centroids of the clusters I are different, the centroids of the clusters I are obtained by updating a first formula, the similarity in the example clustering is calculated by a first distance calculation formula, and the first distance calculation formula is

Where D represents the number of attributes of the data object, and the formula of the subset c assigned to the cluster in the example cluster is

The attribute clustering is to perform attribute dimension clustering on data obtained by example clustering and distribute the data into different second clusters, wherein the centroids of the second clusters are different, the centroids of the second clusters are obtained by updating a formula II, the similarity in the attribute clustering is calculated by a distance formula I, and the mth subset d is obtained in the attribute clustering_mIs given by the formula

Wherein d is_mThe local matrix data obtained for the attribute clustering,

a local attribute matrix, V, composed for all columns of the entire table belonging to the mth subset_:,dThe centroid vector, which is the mth subset, is the representative eigenvector of this local matrix.

The local matrix is constructed to carry out united clustering on the example clustering and the attribute clustering to obtain a local matrix;

the local matrix filling is to fill the local matrix with a potential factor model according to the relevance of the user and the item to obtain a complete matrix, wherein the potential factor model is A ═ UV^TWherein A is a local model, U and V are potential factor matrixes of users and characteristic items respectively, and the first updating formula and the second updating formula are both

Wherein, the Center_kDefined as the centroid of the kth cluster, Center_kRepresents the kth class cluster, | C_kI represents the number of data objects in the kth class cluster, and calculates the Center_kAnd then, selecting the point which is closest to the centroid from the sample points, and updating the point to the centroid.

The matrix filling is to fill the filled local matrix into the matrix to obtain a complete matrix, and fill the missing data.

Taking the sample processing of attached table 1 as an example,

the bidirectional clustering method for wind control system data completion comprises the following operation steps,

step 1, inputting missing wind control data, see attached table 1;

step 2, constructing a model, setting parameters kn, km, I and J, wherein kn is the number of row vector clustering centroids, the value is related to the number of users, kn is 3 for the sample data in the attached table 1, km is the number of column vector clustering centroids, the value is related to the number of attributes, km is 2 for the sample data in the attached table 1, I, J are the maximum iteration times, the value is related to the matrix row dimension, I is 5 for the sample data in the attached table 1, the iteration times I is 0, and J is 0.

Step 3, randomly selecting kn user vectors from the wind control data as representative user vectors in an attached table 1 to obtain a first mass center, and obtaining kn mass vectors as shown in an attached table 4, wherein each row is a mass center vector;

attached table 4

Step 4, calculating the distance from each user vector to kn centroid vectors according to a first distance formula, wherein the first distance formula is

D represents the number of attributes of the data objects, the class of the user vector is divided into the centroid vector closest to the class of the user vector, and kn clusters I are obtained, wherein the three clusters are respectively shown in an attached table 2, an attached table 5 and an attached table 6;

attached table 5

16	5000	704	0.11	9	6	0.47	8	36	0.12	Performing contract
											3	4000	689	0.22			0.58	16	36	0.16	Performing contract
20	10225	689	0.33	30		0.7	52		0.16	Performing contract
											18	6000	679		11	10	0.3	38	36	0.08	Performing contract
19	24000	679	0.25	20			29	36	0.12	Performing contract
											7	3000	674	0.15	32	10	0.34	25	36	0.16	Performing contract
2	6000	669	0.08	37	1		8	36	0.12	Performing contract
											6	3000	669	0.29		4			36	0.16	Performing contract
13	5000	669	0.19	10	10	0.51	41	36	0.09	Performing contract

Attached table 6

14	35000	669	0.17	23		0.87	53	60	0.19	Performing contract
											24	14400	669	0.27	37	10	0.74	29	60	0.19	Default
1	19150		0.13	11	1	0.39	41	36	0.19	Performing contract
											5	12000		0.06	33	10	0.8	5	60	0.14	Performing contract
11	5700		0.15	16	6	0.34		36	0.07	Performing contract
											17	9600		0.15	10	6	0.86		36	0.11	Performing contract
23	14000		0.13	32	9		22	36	0.16	Default

Step 5, averaging the cluster I through a centroid formula, wherein the centroid updating formula is

Center_kCentroid, C, defined as the kth cluster one_kRepresents the kth class cluster, | C_kI represents the number of data objects in the kth cluster I to obtain a second centroid;

step 6, judging whether the iteration frequency I is equal to I or not, if not, executing the step 4, and if so, executing the step 7;

and 7, transposing the obtained second centroid matrix, as shown in an attached table 7, obtaining the second centroid matrix, randomly selecting km centroid vectors from the second centroid matrix, as shown in an attached table 8, wherein each row is one centroid vector.

Attached table 7

User ID	7	21	24
				Amount of loan	3000	6500	14400
Credit score value	674	714	669
				Debt to income ratio	0.15	0.21	0.27
Province of labor	32	37	37
				Length of operation	10	10	10
Turnover limit utilization rate	0.34	0.75	0.74
				Opening account number	25	12	29
Number of loan payments	36	36	60
				Interest rate	0.16	0.12	0.19
Whether or not there is a default	Performing contract	Performing contract	Default

Attached table 8

Credit score value	674	714	669
				Opening account number	25	12	29

Step 8, calculating the distance from each column to a third centroid through a first distance formula, dividing the class of each column to the third centroid closest to the class of each column to form km clusters II, wherein the two clusters II are respectively shown in an attached table 9 and an attached table 10;

attached watch 9

Amount of loan	3000	6500	14400
				Credit score value	674	714	669
Province of labor	32	37	37
				Opening account number	25	12	29
Number of loan payments	36	36	60

Attached watch 10

Debt to income ratio	0.15	0.21	0.27
				Length of operation	10	10	10
Turnover limit utilization rate	0.34	0.75	0.74
				Interest rate	0.16	0.12	0.19

Step 9, averaging the cluster II to obtain a centroid IV;

step 10, judging whether the iteration number J is equal to J +1, if so, executing step 8, and if so, executing step 11;

step 11, constructing a local matrix through a row vector clustering result (cluster one) and a column vector clustering result (cluster two), wherein the local matrix constructed by the row vector cluster of attached table 4 and the column vector cluster of attached table 9 is shown in attached table 11, and the local matrix constructed by the row vector cluster of attached table 4 and the column vector cluster of attached table 10 is shown in attached table 12;

attached watch 11

Attached table 12

And step 12, filling the local matrix through a potential factor model, wherein the potential factor model is A ═ UV^TWhere a is the local model, U and V are the potential factor matrices for the user and attribute, respectively, the number of rows is the number of users and attribute, respectively, the number of columns is the potential factor dimension, in this example, the potential factor dimension is 3, as exemplified by the attached table 11, with the formula a ═ UV^TFor potential vector U8 for user 8 and potential vector V3 on attribute "province", U8 is found to be [32.94, 48.43, 10.14 ", respectively]The characteristic 'province' V3 is [0.22, 0.04, 3.24 ]]Thus can be determined by the formula UV^TGet the missing value of user 8 on attribute "province" as U8V3 ═ a^T42, a partial matrix a' without missing values can be obtained, and all partial matrices obtained in step 11 are filled;

step 13, filling a data matrix with the result of the non-missing value local matrix obtained in the step 12;

and step 14, outputting the data matrix, which is shown in an attached table 13.

Attached watch 13

As can be known from the attached table 13, the bidirectional clustering method for wind control system data completion provided by the invention can stably supplement missing data, and has an important role in supplementing massive missing data at the present stage.

Data set open experimental effect comparison:

the public data set consisted of 656, 724 loan records published by "Lendingclub" between 2013 and 2015. There are 115 attribute description loan applications in total. The "loan status" attribute that describes the current status of the loan has the following value: "issued", "current", "paid full", "default", "received", "delayed (16-30 days)", "late (31-120 days)" and "in grace period". These states are used to reduce them to binary classification problems, i.e., loan applications with "charged", "default", "delayed (31-120 days)" and "delayed (16-30 days)" are considered "bad" or "default" loans, while "current", "paid full" and "in grace period" are classified as "bad" loans, the rest are ignored. A value of 0 indicates a good reputation and a value of 1 indicates a bad reputation or a default. The loan amounts vary from $1000 to $35,000, with each loan having a "rank" (from A-G to A) associated with it. The ratings specify interest rates in order of small to large, ranging from 5.32% to 29%. The results indicate that loans with higher interest rates have a higher risk of default. A G-rated loan accounts for 31% of the loans, while only 3% of the A-rated loans are bad loans. In the data set, the comparison of the performance of the algorithm is evaluated by the AUC, and the accuracy of the algorithm with high AUC is higher.

For comparison, the applicant considered the following methods as comparative references:

offset: offset is widely used for reference testing of prediction accuracy, using an average value of all user data of an item as a prediction value.

ItemKNN: ItemKNN clusters the attributes of the user into a plurality of subsets and uses the average of each subset as a predictor.

MF: matrix Factorization (Matrix Factorization) is a potential factor model. The method is widely applied to wind control systems.

ADFT: an Alternative Distance Function Transformation (Alternative Distance Function Transformation) learns a Distance Function using constraints that must be linked and cannot be linked between instances, and computes a Transformation matrix using the Distance Function, thereby generating an Alternative cluster using a set of features.

MSC: stable multi-clustering (Multiple Stable clustering) uses simplex constraints to generate different sparse weights assigned to features, and then uses spectral clustering to produce Multiple Stable clusters.

MetaClustering: meta-clustering is a well-known method in the unsupervised multi-clustering category. It first gives different weights to the features according to the Zipf distribution, and then obtains multiple clusters by applying the k-means to the weighted features.

The method of this scheme is denoted DCM.

The experimental results are as follows: the results of the experiment are shown in table one, which illustrates the performance of the method of this protocol and other baseline methods in terms of AUC. The results show that the proposed DCM achieves better performance.

TABLE 3

Offset

ItemKNN

MF

ADFT

MSC

MetaClustering

DCM

AUC

66.80％

77.79％

79.69％

84.55％

87.97％

88.22％

92.09％

Visual experiment effect comparison: to further illustrate the performance of the method of the present scheme, the present scheme further shows visually, as shown in fig. 4, by comparing images obtained by filling clusters obtained by clustering ItemKNN and DCM, it can be seen that, when the number of clusters is the same, the ItemKNN is inferior to the DCM in terms of the expression of features, because the DCM utilizes information of two dimensions for clustering.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments or portions thereof without departing from the spirit and scope of the invention.

Claims

1. A bidirectional clustering method for wind control system data completion is characterized by comprising five steps of example clustering, attribute clustering, local matrix construction, local matrix filling and matrix filling, wherein:

The attribute clustering is to perform attribute dimension clustering on the data obtained by example clustering and distribute the data into different clusters II, wherein the mass centers of the clusters II are different, and the qualities of the clusters II are differentThe center is obtained by updating a formula II, the similarity in the attribute clusters is calculated by a distance formula I, and the formula of the subset d distributed by the clusters in the attribute clusters is shown as

2. The bidirectional clustering method oriented to wind control system data completion of claim 1, wherein both the first update formula and the second update formula are the same

Wherein, the Center_kDefined as the centroid of the kth cluster, Center_kRepresents the kth class cluster, | C_kAnd | represents the number of data objects in the kth class cluster.

3. The bidirectional clustering method oriented to wind control system data completion of claim 2, characterized in that a Center is calculated_kAnd then, selecting the point closest to the centroid from the sample points, and updating the point to be the new centroid.