CN114511399B

CN114511399B - Abnormal data identification and elimination method

Info

Publication number: CN114511399B
Application number: CN202210138272.2A
Authority: CN
Inventors: 李开恒; 岳钧; 王子凡; 赵文宇; 赵灿阳; 涂俊; 王正宁
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2022-02-15
Filing date: 2022-02-15
Publication date: 2023-12-15
Anticipated expiration: 2042-02-15
Also published as: CN114511399A

Abstract

The invention discloses a recognition and elimination method of abnormal data, which relates to the field of internet finance, and comprises the following steps of S1, acquiring a training data set and determining attribute weights of the training data set; s2, establishing a binary tree, and dividing a training data set into normal data and abnormal data; s3, establishing a basic model, importing a training data set for optimization training, and obtaining an optimal abnormal data set; s4, screening and removing abnormal data according to the optimal abnormal data set by the data set to be screened; the binary tree set is constructed based on the attribute weight, and the abnormal data is identified through the binary tree set, so that the abnormal data can be more accurately divided, assignment can be carried out according to different attributes when the abnormal score is calculated, and the abnormal data can be more accurately identified; the basic model is adopted to update abnormal data in real time in training, the sample is deleted from the data set after the punishment weight of a single sample reaches a set value, and the abnormal sample is removed, so that the model fully mines the information of the data.

Description

Abnormal data identification and elimination method

Technical Field

The invention relates to the technical field of computers, in particular to a method for identifying and eliminating abnormal data.

Background

In recent years, the internet finance industry develops very rapidly in China, and corresponding matched supervision measures and technologies are not kept synchronous, so that a supervision blind area exists. The threshold of the lending industry is lower, and the quality of the participating people is greatly different, so that the risk level of the lending business of the internet financial industry is higher. And for the internet finance loan platform with the main business of the loan class service, because the information technology is imperfect, the platform can not establish an effective risk monitoring system, and the default phenomenon of the loan user is continuous. In order to obtain a loan of an enterprise, a user may have operations to hide his own related information, and such low quality user information may increase the credit risk of the enterprise, and the loss may not be measured. The credit and behavior data of the lending user on the online shopping, transaction, social contact and other platforms are deeply mined and analyzed, the data generated by the lending user under each platform and scene are classified and summarized, the big data time is taken as the background, and the default prediction model established by taking a machine learning method as means can convert the effective information of the user into the default probability of the user, so that the risk of the lending transaction between the user and the platform is better controlled.

The data determines the upper limit of the model, and high quality data is the guarantee that the model correctly identifies the offending user. In the scene of internet finance wind control, some users often report error information for evading supervision, and under the interference of historical factors, data in different periods have differences in distribution, data quality is uneven, and model learning effect is poor.

Disclosure of Invention

The invention aims to solve the problems and designs an abnormal data identification and elimination method.

The invention realizes the above purpose through the following technical scheme:

the method for identifying and rejecting the abnormal data comprises the following steps:

s1, acquiring a training data set, and determining attribute weights of the training data set, wherein the training data set is reputation and behavior data of a user desiring to borrow;

s2, establishing a binary tree, and dividing a training data set into normal data and abnormal data;

s3, establishing a basic model, importing a training data set for optimization training, and obtaining an optimal abnormal data set;

s4, screening and removing abnormal data according to the optimal abnormal data set by the data set to be screened.

The invention has the beneficial effects that: the binary tree set is constructed based on the attribute weight, and the abnormal data is identified through the binary tree set, so that the abnormal data can be more accurately divided, assignment can be carried out according to different attributes when the abnormal score is calculated, and the abnormal data can be more accurately identified;

and the abnormal data is updated in real time in training by adopting a basic model, the sample is deleted from the data set after the punishment weight of the single sample reaches a set value, the abnormal sample is effectively removed, the effective data is reserved, the model is fully mined to the information of the data, and the default probability of the user is accurately analyzed.

Drawings

Fig. 1 is a flow chart of the method for recognizing and rejecting abnormal data according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

In the description of the present invention, it should be understood that the directions or positional relationships indicated by the terms "upper", "lower", "inner", "outer", "left", "right", etc. are based on the directions or positional relationships shown in the drawings, or the directions or positional relationships conventionally put in place when the inventive product is used, or the directions or positional relationships conventionally understood by those skilled in the art are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific direction, be configured and operated in a specific direction, and therefore should not be construed as limiting the present invention.

Furthermore, the terms "first," "second," and the like, are used merely to distinguish between descriptions and should not be construed as indicating or implying relative importance.

In the description of the present invention, it should also be noted that, unless explicitly specified and limited otherwise, terms such as "disposed," "connected," and the like are to be construed broadly, and for example, "connected" may be either fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.

The following describes specific embodiments of the present invention in detail with reference to the drawings.

s1, acquiring a training data set and determining attribute weights of the training data set, wherein the training data set is reputation and behavior data of a user who wants to borrow and lend, and the training data set specifically comprises:

s11, training data set D is represented as d= { (x) ₁ ，y ₁ )，(x ₂ ，y ₂ )，...，(x _n ，y _n ) Each set of attributes of training dataset D is noted as x= { z } ₁ ，z ₂ ，...，z _k N is the number of datasets, k is the number of attributes, x _i For each piece of data, y _i A label corresponding to the label;

s12, calculating attribute weight W in X _x ＝{w _x ¹ ，w _x ² ，...，w _x ^k -w is _x ^k For each attribute z in X _i Is expressed asWherein V is an attribute z _i Is z _i Dividing D results in V branch nodes, wherein the V-th branch node comprises all attribute values z in D _i Is denoted as D _v 。

S2, establishing a binary tree, and dividing a training data set into normal data and abnormal data, wherein the method specifically comprises the following steps of:

s21, establishing a binary tree set based on attribute weights, sorting the attribute weight values of X from large to small, selecting the first J attributes, wherein J is 50, and taking each attribute z in the J attributes _i Establishing a binary Tree _t Its attribute weight is equal to W _x W of (3) ^xi Correspondingly, randomly select one z _i Values of (2)As Root of the binary tree _t The binary Tree _t Left and right subtrees of (1)>Generation of->MIN(z _i ) Is attribute z _i MAX (z) _i ) Is attribute z _i Filter (x) is the filter function and q is the maximum in data set Dx _i In attribute z _i The value of q is equal to or less than z _i ^j Z of (2) _i Putting a left subtree, otherwise putting a right subtree, completing the establishment of the binary Tree in a recursion manner to generate J binary trees, and forming a binary Tree set F= { Tree based on attribute weight ₁ ，Tree ₂ ，...，Tree _J }；

Each piece of data z in S22, D _i Traversing the binary tree in F and calculating x _i At the depth h, x of each binary tree _i At the binary Tree _j Is h×w _x ^k Denoted as s _j ；

S23, calculating average isolated fractionThe isolated score set for all data in D represents is= { γ ₁ ，γ ₂ ，...，γ _n }；

S24, carrying out abnormal and normal data set division on the data set D according to the isolated score set Is, wherein the division Is expressed asWherein, kappa is a dividing threshold value, data with isolation score larger than the threshold value is divided into abnormal data, and marked as an abnormal data set D _s ＝{(x ₁ ，y ₁ )，(x ₂ ，y ₂ )，...(x _m ，y _m ) And otherwise, normal data is recorded as a normal data set D _t ＝{(x _m+1 ，y _m+1 )，(x _m+2 ，y _m+2 )，...(x _n ，y _n )}。

S3, establishing a basic model, importing a training data set for optimization training, and acquiring an optimal abnormal data set, wherein the method specifically comprises the following steps of:

s31, the basic model phi is an XGBoost model, and a punishment weight set W of abnormal data is initialized _s ＝{ε ₁ ，ε ₂ ，...，ε _m }, where ε _i The punishment weight of each piece of abnormal data is initially 0;

s32, normal dataThe set and the abnormal data set are imported into a base model, and a prediction result of each data is obtained, and the set of prediction results is expressed as y=Φ (X), y= { (μ) ₁ ，v ₁ )，(μ ₂ ，ν ₂ )，...，(μ _n ，v _n ) -wherein u and v are the prediction probability value and the prediction label, respectively;

s33, updating a punishment weight set W of the abnormal data set according to the result of the phi prediction of the basic model _s Is a weight epsilon of each of the (a) _i The updating method is epsilon _i +(|v _i -y _i |-|1-v _i +y _i |)μ _i →ε _i (i.ltoreq.m); wherein y is _i Is a true tag value;

s34, updating the abnormal data set and calculating W _s Mean value r and standard deviation sigma of (c), expressed as

S35, traversing the abnormal data set D _s Setting a penalty weight threshold z to 0.9, deleting the penalty weight epsilon _i Updating Ds and the total number m of abnormal samples after deleting abnormal data samples of more than z;

s36, setting the iteration number N of the basic model phi to be 50, and judging the abnormal data set D according to whether the iteration number of the basic model updating abnormal data set is greater than 50 _s Whether the abnormal data set is optimal or not, if so, the abnormal data set D below _s As the optimal abnormal data set, the process proceeds to S4, and otherwise returns to S34.

The technical scheme of the invention is not limited to the specific embodiment, and all technical modifications made according to the technical scheme of the invention fall within the protection scope of the invention.

Claims

1. The method for identifying and rejecting abnormal data is characterized by comprising the following steps:

s3, establishing a basic model, importing a training data set for optimization training, and obtaining an optimal abnormal data set; the method specifically comprises the following steps:

s32, importing a normal data set and an abnormal data set into a basic model, and obtaining a prediction result of each data, wherein the set of the prediction results is expressed as Y=phi (X), and Y= { (mu) ₁ ，ν ₁ )，(μ ₂ ，ν ₂ )，...，(μ _n ，ν _n ) -wherein u and v are the prediction probability value and the prediction label, respectively;

S35, traversing the abnormal data set D _s Setting a penalty weight threshold z, deleting the penalty weight epsilon _i Abnormal data sample of > z, update D after deletion _s And the total number of abnormal samples m;

s36, judging abnormal data set D _s Whether the abnormal data set is optimal or not, if so, the abnormal data set D below _s As the optimal abnormal data set, entering into S4, otherwise returning to S34;

2. The method of recognizing and removing abnormal data according to claim 1, wherein S1 includes:

s11, training data set D is represented as d= { (x) ₁ ，y ₁ )，(x ₂ ，y ₂ )，...，(x _n ，y _n ) Each set of attributes of training dataset D is noted asWhere n is the number of datasets, k is the number of attributes, x _i For each piece of data, y _i A label corresponding to the label;

s12, calculating attribute weight W in X _x ＝{w _x ¹ ，w _x ² ，...，w _x ^k -w is _x ^k For each attribute in XIs expressed as +.>Wherein V is an attribute->Is used to determine the number of possible values of V,dividing D results in V branch nodes, wherein the V-th branch node comprises all of the attribute values of D as +.>Is denoted as D ^v 。

3. The method of recognizing and removing abnormal data according to claim 2, wherein S2 includes:

s21, establishing a binary tree set based on attribute weights, selecting J attributes according to the attribute weights of X, and using each attribute in the J attributesEstablishing a binary Tree _t Its attribute weight is equal to W _x W of (3) _x ⁱ Correspondingly, randomly select one +>Values of (2)As Root of the binary tree _t The binary Tree _t Left and right subtrees pressGeneration of-> Is attribute ofMinimum value->For attribute->Filter (x) is the filter function and q is x in the dataset D _i In attribute->The value of (i.e.)>Is->Putting a left subtree, otherwise putting a right subtree, completing the establishment of the binary Tree in a recursion manner to generate J binary trees, and forming a binary Tree set F= { Tree based on attribute weight ₁ ，Tree ₂ ，...，Tree _J }；

Each piece of data in S22, DTraversing the binary tree in F and calculating x _i At the depth h, x of each binary tree _i At the binary Tree _j Is h×w _x ^k Denoted as s _j ；

4. The method of recognizing and eliminating abnormal data according to claim 1, wherein in S36, the number of iterations N of the basic model Φ is set to 50, and the abnormal data set D is judged according to whether the number of iterations of updating the abnormal data set of the basic model is greater than 50 _s Whether it is an optimal abnormal data set.

5. The method for recognizing and eliminating abnormal data according to claim 1, wherein the penalty weight threshold z is 0.9, the threshold κ is 0.8, and the number T is 50.

6. The method of recognizing and eliminating abnormal data according to claim 3, wherein in S21, the first J attributes are selected by sorting from large to small attribute weight values of X.