CN114511399B - Abnormal data identification and elimination method - Google Patents

Abnormal data identification and elimination method Download PDF

Info

Publication number
CN114511399B
CN114511399B CN202210138272.2A CN202210138272A CN114511399B CN 114511399 B CN114511399 B CN 114511399B CN 202210138272 A CN202210138272 A CN 202210138272A CN 114511399 B CN114511399 B CN 114511399B
Authority
CN
China
Prior art keywords
data set
abnormal data
abnormal
attribute
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210138272.2A
Other languages
Chinese (zh)
Other versions
CN114511399A (en
Inventor
李开恒
岳钧
王子凡
赵文宇
赵灿阳
涂俊
王正宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202210138272.2A priority Critical patent/CN114511399B/en
Publication of CN114511399A publication Critical patent/CN114511399A/en
Application granted granted Critical
Publication of CN114511399B publication Critical patent/CN114511399B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a recognition and elimination method of abnormal data, which relates to the field of internet finance, and comprises the following steps of S1, acquiring a training data set and determining attribute weights of the training data set; s2, establishing a binary tree, and dividing a training data set into normal data and abnormal data; s3, establishing a basic model, importing a training data set for optimization training, and obtaining an optimal abnormal data set; s4, screening and removing abnormal data according to the optimal abnormal data set by the data set to be screened; the binary tree set is constructed based on the attribute weight, and the abnormal data is identified through the binary tree set, so that the abnormal data can be more accurately divided, assignment can be carried out according to different attributes when the abnormal score is calculated, and the abnormal data can be more accurately identified; the basic model is adopted to update abnormal data in real time in training, the sample is deleted from the data set after the punishment weight of a single sample reaches a set value, and the abnormal sample is removed, so that the model fully mines the information of the data.

Description

Abnormal data identification and elimination method
Technical Field
The invention relates to the technical field of computers, in particular to a method for identifying and eliminating abnormal data.
Background
In recent years, the internet finance industry develops very rapidly in China, and corresponding matched supervision measures and technologies are not kept synchronous, so that a supervision blind area exists. The threshold of the lending industry is lower, and the quality of the participating people is greatly different, so that the risk level of the lending business of the internet financial industry is higher. And for the internet finance loan platform with the main business of the loan class service, because the information technology is imperfect, the platform can not establish an effective risk monitoring system, and the default phenomenon of the loan user is continuous. In order to obtain a loan of an enterprise, a user may have operations to hide his own related information, and such low quality user information may increase the credit risk of the enterprise, and the loss may not be measured. The credit and behavior data of the lending user on the online shopping, transaction, social contact and other platforms are deeply mined and analyzed, the data generated by the lending user under each platform and scene are classified and summarized, the big data time is taken as the background, and the default prediction model established by taking a machine learning method as means can convert the effective information of the user into the default probability of the user, so that the risk of the lending transaction between the user and the platform is better controlled.
The data determines the upper limit of the model, and high quality data is the guarantee that the model correctly identifies the offending user. In the scene of internet finance wind control, some users often report error information for evading supervision, and under the interference of historical factors, data in different periods have differences in distribution, data quality is uneven, and model learning effect is poor.
Disclosure of Invention
The invention aims to solve the problems and designs an abnormal data identification and elimination method.
The invention realizes the above purpose through the following technical scheme:
the method for identifying and rejecting the abnormal data comprises the following steps:
s1, acquiring a training data set, and determining attribute weights of the training data set, wherein the training data set is reputation and behavior data of a user desiring to borrow;
s2, establishing a binary tree, and dividing a training data set into normal data and abnormal data;
s3, establishing a basic model, importing a training data set for optimization training, and obtaining an optimal abnormal data set;
s4, screening and removing abnormal data according to the optimal abnormal data set by the data set to be screened.
The invention has the beneficial effects that: the binary tree set is constructed based on the attribute weight, and the abnormal data is identified through the binary tree set, so that the abnormal data can be more accurately divided, assignment can be carried out according to different attributes when the abnormal score is calculated, and the abnormal data can be more accurately identified;
and the abnormal data is updated in real time in training by adopting a basic model, the sample is deleted from the data set after the punishment weight of the single sample reaches a set value, the abnormal sample is effectively removed, the effective data is reserved, the model is fully mined to the information of the data, and the default probability of the user is accurately analyzed.
Drawings
Fig. 1 is a flow chart of the method for recognizing and rejecting abnormal data according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
In the description of the present invention, it should be understood that the directions or positional relationships indicated by the terms "upper", "lower", "inner", "outer", "left", "right", etc. are based on the directions or positional relationships shown in the drawings, or the directions or positional relationships conventionally put in place when the inventive product is used, or the directions or positional relationships conventionally understood by those skilled in the art are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific direction, be configured and operated in a specific direction, and therefore should not be construed as limiting the present invention.
Furthermore, the terms "first," "second," and the like, are used merely to distinguish between descriptions and should not be construed as indicating or implying relative importance.
In the description of the present invention, it should also be noted that, unless explicitly specified and limited otherwise, terms such as "disposed," "connected," and the like are to be construed broadly, and for example, "connected" may be either fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.
The following describes specific embodiments of the present invention in detail with reference to the drawings.
The method for identifying and rejecting the abnormal data comprises the following steps:
s1, acquiring a training data set and determining attribute weights of the training data set, wherein the training data set is reputation and behavior data of a user who wants to borrow and lend, and the training data set specifically comprises:
s11, training data set D is represented as d= { (x) 1 ,y 1 ),(x 2 ,y 2 ),...,(x n ,y n ) Each set of attributes of training dataset D is noted as x= { z } 1 ,z 2 ,...,z k N is the number of datasets, k is the number of attributes, x i For each piece of data, y i A label corresponding to the label;
s12, calculating attribute weight W in X x ={w x 1 ,w x 2 ,...,w x k -w is x k For each attribute z in X i Is expressed asWherein V is an attribute z i Is z i Dividing D results in V branch nodes, wherein the V-th branch node comprises all attribute values z in D i Is denoted as D v
S2, establishing a binary tree, and dividing a training data set into normal data and abnormal data, wherein the method specifically comprises the following steps of:
s21, establishing a binary tree set based on attribute weights, sorting the attribute weight values of X from large to small, selecting the first J attributes, wherein J is 50, and taking each attribute z in the J attributes i Establishing a binary Tree t Its attribute weight is equal to W x W of (3) xi Correspondingly, randomly select one z i Values of (2)As Root of the binary tree t The binary Tree t Left and right subtrees of (1)>Generation of->MIN(z i ) Is attribute z i MAX (z) i ) Is attribute z i Filter (x) is the filter function and q is the maximum in data set Dx i In attribute z i The value of q is equal to or less than z i j Z of (2) i Putting a left subtree, otherwise putting a right subtree, completing the establishment of the binary Tree in a recursion manner to generate J binary trees, and forming a binary Tree set F= { Tree based on attribute weight 1 ,Tree 2 ,...,Tree J };
Each piece of data z in S22, D i Traversing the binary tree in F and calculating x i At the depth h, x of each binary tree i At the binary Tree j Is h×w x k Denoted as s j
S23, calculating average isolated fractionThe isolated score set for all data in D represents is= { γ 1 ,γ 2 ,...,γ n };
S24, carrying out abnormal and normal data set division on the data set D according to the isolated score set Is, wherein the division Is expressed asWherein, kappa is a dividing threshold value, data with isolation score larger than the threshold value is divided into abnormal data, and marked as an abnormal data set D s ={(x 1 ,y 1 ),(x 2 ,y 2 ),...(x m ,y m ) And otherwise, normal data is recorded as a normal data set D t ={(x m+1 ,y m+1 ),(x m+2 ,y m+2 ),...(x n ,y n )}。
S3, establishing a basic model, importing a training data set for optimization training, and acquiring an optimal abnormal data set, wherein the method specifically comprises the following steps of:
s31, the basic model phi is an XGBoost model, and a punishment weight set W of abnormal data is initialized s ={ε 1 ,ε 2 ,...,ε m }, where ε i The punishment weight of each piece of abnormal data is initially 0;
s32, normal dataThe set and the abnormal data set are imported into a base model, and a prediction result of each data is obtained, and the set of prediction results is expressed as y=Φ (X), y= { (μ) 1 ,v 1 ),(μ 2 ,ν 2 ),...,(μ n ,v n ) -wherein u and v are the prediction probability value and the prediction label, respectively;
s33, updating a punishment weight set W of the abnormal data set according to the result of the phi prediction of the basic model s Is a weight epsilon of each of the (a) i The updating method is epsilon i +(|v i -y i |-|1-v i +y i |)μ i →ε i (i.ltoreq.m); wherein y is i Is a true tag value;
s34, updating the abnormal data set and calculating W s Mean value r and standard deviation sigma of (c), expressed as
S35, traversing the abnormal data set D s Setting a penalty weight threshold z to 0.9, deleting the penalty weight epsilon i Updating Ds and the total number m of abnormal samples after deleting abnormal data samples of more than z;
s36, setting the iteration number N of the basic model phi to be 50, and judging the abnormal data set D according to whether the iteration number of the basic model updating abnormal data set is greater than 50 s Whether the abnormal data set is optimal or not, if so, the abnormal data set D below s As the optimal abnormal data set, the process proceeds to S4, and otherwise returns to S34.
S4, screening and removing abnormal data according to the optimal abnormal data set by the data set to be screened.
The technical scheme of the invention is not limited to the specific embodiment, and all technical modifications made according to the technical scheme of the invention fall within the protection scope of the invention.

Claims (6)

1. The method for identifying and rejecting abnormal data is characterized by comprising the following steps:
s1, acquiring a training data set, and determining attribute weights of the training data set, wherein the training data set is reputation and behavior data of a user desiring to borrow;
s2, establishing a binary tree, and dividing a training data set into normal data and abnormal data;
s3, establishing a basic model, importing a training data set for optimization training, and obtaining an optimal abnormal data set; the method specifically comprises the following steps:
s31, the basic model phi is an XGBoost model, and a punishment weight set W of abnormal data is initialized s ={ε 1 ,ε 2 ,...,ε m }, where ε i The punishment weight of each piece of abnormal data is initially 0;
s32, importing a normal data set and an abnormal data set into a basic model, and obtaining a prediction result of each data, wherein the set of the prediction results is expressed as Y=phi (X), and Y= { (mu) 1 ,ν 1 ),(μ 2 ,ν 2 ),...,(μ n ,ν n ) -wherein u and v are the prediction probability value and the prediction label, respectively;
s33, updating a punishment weight set W of the abnormal data set according to the result of the phi prediction of the basic model s Is a weight epsilon of each of the (a) i The updating method is epsilon i +(|v i -y i |-|1-v i +y i |)μ i →ε i (i.ltoreq.m); wherein y is i Is a true tag value;
s34, updating the abnormal data set and calculating W s Mean value r and standard deviation sigma of (c), expressed as
S35, traversing the abnormal data set D s Setting a penalty weight threshold z, deleting the penalty weight epsilon i Abnormal data sample of > z, update D after deletion s And the total number of abnormal samples m;
s36, judging abnormal data set D s Whether the abnormal data set is optimal or not, if so, the abnormal data set D below s As the optimal abnormal data set, entering into S4, otherwise returning to S34;
s4, screening and removing abnormal data according to the optimal abnormal data set by the data set to be screened.
2. The method of recognizing and removing abnormal data according to claim 1, wherein S1 includes:
s11, training data set D is represented as d= { (x) 1 ,y 1 ),(x 2 ,y 2 ),...,(x n ,y n ) Each set of attributes of training dataset D is noted asWhere n is the number of datasets, k is the number of attributes, x i For each piece of data, y i A label corresponding to the label;
s12, calculating attribute weight W in X x ={w x 1 ,w x 2 ,...,w x k -w is x k For each attribute in XIs expressed as +.>Wherein V is an attribute->Is used to determine the number of possible values of V,dividing D results in V branch nodes, wherein the V-th branch node comprises all of the attribute values of D as +.>Is denoted as D v
3. The method of recognizing and removing abnormal data according to claim 2, wherein S2 includes:
s21, establishing a binary tree set based on attribute weights, selecting J attributes according to the attribute weights of X, and using each attribute in the J attributesEstablishing a binary Tree t Its attribute weight is equal to W x W of (3) x i Correspondingly, randomly select one +>Values of (2)As Root of the binary tree t The binary Tree t Left and right subtrees pressGeneration of-> Is attribute ofMinimum value->For attribute->Filter (x) is the filter function and q is x in the dataset D i In attribute->The value of (i.e.)>Is->Putting a left subtree, otherwise putting a right subtree, completing the establishment of the binary Tree in a recursion manner to generate J binary trees, and forming a binary Tree set F= { Tree based on attribute weight 1 ,Tree 2 ,...,Tree J };
Each piece of data in S22, DTraversing the binary tree in F and calculating x i At the depth h, x of each binary tree i At the binary Tree j Is h×w x k Denoted as s j
S23, calculating average isolated fractionThe isolated score set for all data in D represents is= { γ 1 ,γ 2 ,...,γ n };
S24, carrying out abnormal and normal data set division on the data set D according to the isolated score set Is, wherein the division Is expressed asWherein, kappa is a dividing threshold value, data with isolation score larger than the threshold value is divided into abnormal data, and marked as an abnormal data set D s ={(x 1 ,y 1 ),(x 2 ,y 2 ),...(x m ,y m ) And otherwise, normal data is recorded as a normal data set D t ={(x m+1 ,y m+1 ),(x m+2 ,y m+2 ),...(x n ,y n )}。
4. The method of recognizing and eliminating abnormal data according to claim 1, wherein in S36, the number of iterations N of the basic model Φ is set to 50, and the abnormal data set D is judged according to whether the number of iterations of updating the abnormal data set of the basic model is greater than 50 s Whether it is an optimal abnormal data set.
5. The method for recognizing and eliminating abnormal data according to claim 1, wherein the penalty weight threshold z is 0.9, the threshold κ is 0.8, and the number T is 50.
6. The method of recognizing and eliminating abnormal data according to claim 3, wherein in S21, the first J attributes are selected by sorting from large to small attribute weight values of X.
CN202210138272.2A 2022-02-15 2022-02-15 Abnormal data identification and elimination method Active CN114511399B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210138272.2A CN114511399B (en) 2022-02-15 2022-02-15 Abnormal data identification and elimination method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210138272.2A CN114511399B (en) 2022-02-15 2022-02-15 Abnormal data identification and elimination method

Publications (2)

Publication Number Publication Date
CN114511399A CN114511399A (en) 2022-05-17
CN114511399B true CN114511399B (en) 2023-12-15

Family

ID=81551271

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210138272.2A Active CN114511399B (en) 2022-02-15 2022-02-15 Abnormal data identification and elimination method

Country Status (1)

Country Link
CN (1) CN114511399B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111414970A (en) * 2020-03-27 2020-07-14 西安迅和电气科技有限公司 Wind power gear box abnormal data classification method
CN111860658A (en) * 2020-07-24 2020-10-30 华北电力大学(保定) Transformer fault diagnosis method based on cost sensitivity and integrated learning
CN111931868A (en) * 2020-09-24 2020-11-13 常州微亿智造科技有限公司 Time series data abnormity detection method and device
CN112070125A (en) * 2020-08-19 2020-12-11 西安理工大学 Prediction method of unbalanced data set based on isolated forest learning

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11620180B2 (en) * 2018-11-29 2023-04-04 Vmware, Inc. Holo-entropy adaptive boosting based anomaly detection
EP3866087A1 (en) * 2020-02-12 2021-08-18 KBC Groep NV Method, use thereoff, computer program product and system for fraud detection

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111414970A (en) * 2020-03-27 2020-07-14 西安迅和电气科技有限公司 Wind power gear box abnormal data classification method
CN111860658A (en) * 2020-07-24 2020-10-30 华北电力大学(保定) Transformer fault diagnosis method based on cost sensitivity and integrated learning
CN112070125A (en) * 2020-08-19 2020-12-11 西安理工大学 Prediction method of unbalanced data set based on isolated forest learning
CN111931868A (en) * 2020-09-24 2020-11-13 常州微亿智造科技有限公司 Time series data abnormity detection method and device

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Extended Isolation Forest;Sahand Hariri 等;《arXiv:1811.02141v3 [cs.LG] 8 Jul 2020》;第1-12页 *
基于孤立森林算法的取用水量异常数据检测方法;赵臣啸 等;《中国水利水电科学研究院学报》;第18卷(第1期);第31-39页 *
基于孤立森林算法的用水异常监测研究;谢炎昆;《中国优秀硕士学位论文全文数据库 工程科技Ⅱ辑》(第1期);C038-1281 *
基于孤立森林算法的电能量异常数据检测;黄福兴 等;《华东师范大学学报》(第5期);第123-132页 *

Also Published As

Publication number Publication date
CN114511399A (en) 2022-05-17

Similar Documents

Publication Publication Date Title
CN109685647B (en) Credit fraud detection method and training method and device of model thereof, and server
CN111309824B (en) Entity relationship graph display method and system
CN110880019B (en) Method for adaptively training target domain classification model through unsupervised domain
WO2020008919A1 (en) Machine learning device and method
CN111754345B (en) Bit currency address classification method based on improved random forest
CN110019074B (en) Access path analysis method, device, equipment and medium
CN109948650B (en) Intelligent household equipment type judging method based on message characteristics
CN110990718B (en) Social network model building module of company image lifting system
CN111695597B (en) Credit fraud group identification method and system based on improved isolated forest algorithm
CN111695824B (en) Method, device, equipment and computer storage medium for analyzing risk tail end customer
CN115577152B (en) Online book borrowing management system based on data analysis
CN113902534A (en) Interactive risk group identification method based on stock community relation map
WO2023207013A1 (en) Graph embedding-based relational graph key personnel analysis method and system
CN111127185A (en) Credit fraud identification model construction method and device
CN116700172A (en) Industrial data integrated processing method and system combined with industrial Internet
CN109951499A (en) A kind of method for detecting abnormality based on network structure feature
CN109741029A (en) The building method and device in a kind of power grid enterprises' audit regulation storehouse
CN116823496A (en) Intelligent insurance risk assessment and pricing system based on artificial intelligence
CN110348516B (en) Data processing method, data processing device, storage medium and electronic equipment
CN114742564A (en) False reviewer group detection method fusing complex relationships
CN114511399B (en) Abnormal data identification and elimination method
CN113836373A (en) Bidding information processing method and device based on density clustering and storage medium
CN111491300A (en) Risk detection method, device, equipment and storage medium
CN110471854A (en) A kind of defect report assigning method based on high dimensional data mixing reduction
CN111144430A (en) Genetic algorithm-based card number identification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant