CN112288561A - Internet financial fraud behavior detection method based on DBSCAN algorithm - Google Patents

Internet financial fraud behavior detection method based on DBSCAN algorithm Download PDF

Info

Publication number
CN112288561A
CN112288561A CN202010446194.3A CN202010446194A CN112288561A CN 112288561 A CN112288561 A CN 112288561A CN 202010446194 A CN202010446194 A CN 202010446194A CN 112288561 A CN112288561 A CN 112288561A
Authority
CN
China
Prior art keywords
data
dbscan
eps
minpts
algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010446194.3A
Other languages
Chinese (zh)
Inventor
江远强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baiweijinke Shanghai Information Technology Co ltd
Original Assignee
Baiweijinke Shanghai Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Baiweijinke Shanghai Information Technology Co ltd filed Critical Baiweijinke Shanghai Information Technology Co ltd
Priority to CN202010446194.3A priority Critical patent/CN112288561A/en
Publication of CN112288561A publication Critical patent/CN112288561A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds

Abstract

The invention provides an internet financial fraud behavior detection method based on a DBSCAN algorithm. The method comprises the steps of collecting data and dividing the data into a training set and a verification set; preprocessing acquired data to obtain a data set of a correlation coefficient matrix, determining field parameters Eps and MinPts by using correlation coefficients as a distance measurement mode, putting a training set into a DBSCAN clustering algorithm for training, traversing a maximum sample set to obtain an optimal combination and an abnormal state boundary value of neighborhood parameters, judging clustering points and outliers of a verification set, and optimizing to obtain a DBSCAN model by combining the outlier degree judged by the verification set and the performance comparative analysis after actual credit. The fraud detection method based on the DBSCAN algorithm is more suitable for Internet financial application scenes, and improves the universality and the accuracy of fraud detection and identification.

Description

Internet financial fraud behavior detection method based on DBSCAN algorithm
Technical Field
The invention relates to the technical field of wind control in the internet financial industry, in particular to an internet financial fraud behavior detection method based on a DBSCAN algorithm.
Background
The traditional anti-fraud detection method mainly depends on the establishment of prior knowledge and is based on a predefined anti-fraud rule, and in the face of an increasingly varied fraud mode, the method cannot detect out-of-rule fraud behaviors in time, so that the loss is up to hundreds of billions. For this problem, an anomaly detection technique for constructing a normal behavior model based on a clustering algorithm is widely used.
In the prior art, different clustering algorithms have different characteristics, the class number formed by normal samples needs to be known based on a partitioned k-means algorithm, the cluster number and the initial centroid are manually specified in advance, while the class number of the normal samples does not need to be specified in a hierarchy-based Birch algorithm, only spherical clusters are identified, and the k-means and Birch clustering algorithms are generally only suitable for convex sample sets.
Compared with k-means, the requirement of the Birch algorithm on the number and the shape of the clusters is Based on the DBSCAN (Density-Based Spatial Clustering of Applications with Noise), the Clustering method Based on the Density with Noise does not need to specify the number of the clusters, the clusters with any number and shape can be found, the problem that the number of the classes of the normal behavior model and the shapes of the formed clusters are uncertain in abnormal detection is solved, and the Clustering method is suitable for a non-convex sample set. In view of the fact that normal behavior data applied for real internet finance has the characteristics of diversification, namely the class number is not easy to determine, different behavior rules and different shapes of formed clusters thereof, the DBSCAN algorithm becomes a priority selection algorithm for anomaly detection of internet finance application.
Disclosure of Invention
In order to solve the technical problems, the invention discloses an internet financial fraud behavior detection method based on DBSCAN algorithm, and the technical scheme of the invention is implemented as follows:
the Internet financial fraud behavior detection method based on the DBSCAN algorithm is characterized by comprising the following steps: the method comprises the following steps: data acquisition and partitioning: acquiring application information and repayment behavior data of a user who has successfully paid, wherein the application information and repayment behavior data comprise client application withdrawal operation buried point data and post-loan performance behavior data on a client, and randomly segmenting a training set and a verification set; step two: data preprocessing: preprocessing the acquired data, including cleaning invalid data and abnormal values, normalizing, and generating a data set of a correlation coefficient matrix; step three, determining neighborhood parameters: determining a clustering radius Eps and a minimum number MinPts of samples in each cluster; step four: DBSCAN training and verifying machine discrimination: putting the training set subjected to data preprocessing into a DBSCAN clustering algorithm for training, traversing a maximum sample set, counting the data volume k and the classification number m of a normally-operated data cluster, determining the selectable range of the values of the parameters Eps and MinPts by observing the sensitivity of the k and m to the MinPts, further refining the area grid, and finally determining the optimal combination of the neighborhood parameters Eps and MinPts; solving the maximum value of the relative error of the normal data sample according to the optimal combined field parameters Eps and MinPts to serve as a judgment value of the abnormal state boundary, and judging the clustering point and the outlier of the verification set by using the abnormal state boundary value; step five: model optimization: and (4) according to the outlier degree judged by the verification set and the actual post-loan performance comparative analysis, repeating the third step and the fourth step and iterating again to obtain the DBSCAN model based on the analysis result.
Preferably, the normalization process in step two is a dispersion normalization method, so that the data all fall within the [0, 1] interval.
Preferably, the determining the clustering radius Eps in the third step includes calculating a k-distance of each data, and performing statistics on the k-distances to obtain a curve graph, and taking a distance corresponding to a position where the curve graph changes obviously as a value of the clustering radius Eps; the k-distance refers to the distance from each coordinate point in the data to all points in the data except this point.
Preferably, in step three, the determining the minimum number MinPts of samples in each cluster includes:
Figure RE-GDA0002855666920000021
wherein, PiThe number of points in an Eps area of the point i is n, the number of points in the data set is n, MinPts is larger than or equal to dim +1, and dim represents the dimensionality of the data to be clustered.
Preferably, the DBSCAN clustering algorithm in step four includes: searching the Eps neighborhood of each point P in the sample data by taking the correlation coefficient as a distance measurement mode, and forming a cluster, wherein when the number of sample data points contained in the Eps neighborhood of the point P in the sample data meets | nε(xp) If not, processing according to outliers or edge points, then continuously iterating and aggregating all the object points with the direct density of the core object which can reach based on the DBSCAN algorithm, and when no new object point is added to any cluster, finishing the clustering process.
The implementation of the technical scheme of the invention has the following beneficial effects:
(1) according to the invention, the DBSCAN algorithm is used for detecting the internet financial fraud behaviors, the DBSCAN algorithm can distinguish high-density areas from low-density areas of a data set, and abnormal points are judged according to a clustering result, so that the universality and the accuracy of detection and identification of the internet financial application fraud behaviors are improved.
(2) The invention carries out standardized processing on the acquired data, so that the clustering result is more accurate, and the accuracy of detecting and identifying the Internet financial application fraud behavior is improved.
(3) According to the invention, the correlation coefficient is used for replacing Euclidean distance as a distance measurement mode in the DBSCAN algorithm, and the correlation coefficient is limited between-1 and 1, so that the sensitivity of the clustering density to the clustering radius Eps is greatly reduced, normal behaviors and suspicious behaviors can be better distinguished, and the problem that the normal behaviors and the suspicious behaviors of loan application cannot be effectively distinguished by the traditional DBSCAN algorithm using the Euclidean distance as the distance measurement mode is solved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only one embodiment of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The internet financial fraud behavior detection method based on the DBSCAN algorithm is shown in the combined figure 1 and comprises the following steps: the method comprises the following steps: data acquisition and partitioning: acquiring application information and repayment behavior data of a user who has successfully paid, wherein the application information and repayment behavior data comprise client application withdrawal operation buried point data and post-loan performance behavior data on a client, and randomly segmenting a training set and a verification set; step two: data preprocessing: preprocessing the acquired data, including cleaning invalid data and abnormal values, normalizing, and generating a data set of a correlation coefficient matrix; step three, determining neighborhood parameters: determining a clustering radius Eps and a minimum number MinPts of samples in each cluster; step four: DBSCAN training and verifying machine discrimination: putting the training set subjected to data preprocessing into a DBSCAN clustering algorithm for training, traversing a sample maximum set, counting the data volume k and the classification number m of a normally-operated data cluster, determining the selectable range of the values of the parameters Eps and MinPts by observing the sensitivity of the k and m values to MinPts, further refining the area grid, and finally determining the optimal combination of the neighborhood parameters Eps and MinPts; solving the maximum value of the relative error of the normal data sample according to the optimal combined field parameters Eps and MinPts to serve as a judgment value of the abnormal state boundary, and judging the clustering point and the outlier of the verification set by using the abnormal state boundary value; step five: model optimization: according to the outlier degree judged by the verification set and the comparative analysis of the performance after the actual credit, based on the analysis result, repeating the third step and the fourth step to iterate again to obtain the DBSCAN model
In this embodiment, the normalization processing formula is as follows: set data set Xi={xi1,xi2....,xinAnd n attributes are total, and the normalized value is:
Figure RE-GDA0002855666920000041
in the formula (I), the compound is shown in the specification,
Figure RE-GDA0002855666920000042
and
Figure RE-GDA0002855666920000043
are each XiIn (1)A maximum value and a minimum value.
The data set formula for generating the correlation coefficient matrix is as follows: let X1,X2,X3...XnIs an n-dimensional random variable, then any XiAnd XjThe correlation coefficient of (a) is:
Figure RE-GDA0002855666920000044
in the formula: n ═ 1,2,. n; j ═ 1,2,. n; then by ρijThe n-order matrix of elements is called the correlation coefficient matrix R of the random vector of the dimension as follows:
Figure RE-GDA0002855666920000045
in this embodiment, the data characteristics collected in the first step may reflect the payment capability and payment willingness of the user, including traditional data such as personal and family status, work and income level, which are filled in the application page, and may obtain third-party data authorized by the client, including data such as identity verification, APP behavior characteristics, and third-party payment, under permission of policies and conditions.
In this embodiment, the DBSCAN clustering algorithm in step four includes: searching the Eps neighborhood of each point P in the sample data by taking the correlation coefficient as a distance measurement mode, and forming a cluster, wherein when the number of sample data points contained in the Eps neighborhood of the point P in the sample data meets | nε(xp) If not, processing according to outliers or edge points, then continuously iterating and aggregating all the object points with the direct density of the core object which can reach based on the DBSCAN algorithm, and when no new object point is added to any cluster, finishing the clustering process.
In a preferred embodiment, referring to fig. 1, the normalization process in step two is a dispersion normalization method, so that the data all fall within the [0, 1] interval, and the influence of amplifying certain values with larger orders of magnitude is avoided, so that the clustering result is more accurate.
In a preferred embodiment, referring to fig. 1, the determining the clustering radius Eps in step three includes calculating a k-distance of each data, and performing statistics on the k-distances to obtain a curve graph, and taking a distance corresponding to a position where the curve graph changes significantly as a value of the clustering radius Eps; the k-distance refers to the distance from each coordinate point in the data to all points in the data except the point; the determination of the minimum number MinPts of samples in each cluster described in step three includes:
Figure RE-GDA0002855666920000051
wherein, PiThe number of points in an Eps area of the point i is shown, N is the number of points in the data set, MinPts is larger than or equal to dim +1, and dim represents the dimensionality of the data to be clustered.
It should be understood that the above-described embodiments are merely exemplary of the present invention, and are not intended to limit the present invention, and that any modification, equivalent replacement, or improvement made without departing from the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (5)

1. The Internet financial fraud behavior detection method based on the DBSCAN algorithm is characterized by comprising the following steps:
the method comprises the following steps: data acquisition and partitioning: acquiring application information and repayment behavior data of a user who has successfully paid, wherein the application information and repayment behavior data comprise client application withdrawal operation buried point data and post-loan performance behavior data on a client, and randomly segmenting a training set and a verification set;
step two: data preprocessing: preprocessing the acquired data, including cleaning invalid data and abnormal values, normalizing, and generating a data set of a correlation coefficient matrix;
step three, determining neighborhood parameters: determining a clustering radius Eps and a minimum number MinPts of samples in each cluster;
step four: DBSCAN training and verifying machine discrimination: putting the training set subjected to data preprocessing into a DBSCAN clustering algorithm for training, traversing a maximum sample set, counting the data volume k and the classification number m of a normally-operated data cluster, determining the selectable range of the values of the parameters Eps and MinPts by observing the sensitivity of the k and m values to MinPts, further carrying out gridding search on the region, and finally determining the optimal combination of the neighborhood parameters Eps and MinPts; solving the maximum value of the relative error of the normal data sample according to the optimal combined field parameters Eps and MinPts to serve as a judgment value of the abnormal state boundary, and judging the clustering point and the outlier of the verification set by using the abnormal state boundary value;
step five: model optimization: and (4) according to the outlier degree judged by the verification set and the actual post-loan performance comparative analysis, repeating the third step and the fourth step and iterating again to obtain the DBSCAN model based on the analysis result.
2. The internet financial fraud detection method based on DBSCAN algorithm of claim 1, wherein the normalization in step two is implemented by dispersion normalization, so that the data all fall within the [0, 1] interval.
3. The internet financial fraud behavior detection method based on the DBSCAN algorithm of claim 1, wherein the determining of the clustering radius Eps in step three includes calculating a k-distance of each data and counting the k-distances to obtain a curve graph, and taking a distance corresponding to a position where the curve graph changes significantly as a value of the clustering radius Eps; the k-distance refers to the distance from each coordinate point in the data to all points in the data except this point.
4. The internet financial fraud behavior detection method according to claim 1, wherein in step three, the determining of the minimum number MinPts of samples in each cluster comprises:
Figure FDA0002505903320000021
wherein, PiThe number of points in the Eps region for point i, n is the number of points in the dataset, the MinPts is more than or equal to dim +1, and dim represents the dimensionality of the data to be clustered.
5. The internet financial fraud behavior detection method based on DBSCAN algorithm of claim 1, wherein the DBSCAN clustering algorithm in step four comprises: searching the Eps neighborhood of each point P in the sample data by taking the correlation coefficient as a distance measurement mode, and forming a cluster, wherein when the number of sample data points contained in the Eps neighborhood of the point P in the sample data meets | nε(xp) If not, processing according to outliers or edge points, then continuously iterating and aggregating all the object points with the direct density of the core object which can reach based on the DBSCAN algorithm, and when no new object point is added to any cluster, finishing the clustering process.
CN202010446194.3A 2020-05-25 2020-05-25 Internet financial fraud behavior detection method based on DBSCAN algorithm Pending CN112288561A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010446194.3A CN112288561A (en) 2020-05-25 2020-05-25 Internet financial fraud behavior detection method based on DBSCAN algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010446194.3A CN112288561A (en) 2020-05-25 2020-05-25 Internet financial fraud behavior detection method based on DBSCAN algorithm

Publications (1)

Publication Number Publication Date
CN112288561A true CN112288561A (en) 2021-01-29

Family

ID=74420624

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010446194.3A Pending CN112288561A (en) 2020-05-25 2020-05-25 Internet financial fraud behavior detection method based on DBSCAN algorithm

Country Status (1)

Country Link
CN (1) CN112288561A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113554079A (en) * 2021-07-14 2021-10-26 中国地质大学(北京) Electric power load abnormal data detection method and system based on secondary detection method
CN113985831A (en) * 2021-11-17 2022-01-28 河北工业大学 Industrial control system state mechanism building method

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105825690A (en) * 2016-06-15 2016-08-03 北京航空航天大学 Coordinated control oriented trunk line crossing correlation analysis and division method
CN106503656A (en) * 2016-10-24 2017-03-15 厦门美图之家科技有限公司 A kind of image classification method, device and computing device
CN106991508A (en) * 2017-05-25 2017-07-28 华北电力大学 A kind of running of wind generating set state identification method based on DBSCAN
CN107357844A (en) * 2017-06-26 2017-11-17 广州视源电子科技股份有限公司 Outlier detection method and apparatus
CN109272058A (en) * 2018-11-27 2019-01-25 电子科技大学中山学院 Integrated power load curve clustering method
CN109447163A (en) * 2018-11-01 2019-03-08 中南大学 A kind of mobile object detection method towards radar signal data
CN109495479A (en) * 2018-11-20 2019-03-19 华青融天(北京)软件股份有限公司 A kind of user's abnormal behaviour recognition methods and device
CN109948466A (en) * 2019-02-28 2019-06-28 中国电力科学研究院有限公司 A kind of identification exchanges the method and system of super UHV transmission line audible noise abnormal data
CN110493221A (en) * 2019-08-19 2019-11-22 四川大学 A kind of network anomaly detection method based on the profile that clusters
CN110851741A (en) * 2019-11-09 2020-02-28 郑州天迈科技股份有限公司 Taxi passenger carrying hot spot identification recommendation algorithm

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105825690A (en) * 2016-06-15 2016-08-03 北京航空航天大学 Coordinated control oriented trunk line crossing correlation analysis and division method
CN106503656A (en) * 2016-10-24 2017-03-15 厦门美图之家科技有限公司 A kind of image classification method, device and computing device
CN106991508A (en) * 2017-05-25 2017-07-28 华北电力大学 A kind of running of wind generating set state identification method based on DBSCAN
CN107357844A (en) * 2017-06-26 2017-11-17 广州视源电子科技股份有限公司 Outlier detection method and apparatus
CN109447163A (en) * 2018-11-01 2019-03-08 中南大学 A kind of mobile object detection method towards radar signal data
CN109495479A (en) * 2018-11-20 2019-03-19 华青融天(北京)软件股份有限公司 A kind of user's abnormal behaviour recognition methods and device
CN109272058A (en) * 2018-11-27 2019-01-25 电子科技大学中山学院 Integrated power load curve clustering method
CN109948466A (en) * 2019-02-28 2019-06-28 中国电力科学研究院有限公司 A kind of identification exchanges the method and system of super UHV transmission line audible noise abnormal data
CN110493221A (en) * 2019-08-19 2019-11-22 四川大学 A kind of network anomaly detection method based on the profile that clusters
CN110851741A (en) * 2019-11-09 2020-02-28 郑州天迈科技股份有限公司 Taxi passenger carrying hot spot identification recommendation algorithm

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113554079A (en) * 2021-07-14 2021-10-26 中国地质大学(北京) Electric power load abnormal data detection method and system based on secondary detection method
CN113554079B (en) * 2021-07-14 2023-08-01 中国地质大学(北京) Power load abnormal data detection method and system based on secondary detection method
CN113985831A (en) * 2021-11-17 2022-01-28 河北工业大学 Industrial control system state mechanism building method
CN113985831B (en) * 2021-11-17 2023-06-16 河北工业大学 Construction method of state mechanism of industrial control system

Similar Documents

Publication Publication Date Title
US20170083920A1 (en) Hybrid method of decision tree and clustering technology
CN112560921A (en) Internet financial platform application fraud detection method based on fuzzy C-mean
US11093845B2 (en) Tree pathway analysis for signature inference
Kamalov et al. Outlier detection in high dimensional data
CN111915418A (en) Internet financial fraud online detection method and device
CN109886284B (en) Fraud detection method and system based on hierarchical clustering
CN109829721B (en) Online transaction multi-subject behavior modeling method based on heterogeneous network characterization learning
CN109150830B (en) Hierarchical intrusion detection method based on support vector machine and probabilistic neural network
JP7173332B2 (en) Fraud detection device, fraud detection method, and fraud detection program
CN109190698B (en) Classification and identification system and method for network digital virtual assets
CN111507385B (en) Extensible network attack behavior classification method
CN110149347B (en) Network intrusion detection method for realizing dynamic self-adaptive clustering by using inflection point radius
CN112001788B (en) Credit card illegal fraud identification method based on RF-DBSCAN algorithm
CN111833175A (en) Internet financial platform application fraud behavior detection method based on KNN algorithm
CN112288561A (en) Internet financial fraud behavior detection method based on DBSCAN algorithm
CN115600194A (en) Intrusion detection method, storage medium and device based on XGboost and LGBM
Aziz et al. Cluster Analysis-Based Approach Features Selection on Machine Learning for Detecting Intrusion.
Chen et al. A fast detector generation algorithm for negative selection
CN111353529A (en) Mixed attribute data set clustering method for automatically determining clustering center
CN113852629B (en) Network connection abnormity identification method based on natural neighbor self-adaptive weighted kernel density and computer storage medium
CN111652733B (en) Financial information management system based on cloud computing and block chain
CN111598116B (en) Data classification method, device, electronic equipment and readable storage medium
CN113792141A (en) Feature selection method based on covariance measurement factor
CN113190851A (en) Active learning method of malicious document detection model, electronic device and storage medium
CN112053219A (en) OCSVM (online charging management system VM) -based consumption financial fraud behavior detection method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination