CN111105303B - Network lending fraud detection method based on incremental network characterization learning - Google Patents

Network lending fraud detection method based on incremental network characterization learning Download PDF

Info

Publication number
CN111105303B
CN111105303B CN201911101580.2A CN201911101580A CN111105303B CN 111105303 B CN111105303 B CN 111105303B CN 201911101580 A CN201911101580 A CN 201911101580A CN 111105303 B CN111105303 B CN 111105303B
Authority
CN
China
Prior art keywords
network
lending
data
test
loan
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911101580.2A
Other languages
Chinese (zh)
Other versions
CN111105303A (en
Inventor
王成
朱航宇
胡瑞鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN201911101580.2A priority Critical patent/CN111105303B/en
Publication of CN111105303A publication Critical patent/CN111105303A/en
Application granted granted Critical
Publication of CN111105303B publication Critical patent/CN111105303B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Abstract

A network lending fraud detection method based on incremental network characterization learning. The principle of the invention is as follows: real world loan data is analyzed in the form of a heterogeneous information network that is robust in nature, and a relational loan network is established from the loan data in the form of the heterogeneous information network. And extracting a specific relation from the multi-type heterogeneous relation lending network to form a homogeneous lending network which only retains one node type. And sequentially updating the relation lending network and the homogeneous lending network according to each batch of the arrived lending data, and timely updating the vector characterization of the nodes in the homogeneous lending network by using an incremental network characterization learning algorithm so as to capture the latest relevance among the data. Based on the learned vector characterization structure and new features related to time sequence (such as the relation between single numbers and the first n single numbers), a classifier is combined to realize a classification model for detecting the fraud of lending data, so that the fraud is detected and identified.

Description

Network lending fraud detection method based on incremental network characterization learning
Technical Field
The invention relates to anti-fraud detection of internet financial network loans.
Background
With the rapid development of the internet, various conventional services are gradually shifted to online, and network loans in internet finance are rapidly developed, and the generation of the network loans brings about a large amount of electronic transaction data, and simultaneously, the network loan fraud amount is greatly increased [1]. In recent years, B2C network lending has progressed rapidly worldwide, especially in china, wherein B2C network lending institutions suffer from a large number of bad accounts and lending, resulting in a large economic loss [2]. Fraudsters complete a large batch of network lending fraud by forging false borrower information and even creating a ganged false borrower. To ensure the business safety of investment institutions and normal users in network lending, a practical and effective network lending fraud detection system needs to be established.
In a B2C lending scene, an individual may acquire credit resources through fake application, fake data and fake contacts, multi-head lending and other modes; furthermore, the amount and funds are obtained by means of the black gray industry, such as the proxy package, group cheating and the like. There is often a potential association in these spurious lending data. Network characterization learning has shown strong forces on potential links between mined data [3]. However, most fraud detection systems today update networks periodically based on static lending data networks, which cannot accommodate rapid changes in network age fraud, such as: the dark gray industry generates large amounts of associated lending data in a short period of time, which cannot be effectively prevented from fraud because the static lending network does not learn the associations in time. In addition, B2C network loans can generate a large amount of loan data in a very short time, the loan data is continuously increased and the fraud means are continuously changed, and dynamic addition of new data and deletion of old data are urgently needed, which results in that the fraud detection method based on static network characterization learning cannot adapt to the change of the loan network structure.
So far, research on network lending has focused mainly on how to build an efficient fraud detection model on static data [4], with little research involving dynamically updating the model. Talaver et al [5] trains a radial basis function network to distinguish whether a customer has lending fraud and establishes a fuzzy c-means cluster to group data points to create a customer profile by grouping data within the cluster. Babaev et al [6] use neural networks on fine-grained cross-country data to process loan data, and propose a new method, E.T.RNN, based on business data alone, to enable automated decision making of loan applications.
From the above studies, it was found that one major problem with B2C network lending fraud detection is the lack of a method to cope with novel fraud measures in the short term. The traditional detection method has a longer period, and a plurality of fraud methods change along with the time, so that the traditional detection method lacks better generalization capability.
Disclosure of Invention
Fraudulent loan applications often pass audit systems by way of counterfeit applications, providing fraudulent data, and multi-headed loans, and the like, with potential associations between such fraudulent information often being more evident, particularly in the black gray industry's proxy packaging, group fraud. The invention discloses a network loan fraud detection method which is beneficial to analyzing and taking rich loan data generated by the current network loan as a basis, and protects the safety of users and enterprises.
The principle of the invention is as follows: real world loan data is analyzed in the form of a heterogeneous information network with high characterizing power, and the loan data is formed into a relational loan network in the form of a heterogeneous information network (including various types of nodes and edges, such as a loan bill number, license plate number, telephone number, address, etc.). The specific relationship is extracted from the multi-type heterogeneous relationship lending network to form a homogeneous lending network which only retains one node type (the homogeneous network generation process of lending data is shown in fig. 1). And sequentially updating the relation lending network and the homogeneous lending network according to each batch of the arrived lending data, and timely updating the vector characterization of the nodes in the homogeneous lending network by using an incremental network characterization learning algorithm so as to capture the latest relevance among the data. Based on the learned vector characterization structure and new features related to time sequence (such as the relation between single numbers and the first n single numbers), a classifier is combined to realize a classification model for detecting the fraud of lending data, so that the fraud is detected and identified.
The technical scheme of the method is as follows:
a network lending fraud detection method based on incremental network characterization learning is characterized by comprising the following steps:
step 1, establishing a relation lending network and completing homogenization
Collecting rich loan data generated by historical network loans, establishing a heterogeneous relationship loan network, taking a single number as a node, taking the attribute relationship simultaneously owned by different loan data as an edge, and deriving a homogeneous loan network; providing to step 2;
step 2, constructing a training sample set
Collecting original static data, establishing an initial static data set, transforming a network structure by using a network characterization learning algorithm, carrying out vectorization to obtain vector characterization corresponding to nodes based on the initial network lending data set, and forming a training sample set by the learned vector data; providing to step 3;
step 3, feature construction
Performing feature construction on vector data in a training sample set to prepare for inputting a fraud detection model; providing to step 4;
step 4, training fraud detection model
Adopting an XGBoost classifier in a machine learning integrated library scikit-learn in python as a fraud detection model, and inputting the features constructed in the step 3 into the classifier to train the fraud detection model; providing to step 7;
step 5, updating the relationship lending network and the homogeneous lending network
Updating and collecting the currently generated loan data of the network loan, and providing the updated relation loan network and the homogeneous loan network for the increment flow-type loan data which arrive in sequence in time sequence to the step 6;
step 6: updating a current test dataset
And (3) constructing a current test data set by utilizing the training sample set constructed in the step (2) and the streaming lending data which arrive in sequence in time sequence, namely: adding k new loan data, and deleting k loan data with earliest time in the initial data set to update the current test data set in real time;
referring to step 2, transforming the network structure by using a network representation learning algorithm, carrying out vectorization to obtain a vector representation corresponding to a node of the current test data set, and updating the learned vector data to update the current test data set; providing to step 7;
step 7, feature construction
Referring to step 3, performing feature construction on vector data in the test data set to prepare for inputting a fraud detection model; providing to step 8;
step 8, testing the fraud detection model
And (3) inputting the current test data set in the step (7) into the fraud detection model in the step (4) to obtain a judgment result of the fraud detection model.
Further, judging whether the corresponding moment of the current test data set exceeds the model updating period, if not, repeating the step 5, and if so, repeating the step 1. Until fraud detection is completed for all test data sets, the algorithm ends.
The invention aims at overcoming the debilitation of the static fraud detection method for the rapidly changing network loan fraud, increasing the adaptability of the fraud detection system to the changing environment, and better guaranteeing the detection of the fraud loan, interception of the fraud loan and the protection of the fund safety of users and enterprises.
The invention discloses a network lending fraud detection method based on incremental network characterization, which realizes dynamic update of a lending data network, and the method is characterized in that the lending data network is mined to the characterization with strong generalization capability by means of incremental network characterization learning, so that the real-time performance, accuracy and robustness of the model interception fraudulent lending are improved.
Drawings
Fig. 1: the invention relates to a homogeneous network generation process example graph of lending data in a network lending scene;
fig. 2: the invention relates to a network lending fraud detection method based on incremental network characterization learning, which comprises the following steps of;
fig. 3: the lending data of the embodiment is transformed into a vector representation diagram;
fig. 4: example an incremental lending dataset partitioning scheme at a time.
Detailed Description
The technical scheme of the invention is further described below by combining the embodiment and the attached drawings.
The network lending fraud detection method based on incremental network characterization learning is shown in the flowchart of fig. 2, and the process is as follows:
step 1, establishing a relation lending network and completing homogenization
Collecting rich loan data generated by historical network loans, establishing a heterogeneous relationship loan network, taking a single number as a node, taking the attribute relationship simultaneously owned by different loan data as an edge, and deriving a homogeneous loan network; providing to step 2;
step 2, constructing a training sample set
Collecting original static data, establishing an initial static data set, transforming a network structure by using a network characterization learning algorithm, carrying out vectorization to obtain vector characterization corresponding to nodes based on the initial network lending data set, and forming a training sample set by the learned vector data; providing to step 3;
step 3, feature construction
Performing feature construction on vector data in a training sample set to prepare for inputting a fraud detection model; providing to step 4;
step 4, training fraud detection model
Adopting an XGBoost classifier in a machine learning integrated library scikit-learn in python as a fraud detection model, and inputting the features constructed in the step 3 into the classifier to train the fraud detection model; providing to step 7;
step 5, updating the relationship lending network and the homogeneous lending network
Updating and collecting the currently generated loan data of the network loan, and providing the updated relation loan network and the homogeneous loan network for the increment flow-type loan data which arrive in sequence in time sequence to the step 6;
step 6: updating a current test dataset
And (3) constructing a current test data set by utilizing the training sample set constructed in the step (2) and the streaming lending data which arrive in sequence in time sequence, namely: adding k new loan data, and deleting k loan data with earliest time in the initial data set to update the current test data set in real time;
referring to step 2, transforming the network structure by using a network representation learning algorithm, carrying out vectorization to obtain a vector representation corresponding to a node of the current test data set, and updating the learned vector data to update the current test data set; providing to step 7;
step 7, feature construction
Referring to step 3, performing feature construction on vector data in the test data set to prepare for inputting a fraud detection model; providing to step 8;
step 8, testing the fraud detection model
And (3) inputting the current test data set in the step (7) into the fraud detection model in the step (4) to obtain a judgment result of the fraud detection model.
Further, judging whether the corresponding moment of the current test data set exceeds the model updating period, if not, repeating the step 5, and if so, repeating the step 1. Until fraud detection is completed for all test data sets, the algorithm ends.
Further, detailed examples are given.
Example 1
Is divided into four steps
The first part, generates an initial network characterization, which is as follows:
input:
the data B of the subscriber network lending data,
network characterization learning method parameter W e
And (3) outputting:
mapping relation gamma=f between node v and corresponding vector gamma at initial time t t (v)。
In detail, an initial network characterization is generated as follows:
step 1.1: and screening available original fields (shown in table 1) from the original lending data, performing data preprocessing operations such as field type conversion, null value removal filling and the like, formulating a discretization rule for each field, and discretizing the value to reduce the data precision. Such as: in the embodiment, the amount is divided into a limited number of categories according to different areas; the address is divided into coarse-granularity discretization values according to different streets.
The original lending data is divided into a single number (applync) type and an ATTRIBUTE (ATTRIBUTE) type, wherein the ATTRIBUTE (ATTRIBUTE) is other data except the single number (applync) in the lending data. For a borrowing data, it is noted as (b) i ,ATT(b i )),b i Is the single number of the lending data b, ATT (b i ) Is the attribute set corresponding to the lending data b, att k (b i ) Is ATT (b) i ) Is the kth element in (c).
Establishing a relational loan network N based on original loan data r = (V, E), V is a node set, E is an edge set, where edge e= (u, V), u and V belong to a nodePoint set V (contains multiple types of nodes). For each item of data b in the loan data b i First b i Adding node set V, adding ATT (b) i ) Each element of the list is added to the node set V in turn, and the edge (b) i ,att k (b i ) Add edge set E, att k (b i ) Is ATT (b) i ) Is the kth element in (c). Step 1.2 is performed. The left part of FIG. 1 is a relational lending network N r Is shown in the drawings.
Step 1.2: establishing a homogeneous lending network N based on a relational lending network h =(V h ,E h ),V h Is a node set, E h Is an edge set, where edge e= (u, V, w), u and V belong to node set V h (only nodes of the type lending list number are included). When att k (b i )=att k (b j ) When a pair of edges (b i ,att k (b i ) (b) j ,att k (b j ) Is regarded as edge set E h Edge (b) of (b) i ,b j ) W is the edge (b) i ,b j ) The number of occurrences as a homogeneous lending network N h Is a weight of (a). Based on relation lending network N r Adding all nodes with the types of lending single numbers in the node set V into the node set V h . Each pair of edges (b i ,att k (b i ) (b) j ,att k (b j ) When att k (b -i )=att k (b i ) When the edge (b) i ,b j ) Adding edge set E h . Obtaining a homogeneous lending network N h =(V h ,E h ). Step 1.3 is performed.
The right part of FIG. 1 is a lending network N based on the left part relation r Generating a homogeneous lending network N h Is shown in the drawings.
Step 1.3: based on constructed homogeneous lending network N h In this embodiment, the existing network characterization learning method NetWalk is used to learn the homogeneous lending network N h The vector representation of all network nodes in the network is avoided, the trouble of manually extracting the characteristics is avoided, and the characteristic information is automatically extracted. Network characterization learning methodThe main parameters of the NetWalk learning vector representation are shown in the table 2, the parameter setting is related to the network structure, the parameters walk-length, number _walks are in direct proportion to the number of nodes and edges in the network in general, and the parameters walk-length and number_walks are larger as the number of nodes and edges in the network is larger; the parameter learning_rate affects the performance of the NetWalk method for learning network characterization, and an excessive value may cause over-fitting, and an insufficient value causes under-fitting, and the embodiment is set to 0.01; the parameter dim is the dimension of the resulting output vector representation, a large dimension often contains more potential correlations, but with a consequent higher computational complexity, the embodiment being set to 128; in the network characterization learning method of this embodiment, init is an edge set of the homogeneous lending network generated based on the initial lending data, and snap is an edge set added or deleted in the homogeneous lending network generated based on the streaming lending data. Step 1.4 is performed.
Step 1.4: aiming at a homogeneous lending network N, the network characterization learning method NetWalk in the step 1.3 h Obtaining a vector representation gamma of a node v and a corresponding vector representation gamma in a network at an initial time t, and establishing a mapping relation gamma=F t (v) A. The invention relates to a method for producing a fibre-reinforced plastic composite According to the mapping relation gamma=f t (v) The initial lending data is represented as a vector representation, as shown in fig. 4, in which a set of vector representations consisting of a number of specific field values are converted into a set of fixed dimensions (vector dimensions dim in fig. 4 are determined by parameters dim in the NetWalk of the network representation learning method).
Table 1 available raw fields
Figure BDA0002270027860000071
TABLE 2 NetWalk principal parameters
Figure BDA0002270027860000072
/>
Figure BDA0002270027860000081
Secondly, establishing a fraud detection model, wherein the fraud detection model comprises the following steps:
classifier environment: python, XGBoost classifier
Input:
time t k Mapping relation between corresponding node v and corresponding vector gamma
Figure BDA0002270027860000082
Classifier parameter set W c
The number of features h entered by the classifier,
set B for model training lending data train (t k )。
And (3) outputting:
fraud detection model
Figure BDA0002270027860000083
In detail, the fraud detection model is built as follows:
step 2.1: loan data B containing n available original fields train (t k ) N corresponding nodes may be associated in the homogeneous lending network. As can be seen from step 1.4, based on t k Time node and mapping relation
Figure BDA0002270027860000084
Figure BDA0002270027860000085
The lending data is converted into vectors with dimension dim corresponding to each lending unit number. After the vector is obtained, the vector can be directly input into a classification model to carry out subsequent tasks of node classification. (this is "method one").
This embodiment is further innovative and gives a further disclosure of "method two": based on the obtained vector characterization, sequentially calculating Euclidean distance (Euclidean distance is a calculation method of vector similarity) between each single number and the first h single numbers in a data set (single numbers are ordered according to generation time) for each borrowing data, and ordering the h single numbers according to the order from small to large to be used as the constructed time sequence characteristics of the corresponding single numbers. Then, the similarity of the to-be-detected single number and the vector corresponding to the first h single numbers is introduced as the input of the fraud detection model.
Comparison:
the method one only considers the absolute space position of the vector, and has poor performance in lending data.
Compared with the method one, the method two is more beneficial to detecting the problem of bulk fraud in lending fraud, does not use absolute space position, uses vector similarity, and enhances the generalization capability of a follow-up fraud detection model. Face vector x= (X) 1 ,····,x dim )、Y=(y 1 ,····,y dim ) The Euclidean distance is calculated as follows
Figure BDA0002270027860000091
Step 2.2: based on the time sequence characteristics constructed in the step 2.1, according to the classifier parameter set W c Setting a classifier to make t k Time sequence characteristics corresponding to the time lending data are used as data, whether the corresponding lending data is a fraud transaction or not is used as a label, the time sequence characteristics are imported into a classifier for training, and the trained two classification models are regarded as fraud detection models
Figure BDA0002270027860000092
And thirdly, generating incremental network characterization, wherein the incremental network characterization comprises the following steps of:
input:
time t k Mapping relation between corresponding node v and corresponding vector gamma
Figure BDA0002270027860000093
/>
Time t k Time networkCharacterization of the data set B for learning train (t k ),
Streaming incoming t k+1 Time of day network lending data set B test (t k+1 )。
And (3) outputting:
time t k+1 Mapping relation between time node v and corresponding vector gamma
Figure BDA0002270027860000094
In detail, an incremental network characterization is generated, which proceeds as follows:
step 3.1: according to data set B train (t k ) Time stamp sequence, selection and data set B test (t k+1 ) The same amount of earliest data is placed into data set B' test (t k+1 ). Data set B test (t k+1 ) And B' test (t k+1 ) The same preprocessing operation as in step 1.1 is adopted to process the data set B after processing test (t k+1 ) And B' test (t k+1 ) Based on data set B train (t k ) And updating the relationship lending network. Based on the definition of step 1.1, the network lending data B are processed separately test (t k+1 ) And B' test (t k+1 ) Obtaining node set V in relational lending network test (t k+1 ) And V' test (t k+1 ) And edge set E test (t k+1 ) And E' test (t k+1 ),E test (t k+1 ) Is the single number in the lending data of the stream arrival and the lending network N related to the last moment r A set of edges of existing relationships between existing nodes,
Figure BDA0002270027860000096
is a relational lending network N r The set of expiring edges to be deleted. Let v=v & &v test (t k+1 )-V′ test (t k+1 ) And e=e% test (t k+1 )-E′ test (t k+1 ) Updating a relational lending network N r = (V, E). Step 3.2 is performed.
Step 3.2: based on updated relationship lending network N r = (V, E), the updated homogeneous lending network N is obtained using step 1.2 h =(V h ,E h ). Step 3.3 is performed.
Step 3.3: based on time t k Mapping relation between corresponding node v and corresponding vector gamma
Figure BDA0002270027860000095
Respectively set edge sets E test (t k+1 ) And E' test (t k+1 ) For the newly arrived edge set and the edge set to be deleted, a network characterization learning method NetWalk is applied to the related edge set E test (t k+1 ) And E' test (t k+1 ) Incremental network characterization learning is carried out on the nodes and the edges in the network to obtain a time t k+1 Mapping relation between corresponding node v and corresponding vector gamma>
Figure BDA0002270027860000101
Figure BDA0002270027860000102
Step 3.4 is performed.
Step 3.4: the step 3.3 is directed to the homogeneous lending network N h At time t k Mapping relation between node v and its corresponding vector representation gamma in time network
Figure BDA0002270027860000103
According to the mapping relation gamma=f t (v) The streaming lending data is re-represented as a vector representation, as shown in fig. 4, where a set of lending data consisting of specific field values is transformed into a set of vector representations of fixed dimensions.
The fourth part, the test of the fraud detection model, the procedure is as follows:
classifier environment: python, XGBoost classifier
Input:
the model update period T is set to be a period,
fraud detection model
Figure BDA0002270027860000104
Time t k Mapping relation between corresponding node v and corresponding vector gamma
Figure BDA0002270027860000105
Time t k Set B for model test lending data test (t k )。
And (3) outputting:
the test data is the probability of fraud P.
In detail, the fraud detection model is tested as follows:
step 4.1: loan data B containing n available original fields train (t k ) N corresponding nodes may be associated in the homogeneous lending network. From step 3.4, it can be seen that based on t k Time node and mapping relation
Figure BDA0002270027860000107
Figure BDA0002270027860000106
The lending data is converted into vectors with dimension dim corresponding to each lending unit number. Based on the obtained vector characterization, the Euclidean distance between each single number and the first h single numbers in the data set (the single numbers are ordered according to the generation time) is calculated in sequence for each lending data, and the h single numbers are ordered according to the order from small to large and are used as the time sequence characteristics of the corresponding single numbers. Step 4.2 is performed.
Step 4.2: importing the fraud detection model obtained in step 2.2
Figure BDA0002270027860000108
Let t k Time sequence characteristics corresponding to the test data at the moment are input into a fraud detection model +.>
Figure BDA0002270027860000111
Obtaining a set B of test lending data test (t k ) In (a)Probability of fraud p (b) for each item of debit data i ) Outputting a set of probabilities P of the test data being fraudulent, wherein P (b i ) e.P. Determining time t k+1 +t 0 Whether or not is greater than the period T, if so, then T k Time of day loan data set B train (t k ) The first partial step 1.1 is performed to reconstruct the relational lending network, considered as the initial lending dataset. If smaller than, let->
Figure BDA0002270027860000112
At time t k+1 A third partial step 3.1 is then performed to incrementally update the network characterization based on the incoming streaming lending data.
The invention obtains the recall rate (interception rate, true Positive Rate) under different disturbing rates (error interception rate, false Positive Rate) through detection and demonstration on a real internet financial platform lending data set, and calculates KS value (which is the maximum value of the recall rate-disturbing rate under different conditions) to evaluate the performance of the system.
Innovation point of the project
1. The method has the advantages that the association loan network is built from the recorded loan data, the homogeneous loan network is derived to express the relationship between the loan data in the form of the network, meanwhile, the potential association characteristics are automatically extracted from the data based on the homogeneous information network and the network characterization learning is carried out, and the dependence of the system on business knowledge is reduced.
2. And dynamically updating the associated lending network and the homogeneous lending network structure aiming at the streaming lending data, accurately dynamically updating the related characterization of the continuously changing lending network through an incremental network characterization learning method, constructing new characteristics of the lending data based on the vector characterization of the nodes, and inputting the fraud probability of returning the lending data by the existing trained model. Compared with the traditional method, the method has stronger real-time performance in the updating of the characterization in the model, is suitable for the requirement of rapid data auditing in the network lending scene, and has higher accuracy and robustness.
Annotating: the relevant terms in the present invention can be found in the following for the prior art.
[1]Chen Y Q,Zhang J,Ng W W Y.Loan Default Prediction Using Diversified Sensitivity Undersampling[C]//2018International Conference on Machine Learning and Cybernetics(ICMLC).IEEE,2018,1:240-245.
[2]Shi Y F,Song P P.Improvement Research on the Project Loan Evaluation of Commercial Bank Based on the Risk Analysis[C]//2017 10th International Symposium on Computational Intelligence and Design(ISCID).IEEE,2017,1:3-6.
[3]Cui P,Wang X,Pei J,et al.A survey on network embedding[J].IEEE Transactions on Knowledge and Data Engineering,2018,31(5):833-852.
[4]Saha P,Bose I,Mahanti A.A knowledge based scheme for risk assessment in loan processing by banks[J].Decision Support Systems,2016,84:78-88.
[5]Talavera A,Cano L,Paredes D,et al.Data Mining Algorithms for Risk Detection in Bank Loans[C]//Annual International Symposium on Information Management and Big Data.Springer,Cham,2018:151-159.
[6]Babaev D,Savchenko M,Tuzhilin A,et al.ET-RNN:Applying Deep Learning to Credit Loan Applications[C]//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery&Data Mining.ACM,2019:2183-2190.
[7]Yu W,Cheng W,Aggarwal C C,et al.Netwalk:A flexible deep embedding approach for anomaly detection in dynamic networks[C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery&Data Mining.ACM,2018:2672-2681.
[8]Chen T,Guestrin C.XGBoost:A scalable tree boosting system[C]//Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining.ACM,2016:785-794.

Claims (4)

1. A network lending fraud detection method based on incremental network characterization learning is characterized by comprising the following steps:
step 1, establishing a relation lending network and completing homogenization
Collecting rich loan data generated by historical network loans, establishing a heterogeneous relationship loan network, taking a single number as a node, taking the attribute relationship simultaneously owned by different loan data as an edge, and deriving a homogeneous loan network; providing to step 2;
step 2, constructing a training sample set
Collecting original static data, establishing an initial static data set, transforming a network structure by using a network characterization learning algorithm, carrying out vectorization to obtain vector characterization corresponding to nodes based on the initial network lending data set, and forming a training sample set by the learned vector data; providing to step 3;
step 3, feature construction
Performing feature construction on vector data in a training sample set to prepare for inputting a fraud detection model; providing to step 4;
step 4, training fraud detection model
Adopting an XGBoost classifier in a machine learning integrated library scikit-learn in python as a fraud detection model, and inputting the features constructed in the step 3 into the classifier to train the fraud detection model; providing to step 7;
step 5, updating the relationship lending network and the homogeneous lending network
Updating and collecting the currently generated loan data of the network loan, and providing the updated relation loan network and the homogeneous loan network for the increment flow-type loan data which arrive in sequence in time sequence to the step 6;
step 6: updating a current test dataset
And (3) constructing a current test data set by utilizing the training sample set constructed in the step (2) and the streaming lending data which arrive in sequence in time sequence, namely: adding k new loan data, and deleting k loan data with earliest time in the initial data set to update the current test data set in real time;
referring to step 2, transforming the network structure by using a network representation learning algorithm, carrying out vectorization to obtain a vector representation corresponding to a node of the current test data set, and updating the learned vector data to update the current test data set; providing to step 7;
step 7, feature construction
Referring to step 3, performing feature construction on vector data in the test data set to prepare for inputting a fraud detection model; providing to step 8;
step 8, testing the fraud detection model
And (3) inputting the current test data set in the step (7) into the fraud detection model in the step (4) to obtain a judgment result of the fraud detection model.
2. The method of claim 1, comprising the steps of
Step 1.1: screening original fields from the original lending data, and performing field type conversion and null value removal filling pretreatment operation;
dividing original lending data into two types of single number (applync) and ATTRIBUTE (ATTRIBUTE), wherein the ATTRIBUTE (ATTRIBUTE) is other data except the single number (applync) in the lending data; for a borrowing data, it is noted as (b) i ,ATT(b i )),b i Is the single number of the lending data b, ATT (b i ) Is the attribute set corresponding to the lending data b, att k (b i ) Is ATT (b) i ) The kth element of (a);
establishing a relational loan network N based on original loan data r = (V, E), V is a node set, E is an edge set, where edge e= (u, V), u and V belong to node set V, which contains multiple types of nodes; for each item of data b in the loan data b i First b i Adding node set V, adding ATT (b) i ) Each element of the list is added to the node set V in turn, and the edge (b) i ,att k (b i ) Add edge set E, att k (b i ) Is ATT (b) i ) The kth element of (a); executing the step 1.2;
step 1.2: establishing a homogeneous lending network N based on a relational lending network h =(V h ,E h ),V h Is a node set, E h Is an edge set, where edge e= (u, V, w), u and V belong to node set V h Node set V h Only nodes with the type of lending list number are included; when att k (b i )=att k (b j ) When a pair of edges (b i ,att k (b i ) (b) j ,att k (b j ) Is regarded as edge set E h Edge (b) of (b) i ,b j ) W is the edge (b) i ,b j ) The number of occurrences as a homogeneous lending network N h The weight of (a); based on relation lending network N r Adding all nodes with the types of lending single numbers in the node set V into the node set V h The method comprises the steps of carrying out a first treatment on the surface of the Each pair of edges (b i ,att k (b i ) (b) j ,att k (b j ) When att k (b i )=att k (b i ) When the edge (b) i ,b j ) Adding edge set E h The method comprises the steps of carrying out a first treatment on the surface of the Obtaining a homogeneous lending network N h =(V h ,E h ) The method comprises the steps of carrying out a first treatment on the surface of the Executing the step 1.3;
step 1.3: based on constructed homogeneous lending network N h Network characterization learning method NetWalk is adopted to learn homogeneous lending network N h Vector characterization of all network nodes in (a); executing the step 1.4;
step 1.4: aiming at a homogeneous lending network N, the network characterization learning method NetWalk in the step 1.3 h Obtaining a vector representation gamma of a node v and a corresponding vector representation gamma in a network at an initial time t, and establishing a mapping relation gamma=F t (v) The method comprises the steps of carrying out a first treatment on the surface of the According to the mapping relation gamma=f t (v) Representing the initial lending data as a vector representation form, and converting the lending data formed by a plurality of specific field values into a group of vector representations with fixed dimensions;
step 2.1: based on t k Time node and mapping relation
Figure FDA0004119424370000031
The loan data is transformed into the dimension dim corresponding to each loan unit numberVector;
sequentially calculating Euclidean distance between each single number and the first h single numbers in the data set for each lending data based on the obtained vector characterization, sequencing the single numbers according to the generation time, sequencing the h single numbers according to the sequence from small to large, and taking the h single numbers as the constructed time sequence characteristics of the corresponding single numbers; then, introducing the similarity of the single number to be detected and the vector corresponding to the first h single numbers as the input of the fraud detection model, and facing the vector X= (X) 1 ,····,x dim )、Y=(y 1 ,····,y dim ) The Euclidean distance is calculated as follows
Figure FDA0004119424370000032
Step 2.2: based on the time sequence characteristics constructed in the step 2.1, according to the classifier parameter set W c Setting a classifier to make t k Time sequence characteristics corresponding to the time lending data are used as data, whether the corresponding lending data is a fraud transaction or not is used as a label, the time sequence characteristics are imported into a classifier for training, and the trained two classification models are regarded as fraud detection models
Figure FDA0004119424370000033
Step 3.1: according to data set B train (t k ) Time stamp sequence, selection and data set B test (t k+1 ) The same amount of earliest data is placed into data set B' test (t k+1 ) The method comprises the steps of carrying out a first treatment on the surface of the Data set B test (t k+1 ) And B' test (t k+1 ) The same preprocessing operation as in step 1.1 is adopted to process the data set B after processing test (t k+1 ) And B' test (t k+1 ) Based on data set B train (t k ) Updating the relationship lending network; based on the definition of step 1.1, the network lending data B are processed separately test (t k+1 ) And B' test (t k+1 ) Obtaining node set V in relational lending network test (t k+1 ) And V' test (t k+1 ) And edge set E test (t k+1 ) And E' test (t k+1 ),E test (t k+1 ) Is the single number in the lending data of the stream arrival and the lending network N related to the last moment r A set of edges of existing relationships between existing nodes,
Figure FDA0004119424370000034
is a relational lending network N r A set of expiring edges to be deleted; let v=v & &v test (t k+1 )-V′ test (t k+1 ) And e=e% test (t k+1 )-E′ test (t k+1 ) Updating a relational lending network N r = (V, E); executing the step 3.2;
step 3.2: based on updated relationship lending network N r = (V, E), the updated homogeneous lending network N is obtained using step 1.2 h =(V h ,E h ) The method comprises the steps of carrying out a first treatment on the surface of the Executing the step 3.3;
step 3.3: based on time t k Mapping relation between corresponding node v and corresponding vector gamma
Figure FDA0004119424370000035
Respectively set edge sets E test (t k+1 ) And E' test (t k+1 ) For the newly arrived edge set and the edge set to be deleted, a network characterization learning method NetWalk is applied to the related edge set E test (t k+1 ) And E' test (t k+1 ) Incremental network characterization learning is carried out on the nodes and the edges in the network to obtain a time t k+1 Mapping relation between corresponding node v and corresponding vector gamma>
Figure FDA0004119424370000041
Executing the step 3.4;
step 3.4: the step 3.3 is directed to the homogeneous lending network N h At time t k Mapping relation between node v and its corresponding vector representation gamma in time network
Figure FDA0004119424370000042
According to the mapping relation gamma=f t (v) Re-representing the streaming lending data into a vector representation form, wherein a set of vector representations consisting of a plurality of specific field values are converted into a set of vector representations with fixed dimensions;
step 4.1: loan data B containing n available original fields train (t k ) N corresponding nodes in the homogeneous lending network; from step 3.4, it can be seen that based on t k Time node and mapping relation
Figure FDA0004119424370000043
The lending data are transformed into vectors with dimension dim corresponding to each lending single number; based on the obtained vector characterization, sequentially calculating Euclidean distance between each single number and the first h single numbers in the data set for each lending data, sequencing the single numbers according to the generation time, sequencing the h single numbers according to the sequence from small to large, and taking the h single numbers as the time sequence characteristics of the corresponding single numbers; executing the step 4.2;
step 4.2: importing the fraud detection model obtained in step 2.2
Figure FDA0004119424370000044
Let t k Time sequence characteristics corresponding to the test data at the moment are input into a fraud detection model +.>
Figure FDA0004119424370000045
Obtaining a set B of test lending data test (t k ) Fraud probability p (b) i ) Outputting a set of probabilities P of the test data being fraudulent, wherein P (b i )∈P。
3. The method of claim 1, wherein determining whether the corresponding time of the current test dataset exceeds the model update period, if not, repeating step 5, and if so, repeating step 1; until fraud detection is completed for all test data sets, the algorithm ends.
4. A method as claimed in claim 3, characterized in that the time t is determined k+1 +t 0 Whether or not is greater than the period T, if so, then T k Time of day loan data set B train (t k ) Regarding the initial lending data set, performing a first partial step 1.1 to reconstruct the relational lending network; if smaller than, let
Figure FDA0004119424370000046
B train (t k+1 )=B train (t k )∪B test (t k+1 )-B′ test (t k+1 ) The method comprises the steps of carrying out a first treatment on the surface of the At time t k+1 Step 3.1 is executed to incrementally update the network characterization based on the incoming streaming lending data. />
CN201911101580.2A 2019-11-12 2019-11-12 Network lending fraud detection method based on incremental network characterization learning Active CN111105303B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911101580.2A CN111105303B (en) 2019-11-12 2019-11-12 Network lending fraud detection method based on incremental network characterization learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911101580.2A CN111105303B (en) 2019-11-12 2019-11-12 Network lending fraud detection method based on incremental network characterization learning

Publications (2)

Publication Number Publication Date
CN111105303A CN111105303A (en) 2020-05-05
CN111105303B true CN111105303B (en) 2023-05-12

Family

ID=70420478

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911101580.2A Active CN111105303B (en) 2019-11-12 2019-11-12 Network lending fraud detection method based on incremental network characterization learning

Country Status (1)

Country Link
CN (1) CN111105303B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112270548B (en) * 2020-11-17 2022-09-20 中国人民解放军国防科技大学 Credit card fraud detection method based on deep learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1309147A1 (en) * 2001-10-30 2003-05-07 Hewlett-Packard Company, A Delaware Corporation Method and apparatus for managing profile information in a heterogeneous or homogeneous network environment
EP2640108A1 (en) * 2012-03-16 2013-09-18 Deutsche Telekom AG Method and device for allocating wireless resources in a heterogeneous mobile network
CN110276679A (en) * 2019-05-23 2019-09-24 武汉大学 A kind of network individual credit fraud detection method towards deep learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1309147A1 (en) * 2001-10-30 2003-05-07 Hewlett-Packard Company, A Delaware Corporation Method and apparatus for managing profile information in a heterogeneous or homogeneous network environment
EP2640108A1 (en) * 2012-03-16 2013-09-18 Deutsche Telekom AG Method and device for allocating wireless resources in a heterogeneous mobile network
CN110276679A (en) * 2019-05-23 2019-09-24 武汉大学 A kind of network individual credit fraud detection method towards deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张燕.基于本质特征和网络特征的信用卡欺诈检测.微型电脑应用.2016,第32卷(第12期),72-77. *

Also Published As

Publication number Publication date
CN111105303A (en) 2020-05-05

Similar Documents

Publication Publication Date Title
US20170103203A1 (en) Applying Multi-Level Clustering at Scale to Unlabeled Data For Anomaly Detection and Security
CN109034194B (en) Transaction fraud behavior deep detection method based on feature differentiation
US11263644B2 (en) Systems and methods for detecting unauthorized or suspicious financial activity
CN107292744A (en) Investment Trend analysis method and its system based on machine learning
CN110930038A (en) Loan demand identification method, loan demand identification device, loan demand identification terminal and loan demand identification storage medium
Savage et al. Detection of money laundering groups: Supervised learning on small networks
Jonnalagadda et al. Credit card fraud detection using Random Forest Algorithm
CN110634060A (en) User credit risk assessment method, system, device and storage medium
CN112862585A (en) Personal loan type bad asset risk rating method based on LightGBM decision tree algorithm
Kolodiziev et al. Automatic machine learning algorithms for fraud detection in digital payment systems
CN111105303B (en) Network lending fraud detection method based on incremental network characterization learning
CN111178902A (en) Network payment fraud detection method based on automatic characteristic engineering
CN110956543A (en) Method for detecting abnormal transaction
Yahaya et al. An enhanced bank customers churn prediction model using a hybrid genetic algorithm and k-means filter and artificial neural network
US20220108133A1 (en) Sharing financial crime knowledge
CN111245815B (en) Data processing method and device, storage medium and electronic equipment
CN111028073B (en) Internet financial platform network lending fraud detection system
Guan et al. Grasped: A gru-ae network based multi-perspective business process anomaly detection model
CN111275447A (en) Online network payment fraud detection system based on automatic feature engineering
Casalino et al. Balancing data within incremental semi-supervised fuzzy clustering for credit card fraud detection
Kawahara et al. Cash flow prediction of a bank deposit using scalable graph analysis and machine learning
Muranda et al. Deep learning method for detecting fraudulent motor insurance claims using unbalanced data
Lee et al. Application of machine learning in credit risk scorecard
Eria et al. Decision support credit scoring model to improve loan default prediction in financial institutions
Pasquadibisceglie et al. JARVIS: Joining Adversarial Training With Vision Transformers in Next-Activity Prediction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant