CN109992578A

CN109992578A - Anti- fraud method, apparatus, computer equipment and storage medium based on unsupervised learning

Info

Publication number: CN109992578A
Application number: CN201910011758.8A
Authority: CN
Inventors: 金晓辉; 阮晓雯; 徐亮
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-01-07
Filing date: 2019-01-07
Publication date: 2019-07-09
Anticipated expiration: 2039-01-07
Also published as: CN109992578B

Abstract

The embodiment of the present application provides a kind of anti-fraud method, apparatus, computer equipment and storage medium based on unsupervised learning.The described method includes: filtering out the data with high risk of fraud from business datum according to default rule engine；Multidimensional characteristic is constructed according to the data with high risk of fraud；And multi-dimentional scale transformation model is utilized, the multidimensional characteristic of building is visualized in lower dimensional space to obtain multiple data points；Normal point and potential fraud point are determined from multiple data points after visualization；According to clustering algorithm, the normal point and the potential fraud point are clustered, with the cluster after being clustered；Ratio shared by the potential fraud point in each cluster is calculated；It is determined as potential fraud point proportion to cheat data higher than business datum corresponding to each point in the cluster of preset ratio.For the embodiment of the present application by the method judgement fraud data based on unsupervised learning, judgement is more accurate, improves the accuracy rate of identification fraud data.

Description

Anti- fraud method, apparatus, computer equipment and storage medium based on unsupervised learning

Technical field

This application involves technical field of data processing more particularly to a kind of anti-fraud methods based on unsupervised learning, dress It sets, computer equipment and storage medium.

Background technique

In big data era, data are widely used in many fields.From a large amount of data, how more Accurately determine which data is normal data, which data is the data there are fraud, is had become more and more heavier It wants.For example, there is some Claims Resolution cases for being related to fraud in settlement of insurance claim case in insurance field.It is common in the industry Identification improper data algorithm, namely anti-fraud algorithm is two classification methods.With the increase of data volume, two classification are calculated The decline of method processing capacity, is difficult really to identify improper data from numerous data, is such as difficult to identify that there are take advantage of The Claims Resolution case of swindleness behavior.If anti-the recognition capability of fraud algorithm is too poor, it will lead to more fraud cases, directly to enterprise Bring loss.

Summary of the invention

The embodiment of the present application provides a kind of anti-fraud method, apparatus, computer equipment and storage based on unsupervised learning The accuracy rate of identification fraud data can be improved in medium.

In a first aspect, the embodiment of the present application provides a kind of anti-fraud method based on unsupervised learning, this method packet It includes:

The data with high risk of fraud are filtered out from business datum according to default rule engine；According to the tool There are the historical behavior data of user corresponding to the data and the data with high risk of fraud of high risk of fraud to construct Multidimensional characteristic；According to the data with high risk of fraud, using multi-dimentional scale transformation model, by the multidimensional characteristic of building It is visualized in lower dimensional space to obtain multiple data points；From multiple data points after visualization determine normal point and The abnormal point is determined as potential fraud point by abnormal point；According to density clustering algorithm, to the normal point and described potential Fraud point is clustered, with the cluster after being clustered；Ratio shared by the potential fraud point in each cluster is calculated；It will dive It is higher than the cluster of preset ratio as target cluster in fraud point proportion；By industry corresponding to each point in the target cluster Business data are determined as cheating data.

Second aspect, the anti-rogue device based on unsupervised learning that the embodiment of the invention provides a kind of should be based on no prison The anti-rogue device that educational inspector practises includes for executing the corresponding unit of method described in above-mentioned first aspect.

The third aspect, the embodiment of the invention provides a kind of computer equipment, the computer equipment includes memory, And the processor being connected with the memory；

The memory is for storing computer program, and the processor is based on running and storing in the memory Calculation machine program, to execute method described in above-mentioned first aspect.

Fourth aspect, the embodiment of the invention provides a kind of computer readable storage medium, the computer-readable storage Media storage has computer program, when the computer program is executed by processor, realizes side described in above-mentioned first aspect Method.

The embodiment of the present application filters out the number with high risk of fraud according to default rule engine from business datum According to taking out multidimensional characteristic further according to the data spy with high risk of fraud, then multidimensional characteristic handled in lower dimensional space In visualized, determine abnormal point and potential fraud point from multiple data points after visualization, and to abnormal point and potential Fraud point is clustered, and determines fraud data according to cluster result.The embodiment of the present application is by a kind of based on unsupervised learning Method come judge cheat data, it is no longer necessary to data are labeled, to prevent the mark of error in data from bringing to model learning Influence；Specifically, the method judgement fraud data by default rule engine and based on unsupervised learning, judge more smart Standard improves the accuracy rate of identification fraud data.

Detailed description of the invention

Technical solution in order to illustrate the embodiments of the present invention more clearly, below will be to required use in embodiment description Attached drawing be briefly described, it should be apparent that, drawings in the following description are some embodiments of the invention, for ability For the those of ordinary skill of domain, without creative efforts, it can also be obtained according to these attached drawings other attached Figure.

Fig. 1 is the flow diagram of the anti-fraud method provided by the embodiments of the present application based on unsupervised learning；

Fig. 2 is the sub-process schematic diagram of the anti-fraud method provided by the embodiments of the present application based on unsupervised learning；

Fig. 3 is the sub-process schematic diagram of Fig. 2 provided by the embodiments of the present application；

Fig. 4 is the sub-process schematic diagram of the anti-fraud method provided by the embodiments of the present application based on unsupervised learning；

Fig. 5 is the sub-process schematic diagram of the anti-fraud method provided by the embodiments of the present application based on unsupervised learning；

Fig. 6 is the schematic block diagram of the anti-rogue device provided by the embodiments of the present application based on unsupervised learning；

Fig. 7 is the schematic block diagram of low-dimensional visualization provided by the embodiments of the present application；

Fig. 8 is the schematic block diagram of provided by the embodiments of the present application determination unit；

Fig. 9 is the schematic block diagram of cluster cell provided by the embodiments of the present application；

Figure 10 is the schematic block diagram of computer equipment provided by the embodiments of the present application.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this Shen Please in embodiment, every other reality obtained by those of ordinary skill in the art without making creative efforts Example is applied, shall fall in the protection scope of this application.

In order to facilitate understanding, the data being related in following methods step are by taking the Claims Resolution forms data in insurance field as an example It is illustrated, it is possible to understand that ground, the data in the embodiment of the present application are not limited to the Claims Resolution forms data in insurance field, may be used also To be other data in other field.

Fig. 1 is the flow diagram of the anti-fraud method provided by the embodiments of the present application based on unsupervised learning.Such as Fig. 1 Shown, this method includes S101-S108.

S101 filters out the data with high risk of fraud according to default rule engine from business datum.

Such as insurance field, business datum can be Claims Resolution odd number involved in certain particular kind of insurances According to, such as the Claims Resolution forms data that medical insurance, serious illness insurance are related to.Business datum saves in the database, such as hive database.

Specifically, step S101 includes: that different screening rules is determined according to different business datums；Using preset Regulation engine, to filter out the data with high risk of fraud from corresponding business datum according to different screening rules.Root Different screening rules is formulated, according to different business datums to filter out from different business datums with high risk of fraud Data, the screening rule as corresponding to different types of insurance is different, the number with high risk of fraud filtered out According to there is also differences.The different screening rules that different business datums is determined are realized using SQL scripted code, and same The realization of Shi Liyong SQL scripted code filters out the data with high risk of fraud from different business datums, and periodically automatic Operation updates, and periodically output has the data of high risk of fraud.Default rule engine implementation is above-mentioned to utilize SQL scripted code The function being related to.

Such as include: for the screening rule with high risk of fraud data of chronic disease

(1) disease of being in danger containing chronic disease (chronic disease has corresponding slow sick table herein, comprising disease type, disease name and Disease code amounts to 681 kinds of chronic diseases)；(2) adjustment insurance kind is medical insurance kind；(3) reason for the request is by disease medical treatment；(4) case Part adjust medical insurance kind the effective date expired to accident day for the first time the latest is often 30~180 days etc..

It is to be appreciated that according to the screening rule with high risk of fraud data of the chronic disease, in hive database It is automated using SQL scripted code and realizes data screening and cleaning, and realize that regular automatic running updates, periodically export target Case filters out the chronic disease Claims Resolution forms data with high risk of fraud from the Claims Resolution forms data of medical insurance.

S102, according to corresponding to the data with high risk of fraud and the data with high risk of fraud The historical behavior data of user construct multidimensional characteristic.

Firstly, obtaining going through for user corresponding to the data with high risk of fraud and the data with high risk of fraud History behavioral data.The data with high risk of fraud are obtained, including obtaining the Claims Resolution list master for having the Claims Resolution of high risk of fraud single Information and insurer's information, warrantee's information in the single master of Claims Resolution etc..The single Claims Resolution list master of Claims Resolution with high risk of fraud Information and Claims Resolution it is single main in insurer's information, warrantee's information, static attribute (feature) including the single main appendix agreement of Claims Resolution, The information such as the essential attribute (feature) of corresponding insurer and warrantee.Obtain use corresponding to the data with high risk of fraud The historical behavior data at family, the history including obtaining insurer are insured corresponding to data corresponding to behavior and Claims Resolution behavior Data etc..

Then, the use according to corresponding to the data with high risk of fraud of acquisition and the data with high risk of fraud The historical behavior data at family construct multi-dimensional feature data.

The multidimensional characteristic such as constructed includes:

(1) customer insured's information, comprising: customer ID, name, birthday, gender, certificate, work unit, marital status etc. 37 dimensions；

(2) policy information, comprising: division code, number of policy, main appendix agreement number, insurance kind type, insurance kind code, effective date Deng 84 dimensions；

(3) client's physical examination information, comprising: customer ID, number of policy examine doctor, and physical examination type, checks knot at physical examination project 186 dimensions such as fruit, medical history；

(4) insurance kind attribute information, comprising: serial 3 fields of insurance kind code, insurance kind attribute, insurance kind；

(5) core protects result information, comprising: number of policy, control number, core order-preserving number, main appendix agreement number, customer ID is insured amount, shelves Secondary, core protects 68 dimensions such as reason；

(6) Claims Resolution case information, comprising: case number, processing type, case classification, case state, the number of reporting a case to the security authorities, accident hair 101 fields such as birthday；

(7) Claims Resolution bill information, comprising: case number, number of policy, date of being hospitalized, discharge date, medical amount incurred, residue Volume, 69 dimensions such as deal with insurance money；

(8) disease settle a claim information, comprising: case number, disease serial number, disease code, medical diagnosis on disease result, operation code, 8 dimensions such as disease recovery from illness situation.

(9) it insures behavioural information of settling a claim, comprising: number of insuring, odd number of insuring, type of insuring insured amount, Claims Resolution time The dimensions such as number, amount for which loss settled, Claims Resolution disease, Claims Resolution time interval.

Wherein, static nature refers to the feature that information will not be caused different because of behavior difference of insuring or settle a claim every time, Static nature in such as policy information includes: number of policy, main appendix agreement number, main appendix agreement quantity, insurance kind type, insurance kind code.It throws Guarantor and warrantee's static nature include: date of birth, gender, occupation, marriage etc..It insures the behavioural characteristic meeting of referring to of settling a claim Because behavior of insuring or settle a claim every time generates the feature of variation, behavioural characteristic of such as insuring includes: insure number, odd number of insuring, throwing Protect type insured amount etc.；Claims Resolution behavioural characteristic includes claim times, amount for which loss settled, Claims Resolution disease, Claims Resolution time interval etc..

S103, according to the data with high risk of fraud, using multi-dimentional scale transformation model, by the multidimensional of building Feature is visualized in lower dimensional space to obtain multiple data points.

Multi-dimentional scale transformation model refers to multidimensional scaling, MDS, is to go to open up in lower dimensional space Show a kind of method for visualizing of higher-dimension multivariate data.The elementary object of multi-dimentional scale transformation is by initial data " fitting " to one In a low-dimensional coordinate, so that any deformation as caused by dimensionality reduction is minimum.

Specifically, as shown in Fig. 2, step S103 includes S201-S202.

Nonmetric type characteristics of variables in multidimensional characteristic is converted to measurement in such a way that dummy variable is converted by S201 Type characteristics of variables.

Multi-dimentional scale transformation can be divided into metric form multi-dimentional scale transformation (metric MDS) and nonmetric type multi-dimentional scale becomes Change (non-metric MDS).Metric form multi-dimentional scale transform method is used in the embodiment of the present application.Dummy variable is also known as illusory change Amount, nominal variable, are the independents variable quantified, and usual value is 0 or 1.Introduce dummy variable make to problem describe it is conciser, And close reality.Such as BMI is divided into low birth weight, normal type, overweight, fat classification according to clinical criteria, usually It can assume to be assigned a value of 1,2,3,4.From the perspective of number, after being assigned a value of 1,2,3,4, they are that have from small to large Certain ordinal relation, and in fact, exist between four kinds of weight sorting classifications there is no this size relation, they it Between should be the independent relationship of mutual equality.If according to 1,2,3,4 assignment and be brought into model be it is unreasonable, at this time It just needs to be translated into dummy variable.As dummy variable is arranged as reference in " normal type ", the value of " normal type " is set It is set to 1, other " low birth weights ", " overweight ", " obesity " etc. are set as 0 using " non-normal type " as reference.It is understood that For other weight sortings are compared with normal type, such more specific practical significance.It include duty in the multidimensional characteristic of building Whether industry suffers from the features such as certain disease, belongs to nonmetric type variable, can become nonmetric type in such a way that dummy variable is converted Amount is converted to 0-1 metric form variable.

S202 is utilized according to the multidimensional metric form characteristics of variables after the data and conversion with high risk of fraud Multi-dimentional scale transformation is handled, and obtains multiple data points to be visualized in lower dimensional space.

Multi-dimentional scale converts the classical multi-dimentional scale transformation that can be used in metric form multi-dimentional scale transformation (metric MDS) (classical MDS) method, that is, the standard measured use Euclidean distance.Wherein, lower dimensional space can be three-dimensional space etc..

In one embodiment, as shown in figure 3, step S202 includes the following steps S301-S308.

S301 obtains the item number with the data of high risk of fraud, it is assumed that is n, obtains multidimensional metric form characteristics of variables Dimension, it is assumed that be q, using the characteristic of n q dimension as sample data, obtain matrix X.

S302 calculates Euclidean distance matrix D according to matrix X.

S303, according to Euclidean distance matrix D structural matrix A.

S304 calculates inner product matrix B according to matrix A.

Wherein,For the mean value of all values of the i-th row in matrix A,For in matrix A jth arrange all values mean value,For the mean value of all values in matrix A.

S305 calculates the characteristic value and feature vector of inner product matrix B, wherein characteristic value is arranged according to sequence from big to small Sequence.Such as eigenvalue λ₁≥λ₂≥λ₃≥......

S306 determines the dimension k in visual space.Such as k=3, it means that carrying out in three dimensions visual Change, such as k=4, it is meant that visualized in space-time.

S307, reconstructWherein, Ε_kIt is the matrix of the preceding k feature vector composition of inner product matrix B, Λ_kIt is The preceding k eigenvalue cluster of inner product matrix B at diagonal matrix.The square formed according to the preceding k feature vector of inner product matrix B Battle array and inner product matrix B preceding k eigenvalue cluster at diagonal matrix restructuring matrix

Wherein, k characteristic value is extracted according to calculated characteristic value sequence from big to small, k corresponding feature to Amount is feature vector corresponding to k characteristic value.

S308, using the value reconstructed as the point in k dimension space.

If it should be noted that visualizing in three dimensions, the coordinate of each point in the three-dimensional space reconstructed Value is not three features in original multidimensional characteristic, but is obtained by several dimensional features in original multidimensional characteristic, i.e., each Each coordinate value of point is obtained by several dimensional features in original multidimensional characteristic.

In this way, being converted using multi-dimentional scale by initial data " fitting " into a low-dimensional coordinate, so that being caused by dimensionality reduction Any deformation it is minimum, the point in low-dimensional coordinate can express initial data to the greatest extent.

S104 determines normal point and abnormal point from multiple data points after visualization, the abnormal point is determined as diving In fraud point.

Normal point and potential fraud point are determined from multiple data points after visualization.

In one embodiment, as shown in figure 4, step S104 includes the following steps S401-S404.

S401, the x coordinate value of all the points after obtaining visualization, y-coordinate value ..., k coordinate value, wherein k is indicated The dimension in visual space.

Such as k=3, then obtaining the x coordinate value of all the points after visualization, y-coordinate value, z coordinate value.

S402 determines x value range according to the x coordinate value of all the points, so that x coordinate value falls into the point of x value range Accounting reach the first preset ratio；Y value range is determined according to the y-coordinate value of all the points, is taken so that y-coordinate value falls into y The accounting of the point of value range reaches the second preset ratio；......；K value range is determined according to the k coordinate value of all the points, with So that the accounting that k coordinate value falls into the point of k value range reaches k-th presumed ratio.

Wherein, the first preset ratio, the second preset ratio ..., k-th presumed ratio can be the same ratio value Such as 90%, or different ratio values, including each ratio value are different from, or that there are ratio values is different Situation.

S403, according to determining x value range, y value range ..., k value range, determine a space.

Wherein, the dimension in the space is related with k value.Such as k=3, then according to determining x value range, y value range, Z value range determines a three-dimensional space.

S404, is determined as normal point for the point fallen into the space, and the point not fallen in the space is determined as exception The abnormal point is determined as potential fraud point by point.

The embodiment by a new angle, i.e., space determined by the value range by each latitude coordinates come Determine normal point and potential fraud point.

In one embodiment, step S104 includes: all the points obtained after visualization, according to all the points after visualization Fit a threshold function table；Normal point and abnormal point are determined according to threshold function table, and the abnormal point is determined as potential fraud Point.Point such as by the point after visualization with a distance from threshold function table greater than pre-determined distance is determined as abnormal point, after visualization Point is determined as normal point no more than the point of pre-determined distance with a distance from threshold function table.The threshold value that different business datums fits Function is different to accordingly, pre-determined distance may also be different.By the way that the point in lower dimensional space is fitted to a threshold value Function determines normal point and potential fraud point according to threshold function table.

S105 clusters the normal point and the potential fraud point, after obtaining cluster according to clustering algorithm Cluster.

Wherein, clustering algorithm can be density clustering algorithm etc., wherein density clustering algorithm can be DBSCAN algorithm. This kind of density clustering algorithm commonly assumes that classification can be determined by the tightness degree of sample distribution.Same category of sample, he Between it is closely coupled, that is to say, that nearby centainly there is generic sample to deposit around category arbitrary sample ?.By dividing closely coupled sample into one kind, a cluster classification has thus been obtained.By the way that all each groups are close Connected sample divides each different classification into, then has obtained final all cluster category results.Between the clustering algorithm sample Distance use Euclidean distance.

Normal point and potential fraud point are divided into three classes by the clustering algorithm:

Core point: the point for having more than MinPts quantity is included in radius r；Boundary point: the quantity put in radius r is less than MinPts, but fall in core neighborhood of a point；Noise point: neither core point is also not the point of boundary point.

In one embodiment, as shown in figure 5, step S105 includes the following steps S501-S503.

Radius value and distance value is arranged in S501.Wherein, radius value r, distance value MinPts.Distance can be European Distance etc..Such as r=3, MinPts=3.

The normal point and the potential fraud point are labeled as core according to set radius value and distance value by S502 Heart point, boundary point, noise point, and delete noise point.Specifically, the set of point its radius r field in is calculated each point； It is more than the point of MinPts as core point using the number put in set；Check left point whether in the field of core point；If surplus Remaining point is in the field of core point, it is determined that is boundary point, otherwise, it determines being noise point.After noise point has been determined, it will make an uproar The point of articulation is deleted.If r=3, MinPts=3, the set of the point in its neighborhood r=3 is calculated each point, gathers interior put Number is more than the point of MinPts=3 as core point, if left point is in the field of core point, it is determined that is boundary point.

The point that distance is no more than set distance value is connected with each other by S503, forms a cluster, includes portion in the cluster Divide the boundary point in core point and the part core point pre-determined distance value neighborhood, so obtains multiple clusters, the multiple clusters that will be obtained As the cluster after cluster.

Point such as by distance no more than MinPts=3 is connected with each other, and forms a cluster, the point in core point field also can It is added into cluster, in this way, including the boundary in part core point and the part core point pre-determined distance value neighborhood in the cluster Point.In this way, multiple clusters can be obtained, using obtained multiple clusters as the cluster after cluster.

Ratio shared by the potential fraud point in each cluster is calculated in S106.

The shared ratio of potential fraud point is that the number of potential fraud point is in total points in the cluster in each cluster.Such as Total points in some cluster are 10, and potential fraud point is 4, then the potential shared ratio of point of cheating is 4/10*100% =40%.

Potential fraud point proportion is higher than the cluster of preset ratio as target cluster by S107.

Wherein, preset ratio can be set to 80% etc..Preset ratio may be set to be other numerical value.

Business datum corresponding to each point in the target cluster is determined as cheating data by S108.

It is to be appreciated that density clustering algorithm carry out cluster form cluster when, it has been contemplated that the distributing position feelings of point Condition, therefore, when the point for being more than preset ratio in some cluster is all potential fraud point, then the point in the cluster belong to it is latent It is very big a possibility that cheating point, then data corresponding to all the points in the cluster all can be determined as to cheat data.

Above method embodiment uses the method based on unsupervised learning, without being labeled to data, to prevent data Influence of the marking error to model learning；Method based on unsupervised learning improves so that the judgement of fraud data is more accurate The accuracy of identification fraud data.If the scheme in above method embodiment is applied in insurance field, it can be achieved that intelligent core It pays for.

Fig. 6 is the schematic block diagram of the anti-rogue device provided by the embodiments of the present application based on unsupervised learning.Such as Fig. 6 Shown, which includes for executing unit corresponding to the above-mentioned anti-fraud method based on unsupervised learning.Specifically, such as Shown in Fig. 6, the device 60 include screening unit 601, feature construction unit 602, visualization 603, point determination unit 604, Cluster cell 605, ratio computing unit 606, cluster determination unit 607 and fraud data determination unit 608.

Screening unit 601, for being filtered out from business datum according to default rule engine with high risk of fraud Data.Wherein, screening unit 601 includes condition determining unit, data screening unit.Wherein, condition determining unit is used for root Different screening conditions are determined according to different business datums.Data screening unit, for utilizing default rule engine, with root The data with high risk of fraud are filtered out from corresponding business datum according to different screening conditions.

Feature construction unit 602, for according to the data with high risk of fraud and described there is high fraud wind The historical behavior data of user corresponding to the data of danger construct multidimensional characteristic.

Visualization 603, for according to the data with high risk of fraud, using multi-dimentional scale transformation model, The multidimensional characteristic of building is visualized in lower dimensional space to obtain multiple data points.

In one embodiment, visualization 603 includes variable converting unit, low-dimensional visualization.Wherein, variable Converting unit, for the nonmetric type characteristics of variables in multidimensional characteristic to be converted to measurement in such a way that dummy variable is converted Type characteristics of variables.Low-dimensional visualization, for according to the various dimensions after the data and conversion with high risk of fraud Amount type characteristics of variables is handled using multi-dimentional scale transformation, obtains multiple data to be visualized in lower dimensional space Point.

In one embodiment, as shown in fig. 7, low-dimensional visualization 70 includes sample data acquiring unit 701, distance Matrix calculation unit 702, matrix construction unit 703, inner product matrix calculation unit 704, characteristic value computing unit 705, space dimension Number determination unit 706, reconfiguration unit 707 and lower dimensional space point determination unit 708.Wherein, sample data acquiring unit 701, For obtaining the item number n of the data with high risk of fraud, the dimension q of multidimensional metric form characteristics of variables is obtained, n q is tieed up Characteristic obtains matrix X as sample data.Distance matrix computing unit 702, for calculating Euclidean distance according to matrix X Matrix D, whereinMatrix construction unit 703, for according to Euclidean distance matrix construction matrix A, whereinInner product matrix calculation unit 704, for calculating inner product matrix B according to matrix A, whereinCharacteristic value computing unit 705, for calculate inner product matrix B characteristic value and feature to Amount, wherein characteristic value sorts according to sequence from big to small.Space dimensionality determination unit 706, it is visual empty for determining Between dimension k.Reconfiguration unit 707, for reconstructingWherein E_kIt is the preceding k feature vector composition of inner product matrix B Matrix, Λ_kPreceding k eigenvalue cluster at diagonal matrix.Lower dimensional space point determination unit 708, the value for will reconstruct As the point in k dimension space.

Point determination unit 604 will be described for determining normal point and abnormal point from multiple data points after visualization Abnormal point is determined as potential fraud point.

In one embodiment, as shown in figure 8, point determination unit 604 includes coordinate value acquiring unit 801, the determining list of range Member 802, space determination unit 803 and the first determination unit 804.Wherein, coordinate value acquiring unit 801, it is visual for obtaining The x coordinate value of all the points after change, y-coordinate value ..., k coordinate value, wherein k indicates the dimension in visual space.Model Determination unit 802 is enclosed, for determining x value range according to the x coordinate value of all the points, so that x coordinate value falls into x value model The accounting of the point enclosed reaches the first preset ratio；Y value range is determined according to the y-coordinate value of all the points, so that y-coordinate value The accounting for falling into the point of y value range reaches the second preset ratio；......；K value is determined according to the k coordinate value of all the points Range, so that the accounting that k coordinate value falls into the point of k value range reaches k-th presumed ratio.Space determination unit 803 is used According to determining x value range, y value range ..., k value range, determine a space.First determination unit 804, for the point fallen into the space to be determined as normal point, the point not fallen in the space is determined as abnormal point, by institute It states abnormal point and is determined as potential fraud point.

In one embodiment, point determination unit 604 includes fitting unit, the second determination unit.Wherein, fitting unit is used All the points after obtaining visualization, fit a threshold function table according to all the points after visualization.Second determination unit, For determining normal point and abnormal point according to threshold function table, the abnormal point is determined as potential fraud point.

Cluster cell 605, for being clustered to the normal point and the potential fraud point according to clustering algorithm, with Cluster after being clustered.

In one embodiment, as shown in figure 9, cluster cell 605 include setting unit 901, point marking unit 902 and Cluster cluster cell 903.Wherein, setting unit 901, for radius value and distance value to be arranged.Point marking unit 902, is used for basis Set radius value and distance value, by the normal point and the potential fraud point classification marker be core point, boundary point, Noise point, and delete noise point.Cluster cluster cell 903, the point for distance to be no more than set distance value mutually interconnect It connects, forms a cluster, include the boundary point in part core point and the part core point pre-determined distance value neighborhood in the cluster, such as This obtains multiple clusters, using obtained multiple clusters as the cluster after cluster.

Ratio computing unit 606, ratio shared by the potential fraud point for being calculated in each cluster.

Cluster determination unit 607, for potential fraud point proportion to be higher than the cluster of preset ratio as target cluster.

Data determination unit 608 is cheated, for business datum corresponding to each point in the target cluster to be determined as Cheat data.

It should be noted that it is apparent to those skilled in the art that, the tool of above-mentioned apparatus and each unit Body realizes process, can be with reference to the corresponding description in preceding method embodiment, for convenience of description and succinctly, herein no longer It repeats.

Above-mentioned apparatus can be implemented as a kind of form of computer program, and computer program can be as shown in Figure 10 It is run in computer equipment.

Figure 10 is a kind of schematic block diagram of computer equipment provided by the embodiments of the present application.The equipment is that terminal etc. is set It is standby, such as mobile terminal, PC terminal, IPad.The equipment 100 includes the processor 102 connected by system bus 101, storage Device and network interface 103, wherein memory may include non-volatile memory medium 104 and built-in storage 105.

The non-volatile memory medium 104 can storage program area 1041 and computer program 1042.This is non-volatile to deposit , it can be achieved that described above based on unsupervised when the computer program 1042 stored in storage media is executed by processor 102 The anti-fraud method practised.The processor 102 supports the operation of whole equipment for providing calculating and control ability.The interior storage Device 105 provides environment for the operation of the computer program in non-volatile memory medium, and the computer program is by processor 102 When execution, processor 102 may make to execute the anti-fraud method described above based on unsupervised learning.The network interface 103 For carrying out network communication.It will be understood by those skilled in the art that structure shown in Figure 10, only and application scheme The block diagram of relevant part-structure does not constitute the restriction for the equipment being applied thereon to application scheme, specifically sets Standby may include perhaps combining certain components or with different component cloth than more or fewer components as shown in the figure It sets.

Wherein, the processor 102 is for running computer program stored in memory, to realize following steps:

The data with high risk of fraud are filtered out from business datum according to default rule engine；According to the tool There are the historical behavior data of user corresponding to the data and the data with high risk of fraud of high risk of fraud to construct Multidimensional characteristic；According to the data with high risk of fraud, using multi-dimentional scale transformation model, by the multidimensional characteristic of building It is visualized in lower dimensional space to obtain multiple data points；From multiple data points after visualization determine normal point and The abnormal point is determined as potential fraud point by abnormal point；According to clustering algorithm, to the normal point and the potential fraud Point is clustered, with the cluster after being clustered；Ratio shared by the potential fraud point in each cluster is calculated；It is taken advantage of potential Swindleness point proportion is higher than the cluster of preset ratio as target cluster；By business number corresponding to each point in the target cluster According to be determined as cheat data.

In one embodiment, the processor 102 is normal in the execution determination from multiple data points after visualization Point and abnormal point are implemented as follows step when the abnormal point is determined as the step of potential fraud point:

The x coordinate value of all the points after obtaining visualization, y-coordinate value ..., k coordinate value, wherein k indicates visual The dimension in the space of change；X value range is determined according to the x coordinate value of all the points, so that x coordinate value falls into x value range The accounting of point reaches the first preset ratio；Y value range is determined according to the y-coordinate value of all the points, so that y-coordinate value falls into y The accounting of the point of value range reaches the second preset ratio；......；K value range is determined according to the k coordinate value of all the points, So that the accounting that k coordinate value falls into the point of k value range reaches k-th presumed ratio；It is taken according to determining x value range, y Be worth range ..., k value range, determine a space；The point fallen into the space is determined as normal point, will not fallen within Point in the space is determined as abnormal point, and the abnormal point is determined as potential fraud point.

All the points after obtaining visualization, fit a threshold function table according to all the points after visualization；According to threshold value Function determines normal point and abnormal point, and the abnormal point is determined as potential fraud point.

In one embodiment, the processor 102 is executing the data according to high risk of fraud, benefit With multi-dimentional scale transformation model, the multidimensional characteristic of building is visualized in lower dimensional space to obtain multiple data points When step, it is implemented as follows step:

Nonmetric type characteristics of variables in multidimensional characteristic is converted into metric form variable in such a way that dummy variable is converted Feature；According to the multidimensional metric form characteristics of variables after the data and conversion with high risk of fraud, multi-dimentional scale is utilized Transformation is handled, and obtains multiple data points to be visualized in lower dimensional space.

In one embodiment, the processor 102 execute it is described according to high risk of fraud data and Multidimensional metric form characteristics of variables after conversion is handled using multi-dimentional scale transformation, to be visualized in lower dimensional space When obtaining the step of multiple data points, it is implemented as follows step:

The item number n with the data of high risk of fraud is obtained, the dimension q of multidimensional metric form characteristics of variables is obtained, by n q The characteristic of dimension obtains matrix X as sample data；Euclidean distance matrix D is calculated according to matrix X, wherein； According to Euclidean distance matrix construction matrix A, whereinInner product matrix B is calculated according to matrix A, whereinCalculate the characteristic value and feature vector of inner product matrix B, wherein characteristic value is according to from big It sorts to small sequence；Determine the dimension k in visual space；ReconstructWherein Ε_kIt is the preceding k of inner product matrix B The matrix of a feature vector composition, Λ_kPreceding k eigenvalue cluster at diagonal matrix；Using the value reconstructed as k dimension space In point.

In one embodiment, the processor 102 execute it is described according to clustering algorithm, to the normal point and described Potential fraud point is clustered, and when with the step of the cluster after being clustered, is implemented as follows step:

Radius value and distance value are set；According to set radius value and distance value, by the normal point and described potential Fraud point classification marker is core point, boundary point, noise point, and deletes noise point；Distance is no more than set distance value Point be connected with each other, form a cluster, include in part core point and the part core point pre-determined distance value neighborhood in the cluster Boundary point so obtains multiple clusters, using obtained multiple clusters as the cluster after cluster.

In one embodiment, the processor 102 described is sieved according to default rule engine from business datum executing When selecting the step of the data with high risk of fraud, it is implemented as follows step:

Different screening rules is determined according to different business datums；Using default rule engine, according to different Screening rule filters out the data with high risk of fraud from corresponding business datum.

It should be appreciated that in the embodiment of the present application, alleged processor 102 can be central processing unit (Central Processing Unit, CPU), which can also be other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (application program lication Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other can Programmed logic device, discrete gate or transistor logic, discrete hardware components etc..General processor can be micro process Device or the processor are also possible to any conventional processor etc..

Those of ordinary skill in the art will appreciate that be all or part of stream in the method for realize above-described embodiment Journey is relevant hardware can be instructed to complete by computer program.The computer program can be stored in a storage medium In, which can be computer readable storage medium.The computer program is by least one of the computer system Processor executes, to realize the process step of the embodiment of the above method.

Therefore, present invention also provides a kind of storage mediums.The storage medium can be computer readable storage medium. The storage medium is stored with computer program, which performs the steps of when being executed by a processor

In one embodiment, the processor is executing the normal point determining from multiple data points after visualization And abnormal point is implemented as follows step when the abnormal point is determined as the step of potential fraud point:

In one embodiment, the processor execute it is described according to the data of high risk of fraud, using more Change of scale model is tieed up, the step of the multidimensional characteristic of building is visualized in lower dimensional space to obtain multiple data points When, it is implemented as follows step:

In one embodiment, the processor is executing the data according to high risk of fraud and is turning Multidimensional metric form characteristics of variables after changing, using multi-dimentional scale transformation handled, with visualized in lower dimensional space with When obtaining the step of multiple data points, it is implemented as follows step:

The item number n with the data of high risk of fraud is obtained, the dimension q of multidimensional metric form characteristics of variables is obtained, by n q The characteristic of dimension obtains matrix X as sample data；Euclidean distance matrix D is calculated according to matrix X, wherein According to Euclidean distance matrix construction matrix A, whereinInner product matrix B is calculated according to matrix A, whereinCalculate the characteristic value and feature vector of inner product matrix B, wherein characteristic value is according to from big It sorts to small sequence；Determine the dimension k in visual space；ReconstructWherein Ε_kIt is the preceding k of inner product matrix B The matrix of a feature vector composition, Λ_kPreceding k eigenvalue cluster at diagonal matrix；Using the value reconstructed as k dimension space In point.

In one embodiment, the processor execute it is described according to clustering algorithm, to the normal point and described potential Fraud point is clustered, and when with the step of the cluster after being clustered, is implemented as follows step:

In one embodiment, the processor described is screened according to default rule engine from business datum executing When providing the step of the data of high risk of fraud, it is implemented as follows step:

The storage medium can be USB flash disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), magnetic disk Or the various computer readable storage mediums that can store program code such as CD.

In several embodiments provided herein, it should be understood that disclosed device, device and method, it can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, the division of the unit, Only a kind of logical function partition, there may be another division manner in actual implementation.Those skilled in the art can be with It is well understood, for convenience of description and succinctly, the specific work process of the device of foregoing description, equipment and unit can With with reference to the corresponding process in preceding method embodiment, details are not described herein.The above, the only specific implementation of the application Mode, but the protection scope of the application is not limited thereto, and anyone skilled in the art discloses in the application Technical scope in, various equivalent modifications or substitutions can be readily occurred in, these modifications or substitutions should all cover in the application Protection scope within.Therefore, the protection scope of the application should be subject to the protection scope in claims.

Claims

1. a kind of anti-fraud method based on unsupervised learning, which is characterized in that the described method includes:

The data with high risk of fraud are filtered out from business datum according to default rule engine；

According to the history of user corresponding to the data with high risk of fraud and the data with high risk of fraud Behavioral data constructs multidimensional characteristic；

According to the data with high risk of fraud, using multi-dimentional scale transformation model, by the multidimensional characteristic of building in low-dimensional It is visualized in space to obtain multiple data points；

Normal point and abnormal point are determined from multiple data points after visualization, and the abnormal point is determined as potential fraud point；

According to clustering algorithm, the normal point and the potential fraud point are clustered, with the cluster after being clustered；

Ratio shared by the potential fraud point in each cluster is calculated；

Potential fraud point proportion is higher than the cluster of preset ratio as target cluster；

Business datum corresponding to each point in the target cluster is determined as to cheat data.

2. the method according to claim 1, wherein being determined in multiple data points from after visualization normal Point and abnormal point, are determined as potential fraud point for the abnormal point, comprising:

The x coordinate value of all the points after obtaining visualization, y-coordinate value ..., k coordinate value, wherein k indicates visual empty Between dimension；

X value range is determined according to the x coordinate value of all the points, so that the accounting that x coordinate value falls into the point of x value range reaches First preset ratio；Y value range is determined according to the y-coordinate value of all the points, so that y-coordinate value falls into the point of y value range Accounting reach the second preset ratio；......；K value range is determined according to the k coordinate value of all the points, so that k coordinate value The accounting for falling into the point of k value range reaches k-th presumed ratio；

According to determining x value range, y value range ..., k value range, determine a space；

The point fallen into the space is determined as normal point, the point not fallen in the space is determined as abnormal point, it will be described different Often point is determined as potential fraud point.

3. the method according to claim 1, wherein being determined in multiple data points from after visualization normal Point and abnormal point, are determined as potential fraud point for the abnormal point, comprising:

All the points after obtaining visualization, fit a threshold function table according to all the points after visualization；

Normal point and abnormal point are determined according to threshold function table, and the abnormal point is determined as potential fraud point.

4. the method according to claim 1, wherein the data according to high risk of fraud, benefit With multi-dimentional scale transformation model, the multidimensional characteristic of building is visualized in lower dimensional space to obtain multiple data points, packet It includes:

Nonmetric type characteristics of variables in multidimensional characteristic is converted into metric form characteristics of variables in such a way that dummy variable is converted；

According to the multidimensional metric form characteristics of variables after the data and conversion with high risk of fraud, become using multi-dimentional scale The processing of swap-in row, obtains multiple data points to be visualized in lower dimensional space.

5. according to the method described in claim 4, it is characterized in that, it is described according to high risk of fraud data and Multidimensional metric form characteristics of variables after conversion is handled using multi-dimentional scale transformation, to be visualized in lower dimensional space To obtain multiple data points, comprising:

The item number n with the data of high risk of fraud is obtained, the dimension q of multidimensional metric form characteristics of variables is obtained, n q is tieed up Characteristic obtains matrix X as sample data；

Euclidean distance matrix D is calculated according to matrix X, wherein

According to Euclidean distance matrix construction matrix A, wherein

Inner product matrix B is calculated according to matrix A, wherein

Calculate the characteristic value and feature vector of inner product matrix B, wherein characteristic value sorts according to sequence from big to small；

Determine the dimension k in visual space；

ReconstructWherein Ε_kIt is the matrix of the preceding k feature vector composition of inner product matrix B, Λ_kIt is preceding k characteristic value The diagonal matrix of composition；

Using the value reconstructed as the point in k dimension space.

6. the method according to claim 1, wherein described according to clustering algorithm, to the normal point and described Potential fraud point is clustered, with the cluster after being clustered, comprising:

Radius value and distance value are set；

According to set radius value and distance value, by the normal point and the potential fraud point classification marker be core point, Boundary point, noise point, and delete noise point；

Distance is no more than the point of set distance value to be connected with each other, form a cluster, include in the cluster part core point and Boundary point in the part core point pre-determined distance value neighborhood, so obtains multiple clusters, using obtained multiple clusters as cluster after Cluster.

7. the method according to claim 1, wherein described sieve from business datum according to default rule engine Select the data with high risk of fraud, comprising:

Different screening rules is determined according to different business datums；

Using default rule engine, there is high fraud to filter out from corresponding business datum according to different screening rules The data of risk.

8. a kind of anti-rogue device based on unsupervised learning, which is characterized in that the anti-fraud dress based on unsupervised learning It sets and includes:

Screening unit, for filtering out the data with high risk of fraud from business datum according to default rule engine；

Feature construction unit, for according to the data and the data with high risk of fraud with high risk of fraud The historical behavior data of corresponding user construct multidimensional characteristic；

Visualization, for having the data of high risk of fraud according to, using multi-dimentional scale transformation model, by building Multidimensional characteristic is visualized in lower dimensional space to obtain multiple data points；

Point determination unit, it is for determining normal point and abnormal point from multiple data points after visualization, the abnormal point is true It is set to potential fraud point；

Cluster cell, for being clustered to the normal point and the potential fraud point, to be clustered according to clustering algorithm Cluster afterwards；

Ratio computing unit, ratio shared by the potential fraud point for being calculated in each cluster；

Cluster determination unit, for potential fraud point proportion to be higher than the cluster of preset ratio as target cluster；

Data determination unit is cheated, cheats number for business datum corresponding to each point in the target cluster to be determined as According to.

9. a kind of computer equipment, which is characterized in that the computer equipment includes memory, and is connected with the memory Processor；

The memory is for storing computer program；The processor is for running the computer journey stored in the memory Sequence, to execute the method according to claim 1 to 7.

10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey Sequence when the computer program is executed by processor, realizes the method according to claim 1 to 7.