CN103793484A

CN103793484A - Fraudulent conduct identification system based on machine learning in classified information website

Info

Publication number: CN103793484A
Application number: CN201410022138.1A
Authority: CN
Inventors: 张鹏; 张爱华; 张美琦; 张朝阳; 孙亚健
Original assignee: Beijing 58 Information Technology Co Ltd
Current assignee: Beijing 58 Information Technology Co Ltd
Priority date: 2014-01-17
Filing date: 2014-01-17
Publication date: 2014-05-14
Anticipated expiration: 2034-01-17
Also published as: CN103793484B

Abstract

The invention provides a method used for a fraudulent conduct identification system based on machine learning in a classified information website. The method includes the following steps that (a), sample data are extracted based on existing user behavior data and used for generating a model for the first time; (b), multiple user behavior characteristics are selected to be extracted according to training data of different service types; (c), based on the extracted user behavior characteristics, the sample training data are vectorized; (d), the vectorized sample training data are used for generating a prediction model; (e), on-line data are detected by using the generated model based on classification and cluster rules; (f), detected abnormal user data are processed. User behaviors can be identified in multiple dimensions through the method, and the false amount of trade information can be reduced efficiently. Moreover, the user behaviors of low quality can be identified well even if the training data contain noise data.

Description

The fraud recognition system based on machine learning in classified information website

Technical field

The present invention relates to Internet technology, particularly the fraud recognition system based on machine learning in a kind of classified information website.

Background technology

Classified information net is the Type of website of the new a kind of every aspect information that relates to daily life of rising in internet.Can obtain freely the inside, these websites users, information issuing service easily, comprise second-hand article trading, used car trade, housing, pet, recruitment, part-time, job hunting, friend-making activity, service for life information etc.Classified information claims again classified advertisement, people's advertisement of seeing on TV, newpapers and periodicals of being everlasting day, no matter beholder is willing to be unwilling, it all can impose on beholder often, and this series advertisements is passive advertisement; And the information of people initiatively go aspects such as inquiry recruit, rented a house, tourism, to these information, claims that it is active advertisement.The today of progressively developing in information society, passive advertisement more and more causes people's dislike, and active advertisement is subject to people's extensive favor.Almost the evening paper in each place, daily paper, life & amusement report be can't do without the figure of classified information, and do to obtain better newspaper, and the length of classified information is often larger.Just produce thus classified information net.

In the user who releases news in classified information website, often there will be a part of user inferior, they obtain interests to issue the mode fraudulent user such as deceptive information.Therefore, classified information website can arrange some processing rule and filter logic etc. to information inferior.

The means of existing deceptive information identification are mainly rule-based recognition method, more additional artificial interventions, for example within a period of time, issue and count, whether contain in the information content commodity of illegal word, issue or whether unreasonable etc. the rule of the price range of service judges whether a user is the user inferior who issues deceptive information by adding up an ip, thereby take the processing means such as deletion information, warning, logging off users.But common processing rule and filter logic are all to carry out the identification of behavior inferior by single dimension conventionally, thereby therefore user inferior can sound out regular critical point and get around the processing to information inferior and the filter logic of system by every means.

In addition, along with reaching the standard grade of various rules, spendable rule can be fewer and feweri, because rule is all the obvious feature of meeting.In existing method, can only use linear classifying face to distinguish to regular identification, thereby cause most information inferior can and not process by system identification.

Therefore, the fraud recognition system based on machine learning in needs a kind of classified information website, identifies user's behavior in multiple dimensions, thereby reduces efficiently the falseness amount of Transaction Information, improves the authenticity of Transaction Information.

Summary of the invention

The object of this invention is to provide the fraud recognition system based on machine learning in a kind of classified information website.

According to an aspect of the present invention, provide a kind of for classified information website the method for the fraud recognition system based on machine learning, described method comprises the steps: a) based on existing user behavior data sample drawn data, for generation model first; B) for the multiple user behavior feature of training data selective extraction of different service types; C) the user behavior feature based on extracted, carries out vectorization to described sample training data; D) utilize the sample training data of vectorization to produce forecast model; E) utilize the model producing based on Classification and clustering rule, data on line to be detected; F) the detect abnormal user data that obtain are processed.

Preferably, the sample data in described step a comprises positive sample data and negative sample data, corresponds respectively to the user of high-quality behavior and the user of behavior inferior.

Preferably, in described step b, user behavior feature comprises for the user behavior data of same cookie and the statistical magnitude of the each dimension of user.

Preferably, in described step b, select by the mode of computing information entropy and the data verification of model intersection the user characteristics that different service types is extracted.

Preferably, in described steps d, the sorter of probability of use type carries out decision-making.

Preferably, in described step e, utilize model to calculate the Probability Point of the abnormal probability that represents user behavior data.

Preferably, the method for calculating described Probability Point is that multiple models detect many stack features of user behavior data respectively, and draw respectively a point of Probability Point, then each point of Probability Point are carried out to sum of products conversion operation, draw the Probability Point of user behavior data.

Preferably, in described step e, the user's anomaly detection method based on classifying rules comprises that a probability line of setting is for judging whether user behavior data is bad data.

Preferably, in described step e, the user's anomaly detection method based on clustering rule comprises as follows: e1) Probability Point is carried out to the monitoring of cluster phenomenon; E2) Probability Point cluster is detected to the user behavior of some, take judge cluster to the user behavior of equal probabilities point whether as user behavior inferior; E3), according to testing result, abnormal user behavior discrimination model upgrades the Probability Point of such user behavior; E4) will add Sample Storehouse as training data through detecting the new bad data of finding; E5) utilize new training data training pattern.

Preferably, in described step e5, carry out off-line analysis for the inaccurate user behavior data of Probability Point, find new user behavior feature and select suitable feature.

Utilize the fraud recognition system based on machine learning in a kind of classified information of the present invention website, can identify user's behavior from multiple dimensions, thereby reduce efficiently the falseness amount of Transaction Information, improve the authenticity of Transaction Information.And, even also can be good at the in the situation that of containing noise data in training data, user behavior inferior is identified.

Accompanying drawing explanation

With reference to the accompanying drawing of enclosing, the more object of the present invention, function and advantage are illustrated the following description by embodiment of the present invention, wherein:

Fig. 1 has schematically shown the method flow diagram of the fraud recognition system based on machine learning in classified information of the present invention website.

Embodiment

By reference to one exemplary embodiment, object of the present invention and function and will be illustrated for the method that realizes these objects and function.But the present invention is not limited to following disclosed one exemplary embodiment; Can be realized it by multi-form.The essence of instructions is only to help various equivalent modifications Integrated Understanding detail of the present invention.

Hereinafter, embodiments of the invention will be described with reference to the drawings.In the accompanying drawings, identical Reference numeral represents same or similar parts, or same or similar step.

Fraud information recognition methods of the present invention has been used the data that produce based on user behavior, and the information data that can immediately issue user is identified.The Model Identification of the machine learning that the present invention adopts, can identify user's behavior in multiple dimensions, and the particularly user of information inferior that makes to release news is difficult to know what the dimension of identification is, thereby cannot evade by getting around rule.The present invention can, under the environment of a small amount of sample and high noisy, data are predicted, and accuracy rate be high.Collect modeling in the various actions to user, thereby reach, abnormal user is identified.

Fig. 1 has schematically shown the method flow diagram of the fraud recognition system based on machine learning in classified information of the present invention website.As shown in Figure 1:

Step 110, based on existing user behavior data sample drawn data, for generation model first.The sample data of this extraction can extract from existing user behavior data storehouse of having examined, be mainly used in user area to be divided into high-quality user and user inferior, correspond respectively to positive sample data and negative sample data, wherein, positive sample is the user behavior data through examining the high-quality behavior user who passes through, and negative sample is the behavioral data through the behavior user inferior of audit identification, for example, offend the user behavior data of some comparatively serious rules (for example issuing wash sale information).Sample data in existing audit storehouse is to the user behavior database of setting up of classifying by some conventional user behavior recognition methodss.Described method is for example: detect in the text message that user issues and whether contain illegal word, detect in the pictorial information that user issues whether contain illegal contents etc.

In the process of the upper once model iteration after generation model, the method according to this invention can directly be used extracted positive and negative samples storehouse, and without re-using original audit library information.

Step 120, for the multiple user behavior feature of training data selective extraction of different service types.Effect by experiment, judges in different business line, to use which user behavior feature.

User's characteristic behavior is conventionally very many, in view of the requirement of while computational accuracy and counting yield, summarizing according to user behavior feature of the present invention is the feature that good and bad user is had to discrimination, and do not require a lot of features therefore, model is the smaller the better, and object is can use multiple models to detect data in the time of last identification.

User behavior feature is the behavioural characteristic of finding by some off-line analysiss, generally can not repeat with other rules on line, can be understood as the feature that background audit personnel can not find, it is for example the user behavior data for same cookie, comprise: the trans-city number of posting, use mobile phone number, the time interval and mouse to click behavior, user to be registered to the login behavior that the time interval of posting divides, also has user, the statistical magnitude of the data of the dimensions such as user browsing behavior and the each dimension of user is as the trans-city N of ip days countings, cookie counting etc.

Preferably, for the training data of different types of service, system is extracted different features.The feature that for example used car service line uses can only comprise: be registered to the time of posting, the user that posts fills in data time, user's mouse track, 30 days ip correlated counts etc.Second-hand service line can only comprise following feature: user registers login time, the number of times of the browsing pages of N days before the corresponding ip of user.

More preferably, select the feature to different business line drawing by the mode of computing information entropy and the data verification of model intersection.

The model that the method according to this invention is set up has only been selected the feature of less dimension, each dimension that model uses be produce by data analysis to the good dimension of classifying quality, the data that therefore each dimension produces are not sparse.Method of the present invention has overcome dimension in prior art and has too much caused calculating too complicated shortcoming, thereby prior art normally produces a large amount of dimensions by text being carried out to participle differentiation user, each word is as a dimension, thereby cause training sample too much, calculation of complex, also can be too high to the accuracy requirement of sample data.

Step 130, according to the user behavior feature of selecting in step 120, carries out vectorization to sample training data.The result of the vectorization of training data can be saved in file.The component of each dimension of training data is a feature of selecting.The vectorization procedure of this dimension of time of filling in model take user below, as example, illustrates the vectorization procedure of training data:

1. in sample data, obtain the data of each model fill in the time.

2. these data are carried out to data cleansing, its objective is outlier is playbacked.

3. the attribute of pair successive value carries out discretize processing, uses K mean cluster 100 times, is to have the cluster centre point of error minimum as discretize cut section.

4. then data are carried out to last correction.

5. vectorization completes.

Step 140, utilizes the sample training data of vectorization to produce forecast model.When training pattern, preferably the sorter of probability of use type carries out decision-making.Probabilistic type sorter is for calculating the Probability Point of user behavior data.

The reason that the sorter of probability of use type carries out decision-making is, the object producing due to last model is to identify and delete some abnormal behaviour information, so this model need to have very high accuracy rate, because the sorter such as neural network or decision tree has the situation of manslaughtering, so the sorter of probability of use type carries out decision-making.

The model using is preferably employing Bayesian network model.Bayesian network (Bayesian network) is the mathematical model based on probability inference, and it has stronger generalization ability, and can the layering of firm logic and the output of probability, so be well suited for the scene of behavior identification.

Preferably, the employing program WEKA that increases income carries out the training of model, WEKA(Waikato Environment for Knowledge Analysis) as a disclosed data mining workbench, the a large amount of machine learning algorithms that can bear data mining task are gathered, comprise data are carried out to pre-service, classification, recurrence, cluster, correlation rule and visual on new interactive interface.

Step 150, utilizes the multiple models that produce based on Classification and clustering rule, data on line to be detected, and user inferior is processed.While utilizing abnormal user behavior model of cognition to detect the user behavior data on line, can calculate every user behavior data, generate a Probability Point, Probability Point represents that these data are the probable value of bad data, and higher these data of this Probability Point are more prone to bad data.Wherein, generate the Probability Point of user behavior data with following method: each model respectively many stack features to user behavior data (organizing the feature of different dimensions) detects more, each model show that respectively one represents that data are the probable value (hereinafter referred to as a point Probability Point) of the probability of quality tendency, finally each point of Probability Point carried out to sum of products conversion operation, draw the Probability Point of user behavior data.

The method that detects user's abnormal behaviour comprises following two kinds:

User's anomaly detection method based on classifying rules.Set a probability line (being probability threshold value), for judging whether user behavior data is bad data.If the Probability Point of certain user behavior data exceedes probability threshold value, this user behavior data is judged to be to bad data, be judged to be user inferior by this user.Otherwise this user behavior data is judged to be to normal data, this user is judged to be to high-quality user.Wherein, probability line is to obtain by the mode of artificial checking.

Based on clustering rule, data on line are detected.Concrete steps are as follows:

Step a, carries out the monitoring of cluster phenomenon to Probability Point.

Step b, Probability Point cluster is detected to the user behavior data turning-over operation personnel of some, judge whether cluster to the user behavior of equal probabilities point is user behavior inferior, and on detection line, whether other the user behavior data with this Probability Point is all bad data.Cluster is to the user behavior data same class behavior of equal probabilities point or the user behavior data of similar behavior, and they may only have the feature of less dimension distinct.Detection method is preferably, and whether the user behavior data that detection has this Probability Point has all been fallen by other rule treatments.Wherein, other rules are modes of the identification bad data outside invention.For example: the pictorial information that contains illegal word, user's issue in the text message that user issues contains illegal contents etc.

Step c, according to operating personnel's testing result, abnormal user behavior discrimination model upgrades the Probability Point of such user behavior.That is, if find, this behavior is user's inferior behavior, this user behavior data is judged to be to bad data, improves the Probability Point of this user behavior to certain higher probable value, for example, Probability Point is increased to 0.999.

Steps d, will add Sample Storehouse as training data through detecting the new bad data of finding.That is, the user behavior data of bad data adds in Sample Storehouse when being judged as, as the training data of model next time, thereby provides new training data for the renewal of model.

Step e, utilizes new training data training pattern.

Preferably, in step e, carry out off-line analysis for the inaccurate user behavior data of Probability Point, find new user behavior feature and select suitable feature.And the model of new generation is done to cross validation judgment models whether there is better performance.

Utilize above-mentioned based on clustering rule the detection method to data on line, can realize the detection to user behavior data in the situation that sample data is not accurate enough.And, utilize above-mentioned steps a to step e, can realize in semi-supervised machine learning mode and carry out the renewal to model.And, by this mechanism, can avoid the inaccurate problem of the Probability Point of data being calculated owing to containing the caused model of the reasons such as noise data in the sample data as training data, so even also can be good at user behavior inferior to identify in the inaccurate situation of sample data.

Step 160, processes abnormal user data.After definite certain user behavior is abnormal behaviour, system can be processed user inferior, and the information of for example user being issued is on the net deleted etc.

In conjunction with the explanation of the present invention and the practice that disclose here, other embodiment of the present invention are easy to expect and understand for those skilled in the art.Illustrate with embodiment and be only considered to exemplary, true scope of the present invention and purport limit by claim.

Claims

1. a method for the fraud recognition system based on machine learning for classified information website, described method comprises the steps:

A) based on existing user behavior data sample drawn data, for generation model first;

B) for the multiple user behavior feature of training data selective extraction of different service types;

C) the user behavior feature based on extracted, carries out vectorization to described sample training data;

D) utilize the sample training data of vectorization to produce forecast model;

E) utilize the model producing based on Classification and clustering rule, data on line to be detected;

F) the detect abnormal user data that obtain are processed.

2. the method for claim 1, the sample data in wherein said step a comprises positive sample data and negative sample data, corresponds respectively to the user of high-quality behavior and the user of behavior inferior.

3. the method for claim 1, in wherein said step b, user behavior feature comprises for the user behavior data of same cookie and the statistical magnitude of the each dimension of user.

4. the method for claim 1, selects by the mode of computing information entropy and the data verification of model intersection the user characteristics that different service types is extracted in wherein said step b.

5. the method for claim 1, in wherein said steps d, the sorter of probability of use type carries out decision-making.

6. the method for claim 1, utilizes model to calculate the Probability Point of the abnormal probability that represents user behavior data in wherein said step e.

7. method as claimed in claim 6, the method of wherein calculating described Probability Point is, multiple models detect many stack features of user behavior data respectively, and draw respectively a point of Probability Point, then each point of Probability Point carried out to sum of products conversion operation, draw the Probability Point of user behavior data.

8. the method for claim 1, the user's anomaly detection method based on classifying rules in wherein said step e comprises sets a probability line for judging whether user behavior data is bad data.

9. the method for claim 1, the user's anomaly detection method based on clustering rule in wherein said step e comprises as follows:

E1) Probability Point is carried out to the monitoring of cluster phenomenon;

E2) Probability Point cluster is detected to the user behavior of some, take judge cluster to the user behavior of equal probabilities point whether as user behavior inferior;

E3), according to testing result, abnormal user behavior discrimination model upgrades the Probability Point of such user behavior;

E4) will add Sample Storehouse as training data through detecting the new bad data of finding;

E5) utilize new training data training pattern.

10. the method for claim 1, carries out off-line analysis for the inaccurate user behavior data of Probability Point in wherein said step e5, finds new user behavior feature and selects suitable feature.