CN103793484B

CN103793484B - The fraud identifying system based on machine learning in classification information website

Info

Publication number: CN103793484B
Application number: CN201410022138.1A
Authority: CN
Inventors: 张鹏; 张爱华; 张美琦; 张朝阳; 孙亚健
Original assignee: Beijing 58 Information Technology Co Ltd
Current assignee: Beijing 58 Information Technology Co Ltd
Priority date: 2014-01-17
Filing date: 2014-01-17
Publication date: 2017-03-15
Anticipated expiration: 2034-01-17
Also published as: CN103793484A

Abstract

The invention provides in a kind of website for classification information based on machine learning fraud identifying system method, methods described comprises the steps：A) existing user behavior data sample drawn data are based on, for generation model first；B) for the multiple user behavior features of training data selective extraction of different service types；C) based on the user behavior feature that is extracted, vectorization is carried out to the sample training data；D) forecast model is produced using the sample training data of vectorization；E) data on line are detected based on classification and clustering rule using produced model；F) the detected abnormal user data for obtaining are processed.The behavior of user can be identified from multiple dimensions using the present invention, efficiently reduce the false amount of Transaction Information.Even if also, also can be good at being identified user behavior inferior in the case of containing noise data in training data.

Description

The fraud identifying system based on machine learning in classification information website

Technical field

The present invention relates to Internet technology, the fraud based on machine learning in particularly a kind of classification information website Identifying system.

Background technology

Classification information net is a kind of website class of every aspect information for being related to daily life that the Internet newly rises Type.Inside these websites user can obtain freely, easily information distribution service, including second-hand article trading, used car Dealing, housing, house pet, recruitment, part-time, job hunting, make friend activity, life service information etc..Classification information is wide also known as classification Accuse, the daily advertisement that is seen on TV, newpapers and periodicals of people, often no matter beholder is willing to be unwilling that it can all impose on sight The person of seeing, this series advertisements are passive advertisement；And people actively go to inquire about the information of the aspect such as recruit, rent a house, travelling, these are believed Breath, it is called active advertisement.In today that information-intensive society progressively develops, passive advertisement increasingly causes the dislike of people, and leads Dynamic advertisement is but subject to the extensive favor of people.Almost each local evening paper, daily paper, life ＆ amusement report can't do without classification information Figure, and do to obtain better newspaper, the length of classification information is often bigger.Thus classification information net is just generated.

In the user of classification information website orientation information, a part of poor quality user often occurs, they are issuing falseness The mode such as information fraudulent user is obtaining interests.Therefore, classification information website can arrange some process rules to tinpot information With filter logic etc..

The means of existing deceptive information identification are mainly based upon the recognition method of rule, some artificial interventions additional, For example counted with issuing within a period of time by counting an ip, in information content whether containing illegal word, the commodity that issues Or whether unreasonable etc. the rule of price range of service is come whether judge a user be the user inferior that issues deceptive information, So as to take the processing means such as deletion information, warning, logging off users.However, common process rule and filter logic are generally all It is to carry out the identification of behavior inferior using single dimension, therefore user inferior can sound out the critical of rule by every means Point is so as to process and the filter logic to tinpot information around open system.

In addition, reaching the standard grade with various rules, spendable rule can be fewer and feweri, because rule is all can be obvious Feature.The identification of rule can only be made a distinction using linear classifying face in existing method, so as to cause the bad of majority Matter information is all without being recognized by the system and process.

Accordingly, it would be desirable to the fraud identifying system based on machine learning in a kind of classification information website, comes multiple Dimension is identified to the behavior of user, so as to efficiently reduce the false amount of Transaction Information, improves the verity of Transaction Information.

Content of the invention

It is an object of the invention to provide the fraud identifying system based on machine learning in a kind of classification information website.

According to an aspect of the invention, there is provided the fraud row in a kind of website for classification information based on machine learning For the method for identifying system, methods described comprises the steps：A) based on existing user behavior data sample drawn data, use In generation model first；B) for the multiple user behavior features of training data selective extraction of different service types；C) based on institute The sample training data are carried out vectorization by the user behavior feature of extraction；D) produced using the sample training data of vectorization Raw forecast model；E) data on line are detected based on classification and clustering rule using produced model；F) to being detected The abnormal user data for obtaining are processed.

Preferably, the sample data in step a includes positive sample data and negative sample data, corresponds respectively to high-quality The user of behavior and the user of behavior inferior.

Preferably, in step b, user behavior feature includes user behavior data and the use for same cookie The statistical magnitude of each dimension in family.

Preferably, selected to difference by way of calculating comentropy and model intersection data verification in step b The user characteristicses that type of service is extracted.

Preferably, used in step d, the grader of probabilistic type carries out decision-making.

Preferably, the Probability Point of the abnormal probability for representing user behavior data is calculated using model in step e.

Preferably, the method for calculating the Probability Point is that multiple models enter to many stack features of user behavior data respectively Row detection, and draw a point of Probability Point respectively, sum of products conversion operation is carried out to each point of Probability Point then, user behavior is drawn The Probability Point of data.

Preferably, include setting a probability based on user's anomaly detection method of classifying ruless in step e Line is used for judging whether user behavior data is bad data.

Preferably, included based on user's anomaly detection method of clustering rule in step e as follows：E1) to general Rate point carries out clustering phenomenon monitoring；E2) Probability Point cluster is detected to a number of user behavior, to judge cluster extremely Whether the user behavior of equal probabilities point is user behavior inferior；E3) according to testing result, abnormal user behavior discrimination model pair The Probability Point of such user behavior is updated；E4 the new bad data for passing through detection discovery is added sample as training data) This storehouse；E5) new training data training pattern is utilized.

Preferably, for the inaccurate user behavior data of Probability Point carries out off-line analysiss in step e5, find new User behavior feature and select suitable feature.

Using the fraud identifying system based on machine learning in a kind of classification information website of the present invention, Neng Goucong Multiple dimensions are identified to the behavior of user, so as to efficiently reduce the false amount of Transaction Information, improve the true of Transaction Information Reality.Even if also, also can be good at carrying out user behavior inferior in the case of containing noise data in training data Identification.

Description of the drawings

With reference to the accompanying drawing that encloses, the present invention more purpose, function and advantages are by by the as follows of embodiment of the present invention Description is illustrated, wherein：

Fig. 1 diagrammatically illustrates the fraud identifying system based on machine learning in the classification information website of the present invention Method flow diagram.

Specific embodiment

By reference to one exemplary embodiment, the purpose of the present invention and function and the side for realizing these purposes and function Method will be illustrated.However, the present invention is not limited to one exemplary embodiment disclosed below；Can by multi-form come Which is realized.The essence of description is only to aid in the detail of the various equivalent modifications Integrated Understanding present invention.

Hereinafter, embodiments of the invention will be described with reference to the drawings.In the accompanying drawings, identical reference represents identical Or similar part, or same or like step.

The fraud information recognition methodss of the present invention have used the data produced based on user behavior, can immediately to user The information data of issue is identified.The Model Identification of the machine learning that the present invention is adopted, can be in multiple dimensions to user's Behavior is identified so that the user of the particularly tinpot information that releases news is difficult to know that what the dimension of identification is, so as to nothing Method is evaded by getting around rule.The present invention can be predicted to data in the environment of a small amount of sample and high noisy, And accuracy rate is high.Modeling is collected in the various actions to user, abnormal user is identified so as to reach.

Fig. 1 diagrammatically illustrates the fraud identifying system based on machine learning in the classification information website of the present invention Method flow diagram.As shown in Figure 1：

Step 110, based on existing user behavior data sample drawn data, for generation model first.The extraction Sample data can be extracted from the existing user behavior data storehouse that had audited, be mainly used in for user dividing into high-quality use Family and user inferior, correspond respectively to positive sample data and negative sample data, and wherein, positive sample is through auditing the high-quality row for passing through For the user behavior data of user, and negative sample is the behavioral data of the behavior user inferior through examination ＆ verification identification, for example offence Some more serious rules（Wash sale information is for example issued）User behavior data.Sample number in existing examination ＆ verification storehouse According to the data base set up for being classified to user behavior by some conventional user behavior recognition methods.Methods described is for example： Whether whether contain in the text message that detection user issues in illegal word, the pictorial information that detection user issues containing illegal interior Hold etc..

During the upper once model iteration after generation model, the method according to the invention can directly using institute The positive and negative samples storehouse of extraction, and original examination ＆ verification storehouse information need not be reused.

Step 120, for the multiple user behavior features of training data selective extraction of different service types.Imitated by experiment Really, which user behavior feature used in different business line judged.

The characteristic behavior of user is generally very more, in view of the requirement of computational accuracy and computational efficiency simultaneously, according to the present invention User behavior feature generalizations be the feature for having discrimination to good and bad user, be therefore not required for a lot of features, model is got over Little better, it is therefore an objective to data can be detected using multiple models in last identification.

User behavior is characterized in that the behavior characteristicss found by some off-line analysiss, typically will not be with other rule weights on line Multiple, it can be understood as the feature that background audit personnel can not find, e.g. for the user behavior data of same cookie, bag Include：Trans-city post number, using mobile phone number, time interval and click behavior, user's registration to the time interval that posts Point, also have the statistical magnitude of the login behavior of user, the data of dimension such as user browsing behavior and each dimension of user such as：ip Count within trans-city N days, cookie is counted etc..

Preferably for the training data of different types of service, system extracts different features.Such as used car business Line using feature only can include：The time of posting is registered to, the user that posts fills in data time, the mouse track of user, 30 Its ip correlated count etc..Second-hand service line can only include following feature：User's registration login time, N days before the corresponding ip of user Browsing pages number of times.

It is highly preferred that being selected to different business line drawing by way of calculating comentropy and model intersection data verification Feature.

The model set up by the method according to the invention only have selected each that the feature of less dimension, i.e. model are used Dimension be all by data analysiss produce to the good dimension of classifying quality, the data that therefore each dimension is produced are not Sparse.The method of the present invention overcomes dimension in prior art excessively to be caused to calculate excessively complicated shortcoming, and prior art is led to Be often by text is carried out participle distinguish user so as to produce a large amount of dimensions, each word as a dimension, so as to cause to instruct Practice sample excessive, calculate complexity, also can be too high to the accuracy requirement of sample data.

Sample training data, according to the user behavior feature that selects in step 120, are carried out vectorization by step 130.Training The result of the vectorization of data can be saved in file.The component of each dimension of training data is the feature of a selection. Below by taking the vectorization procedure of this dimension of time that user fills in model as an example, the vectorization procedure of training data is described：

1. the data of each model fill in the time are obtained in sample data.

2. pair these data carry out data cleansing, its objective is to playback outlier.

3. the attribute of pair successive value carries out sliding-model control, using K mean cluster 100 times, is the cluster for having error minimum Central point is used as discretization cut section.

4. and then last correction is carried out to data.

5. vectorization is completed.

Step 140, produces forecast model using the sample training data of vectorization.During training pattern, probability is preferably used The grader of type carries out decision-making.Probabilistic type grader is used for the Probability Point for calculating user behavior data.

The reason for decision-making is carried out using the grader of probabilistic type is, as the purpose that last model is produced is to recognize and delete Some Deviant Behavior information, so this model needs that there is very high accuracy rate, due to graders such as neutral net or decision trees Situation about manslaughtering is had, so the grader using probabilistic type carries out decision-making.

The model for using is preferably and adopts Bayesian network model.Bayesian network (Bayesian network) is to be based on The mathematical model of probability inference, which has a stronger generalization ability, and is capable of the layering and the output of probability of firm logic, institute To be well suited for the scene of Activity recognition.

Preferably, the training of model, WEKA are carried out using the program WEKA of increasing income（Waikato Environment for Knowledge Analysis）As disclosed data mining work platformses, gathered a large amount of can undertake data mining appoint The machine learning algorithm of business, including data are carried out with pretreatment, classification is returned, cluster, correlation rule and in new interactive mode Visualization on interface.

Data on line are detected based on classification and clustering rule by step 150 using produced multiple models, and right User inferior is processed.When being detected to the user behavior data on line using abnormal user Activity recognition model, can be right Every user behavior data is calculated, and generates a Probability Point, and Probability Point represents the probit that the data are bad data, this Probability Point is more high, and the data are more likely to bad data.Wherein, the Probability Point of user behavior data is generated in the following manner：Each Model many stack features respectively to user behavior data（The feature of multigroup different dimensions）Detected, each model is drawn respectively One probit for representing that data are the probability that quality is inclined to（Hereinafter referred to as divide Probability Point）, finally each point of Probability Point is carried out Sum of products conversion operation, draws the Probability Point of user behavior data.

The method of detection user's Deviant Behavior includes following two：

User's anomaly detection method based on classifying ruless.Set a probability line（That is probability threshold value）, for sentencing Whether disconnected user behavior data is bad data.If the Probability Point of certain user behavior data exceedes probability threshold value, by user's row For data judging be bad data, will the user be judged to user inferior.Otherwise then the user behavior data is judged to normally The user is judged to high-quality user by data.Wherein, probability line is obtained by way of artificially verifying.

Data on line are detected based on clustering rule.Comprise the following steps that：

Step a, carries out clustering phenomenon monitoring to Probability Point.

Step b, Probability Point cluster is detected to a number of user behavior data turning-over operation personnel, cluster is judged extremely Whether the user behavior of equal probabilities point is other user behaviors with the Probability Point in user behavior inferior, i.e. detection line Whether data are all bad data.Cluster to the same class behavior of user behavior data or user's row of similar behavior of equal probabilities point For data, they may only have the feature of less dimension distinct.Detection method is preferably, and detection has the use of the Probability Point Whether family behavioral data is all fallen by other rule treatments.Wherein, other rules are the modes of the identification bad data outside invention.Example Such as：The pictorial information that issues containing illegal word, user in the text message that user issues contains illegal contents etc..

Step c, according to operator's testing result, Probability Point of the abnormal user behavior discrimination model to such user behavior It is updated.That is, if finding, this behavior is the behavior of user inferior, and the user behavior data is judged to bad data, improves The Probability Point of the user behavior is for example improved Probability Point to 0.999 to certain higher probit.

The new bad data for passing through detection discovery is added Sample Storehouse as training data by step d.That is, will be judged as When bad data user behavior data add Sample Storehouse in, as the training data of model next time, so as to the renewal for model New training data is provided.

Step e, using new training data training pattern.

Preferably, in step e, for the inaccurate user behavior data of Probability Point carries out off-line analysiss, find new User behavior feature simultaneously selects suitable feature.And do to the model for newly producing whether cross validation judgment models have preferably Performance.

Using the above-mentioned detection method based on clustering rule to data on line, can realize not accurate enough in sample data In the case of detection to user behavior data.Also, can be realized with semi-supervised engineering to step e using above-mentioned steps a Habit mode carries out the renewal to model.Also, by present mechanism, can avoid due to containing in the sample data as training data The inaccurate problem for having the model caused by the reasons such as noise data to calculate the Probability Point of data, even if so in sample number According to also can be good at being identified user behavior inferior in the case of inaccurate.

Abnormal user data are processed by step 160.After determining that certain user behavior is Deviant Behavior, system meeting User inferior is processed, for example, user is deleted etc. in the information of Web realease.

In conjunction with the explanation and practice of the present invention for disclosing here, the other embodiment of the present invention is for those skilled in the art All will be readily apparent and understand.Illustrate and embodiment be to be considered only as exemplary, the present invention true scope and purport equal It is defined in the claims.

Claims

1. in a kind of website for classification information based on machine learning fraud identifying system method, methods described includes Following steps：

A) existing user behavior data sample drawn data are based on, for generation model first；

B) for the multiple user behavior features of training data selective extraction of different service types；

C) based on the user behavior feature that is extracted, vectorization is carried out to the sample training data；

D) forecast model is produced using the sample training data of vectorization；

E) data on line are detected based on classification and clustering rule using produced model, wherein,

Included based on user's anomaly detection method of clustering rule as follows：

E1) Probability Point is carried out clustering phenomenon monitoring；

E2) Probability Point cluster is detected to a number of user behavior, to judge the user clustered to equal probabilities point Whether behavior is user behavior inferior；

E3) according to testing result, abnormal user behavior discrimination model is updated to the Probability Point of such user behavior；

E4 the new bad data for passing through detection discovery is added Sample Storehouse as training data)；E5) instructed using new training data Practice model；

F) the detected abnormal user data for obtaining are processed.

2. the method for claim 1, the sample data in wherein described step a include positive sample data and negative sample number According to corresponding respectively to the user of high-quality behavior and the user of behavior inferior.

3. the method for claim 1, in wherein described step b, user behavior feature is included for the use of same cookie Family behavioral data and the statistical magnitude of each dimension of user.

4. the method for claim 1, intersects data verification by calculating comentropy and model in wherein described step b Mode come select to different service types extract user characteristicses.

5. the method for claim 1, used in wherein described step d, the grader of probabilistic type carries out decision-making.

6. the method for claim 1, calculates the exception for representing user behavior data using model in wherein described step e The Probability Point of probability.

7. method as claimed in claim 6, the method for wherein calculating the Probability Point is that multiple models are respectively to user behavior Many stack features of data are detected, and draw a point of Probability Point respectively, then carry out sum of products conversion to each point of Probability Point Operation, draws the Probability Point of user behavior data.

8. the method for claim 1, the user's anomaly detection method in wherein described step e based on classifying ruless It is used for judging whether user behavior data is bad data including setting a probability line.

9. the method for claim 1, for the inaccurate user behavior data of Probability Point enters in wherein described step e5 Row off-line analysiss, find new user behavior feature and select suitable feature.