CN108415931A

CN108415931A - A kind of method for establishing model and system of flow of practising fraud for identification

Info

Publication number: CN108415931A
Application number: CN201810059065.1A
Authority: CN
Inventors: 郭昊; 欧阳辰
Original assignee: Beijing Friends Of Interactive Information Technology Co Ltd
Current assignee: Beijing Friends Of Interactive Information Technology Co Ltd
Priority date: 2018-01-22
Filing date: 2018-01-22
Publication date: 2018-08-17
Anticipated expiration: 2038-01-22
Also published as: CN108415931B

Abstract

The invention discloses a kind of method for establishing model and system of flow of practising fraud for identification.This method includes：Obtain a plurality of flow；Extract the cheating feature of flow；Establish the corresponding ad-request number sorted lists in heterogeneous networks address, the corresponding ad-request number sorted lists of different top level domain number of request sorted lists corresponding with different advertisement types；Extract the network address of preceding first preset ratio of ranking；The cheating flow of label first；Extract the top level domain of preceding second preset ratio of ranking；The cheating flow of label second；Extract the advertisement type of the preceding third preset ratio of ranking；Mark third cheating flow；Judge whether the first cheating flow, the second cheating flow and third cheating flow are identical flow；If so, being determined as flow of practising fraud；If it is not, being then determined as normal discharge；Using cheating flow trained traffic classification model is obtained with normal discharge.The present invention disclosure satisfy that DSP environment, improve the robustness of cheating flow identification.

Description

A kind of method for establishing model and system of flow of practising fraud for identification

Technical field

The present invention relates to internet advertisement technology fields, more particularly to a kind of model foundation for flow of practising fraud for identification Method and system.

Background technology

Anti- cheating is always the critical issue of Internet advertising industry.For every flow, party in request platform (Demand Side Platform, DSP) needing real time discriminating, whether it is cheating flow, to be further determined whether to bid, DSP docking One or more advertisement transaction platforms can be directed to complicated flow and be differentiated that robustness is high.Currently, common advertisement Anti- cheat method is to establish disaggregated model, is trained to obtain training pattern to disaggregated model using positive negative sample, utilizes training Model Identification cheating flow.And since DSP can not directly acquire cheating flow sample, the classification that existing method is established Model cannot meet DSP environment, cause its robustness not high.

Invention content

Based on this, it is necessary to a kind of method for establishing model and system of flow of practising fraud for identification are provided, to meet DSP rings The robustness of cheating flow identification is improved in border.

To achieve the above object, the present invention provides following schemes：

A kind of method for establishing model for flow of practising fraud for identification, including：

Obtain a plurality of flow；

Extract the cheating feature of the flow, the cheating feature include the corresponding ad-request number in heterogeneous networks address, The different corresponding ad-request numbers of top level domain and the corresponding number of request of different advertisement types；

According to the cheating feature of the flow, the corresponding ad-request number sorted lists in heterogeneous networks address, difference are established The corresponding ad-request number sorted lists of top level domain and the corresponding number of request sorted lists of different advertisement types；

Extract preceding first preset ratio of ranking in the corresponding ad-request number sorted lists in the heterogeneous networks address Network address；

By the corresponding flow of network address of preceding first preset ratio of the ranking labeled as the first cheating flow；

Extract preceding second preset ratio of ranking in the corresponding ad-request number sorted lists of the different top level domain Top level domain；

By the corresponding flow of top level domain of preceding second preset ratio of the ranking labeled as the second cheating flow；

Extract the wide of the preceding third preset ratio of ranking in the corresponding number of request sorted lists of the different advertisement type Accuse type；

By the corresponding flow of advertisement type of the preceding third preset ratio of the ranking labeled as third cheating flow；

Judge whether the first cheating flow, the second cheating flow and third cheating flow are identical stream Amount；

If so, the identical flow is determined as flow of practising fraud；

If it is not, then the first cheating flow, the second cheating flow and third cheating flow are determined as Normal discharge；

Flow disaggregated model is trained with the normal discharge using the cheating flow, obtains trained flow Disaggregated model, the trained traffic classification model is for being identified flow to be tested.

Optionally, the cheating feature according to the flow establishes the corresponding ad-request number row in heterogeneous networks address Sequence table, the corresponding ad-request number sorted lists of different top level domain number of request Sorted list corresponding with different advertisement types Table specifically includes：

Count the corresponding request number of times of each cheating feature in preset time period；

The corresponding request number of times of each cheating feature is ranked up from high to low, obtains heterogeneous networks address correspondence Ad-request number sorted lists, the corresponding ad-request number sorted lists of different top level domain it is corresponding with different advertisement types Number of request sorted lists.

Optionally, described that flow disaggregated model is trained with the normal discharge using the cheating flow, it obtains Trained traffic classification model, the trained traffic classification model is for being identified flow to be tested, specifically Including：

Traffic classification model is established using decision Tree algorithms；

Extract the cheating feature of the cheating feature and the normal discharge of the cheating flow；

The cheating feature of the cheating feature of the cheating flow and the normal discharge is input to the stream It measures in disaggregated model, judges whether the traffic classification model can correctly classify；

If it is not, then adjusting the parameter of the traffic classification model, it is special to return to the cheating by the cheating flow The cheating feature of the normal discharge of seeking peace is input in the traffic classification model, judges that the traffic classification model is No the step for capable of correctly classifying；

If so, the traffic classification model is determined as trained traffic classification model.

Optionally, it is using the method that flow to be tested is identified in the trained traffic classification model：

Extract the cheating feature of flow to be tested；

The cheating feature of the flow to be tested is input in the trained traffic classification model, is exported As a result；

Judge whether the flow to be tested is cheating flow according to the output result.

The present invention also provides a kind of model foundation systems for flow of practising fraud for identification, including：

Acquisition module, for obtaining a plurality of flow；

Cheating characteristic extracting module, the cheating feature for extracting the flow, the cheating feature includes heterogeneous networks The corresponding ad-request number in address, the corresponding ad-request number of different top level domain number of request corresponding with different advertisement types；

Sorted lists establish module, and for the cheating feature according to the flow, it is corresponding wide to establish heterogeneous networks address Accuse number of request sorted lists, the request corresponding with different advertisement types of the corresponding ad-request number sorted lists of different top level domain Number sorted lists；

First extraction module exists for extracting ranking in the corresponding ad-request number sorted lists in the heterogeneous networks address The network address of the first preceding preset ratio；

First mark module, for marking the corresponding flow of the network address of preceding first preset ratio of the ranking For the first cheating flow；

Second extraction module exists for extracting ranking in the corresponding ad-request number sorted lists of the different top level domain The top level domain of the second preceding preset ratio；

Second mark module, for marking the corresponding flow of the top level domain of preceding second preset ratio of the ranking For the second cheating flow；

Third extraction module, it is preceding for extracting ranking in the corresponding number of request sorted lists of the different advertisement types The advertisement type of third preset ratio；

Third mark module, for marking the corresponding flow of the advertisement type of the preceding third preset ratio of the ranking For third cheating flow；

Judgment module, for judging the first cheating flow, the second cheating flow and third cheating flow Whether it is identical flow；If so, the identical flow is determined as flow of practising fraud；If it is not, then described first is practised fraud Flow, the second cheating flow and third cheating flow are determined as normal discharge；

Training module is obtained for being trained to flow disaggregated model with the normal discharge using the cheating flow To trained traffic classification model, the trained traffic classification model is for being identified flow to be tested.

Optionally, the sorted lists establish module, specifically include：

Statistic unit, for counting the corresponding request number of times of each cheating feature in preset time period；

Sequencing unit obtains not for being ranked up from high to low to the corresponding request number of times of each cheating feature With the corresponding ad-request number sorted lists of network address, the corresponding ad-request number sorted lists of different top level domain and difference The corresponding number of request sorted lists of advertisement type.

Optionally, the training module, specifically includes：

Disaggregated model establishes unit, for establishing traffic classification model using decision Tree algorithms；

Cheating feature extraction unit, the institute of the cheating feature and the normal discharge for extracting the cheating flow State cheating feature；

First judging unit is used for the cheating of the cheating feature and the normal discharge of the cheating flow Feature is input in the traffic classification model, judges whether the traffic classification model can correctly classify；

Adjustment unit, for if it is not, then adjusting the parameter of the traffic classification model, return to be described by the cheating flow The cheating feature and the cheating feature of the normal discharge be input in the traffic classification model, judge the stream The step for whether amount disaggregated model can correctly classify；

Disaggregated model determination unit, for if so, the traffic classification model is determined as trained flow Disaggregated model.

Optionally, further include identification module, the identification module is used to utilize the trained traffic classification model pair Flow to be tested is identified, and the identification module specifically includes：

Extraction unit, the cheating feature for extracting flow to be tested；

As a result acquiring unit, for the cheating feature of the flow to be tested to be input to the trained flow point In class model, output result is obtained；

Second judgment unit, for judging whether the flow to be tested is cheating flow according to the output result.

Compared with prior art, the beneficial effects of the invention are as follows：

The present invention proposes a kind of method for establishing model and system of flow of practising fraud for identification, described to include：It obtains more Flow；Extract the cheating feature of flow；According to the cheating feature of flow, the corresponding ad-request number in heterogeneous networks address is established Sorted lists, the corresponding ad-request number sorted lists of different top level domain number of request Sorted list corresponding with different advertisement types Table；Extract the network address of preceding first preset ratio of ranking；By the network address pair of preceding first preset ratio of ranking The flow answered is labeled as the first cheating flow；Extract the top level domain of preceding second preset ratio of ranking；Ranking is preceding The corresponding flow of top level domain of second preset ratio is labeled as the second cheating flow；Extract the preceding third preset ratio of ranking Advertisement type；By the corresponding flow of advertisement type of the preceding third preset ratio of ranking labeled as third cheating flow；Sentence Whether disconnected first cheating flow, the second cheating flow and third cheating flow are identical flow；If so, by identical flow It is determined as flow of practising fraud；If it is not, the first cheating flow, the second cheating flow and third cheating flow are then determined as normal stream Amount；Flow disaggregated model is trained with normal discharge using cheating flow, obtains trained traffic classification model.This hair Bright method or system disclosure satisfy that DSP environment, improve the robustness of cheating flow identification.

Description of the drawings

It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be to institute in embodiment Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the present invention Example, for those of ordinary skill in the art, without having to pay creative labor, can also be according to these attached drawings Obtain other attached drawings.

Fig. 1 is a kind of flow chart of the method for establishing model of the flow of cheating for identification of the embodiment of the present invention；

Fig. 2 is a kind of structure chart of the model foundation system of the flow of cheating for identification of the embodiment of the present invention.

Specific implementation mode

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, below in conjunction with the accompanying drawings and specific real Applying mode, the present invention is described in further detail.

The present invention utilizes the knowledge of hypothesis testing in statistics, gives a kind of model foundation for flow of practising fraud for identification Method, and further flow to be tested is identified and is labelled in real time online using the model of foundation.

Hypothesis testing is one of the classical way for doing statistical inference, and main thought may be caused for two kinds Identical result, the reason of needing to differentiate A and B, ratio of making a mistake (generally 5%) fixed first, the sample distribution under A reasons In, select 5% probability interval of most likely B reasons.If sample falls into the section, which is considered as being led by B reasons It causes, it is on the contrary then be considered as being caused by A reasons.

Using the thought of hypothesis testing, a kind of method for establishing model for flow of practising fraud for identification is present embodiments provided, Fig. 1 is a kind of flow chart of the method for establishing model of the flow of cheating for identification of the embodiment of the present invention.

Referring to Fig. 1, the method for establishing model of the flow of practising fraud for identification of embodiment, including：

Step S1：Obtain a plurality of flow.

Step S2：The cheating feature of the flow is extracted, the cheating feature includes the corresponding advertisement in heterogeneous networks address Number of request, the corresponding ad-request number of different top level domain number of request corresponding with different advertisement types.

Step S3：According to the cheating feature of the flow, the corresponding ad-request number Sorted list in heterogeneous networks address is established Table, the corresponding ad-request number sorted lists of different top level domain number of request sorted lists corresponding with different advertisement types.

It specifically includes：

Step S4：It is pre- to extract ranking preceding first in the corresponding ad-request number sorted lists in the heterogeneous networks address If the network address of ratio.

In the present embodiment, in the corresponding ad-request number sorted lists in the heterogeneous networks address preceding 5% network is extracted Address.

Step S5：By the corresponding flow of network address of preceding first preset ratio of the ranking labeled as the first cheating Flow.

In the present embodiment, by preceding 5% network in the corresponding ad-request number sorted lists in the heterogeneous networks address The corresponding flow in location is labeled as the first cheating flow.

Step S6：It is pre- to extract ranking preceding second in the corresponding ad-request number sorted lists of the different top level domain If the top level domain of ratio.

In the present embodiment, extract in the corresponding ad-request number sorted lists of the different top level domain preceding 3% it is top Domain name.

Step S7：By the corresponding flow of top level domain of preceding second preset ratio of the ranking labeled as the second cheating Flow.

In the present embodiment, by the corresponding ad-request number sorted lists of the difference top level domain preceding 3% top level domain The corresponding flow of name is labeled as the second cheating flow.

Step S8：Extract the default ratio of the preceding third of ranking in the corresponding number of request sorted lists of the different advertisement types The advertisement type of example.

In the present embodiment, in the corresponding number of request sorted lists of the different advertisement types preceding 8% advertisement type is extracted.

Step S9：The corresponding flow of advertisement type of the preceding third preset ratio of the ranking is practised fraud labeled as third Flow.

In the present embodiment, by the corresponding number of request sorted lists of the difference advertisement type preceding 8% advertisement type pair The flow answered is labeled as third cheating flow.

Step S10：Judge it is described first cheating flow, it is described second cheating flow and the third cheating flow whether be Identical flow.

If so, thening follow the steps S11.

Step S11：The identical flow is determined as flow of practising fraud.

If it is not, thening follow the steps S12.

Step S12：The first cheating flow, the second cheating flow and third cheating flow are determined as Normal discharge.

Step S13：Flow disaggregated model is trained with the normal discharge using the cheating flow, is trained Good traffic classification model.

It specifically includes：

Traffic classification model is established using decision Tree algorithms；

If so, the traffic classification model is determined as trained traffic classification model, it is described trained Traffic classification model is for being identified flow to be tested.

In the present embodiment, flow to be tested is identified using above-mentioned trained traffic classification model, specific side Method is：

By the trained traffic classification model deployment or update onto line；The cheating for extracting flow to be tested is special Sign；The cheating feature of the flow to be tested is input in the trained traffic classification model, output result is obtained； Judge whether the flow to be tested is cheating flow according to the output result.

Identify that cheating flow, the flow to be tested to each carry out cheating identification and label using the above method, with Subsequent algorithm is supplied to use.

The method for establishing model of flow of practising fraud for identification in the present embodiment, does not need previously known positive negative sample, is A kind of unsupervised method can be good at meeting DSP environment, and then improve the robustness of cheating flow identification.

The present invention also provides a kind of model foundation systems for flow of practising fraud for identification, and Fig. 2 is the embodiment of the present invention one Plant the structure chart of the model foundation system of cheating flow for identification.

The model foundation system 20 of the flow of practising fraud for identification of embodiment, including：

Acquisition module 201, for obtaining a plurality of flow.

Cheating characteristic extracting module 202, the cheating feature for extracting the flow, the cheating feature includes different nets The corresponding ad-request number in network address, the request corresponding with different advertisement types of the corresponding ad-request number of different top level domain Number.

Sorted lists establish module 203, and for the cheating feature according to the flow, it is corresponding to establish heterogeneous networks address Ad-request number sorted lists, the corresponding ad-request number sorted lists of different top level domain and different advertisement type is corresponding asks Seek several sorted lists.

The sorted lists establish module 203, specifically include：

First extraction module 204 is arranged for extracting in the corresponding ad-request number sorted lists in the heterogeneous networks address The network address of preceding first preset ratio of name.

First mark module 205 is used for the corresponding flow of network address of preceding first preset ratio of the ranking Labeled as the first cheating flow.

Second extraction module 206 is arranged for extracting in the corresponding ad-request number sorted lists of the different top level domain The top level domain of preceding second preset ratio of name.

Second mark module 207 is used for the corresponding flow of top level domain of preceding second preset ratio of the ranking Labeled as the second cheating flow.

Third extraction module 208 exists for extracting ranking in the corresponding number of request sorted lists of the different advertisement types The advertisement type of preceding third preset ratio.

Third mark module 209 is used for the corresponding flow of advertisement type of the preceding third preset ratio of the ranking Labeled as third cheating flow.

Judgment module 210, for judging the first cheating flow, the second cheating flow and third cheating stream Whether amount is identical flow；If so, the identical flow is determined as flow of practising fraud；If it is not, then described first is made Disadvantage flow, the second cheating flow and third cheating flow are determined as normal discharge.

Training module 211, for being trained to flow disaggregated model with the normal discharge using the cheating flow, Trained traffic classification model is obtained, the trained traffic classification model is for being identified flow to be tested.

The training module 211, specifically includes：

Identification module 212, for flow to be tested to be identified using the disaggregated model.

The identification module 212, specifically includes：

Extraction unit, the cheating feature for extracting flow to be tested；

The model foundation system of flow of practising fraud for identification in the present embodiment, does not need previously known positive negative sample, is A kind of unsupervised method can be good at meeting DSP environment, and then improve the robustness of cheating flow identification.

Principle and implementation of the present invention are described for specific case used herein, and above example is said The bright method and its core concept for being merely used to help understand the present invention；Meanwhile for those of ordinary skill in the art, foundation The thought of the present invention, there will be changes in the specific implementation manner and application range.In conclusion the content of the present specification is not It is interpreted as limitation of the present invention.

Claims

1. a kind of method for establishing model for flow of practising fraud for identification, which is characterized in that including：

Obtain a plurality of flow；

The cheating feature of the flow is extracted, the cheating feature includes the corresponding ad-request number in heterogeneous networks address, difference The corresponding ad-request number of top level domain and the corresponding number of request of different advertisement types；

According to the cheating feature of the flow, it is top to establish the corresponding ad-request number sorted lists in heterogeneous networks address, difference The corresponding ad-request number sorted lists of domain name and the corresponding number of request sorted lists of different advertisement types；

Extract the net of preceding first preset ratio of ranking in the corresponding ad-request number sorted lists in the heterogeneous networks address Network address；

Extract the top of preceding second preset ratio of ranking in the corresponding ad-request number sorted lists of the different top level domain Grade domain name；

Extract the commercial paper of the preceding third preset ratio of ranking in the corresponding number of request sorted lists of the different advertisement types Type；

Judge whether the first cheating flow, the second cheating flow and third cheating flow are identical flow；

If so, the identical flow is determined as flow of practising fraud；

If it is not, then the first cheating flow, the second cheating flow and third cheating flow are determined as normally Flow；

Flow disaggregated model is trained with the normal discharge using the cheating flow, obtains trained traffic classification Model, the trained traffic classification model is for being identified flow to be tested.

2. it is according to claim 1 it is a kind of for identification practise fraud flow method for establishing model, which is characterized in that it is described according to According to the cheating feature of the flow, the corresponding ad-request number sorted lists in heterogeneous networks address, different top level domain pair are established The corresponding number of request sorted lists of ad-request number sorted lists and different advertisement types answered, specifically include：

The corresponding request number of times of each cheating feature is ranked up from high to low, it is corresponding wide to obtain heterogeneous networks address Accuse number of request sorted lists, the request corresponding with different advertisement types of the corresponding ad-request number sorted lists of different top level domain Number sorted lists.

3. a kind of method for establishing model of flow of practising fraud for identification according to claim 1, which is characterized in that the profit Flow disaggregated model is trained with the normal discharge with the cheating flow, obtains trained traffic classification model, The trained traffic classification model is specifically included for flow to be tested to be identified：

Traffic classification model is established using decision Tree algorithms；

The cheating feature of the cheating feature of the cheating flow and the normal discharge is input to the flow point In class model, judge whether the traffic classification model can correctly classify；

If it is not, then adjust the parameter of the traffic classification model, return it is described by the cheating feature of the cheating flow and The cheating feature of the normal discharge is input in the traffic classification model, judges that the traffic classification model whether can The step for correct classification；

4. a kind of method for establishing model of flow of practising fraud for identification according to claim 1, which is characterized in that utilize institute Stating the method that flow to be tested is identified in trained traffic classification model is：

Extract the cheating feature of flow to be tested；

The cheating feature of the flow to be tested is input in the trained traffic classification model, output knot is obtained Fruit；

5. a kind of model foundation system for flow of practising fraud for identification, which is characterized in that including：

Acquisition module, for obtaining a plurality of flow；

Cheating characteristic extracting module, the cheating feature for extracting the flow, the cheating feature includes heterogeneous networks address Corresponding ad-request number, the corresponding ad-request number of different top level domain number of request corresponding with different advertisement types；

Sorted lists establish module, for the cheating feature according to the flow, establish the corresponding advertisement in heterogeneous networks address and ask Ask several sorted lists, the corresponding ad-request number sorted lists of different top level domain number of request row corresponding with different advertisement types Sequence table；

First extraction module, it is preceding for extracting ranking in the corresponding ad-request number sorted lists in the heterogeneous networks address The network address of first preset ratio；

First mark module, for by the corresponding flow of network address of preceding first preset ratio of the ranking labeled as the One cheating flow；

Second extraction module, it is preceding for extracting ranking in the corresponding ad-request number sorted lists of the different top level domain The top level domain of second preset ratio；

Second mark module, for by the corresponding flow of top level domain of preceding second preset ratio of the ranking labeled as the Two cheating flows；

Third extraction module, for extracting the preceding third of ranking in the corresponding number of request sorted lists of the different advertisement types The advertisement type of preset ratio；

Third mark module, for by the corresponding flow of advertisement type of the preceding third preset ratio of the ranking labeled as the Three cheating flows；

Judgment module, for whether judging the first cheating flow, the second cheating flow and third cheating flow For identical flow；If so, the identical flow is determined as flow of practising fraud；If it is not, then by it is described first practise fraud flow, The second cheating flow and third cheating flow are determined as normal discharge；

Training module is instructed for being trained to flow disaggregated model with the normal discharge using the cheating flow The traffic classification model perfected, the trained traffic classification model is for being identified flow to be tested.

6. a kind of model foundation system of flow of practising fraud for identification according to claim 5, which is characterized in that the row Module is established in sequence table, is specifically included：

Sequencing unit obtains different nets for being ranked up from high to low to the corresponding request number of times of each cheating feature The corresponding ad-request number sorted lists in network address, the corresponding ad-request number sorted lists of different top level domain and different advertisements The corresponding number of request sorted lists of type.

7. a kind of model foundation system of flow of practising fraud for identification according to claim 5, which is characterized in that the instruction Practice module, specifically includes：

Cheating feature extraction unit, the work of the cheating feature and the normal discharge for extracting the cheating flow Disadvantage feature；

First judging unit is used for the cheating feature of the cheating feature and the normal discharge of the cheating flow It is input in the traffic classification model, judges whether the traffic classification model can correctly classify；

Adjustment unit, for if it is not, then adjust the parameter of the traffic classification model, returning to the institute by the cheating flow The cheating feature for stating cheating feature and the normal discharge is input in the traffic classification model, judges the flow point The step for whether class model can correctly classify；

Disaggregated model determination unit, for if so, the traffic classification model is determined as trained traffic classification Model.

8. a kind of model foundation system of flow of practising fraud for identification according to claim 5, which is characterized in that further include Identification module, the identification module are used to that flow to be tested to be identified using the trained traffic classification model, The identification module, specifically includes：

Extraction unit, the cheating feature for extracting flow to be tested；

As a result acquiring unit, for the cheating feature of the flow to be tested to be input to the trained traffic classification mould In type, output result is obtained；