CN106961444A

CN106961444A - A kind of hostile network reptile detection method based on hidden Markov model

Info

Publication number: CN106961444A
Application number: CN201710281763.1A
Authority: CN
Inventors: 罗日红; 蔡君
Original assignee: Guangdong Yi Rong Agel Ecommerce Ltd; Guangdong Polytechnic Normal University
Current assignee: Guangdong Yi Rong Agel Ecommerce Ltd; Guangdong Polytechnic Normal University
Priority date: 2017-04-26
Filing date: 2017-04-26
Publication date: 2017-07-18

Abstract

The present invention proposes a kind of hostile network reptile detection method based on hidden Markov model, belongs to the technical field of computer software.Hostile network reptile maliciously captures the sensitive information and private data of website in the case of without permission, while the traffic behavior pattern of its barbarous formula can also cause bad influence to the service quality of website.Existing network reptile detection method can not identify hostile network reptile exactly, and False Rate is higher.Therefore, the present invention proposes a kind of new hostile network reptile detection method based on hidden Markov model, specifically includes：（1）User's HTTP flowmeter factors based on HMM,（2）Web crawlers behavior modeling based on HTTP.

Description

A kind of hostile network reptile detection method based on hidden Markov model

Technical field

The invention belongs to computer software technical field.

Background technology

Benign web crawlers is the indispensable part of search engine.Well behaved net reptile can typically consider to take website The influence for quality of being engaged in, and strictly observe the data grabber rule of website.However, hostile network reptile is then to capture having for website For the purpose of information, do not take harmful effect of the crawl behavior to website into account, or even violate protection of the website to data and state, by force The sensitive information of website is captured, the adverse consequences such as privacy of user leakage and trade secret exposure is caused.Existing web crawlers is only Reptile flow and general user's flow can be distinguished, benign and hostile network reptile is but difficult to differentiate between.

The content of the invention

The purpose of the present invention is to propose to a kind of hostile network reptile detection method based on hidden Markov model.Malice net Network reptile maliciously captures the sensitive information and private data of website in the case of without permission, while the flow of its barbarous formula Behavior pattern can also cause bad influence to the service quality of website.Existing network reptile detection method can not be recognized exactly Go out hostile network reptile, False Rate is higher.Therefore, patent of the present invention proposes a kind of new malice based on hidden Markov model Web crawlers detection method.Specifically include：(1) user's HTTP flowmeter factors based on HMM, the web crawlers of (2) based on HTTP Behavior modeling.

Technical scheme is as follows：

1st, the construction method of HTTP traffic behaviors model

1.1 basic definitions：

Observation：Using the resource type of HTTP request as observation, then the observation sequence of HTTP flows is expressed asWhereinRepresent to be connected to the resource type that t is asked c-th.Observation space is：V=1, 2,...,N}。

State value is the page that t connects c requests, is expressed as y=y₁,y₂,...y_T, state value space be S=1, 2,...,M}。

The parameter model of HTTP traffic behaviors is expressed as：θ={ π, A, B }, wherein, π is general for the original state of initial model Rate, A is state transition probability, and B is observation probability.

The parameter estimation techniques of the 1.2 HTTP traffic behavior models based on forward-backward algorithm algorithm

HTTP traffic behavior model parameter estimation tasks are to estimate corresponding hidden half horse by the sequence of observations collected The parameter of Er Kefu models.The parameter Estimation that the present invention solves HTTP traffic behavior models using famous forward-backward algorithm algorithm is asked Topic, it is described in detail below.

1) forward-backward algorithm variable is defined：

α_t(j)=P [S_t=j, o_1:t|θ]

β_t(j)=P [o_t+1:T|S_t=j, θ]

2) initialization of forward-backward algorithm algorithm：

α₁(j)=π_j,

β_T(j)=1.

3) iteration derivation：

4) intermediate variable is calculated：

ξ_t(i, j)=P [S_t=i, S_t+1=j, o_1:T| λ]=α_t(i)a_ijb_j(o_t+1)β_t+1(j)

5) parameter more new formula

Wherein, o is worked as_t=v_kWhen, I (o_t=v_k)=1, otherwise I (o_t=v_k)=0.

The detection method of 1.3 web crawlers

HTTP flows include the HTTP flows of general user and the HTTP flows of web crawlers, detection of malicious web Reptile flow, first has to the flow separation web crawlers flow and general user, is that this patent of the present invention passes through abnormality detection Method recognize web crawlers flow.

Calculate the entropy of the HTTP traffic behavior model observation sequences of general user：

The standard variance for calculating the entropy of the observation sequence of the HTTP flows of general user is σ₀, average is μ₀,

When detecting web crawlers, the average that the entropy of monitoring data sequent is calculated first is μ, then with | μ-μ₀| it is abnormality detection amount, If | μ-μ₀|≥3σ₀, then it is abnormality.

The detection method of 1.4 hostile network reptiles

Calculate the entropy of well behaved net reptile behavior model observation sequence：

The standard variance for calculating the entropy of the observation sequence of well behaved net reptile is σ₀, average is μ₀,

During detection of malicious web reptile, the average that the entropy of monitoring data sequent is calculated first is μ, then with | μ-μ₀| it is abnormality detection Amount, if | μ-μ₀|≥3σ₀, then it is abnormality.

Brief description of the drawings

Hostile network reptile detection model schematic diagrames of the Fig. 1 based on hidden Markov model

Embodiment

Implementing procedure

Step 1：Training data is pre-processed, the training dataset of generation web crawlers detection；

Step 2：The parameter of model is estimated using forward-backward algorithm algorithm, the HTTP discharge models based on HMM are obtained；

Step 3：The entropy of monitoring data sequent is calculated using the model trained；

Step 4：Calculate Traffic anomaly detection amount | μ-μ₀|；

Step 5：By judging | μ-μ₀|≥3σ₀Whether set up, to recognize web crawlers flow；

Step 6：Extract the training dataset of benign reptile detection；

Step 7：The model parameter of well behaved net reptile is estimated using forward-backward algorithm algorithm；

Step 8：Use the entropy of the well behaved net reptile model calculating network reptile sequence trained；

Step 9：Calculate abnormality detection amount | μ-μ₀|；

Step 10：By judging | μ-μ₀|≥3σ₀Whether set up, to recognize hostile network reptile flow.

Claims

1. a kind of hostile network reptile detection method based on hidden Markov model, it is characterized in that,

The construction method of HTTP traffic behavior models

1.1 basic definitions：

Observation：Using the resource type of HTTP request as observation, then the observation sequence of HTTP flows is expressed asWhereinRepresent to be connected to the resource type that t is asked c-th.Observation space is：V=1, 2,...,N}；

State value is the page that t connects c requests, is expressed as y=y₁,y₂,...y_T, state value space be S=1,2 ..., M}；

The parameter model of HTTP traffic behaviors is expressed as：θ={ π, A, B }, wherein, π is the initial state probabilities of initial model, A For state transition probability, B is observation probability；

HTTP traffic behavior model parameter estimation tasks are to estimate corresponding hidden half Ma Erke by the sequence of observations collected The parameter of husband's model；The present invention solves the Parameter Estimation Problem of HTTP traffic behavior models using famous forward-backward algorithm algorithm, It is described in detail below；

1) forward-backward algorithm variable is defined：

α_t(j)=P [S_t=j, o_1:t|θ]

β_t(j)=P [o_t+1:T|S_t=j, θ]

2) initialization of forward-backward algorithm algorithm：

α₁(j)=π_j,

β_T(j)=1；

3) iteration derivation：

4) intermediate variable is calculated：

ξ_t(i, j)=P [S_t=i, S_t+1=j, o_1:T| λ]=α_t(i)a_ijb_j(o_t+1)β_t+1(j)

5) parameter more new formula

Wherein, o is worked as_t=v_kWhen, I (o_t=v_k)=1, otherwise I (o_t=v_k)=0；

The detection method of 1.3 web crawlers

HTTP flows include the HTTP flows of general user and the HTTP flows of web crawlers, detection of malicious web reptile Flow, first has to the flow separation web crawlers flow and general user, is the side that this patent of the present invention passes through abnormality detection Method recognizes web crawlers flow.

2. the hostile network reptile detection method according to claim 1 based on hidden Markov model, it is characterized in that, its It is characterized in

The construction method of HTTP traffic behavior models

The detection method of 1.4 hostile network reptiles

During detection of malicious web reptile, the average that the entropy of monitoring data sequent is calculated first is μ, then with | μ-μ₀| it is abnormality detection amount, such as Really | μ-μ₀|≥3σ₀, then it is abnormality.

3. the hostile network reptile detection method according to claim 1 based on hidden Markov model, it is characterized in that, The construction method of HTTP traffic behavior models

Implementing procedure

Step 4：Calculate Traffic anomaly detection amount | μ-μ₀|；

Step 6：Extract the training dataset of benign reptile detection；

Step 9：Calculate abnormality detection amount | μ-μ₀|；