CN107992469A

CN107992469A - A kind of fishing URL detection methods and system based on word sequence

Info

Publication number: CN107992469A
Application number: CN201710952360.5A
Authority: CN
Inventors: 亚静; 柳厅文; 时金桥; 张盼盼; 张振宇; 王玉斌; 李全刚
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2017-10-13
Filing date: 2017-10-13
Publication date: 2018-05-04

Abstract

The present invention provides a kind of fishing URL detection methods and system based on word sequence, for detecting fishing URL.By being segmented to URL character strings, and then obtain the vector representation of word sequence, then the contextual information and feature in word sequence are learnt automatically using deep learning model, it is not necessary to manually include the relevant text feature of word in extraction URL, be used for detecting fishing URL using trained model.So as to solve the problems, such as to run into the fishing URL detections of above-mentioned existing word-based feature.

Description

A kind of fishing URL detection methods and system based on word sequence

Technical field

The present invention relates to information security field, more particularly to a kind of fishing URL detection methods and system based on word sequence.

Background technology

The URL that goes fishing is a kind of phishing behavior, by disguising oneself as legal person's online media sites for winning a high reputation to obtain user Sensitive information, such as user name, password and credit card detail.Fishing URL usually claims the social activity for oneself coming from prevalence Website (including YouTube, Facebook, Twitter etc.), Auction Site (eBay), electronic business transaction website (PayPal, Alibaba etc.) or network manager (Google, Yahoo, ISP) etc., with this, to inveigle, victim's is credulous. Attacker pass through frequently with fraud be the embedded confusing user in URL keyword, as attacker utilize shaped like The URL of " login.mydomain.tld/paypal " inveigles PayPal user.

At present, no matter in research field, or commercial product, the method and safety of existing many fishing URL detections are produced Product, its cardinal principle are all based on greatly the feature of manually extraction URL related datas, build disaggregated model, classify to URL, from And detect fishing URL.According to the difference of analysis data, existing detection method can be divided into the detection method based on multi-source information With two major class of detection method based on URL itself.

Detection method based on multi-source information needs to gather the relevant a variety of data of URL, including Alexa rankings, WHOIS letter Breath, web page contents etc., the model for constructing complexity is trained the data marked, for detecting whether unknown URL is fishing URL.This method usually has higher accuracy rate, still, due to gather these a variety of data need very big resource and The extra expense such as time, therefore, the real-time detection not being suitable in express network.

And based on the detection method of URL itself, the text feature of URL character strings in itself is only analyzed, for building classification mould Type, is a kind of detection method of lightweight, suitable for detection in real time.

Specifically, the fishing detection method based on URL itself, by extracting the text feature of URL character strings, training point Class model, for detecting fishing URL.The text feature of URL character strings in itself can be divided into two class of character feature and word feature again. Character feature mainly considers to form the feature of the character performance of URL text strings, including character length, vowel-consonant ratio, numeral Entropy that number, additional character number, character are distributed etc..The word for having semantic information included in word feature Main Analysis URL And its occurrence frequency feature etc., common word login, update and the famous brand name paypal of prevalence in such as URL, Alibaba etc..

Lightweight fishing detection based on URL itself more meets the demand of real-time response in express network.Based on character Feature have ignored the semantic information included in URL, and URL is for facilitating people to remember, therefore is usually had readable and easy to remember The property recalled, includes multiple significant everyday words.Moreover, in phishing attack, attacker pass through frequently with strategy be to utilize key Word confuses user.

And the fishing URL detection methods of existing word-based feature are mostly using word and the frequency occurred as special at present Sign, does not account for the word sequence feature included in URL, and these features are all based on manually proposing, there is certain limitation. First, manually extraction feature needs to expend substantial amounts of manpower and resource goes statistical analysis and verifies the validity of feature；Secondly, people The feature of work extraction is usually only effective to certain a kind of data, poor robustness；Moreover, the key that attacker uses in the URL that goes fishing Word is usually similar to normal URL, so just can cause the reduction of disaggregated model detection efficiency with confusing user.

The content of the invention

In view of the deficiency of the prior art, it is an object of the invention to provide a kind of fishing based on word sequence URL detection methods and system, for detecting fishing URL.By being segmented to URL character strings, so obtain word sequence to Amount represents, then learns the contextual information and feature in word sequence automatically using deep learning model, it is not necessary to manually extraction The relevant text feature of word is included in URL, is used for detecting fishing URL using trained model.So as to solve mentioned above Existing word-based feature fishing URL detections in the problem of running into.

In order to achieve the above object, the present invention adopts the technical scheme that：

A kind of fishing URL detection methods based on word sequence, comprise the following steps：

URL will have been marked and be converted to word order column vector as training data；

Using training data train classification models；

Unknown URL is converted to word order column vector and is input in trained disaggregated model and is labeled.

Further, will mark URL or unknown URL and be converted to word order column vector includes：

Filter out and marked URL or agreement and generic top-level domain in unknown URL；

Remaining part after filtering is split, the character string of each segmentation obtained to segmentation is passed through using dictionary The mode of Forward Maximum Method is segmented, and obtains word sequence；

Numbering is proceeded by from 1 to word all in above-mentioned dictionary, each word is had unique number, each having marked The word sequence of URL or unknown URL are converted to the fixed length vector of digital representation.

Further, the agreement includes http, https, ftp, ftps, gopher；The generic top-level domain includes com、org、net、edu、gov。

Further, it is described to carry out participle by way of Forward Maximum Method using dictionary and include：

Whole character string is judged whether in dictionary, if so, then no longer being segmented；

If it is not, then removing last character, judge remaining character string whether in dictionary；

Foregoing deterministic process is repeated until matching the word in dictionary, then removes the word in matching；

Above-mentioned steps are continued to the remaining part of character string, until character string is all disposed；

As character string do not include dictionary in word, then be divided into single character.

Further, the dictionary selects Google's English word corpus disclosed in Peter Norvig.

Further, the two-way LSTM models based on word sequence are selected to be instructed using the disaggregated model of training data training Practice.

Further, included using training data train classification models：

Training data is randomly divided into training part and verification portion, by setting the hyper parameter of neural network model and swashing The parameters such as function living are trained two-way LSTM models.

Further, two-way LSTM models include embeding layer, LSTM layers two-way, dropout layers and four layers of god of sigmoid layers Through network, further included using training data train classification models：Output LSTM layers two-way is used to prevent using dropout functions Only over-fitting.

A kind of fishing URL detecting systems based on word sequence, including：

Modular converter and classification based training model；

Modular converter is converted to training data of the word order column vector as train classification models will mark URL；And It is labeled to be converted to word order column vector and be input in trained disaggregated model unknown URL.

As described above, method and system provided by the invention, it is not necessary to manually extract any feature, it is only necessary to which URL is turned Word sequence vector representation is changed to, passes through the contextual information in deep neural network (two-way LSTM models) automatically study word sequence And feature, for detecting fishing URL.

Compared to the technology of traditional detection fishing URL, has the following advantages：

First, it is not necessary to the related data of extra collection URL and the manually text feature of extraction URL, by using depth Degree learning model learns the word sequence contextual information and feature of URL, and thereby detection fishing URL automatically；It is obvious to reduce expense.

In addition, contextual information and the feature that the word sequence of URL includes are excavated by depth, compared to based on artificial extraction The machine learning model of word feature and the deep learning model based on character string have preferable effect, in same data set On detection result it is preferable.

Finally, by the method and system of the present invention, trained model, on common server, single thread are used Predetermined speed is no less than 600 URL up to each second.On the premise of Detection accuracy is improved, it can meet to detect in real time at the same time Demand.

Brief description of the drawings

Fig. 1 is the flow diagram of the fishing URL detection methods based on word sequence in one embodiment of the invention.

Fig. 2 is the two-way LSTM models used in one embodiment of the invention in the fishing URL detection methods based on word sequence Structure diagram.

Embodiment

Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, complete Whole description.

In one embodiment of this invention, there is provided a kind of fishing URL detection methods and system based on word sequence, method Key step includes:

(1) word sequence vector representation, first, the crucial word order included in URL is obtained using based on the matched method of dictionary Row, are then based on dictionary and encode to obtain the vector representation of URL word sequences；

(2) model training, it is word-based using the training data training marked to the word order column vector obtained in previous step The two-way LSTM models of sequence；

(3) go fishing URL detection, using the trained two-way unknown URL of LSTM model inspections based on word sequence whether be Fishing.

System includes：Modular converter and classification based training model；

Modular converter is converted to training number of the word sequence vector representation as train classification models will mark URL According to；And it is labeled to be converted to word sequence vector representation and be input in trained disaggregated model unknown URL.

Word sequence vector representation step in this method, primarily to obtaining the vector representation of URL word sequences, mainly has The following steps：

I) first, filter out known agreement and generic top-level domain two parts in URL, common agreement include http, Https, ftp, ftps, gopher etc., generic top-level domain is 14 including com, org, net, edu, gov etc.；

Ii) to remaining part, first split with symbol, preprepared dictionary then is used to each segmentation Segmented by the method for Forward Maximum Method, with reference to the pseudocode shown in lower table algorithm 1, specific participle process is：First Whole character string is judged whether in dictionary, if need not segmented again；If it was not then remove last Character, judges remaining character string whether in dictionary, until matching the word in dictionary, then removes the word in matching, Above-mentioned steps are continued to the remaining part of character string, have been handled until character string is whole, if character string does not include dictionary In word, then be divided into single character.

The dictionary used during above-mentioned participle is that Google's English word corpus (includes disclosed in Peter Norvig 333,333 English words)；Other English word dictionaries are not applied to, which is that Peter Norvig have been counted in web page In common word, more meet the naming method of URL.

Iii) then, numbering being proceeded by from 1 to word all in above-mentioned dictionary, each word has only one numbering, The word sequence of each URL is converted to the fixed length vector of digital representation；

Model training step in this method, gathers the vector obtained in previous step, is gathered using the vector marked The two-way LSTM models based on word sequence are trained as training data.Training sample set is randomly divided into training and verification Two parts (account for whole labeled data respectively 80% and 20%), by hyper parameter (each layer for setting neural network model Output dimension etc.) and the parameter such as activation primitive two-way LSTM models are trained.Used deep learning model includes Multilayer neural network, is respectively embeding layer, LSTM layers two-way, dropout layers and four layers of neutral net of sigmoid layers, to two-way LSTM layers of output is used to prevent over-fitting using dropout functions.

Fishing URL detecting steps in this method, the main data realized to not marking, i.e., whether unknown URL, detect it For fishing.The word order column vector of unknown URL is input in trained two-way LSTM models and is labeled, if output is 1 Then represent that it, for fishing URL, is otherwise normal URL.

It is described further with reference to example：Fishing URL detection methods based on word sequence, its overall procedure as shown in Figure 1, Two-way LSTM model structures based on word sequence are as shown in Figure 2.

With the URL that goes fishing：http:Exemplified by //shen.mansell.tripod.com/games/gameboy.html, the URL Mark state is 1, and word sequence vector representation and the two-way LSTM models of training of fixed length are carried out to URL, and uses trained mould Type is to unknown URL：http://fly-project.net//yahoo.link/Yah/T/Y.html is detected.

1) word sequence vector representation is carried out to the URL of input first, URL is carried out first by preprepared dictionary Participle：

Then the word in dictionary is numbered, word sequence is expressed as the fixed length vector that length is N, and the value of N can lead to Cross statistics to obtain, find to include 13 words in the URL more than 90 percent by statistics, therefore set N=13, then two URL is respectively obtained vectorial (1,4,5,6,7,11,13,0,0,0,0,0,0) and (2,19,3,9,12,8,14,0,0,0,0,0,0).

The word sequence vector representation of all URL in sample set is obtained with identical method.Include and marked in sample set Normal URL and fishing url data.

2) it is input to using the vectorial data for gathering acceptance of the bid note of word sequence as training data as shown in Figure 2 based on word order It is trained in the two-way LSTM models of row, the word order column vector of URL first is input to Embedding layers of dimension-reduction treatment, then Be input to it is LSTM layers two-way in learnt, the result of study, which is input to dropout layers, prevents over-fitting, last layer Sigmoid functions export testing result.Mark 1 is expressed as fishing URL, is labeled as the normal URL of 0 expression, really two classification Problem, therefore model output carries out 0-1 classification using sigmoid functions.

All labeled data are input to training data in model, export trained model.

3) for the data not marked, its vector is input in trained model, exports annotation results, if output Fishing URL is expressed as 1, is otherwise normal URL.

Thus, by examples detailed above, the method in this example need not manually extract any feature, it is only necessary to URL Word sequence vector representation is converted to, is believed by the context in deep neural network (two-way LSTM models) automatically study word sequence Breath and feature, for detecting fishing URL.

Its key step includes:1) word sequence vector representation, first segments URL, and URL herein is included and marked And it is unknown.All URL will be converted to vector, then with the data training pattern of mark.Then padding sequence is utilized Method be fixed the vector representation of length；" fixed length " represents that the word sequence vector length that each URL is obtained is identical.Fill out It is the vector for handling different length to fill sequence method, is converted to equal length.

2) model training, the vector obtained to previous step, two-way LSTM models are trained using the training data marked.

3) URL that goes fishing is detected, and for the URL not marked, its vector representation is input to trained two-way LSTM models In be labeled, be labeled as 1 for go fishing URL.

Step 1) by word sequence vector representation, obtains the fixed length vector representation of URL character strings, this method is to URL first Vector representation be trained and analyze；

Step 2) uses two-way LSTM model of the data training marked based on word sequence to pretreated data；

Step 3) is input to the vector representation of unknown URL in trained two-way LSTM models and is labeled, and detects it Whether it is fishing URL；

Fishing URL is detected using the above method；It is capable of contextual information and the spy that the word sequence of depth excavation URL includes Sign, compared to the machine learning model based on the word feature manually extracted and the deep learning model based on character string have compared with Good effect, the detection result in same data set are as shown in table 1；

Also, this method is a kind of fishing URL detection methods of lightweight, using trained model, in common clothes It is engaged on device, single thread predetermined speed is no less than 600 URL up to each second.It can meet real-time while Detection accuracy is improved The demand of detection.

The testing result contrast of the different detection models of 1 four kinds of table

Model	Precision	Recall	F1
				The decision-tree model of word-based feature	0.8803	0.8700	0.8751
The Random Forest model of word-based feature	0.8981	0.8965	0.8973
				Two-way LSTM models based on character string	0.9553	0.9474	0.9513
Two-way LSTM models based on word sequence	0.9808	0.9716	0.9762

Obviously, described embodiment is only part of the embodiment of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, the every other implementation that those of ordinary skill in the art are obtained without making creative work Example, belongs to the scope of protection of the invention.

Claims

1. a kind of fishing URL detection methods based on word sequence, comprise the following steps：

Using training data train classification models；

2. the fishing URL detection methods based on word sequence as claimed in claim 1, it is characterised in that URL or not will have been marked The URL known, which is converted to word order column vector, to be included：

Remaining part after filtering is split, the character string of each segmentation obtained to segmentation passes through forward direction using dictionary Maximum matched mode is segmented, and obtains word sequence；

Numbering is proceeded by from 1 to word all in above-mentioned dictionary, each word is had unique number, each having marked URL Or the word sequence of unknown URL is converted to the fixed length vector of digital representation.

3. the fishing URL detection methods based on word sequence as claimed in claim 2, it is characterised in that the agreement includes http、https、ftp、ftps、gopher；The generic top-level domain includes com, org, net, edu, gov.

4. the fishing URL detection methods based on word sequence as claimed in claim 2, it is characterised in that described to be led to using dictionary The mode for crossing Forward Maximum Method carries out participle and includes：

If not, removing last character, judge remaining character string whether in dictionary；

5. the fishing URL detection methods based on word sequence as claimed in claim 4, it is characterised in that the dictionary is selected Google's English word corpus disclosed in Peter Norvig.

6. the fishing URL detection methods based on word sequence as claimed in claim 2, it is characterised in that instructed using training data Experienced disaggregated model selects the two-way LSTM models based on word sequence to be trained.

7. the fishing URL detection methods based on word sequence as claimed in claim 1, it is characterised in that instructed using training data Practicing disaggregated model includes：

Training data is randomly divided into training part and verification portion, by the hyper parameter and activation letter that set neural network model The parameters such as number are trained two-way LSTM models.

8. the fishing URL detection methods based on word sequence as claimed in claim 7, it is characterised in that two-way LSTM models bag Containing embeding layer, LSTM layers two-way, dropout layers and four layers of neutral net of sigmoid layers, using training data train classification models Further include：To output LSTM layers two-way using dropout functions for preventing over-fitting.

A kind of 9. fishing URL detecting systems based on word sequence, it is characterised in that including：

Modular converter and classification based training model；

Modular converter is converted to training data of the word order column vector as train classification models will mark URL；And to Unknown URL is converted to word order column vector and is input in trained disaggregated model and is labeled.