CN108111478A

CN108111478A - A kind of phishing recognition methods and device based on semantic understanding

Info

Publication number: CN108111478A
Application number: CN201711085356.XA
Authority: CN
Inventors: 张茜; 曾宇; 李洪涛; 延志伟; 袁晓彤; 耿光刚
Original assignee: China Internet Network Information Center
Current assignee: China Internet Network Information Center
Priority date: 2017-11-07
Filing date: 2017-11-07
Publication date: 2018-06-01

Abstract

The present invention relates to a kind of phishing recognition methods based on semantic understanding and devices.This method includes：The word segment in the html text of webpage in website is extracted, obtains the text data of webpage；Text semantic feature is generated using the text data of the webpage；The text semantic feature of website to be detected is inputted into fishing detection model, to judge whether website to be detected is fishing website；The fishing detection model is to be built using the text semantic feature of website using machine learning algorithm.The text data of legal webpage is carried out train language model by this method as corpus, obtains the term vector of word, is carried out vectorial expression to the html text of webpage in legitimate site and fishing website using term vector, is generated text semantic feature.The present invention extracts series of features from the visual angle of web page text semantic analysis, can build the fishing detection model of more robust, and promotes the ability of phishing identification.

Description

A kind of phishing recognition methods and device based on semantic understanding

Technical field

The invention belongs to network technique fields, and in particular to a kind of phishing recognition methods and dress based on semantic understanding It puts.

Background technology

Phishing (Phishing) this term results from 1996, it be by go fishing (Fishing) word develop and Come.During phishing, attacker is sent to a large number of users, phase using bait (such as Email, SMS) Treat a few users " rising to the bait ", and then the purpose of " fishing " (privacy information for such as stealing user).International anti-phishing work Make group (APWG) is to the definition of phishing：Phishing be it is a kind of using social engineering and technological means come steal consumption The personal identification data of person and the network attack mode of accounts of finance voucher.Phishing attacks using social engineering means are past Toward being to send duplicity Email seemingly from legal enterprise or mechanism, SMS etc. to user, user is lured to return Multiple personal sensitive information clicks on the website that the links and accesses of the inside are forged, and then it is (such as user name, close to reveal credential information Code) or download of malware.The property and personal secrets of phishing serious threat netizen, it has also become current internet maximum One of security risk.

Phishing substantially belongs to brand counterfeit, and in order to achieve the effect that mix the spurious with the genuine, fishing website is in vision and language Brand website is highly similar in justice.Fishing detection based on machine learning is current research hotspot, the selection of statistical nature Validity concerning model.However, the extraction of existing statistical nature mainly around visual similarity, steal and third party's feature Deng having ignored the excavation to web page semantics feature.

Deep learning achieved major progress in image identification, field of speech recognition in recent years, in natural language understanding Multiple-task also achieve it is very good as a result, particularly subject classification, mood analysis, question and answer and language translation.It is natural A critically important task is exactly to carry out vectorial expression to word, text in Language Processing, passes through instruction using depth learning technology Practice language model, the term vector with semantic information and syntactic information, and relative similarity and semanteme between vector can be obtained Similarity is relevant.

The content of the invention

In order to preferably portray the counterfeit characteristic of fishing website, the present invention proposes a kind of phishing based on semantic understanding Recognition methods and device extract series of features from the visual angle of web page text semantic analysis, still unlapped to excavate research at present Fishing characteristic, the fishing detection model of structure more robust promote the ability that phishing identifies.

The technical solution adopted by the present invention is as follows：

A kind of phishing recognition methods based on semantic understanding, comprises the following steps：

The word segment in the html text of webpage in website is extracted, obtains the text data of webpage；

Text semantic feature is generated using the text data of the webpage；

The text semantic feature of website to be detected is inputted into fishing detection model, to judge whether website to be detected is fishing Website；The fishing detection model is to be built using the text semantic feature of website using machine learning algorithm.

Further, the method for the generation text semantic feature is：Using the text data of legal webpage as corpus Carry out train language model, obtain the term vector of word；Using the term vector of the word to net in legitimate site and fishing website The html text of page carries out vectorial expression, generates text semantic feature.

Further, the study of the language model is carried out using neural network model, is built by the training of term vector Then the term vector table of word obtains the term vector of all words in web page text by query word vector table, and utilizes word Term vector carry out text semantic character representation.

Further, it is for the processing mode of the word not in term vector table：A) for not in term vector table Word, using the miss vector of predefined as the term vector of the word；B) build a high frequency vocabulary, for not word to Word in scale but in high frequency vocabulary determines the term vector of the word according to word frequency, for term vector table and high frequency vocabulary In not word, using the vector of a predefined as the term vector of the word.

Further, using the term vector of word, by way of averaging or the mode of weighting is asked to generate text semantic Feature.

Further, the method for the generation text semantic feature is：Text language is directly generated using the method for doc2vec Adopted feature.

A kind of phishing identification device based on semantic understanding, including：

Text data extraction module for extracting the word segment in website in the html text of webpage, obtains webpage Text data；

Text semantic feature generation module, for generating text semantic feature using the text data of the webpage；

Fishing detection model training module for utilizing the text semantic feature, is built using machine learning algorithm and fished Fish detection model；

Go fishing detection module, for call the text data extraction module and the text semantic feature generation module with The text semantic feature of webpage in website to be detected is extracted, and is inputted the fishing detection model to judge website to be detected Whether it is fishing website.

Further, the text semantic feature generation module is trained using the text data of legal webpage as corpus Language model obtains the term vector of word, then using the term vector of the word to webpage in legitimate site and fishing website Html text carry out vectorial expression, generate text semantic feature；Alternatively, the text semantic feature generation module utilizes The method of doc2vec directly generates text semantic feature.

Compared with prior art, beneficial effects of the present invention are as follows：

1. excavate the still unlapped fishing characteristic of research at present from semantic angle, compensate for existing based on machine learning The deficiency for identification technology of going fishing improves the robustness of detection model.

2. representing text semantic feature using term vector, web page text semantic feature represents fast and easy.According to language material Storehouse is trained after obtaining term vector table, and subsequent web pages text semantic character representation is subject to simply to calculate by way of tabling look-up It obtains.

3. the problem of multi-brand multiplexing of fishing template can be handled.Since term vector has functionally similar word at this In space at least along some direction it is close to each other the characteristics of, the present invention is imitated for handling similar fishing template for different brands The problem of emitting is advantageous.

4. the precision ratio and recall ratio of fishing detection can be promoted effectively, environment is detected suitable for actual internet.

Description of the drawings

Fig. 1 fishing detection model training flow charts.

Fig. 2 fishing overhaul flow charts.

Specific embodiment

In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, below by specific embodiment and Attached drawing is described in further details the present invention.

In order to gain users to trust by cheating, fishing website is often looked like with legitimate site, and this similitude is embodied in On a variety of visual elements such as URL, Logo, login frame, copyright statement.Existing mainstream research is by excavating visual similarity, stealing Feature and third party's feature etc. are taken, realizes the detection of phishing.However essentially, fishing website is highly dependent on net Content of text in page is counterfeit further to achieve the purpose that lure user to input sensitive information, i.e., it is fishing website that semanteme is counterfeit Key property, it is existing research lack correlation analysis.Therefore, the present invention explores the Semantic Similarity for excavating fishing website, to carry Rise the performance of fishing detection.The present invention represents term vector to introduce fishing detection, to expect preferably to portray the imitative of fishing website Emit essence.

Fishing detection method proposed by the present invention based on semantic understanding carries out text semantic mark sheet using term vector Show, realize the detection of phishing web.The training process and detection process of detection model are shown in Fig. 1, Fig. 2, mainly comprising following Step：

1. the detection model training stage

The training process of fishing detection model mainly includes following four step：

A) segment：There is no the language in space between words and word for Chinese etc., the text in the html text of extraction webpage , it is necessary to carry out word segmentation processing first after character segment；The language separated is done between words and word with space for English etc., then need not It is segmented, directly extracts the word segment in html text.

B) train language model, the term vector for obtaining word represent：By the use of legal web page text data as corpus, select The study (i.e. trained) that neural network model carries out language model is selected, is represented so as to obtain the term vector of word, forms term vector Table.

C) semantic expressiveness is carried out to html text using term vector：Using b) obtain term vector table in word word to Amount carries out vectorial expression, generation text semantic feature (i.e. text vector) to the html text of valid data, data of going fishing.

D) text semantic feature construction fishing detection model is utilized using machine learning algorithm.

The machine learning algorithm is not specifically designated herein, include but not limited to support vector machines, random forest, The common Supervised machine learning algorithm such as AdaBoost.

The process using text semantic feature construction fishing detection model is instructed with common using machine learning algorithm The mode for practicing model is similar：Using obtained text semantic feature as sample characteristics, the feature and label of training data are utilized Whether (being fishing website) selects suitable machine learning algorithm to realize the training of fishing detection model.

2. phishing detection-phase

Extraction text semantic feature is had main steps that is be detected to webpage to be detected, then inputs semantic feature Detection model go fishing to judge whether webpage to be detected is fishing.The process of the text semantic feature extraction in the stage is instructed with model The process for practicing the text semantic feature extraction in stage is similar.

Two stages outlined above for illustrating the method for the present invention.In the two stages of the present invention, it is preferred that emphasis is webpage Text semantic character representation.The present invention do not limit concrete implementation mode, by neural network model learn language model so as to The term vector of word is obtained, does not limit specific neural network model；Text semantic feature is carried out using word term vector It represents can respectively provide embodiment below by averaging, asking the modes such as weighting to realize.

1) word term vector is obtained

Term vector, also known as distributed word represent that training method has very much, but are all to utilize neural network model (example Such as CBOW, Skip-gram, C＆W, LBL) study language model, so as to obtain the term vector of word.Term vector table in the present invention Building mode it is as follows：The data set of legal web page text is built, it is refreshing with reference to having as the corpus of training term vector The training that neutral net carries out term vector through network model or is voluntarily built, builds the term vector table of word in corpus.Word to Often row includes a word and the term vector (dimension N can be configured as needed) of the corresponding N-dimensional of the word in scale, should be somebody's turn to do Each dimension of vector represents the potential grammer of the word or semantic feature.It can utilize the methods of word2vec and generate word The term vector table of language.

Term vector makes functionally similar word at least close to each other along some direction in feature space, therefore, word Between similitude can be weighed by the distance between its term vector (Euclidean distance, cosine similarity etc.).It can pass through Be calculated with the highest word of given Words similarity, it is as shown below for the several words most like with " Construction Bank ", wherein, The Section 1 of each tuple is word, and Section 2 is the similarity with " Construction Bank " word.

(agricultural bank, 0.708540976048)

(Societe Generale, 0.65518784523)

(Construction Bank, 0.636544108391)

(Bank of Communications, 0.616162657738)

(Huaxia Bank, 0.608458161354)

(subbranch, 0.608001768589)

(industrial and commercial bank, 0.59148645401)

2) text semantic character representation

The method of text semantic character representation is as follows：By query word vector table, all words in web page text are obtained Term vector, and obtain text vector using certain calculation.Wherein, for the word not in term vector table, there are two types of Processing mode：

First, using the miss of predefined vectorial (such as being all 0 vector) as the term vector of the word.

2nd, a high frequency vocabulary is built.For the word not in term vector table but in high frequency vocabulary, determined according to word frequency The term vector of the fixed word；For term vector table and high frequency vocabulary not word, using predefined it is vectorial as The term vector of the word.

The calculation for carrying out vectorial expression to a text using the term vector table of word is as follows：

A) average

The mode for calculating average thinks that the weight of each word in text is identical.Text vector is carried out using the mode averaged During expression, in order to avoid the noise that stop words is brought, stop words is carried out to text first and is handled, is then retouched using formula (1) The mode stated calculates text vector.

Wherein, d_iRepresent the vector expression of i-th of text；n_iRepresent the number of word in i-th of text；w_ijRepresent i-th The term vector of j-th of word in a text.

B) weighting is asked

The calculation of weighting thinks the weighted of each word in text, and the calculation of weight includes but not limited to TF-IDF (Term Frequency-Inverse Document Frequency), using TF-IDF as the text of term weighing This vector calculation formula is as follows：

Wherein, d_i、n_i、w_ijThe meaning of expression is identical with formula (1)；tfidf_ijRepresent in i-th of text j-th word TF-IDF values.

A concrete application example is provided below.

Assuming that the content of text of a webpage is " Mobile banking of the Industrial and Commercial Bank of China ", word segmentation result is " the industrial and commercial silver of China Row/mobile phone/bank ", vector of these three words in term vector table are respectively (for convenience of description, only taking preceding 5 dimension herein)：

Table 1. segments the term vector (first five dimension) of three obtained words

Since these three words are not in deactivated vocabulary, the text vector obtained using the mode averaged is this The average value of three vector sums, i.e.,：

Text vector is calculated using weighting scheme：

D=[2.7928238* (- 0.037823,0.361873,0.033403, -0.252190, -0.015590)+

1.4973016*(-1.876170,0.183362,-0.304421,-0.512916,3.008589)+

1.7978696*(0.455634,-1.009433,-0.683979,-1.826192,1.280102)]

/(2.7928238+1.4973016+1.7978696)

=(0.455634, -1.009433, -0.683979, -1.826192,1.280102)

Common statistical nature is extracted, contrast test is carried out with the method for invention extraction.Respectively using average term vector, Weighting term vector, statistical nature (steal the counterfeit feature of feature, copyright, the counterfeit feature of license, domain name timeliness including what table 2 described Feature and link uniformity feature linear fusion, i.e. t1 ∪ t2 ∪ t3 ∪ t4 ∪ t5) and average term vector melt with statistical nature These four Feature Selection modes are closed, have used tetra- AdaBoost, Bagging, Random Forest, SMO machine learning respectively Algorithm carries out ten folding cross validations, and experimental result is shown in Table 3.

The statistical nature for comparison that table 2. extracts

The experimental result classified under 3. 4 kinds of machine learning algorithms of table using different characteristic

Each index is described as follows in table 3：

For two classification problems, sample can be divided into really according to the combination of its true classification and learning period prediction classification Example (TP), false positive example (FP), true counter-example (TN), false counter-example (FN), the confusion matrix formed are as shown in table 4：

4. classification results confusion matrix of table

Following evaluation index can define according to confusion matrix：

P (accuracy rate)：

R (recall rate)：

F-measure：(β=1 in the present invention)

FP Rate (false drop rate)：

Error Rate (error rate)：

AUC：ROC curve is the curve formed using FPR and TPR as x-axis and y-axis, which is referred to as For AUC.Wherein

As shown in Table 3, on the whole, the effect and the effect using statistical nature that fishing detection is carried out using only term vector Quite, and effect that term vector is merged with statistical nature is then the most prominent.

Another embodiment of the present invention provides a kind of phishing identification device based on semantic understanding, including：

The text data of legal webpage is carried out train language model by the text semantic feature generation module as corpus, The term vector of word is obtained, then using the term vector of the word to the html text of webpage in legitimate site and fishing website Vectorial expression is carried out, generates text semantic feature.

The methods of word2vec is utilized in above example generates the term vector table of word, and then generates text vector. In other embodiments, the method that can also utilize doc2vec by training, directly generates the vector of an indefinite long text, i.e., Directly generate text semantic feature.Then using text semantic feature, fishing detection model is built using machine learning algorithm. Phishing detection-phase extracts the text semantic feature of webpage in website to be detected, is inputted the fishing detection model To judge whether website to be detected is fishing website.

The above embodiments are merely illustrative of the technical solutions of the present invention rather than is limited, the ordinary skill of this field Personnel can be modified or replaced equivalently technical scheme, without departing from the spirit and scope of the present invention, this The protection domain of invention should be subject to described in claims.

Claims

1. a kind of phishing recognition methods based on semantic understanding, which is characterized in that comprise the following steps：

Text semantic feature is generated using the text data of webpage；

The text semantic feature of website to be detected is inputted into fishing detection model, to judge whether website to be detected is Fishing net It stands；The fishing detection model is to be built using the text semantic feature of website using machine learning algorithm.

2. the method as described in claim 1, which is characterized in that it is described generation text semantic feature method be：By legal net The text data of page carrys out train language model as corpus, obtains the term vector of word；Utilize the term vector pair of the word The html text of webpage carries out vectorial expression in legitimate site and fishing website, generates text semantic feature.

3. method as claimed in claim 2, which is characterized in that the language model is carried out using neural network model It practises, by the term vector table of the training structure word of term vector, then obtains owning in web page text by query word vector table The term vector of word, and carry out text semantic character representation using the term vector of word.

4. method as claimed in claim 3, which is characterized in that the processing mode for the word not in term vector table is： A) for the word not in term vector table, using the miss vector of predefined as the term vector of the word；B) one is built For the word not in term vector table but in high frequency vocabulary, the term vector of the word is determined according to word frequency for a high frequency vocabulary, For in term vector table and high frequency vocabulary not word, using the vector of a predefined as the term vector of the word.

5. method as claimed in claim 2, which is characterized in that using the term vector of word, by way of averaging or ask The mode of weighting generates text semantic feature.

6. method as claimed in claim 5, which is characterized in that the mode averaged to text disable first Word processing, then calculates text vector using the following formula：

<mrow> <msub> <mi>d</mi> <mi>i</mi> </msub> <mo>=</mo> <mfrac> <mn>1</mn> <msub> <mi>n</mi> <mi>i</mi> </msub> </mfrac> <msubsup> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>n</mi> <mi>i</mi> </msub> </msubsup> <msub> <mi>w</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>,</mo> </mrow>

Wherein, d_iRepresent the vector expression of i-th of text；n_iRepresent the number of word in i-th of text；w_ijRepresent i-th of text In j-th of word term vector.

7. method as claimed in claim 5, which is characterized in that the mode for asking weighting using the following formula calculate text to Amount：

<mrow> <msub> <mi>d</mi> <mi>i</mi> </msub> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <msubsup> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>n</mi> <mi>i</mi> </msub> </msubsup> <msub> <mi>tfidf</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> </mrow> </mfrac> <msubsup> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>n</mi> <mi>i</mi> </msub> </msubsup> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>&times;</mo> <msub> <mi>tfidf</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>)</mo> </mrow> <mo>,</mo> </mrow>

Wherein, d_iRepresent the vector expression of i-th of text；n_iRepresent the number of word in i-th of text；w_ijRepresent i-th of text In j-th of word term vector；tfidf_ijRepresent the TF-IDF values of j-th of word in i-th of text.

8. the method as described in claim 1, which is characterized in that it is described generation text semantic feature method be：It utilizes The method of doc2vec directly generates text semantic feature.

9. a kind of phishing identification device based on semantic understanding, which is characterized in that including：

Text data extraction module for extracting the word segment in website in the html text of webpage, obtains the text of webpage Data；

Text semantic feature generation module, for generating text semantic feature using the text data of webpage；

For utilizing text semantic feature, fishing detection mould is built using machine learning algorithm for fishing detection model training module Type；

Fishing detection module, for the text data extraction module and the text semantic feature generation module to be called to extract The text semantic feature of webpage in website to be detected, and the fishing detection model is inputted whether to judge website to be detected For fishing website.

10. device as claimed in claim 9, which is characterized in that the text semantic feature generation module is by legal webpage Text data carrys out train language model as corpus, obtains the term vector of word, then utilizes the term vector pair of the word The html text of webpage carries out vectorial expression in legitimate site and fishing website, generates text semantic feature；Alternatively, the text Semantic feature generation module directly generates text semantic feature using the method for doc2vec.