CN108111478A - A kind of phishing recognition methods and device based on semantic understanding - Google Patents

A kind of phishing recognition methods and device based on semantic understanding Download PDF

Info

Publication number
CN108111478A
CN108111478A CN201711085356.XA CN201711085356A CN108111478A CN 108111478 A CN108111478 A CN 108111478A CN 201711085356 A CN201711085356 A CN 201711085356A CN 108111478 A CN108111478 A CN 108111478A
Authority
CN
China
Prior art keywords
text
word
term vector
mrow
semantic feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711085356.XA
Other languages
Chinese (zh)
Inventor
张茜
曾宇
李洪涛
延志伟
袁晓彤
耿光刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Internet Network Information Center
Original Assignee
China Internet Network Information Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Internet Network Information Center filed Critical China Internet Network Information Center
Priority to CN201711085356.XA priority Critical patent/CN108111478A/en
Publication of CN108111478A publication Critical patent/CN108111478A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1483Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection

Abstract

The present invention relates to a kind of phishing recognition methods based on semantic understanding and devices.This method includes:The word segment in the html text of webpage in website is extracted, obtains the text data of webpage;Text semantic feature is generated using the text data of the webpage;The text semantic feature of website to be detected is inputted into fishing detection model, to judge whether website to be detected is fishing website;The fishing detection model is to be built using the text semantic feature of website using machine learning algorithm.The text data of legal webpage is carried out train language model by this method as corpus, obtains the term vector of word, is carried out vectorial expression to the html text of webpage in legitimate site and fishing website using term vector, is generated text semantic feature.The present invention extracts series of features from the visual angle of web page text semantic analysis, can build the fishing detection model of more robust, and promotes the ability of phishing identification.

Description

A kind of phishing recognition methods and device based on semantic understanding
Technical field
The invention belongs to network technique fields, and in particular to a kind of phishing recognition methods and dress based on semantic understanding It puts.
Background technology
Phishing (Phishing) this term results from 1996, it be by go fishing (Fishing) word develop and Come.During phishing, attacker is sent to a large number of users, phase using bait (such as Email, SMS) Treat a few users " rising to the bait ", and then the purpose of " fishing " (privacy information for such as stealing user).International anti-phishing work Make group (APWG) is to the definition of phishing:Phishing be it is a kind of using social engineering and technological means come steal consumption The personal identification data of person and the network attack mode of accounts of finance voucher.Phishing attacks using social engineering means are past Toward being to send duplicity Email seemingly from legal enterprise or mechanism, SMS etc. to user, user is lured to return Multiple personal sensitive information clicks on the website that the links and accesses of the inside are forged, and then it is (such as user name, close to reveal credential information Code) or download of malware.The property and personal secrets of phishing serious threat netizen, it has also become current internet maximum One of security risk.
Phishing substantially belongs to brand counterfeit, and in order to achieve the effect that mix the spurious with the genuine, fishing website is in vision and language Brand website is highly similar in justice.Fishing detection based on machine learning is current research hotspot, the selection of statistical nature Validity concerning model.However, the extraction of existing statistical nature mainly around visual similarity, steal and third party's feature Deng having ignored the excavation to web page semantics feature.
Deep learning achieved major progress in image identification, field of speech recognition in recent years, in natural language understanding Multiple-task also achieve it is very good as a result, particularly subject classification, mood analysis, question and answer and language translation.It is natural A critically important task is exactly to carry out vectorial expression to word, text in Language Processing, passes through instruction using depth learning technology Practice language model, the term vector with semantic information and syntactic information, and relative similarity and semanteme between vector can be obtained Similarity is relevant.
The content of the invention
In order to preferably portray the counterfeit characteristic of fishing website, the present invention proposes a kind of phishing based on semantic understanding Recognition methods and device extract series of features from the visual angle of web page text semantic analysis, still unlapped to excavate research at present Fishing characteristic, the fishing detection model of structure more robust promote the ability that phishing identifies.
The technical solution adopted by the present invention is as follows:
A kind of phishing recognition methods based on semantic understanding, comprises the following steps:
The word segment in the html text of webpage in website is extracted, obtains the text data of webpage;
Text semantic feature is generated using the text data of the webpage;
The text semantic feature of website to be detected is inputted into fishing detection model, to judge whether website to be detected is fishing Website;The fishing detection model is to be built using the text semantic feature of website using machine learning algorithm.
Further, the method for the generation text semantic feature is:Using the text data of legal webpage as corpus Carry out train language model, obtain the term vector of word;Using the term vector of the word to net in legitimate site and fishing website The html text of page carries out vectorial expression, generates text semantic feature.
Further, the study of the language model is carried out using neural network model, is built by the training of term vector Then the term vector table of word obtains the term vector of all words in web page text by query word vector table, and utilizes word Term vector carry out text semantic character representation.
Further, it is for the processing mode of the word not in term vector table:A) for not in term vector table Word, using the miss vector of predefined as the term vector of the word;B) build a high frequency vocabulary, for not word to Word in scale but in high frequency vocabulary determines the term vector of the word according to word frequency, for term vector table and high frequency vocabulary In not word, using the vector of a predefined as the term vector of the word.
Further, using the term vector of word, by way of averaging or the mode of weighting is asked to generate text semantic Feature.
Further, the method for the generation text semantic feature is:Text language is directly generated using the method for doc2vec Adopted feature.
A kind of phishing identification device based on semantic understanding, including:
Text data extraction module for extracting the word segment in website in the html text of webpage, obtains webpage Text data;
Text semantic feature generation module, for generating text semantic feature using the text data of the webpage;
Fishing detection model training module for utilizing the text semantic feature, is built using machine learning algorithm and fished Fish detection model;
Go fishing detection module, for call the text data extraction module and the text semantic feature generation module with The text semantic feature of webpage in website to be detected is extracted, and is inputted the fishing detection model to judge website to be detected Whether it is fishing website.
Further, the text semantic feature generation module is trained using the text data of legal webpage as corpus Language model obtains the term vector of word, then using the term vector of the word to webpage in legitimate site and fishing website Html text carry out vectorial expression, generate text semantic feature;Alternatively, the text semantic feature generation module utilizes The method of doc2vec directly generates text semantic feature.
Compared with prior art, beneficial effects of the present invention are as follows:
1. excavate the still unlapped fishing characteristic of research at present from semantic angle, compensate for existing based on machine learning The deficiency for identification technology of going fishing improves the robustness of detection model.
2. representing text semantic feature using term vector, web page text semantic feature represents fast and easy.According to language material Storehouse is trained after obtaining term vector table, and subsequent web pages text semantic character representation is subject to simply to calculate by way of tabling look-up It obtains.
3. the problem of multi-brand multiplexing of fishing template can be handled.Since term vector has functionally similar word at this In space at least along some direction it is close to each other the characteristics of, the present invention is imitated for handling similar fishing template for different brands The problem of emitting is advantageous.
4. the precision ratio and recall ratio of fishing detection can be promoted effectively, environment is detected suitable for actual internet.
Description of the drawings
Fig. 1 fishing detection model training flow charts.
Fig. 2 fishing overhaul flow charts.
Specific embodiment
In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, below by specific embodiment and Attached drawing is described in further details the present invention.
In order to gain users to trust by cheating, fishing website is often looked like with legitimate site, and this similitude is embodied in On a variety of visual elements such as URL, Logo, login frame, copyright statement.Existing mainstream research is by excavating visual similarity, stealing Feature and third party's feature etc. are taken, realizes the detection of phishing.However essentially, fishing website is highly dependent on net Content of text in page is counterfeit further to achieve the purpose that lure user to input sensitive information, i.e., it is fishing website that semanteme is counterfeit Key property, it is existing research lack correlation analysis.Therefore, the present invention explores the Semantic Similarity for excavating fishing website, to carry Rise the performance of fishing detection.The present invention represents term vector to introduce fishing detection, to expect preferably to portray the imitative of fishing website Emit essence.
Fishing detection method proposed by the present invention based on semantic understanding carries out text semantic mark sheet using term vector Show, realize the detection of phishing web.The training process and detection process of detection model are shown in Fig. 1, Fig. 2, mainly comprising following Step:
1. the detection model training stage
The training process of fishing detection model mainly includes following four step:
A) segment:There is no the language in space between words and word for Chinese etc., the text in the html text of extraction webpage , it is necessary to carry out word segmentation processing first after character segment;The language separated is done between words and word with space for English etc., then need not It is segmented, directly extracts the word segment in html text.
B) train language model, the term vector for obtaining word represent:By the use of legal web page text data as corpus, select The study (i.e. trained) that neural network model carries out language model is selected, is represented so as to obtain the term vector of word, forms term vector Table.
C) semantic expressiveness is carried out to html text using term vector:Using b) obtain term vector table in word word to Amount carries out vectorial expression, generation text semantic feature (i.e. text vector) to the html text of valid data, data of going fishing.
D) text semantic feature construction fishing detection model is utilized using machine learning algorithm.
The machine learning algorithm is not specifically designated herein, include but not limited to support vector machines, random forest, The common Supervised machine learning algorithm such as AdaBoost.
The process using text semantic feature construction fishing detection model is instructed with common using machine learning algorithm The mode for practicing model is similar:Using obtained text semantic feature as sample characteristics, the feature and label of training data are utilized Whether (being fishing website) selects suitable machine learning algorithm to realize the training of fishing detection model.
2. phishing detection-phase
Extraction text semantic feature is had main steps that is be detected to webpage to be detected, then inputs semantic feature Detection model go fishing to judge whether webpage to be detected is fishing.The process of the text semantic feature extraction in the stage is instructed with model The process for practicing the text semantic feature extraction in stage is similar.
Two stages outlined above for illustrating the method for the present invention.In the two stages of the present invention, it is preferred that emphasis is webpage Text semantic character representation.The present invention do not limit concrete implementation mode, by neural network model learn language model so as to The term vector of word is obtained, does not limit specific neural network model;Text semantic feature is carried out using word term vector It represents can respectively provide embodiment below by averaging, asking the modes such as weighting to realize.
1) word term vector is obtained
Term vector, also known as distributed word represent that training method has very much, but are all to utilize neural network model (example Such as CBOW, Skip-gram, C&W, LBL) study language model, so as to obtain the term vector of word.Term vector table in the present invention Building mode it is as follows:The data set of legal web page text is built, it is refreshing with reference to having as the corpus of training term vector The training that neutral net carries out term vector through network model or is voluntarily built, builds the term vector table of word in corpus.Word to Often row includes a word and the term vector (dimension N can be configured as needed) of the corresponding N-dimensional of the word in scale, should be somebody's turn to do Each dimension of vector represents the potential grammer of the word or semantic feature.It can utilize the methods of word2vec and generate word The term vector table of language.
Term vector makes functionally similar word at least close to each other along some direction in feature space, therefore, word Between similitude can be weighed by the distance between its term vector (Euclidean distance, cosine similarity etc.).It can pass through Be calculated with the highest word of given Words similarity, it is as shown below for the several words most like with " Construction Bank ", wherein, The Section 1 of each tuple is word, and Section 2 is the similarity with " Construction Bank " word.
(agricultural bank, 0.708540976048)
(Societe Generale, 0.65518784523)
(Construction Bank, 0.636544108391)
(Bank of Communications, 0.616162657738)
(Huaxia Bank, 0.608458161354)
(subbranch, 0.608001768589)
(industrial and commercial bank, 0.59148645401)
2) text semantic character representation
The method of text semantic character representation is as follows:By query word vector table, all words in web page text are obtained Term vector, and obtain text vector using certain calculation.Wherein, for the word not in term vector table, there are two types of Processing mode:
First, using the miss of predefined vectorial (such as being all 0 vector) as the term vector of the word.
2nd, a high frequency vocabulary is built.For the word not in term vector table but in high frequency vocabulary, determined according to word frequency The term vector of the fixed word;For term vector table and high frequency vocabulary not word, using predefined it is vectorial as The term vector of the word.
The calculation for carrying out vectorial expression to a text using the term vector table of word is as follows:
A) average
The mode for calculating average thinks that the weight of each word in text is identical.Text vector is carried out using the mode averaged During expression, in order to avoid the noise that stop words is brought, stop words is carried out to text first and is handled, is then retouched using formula (1) The mode stated calculates text vector.
Wherein, diRepresent the vector expression of i-th of text;niRepresent the number of word in i-th of text;wijRepresent i-th The term vector of j-th of word in a text.
B) weighting is asked
The calculation of weighting thinks the weighted of each word in text, and the calculation of weight includes but not limited to TF-IDF (Term Frequency-Inverse Document Frequency), using TF-IDF as the text of term weighing This vector calculation formula is as follows:
Wherein, di、ni、wijThe meaning of expression is identical with formula (1);tfidfijRepresent in i-th of text j-th word TF-IDF values.
A concrete application example is provided below.
Assuming that the content of text of a webpage is " Mobile banking of the Industrial and Commercial Bank of China ", word segmentation result is " the industrial and commercial silver of China Row/mobile phone/bank ", vector of these three words in term vector table are respectively (for convenience of description, only taking preceding 5 dimension herein):
Table 1. segments the term vector (first five dimension) of three obtained words
Since these three words are not in deactivated vocabulary, the text vector obtained using the mode averaged is this The average value of three vector sums, i.e.,:
Text vector is calculated using weighting scheme:
D=[2.7928238* (- 0.037823,0.361873,0.033403, -0.252190, -0.015590)+
1.4973016*(-1.876170,0.183362,-0.304421,-0.512916,3.008589)+
1.7978696*(0.455634,-1.009433,-0.683979,-1.826192,1.280102)]
/(2.7928238+1.4973016+1.7978696)
=(0.455634, -1.009433, -0.683979, -1.826192,1.280102)
Common statistical nature is extracted, contrast test is carried out with the method for invention extraction.Respectively using average term vector, Weighting term vector, statistical nature (steal the counterfeit feature of feature, copyright, the counterfeit feature of license, domain name timeliness including what table 2 described Feature and link uniformity feature linear fusion, i.e. t1 ∪ t2 ∪ t3 ∪ t4 ∪ t5) and average term vector melt with statistical nature These four Feature Selection modes are closed, have used tetra- AdaBoost, Bagging, Random Forest, SMO machine learning respectively Algorithm carries out ten folding cross validations, and experimental result is shown in Table 3.
The statistical nature for comparison that table 2. extracts
The experimental result classified under 3. 4 kinds of machine learning algorithms of table using different characteristic
Each index is described as follows in table 3:
For two classification problems, sample can be divided into really according to the combination of its true classification and learning period prediction classification Example (TP), false positive example (FP), true counter-example (TN), false counter-example (FN), the confusion matrix formed are as shown in table 4:
4. classification results confusion matrix of table
Following evaluation index can define according to confusion matrix:
P (accuracy rate):
R (recall rate):
F-measure:(β=1 in the present invention)
FP Rate (false drop rate):
Error Rate (error rate):
AUC:ROC curve is the curve formed using FPR and TPR as x-axis and y-axis, which is referred to as For AUC.Wherein
As shown in Table 3, on the whole, the effect and the effect using statistical nature that fishing detection is carried out using only term vector Quite, and effect that term vector is merged with statistical nature is then the most prominent.
Another embodiment of the present invention provides a kind of phishing identification device based on semantic understanding, including:
Text data extraction module for extracting the word segment in website in the html text of webpage, obtains webpage Text data;
Text semantic feature generation module, for generating text semantic feature using the text data of the webpage;
Fishing detection model training module for utilizing the text semantic feature, is built using machine learning algorithm and fished Fish detection model;
Go fishing detection module, for call the text data extraction module and the text semantic feature generation module with The text semantic feature of webpage in website to be detected is extracted, and is inputted the fishing detection model to judge website to be detected Whether it is fishing website.
The text data of legal webpage is carried out train language model by the text semantic feature generation module as corpus, The term vector of word is obtained, then using the term vector of the word to the html text of webpage in legitimate site and fishing website Vectorial expression is carried out, generates text semantic feature.
The methods of word2vec is utilized in above example generates the term vector table of word, and then generates text vector. In other embodiments, the method that can also utilize doc2vec by training, directly generates the vector of an indefinite long text, i.e., Directly generate text semantic feature.Then using text semantic feature, fishing detection model is built using machine learning algorithm. Phishing detection-phase extracts the text semantic feature of webpage in website to be detected, is inputted the fishing detection model To judge whether website to be detected is fishing website.
The above embodiments are merely illustrative of the technical solutions of the present invention rather than is limited, the ordinary skill of this field Personnel can be modified or replaced equivalently technical scheme, without departing from the spirit and scope of the present invention, this The protection domain of invention should be subject to described in claims.

Claims (10)

1. a kind of phishing recognition methods based on semantic understanding, which is characterized in that comprise the following steps:
The word segment in the html text of webpage in website is extracted, obtains the text data of webpage;
Text semantic feature is generated using the text data of webpage;
The text semantic feature of website to be detected is inputted into fishing detection model, to judge whether website to be detected is Fishing net It stands;The fishing detection model is to be built using the text semantic feature of website using machine learning algorithm.
2. the method as described in claim 1, which is characterized in that it is described generation text semantic feature method be:By legal net The text data of page carrys out train language model as corpus, obtains the term vector of word;Utilize the term vector pair of the word The html text of webpage carries out vectorial expression in legitimate site and fishing website, generates text semantic feature.
3. method as claimed in claim 2, which is characterized in that the language model is carried out using neural network model It practises, by the term vector table of the training structure word of term vector, then obtains owning in web page text by query word vector table The term vector of word, and carry out text semantic character representation using the term vector of word.
4. method as claimed in claim 3, which is characterized in that the processing mode for the word not in term vector table is: A) for the word not in term vector table, using the miss vector of predefined as the term vector of the word;B) one is built For the word not in term vector table but in high frequency vocabulary, the term vector of the word is determined according to word frequency for a high frequency vocabulary, For in term vector table and high frequency vocabulary not word, using the vector of a predefined as the term vector of the word.
5. method as claimed in claim 2, which is characterized in that using the term vector of word, by way of averaging or ask The mode of weighting generates text semantic feature.
6. method as claimed in claim 5, which is characterized in that the mode averaged to text disable first Word processing, then calculates text vector using the following formula:
<mrow> <msub> <mi>d</mi> <mi>i</mi> </msub> <mo>=</mo> <mfrac> <mn>1</mn> <msub> <mi>n</mi> <mi>i</mi> </msub> </mfrac> <msubsup> <mo>&amp;Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>n</mi> <mi>i</mi> </msub> </msubsup> <msub> <mi>w</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>,</mo> </mrow>
Wherein, diRepresent the vector expression of i-th of text;niRepresent the number of word in i-th of text;wijRepresent i-th of text In j-th of word term vector.
7. method as claimed in claim 5, which is characterized in that the mode for asking weighting using the following formula calculate text to Amount:
<mrow> <msub> <mi>d</mi> <mi>i</mi> </msub> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <msubsup> <mo>&amp;Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>n</mi> <mi>i</mi> </msub> </msubsup> <msub> <mi>tfidf</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> </mrow> </mfrac> <msubsup> <mo>&amp;Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>n</mi> <mi>i</mi> </msub> </msubsup> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>&amp;times;</mo> <msub> <mi>tfidf</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>)</mo> </mrow> <mo>,</mo> </mrow>
Wherein, diRepresent the vector expression of i-th of text;niRepresent the number of word in i-th of text;wijRepresent i-th of text In j-th of word term vector;tfidfijRepresent the TF-IDF values of j-th of word in i-th of text.
8. the method as described in claim 1, which is characterized in that it is described generation text semantic feature method be:It utilizes The method of doc2vec directly generates text semantic feature.
9. a kind of phishing identification device based on semantic understanding, which is characterized in that including:
Text data extraction module for extracting the word segment in website in the html text of webpage, obtains the text of webpage Data;
Text semantic feature generation module, for generating text semantic feature using the text data of webpage;
For utilizing text semantic feature, fishing detection mould is built using machine learning algorithm for fishing detection model training module Type;
Fishing detection module, for the text data extraction module and the text semantic feature generation module to be called to extract The text semantic feature of webpage in website to be detected, and the fishing detection model is inputted whether to judge website to be detected For fishing website.
10. device as claimed in claim 9, which is characterized in that the text semantic feature generation module is by legal webpage Text data carrys out train language model as corpus, obtains the term vector of word, then utilizes the term vector pair of the word The html text of webpage carries out vectorial expression in legitimate site and fishing website, generates text semantic feature;Alternatively, the text Semantic feature generation module directly generates text semantic feature using the method for doc2vec.
CN201711085356.XA 2017-11-07 2017-11-07 A kind of phishing recognition methods and device based on semantic understanding Pending CN108111478A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711085356.XA CN108111478A (en) 2017-11-07 2017-11-07 A kind of phishing recognition methods and device based on semantic understanding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711085356.XA CN108111478A (en) 2017-11-07 2017-11-07 A kind of phishing recognition methods and device based on semantic understanding

Publications (1)

Publication Number Publication Date
CN108111478A true CN108111478A (en) 2018-06-01

Family

ID=62207455

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711085356.XA Pending CN108111478A (en) 2017-11-07 2017-11-07 A kind of phishing recognition methods and device based on semantic understanding

Country Status (1)

Country Link
CN (1) CN108111478A (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108846097A (en) * 2018-06-15 2018-11-20 北京搜狐新媒体信息技术有限公司 The interest tags representation method of user, article recommended method and device, equipment
CN109413028A (en) * 2018-08-29 2019-03-01 集美大学 SQL injection detection method based on convolutional neural networks algorithm
CN109462582A (en) * 2018-10-30 2019-03-12 腾讯科技(深圳)有限公司 Text recognition method, device, server and storage medium
CN109905359A (en) * 2018-12-24 2019-06-18 深圳市珍爱捷云信息技术有限公司 Communication message processing method, device, computer equipment and can read access medium
CN110191096A (en) * 2019-04-30 2019-08-30 安徽工业大学 A kind of term vector homepage invasion detection method based on semantic analysis
CN110427627A (en) * 2019-08-02 2019-11-08 北京百度网讯科技有限公司 Task processing method and device based on semantic expressiveness model
CN110572359A (en) * 2019-08-01 2019-12-13 杭州安恒信息技术股份有限公司 Phishing webpage detection method based on machine learning
CN110825998A (en) * 2019-08-09 2020-02-21 国家计算机网络与信息安全管理中心 Website identification method and readable storage medium
CN110830489A (en) * 2019-11-14 2020-02-21 国网江苏省电力有限公司苏州供电分公司 Method and system for detecting counterattack type fraud website based on content abstract representation
CN111091019A (en) * 2019-12-23 2020-05-01 支付宝(杭州)信息技术有限公司 Information prompting method, device and equipment
CN111324831A (en) * 2018-12-17 2020-06-23 中国移动通信集团北京有限公司 Method and device for detecting fraudulent website
CN111488622A (en) * 2019-01-25 2020-08-04 深信服科技股份有限公司 Method and device for detecting webpage tampering behavior and related components
CN112347244A (en) * 2019-08-08 2021-02-09 四川大学 Method for detecting website involved in yellow and gambling based on mixed feature analysis
CN112541476A (en) * 2020-12-24 2021-03-23 西安交通大学 Malicious webpage identification method based on semantic feature extraction
US11303674B2 (en) * 2019-05-14 2022-04-12 International Business Machines Corporation Detection of phishing campaigns based on deep learning network detection of phishing exfiltration communications
CN115051817A (en) * 2022-01-05 2022-09-13 中国互联网络信息中心 Phishing detection method and system based on multi-mode fusion features
CN116962817A (en) * 2023-09-21 2023-10-27 世优(北京)科技有限公司 Video processing method, device, electronic equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102662959A (en) * 2012-03-07 2012-09-12 南京邮电大学 Method for detecting phishing web pages with spatial mixed index mechanism
CN103020164A (en) * 2012-11-26 2013-04-03 华北电力大学 Semantic search method based on multi-semantic analysis and personalized sequencing
CN105338001A (en) * 2015-12-04 2016-02-17 北京奇虎科技有限公司 Method and device for recognizing phishing website
CN105718577A (en) * 2016-01-22 2016-06-29 中国互联网络信息中心 Method and system for automatically detecting phishing aiming at added domain name
CN105786782A (en) * 2016-03-25 2016-07-20 北京搜狗科技发展有限公司 Word vector training method and device
CN105956472A (en) * 2016-05-12 2016-09-21 宝利九章(北京)数据技术有限公司 Method and system for identifying whether webpage includes malicious content or not
US9697828B1 (en) * 2014-06-20 2017-07-04 Amazon Technologies, Inc. Keyword detection modeling using contextual and environmental information
US20170223034A1 (en) * 2016-01-29 2017-08-03 Acalvio Technologies, Inc. Classifying an email as malicious

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102662959A (en) * 2012-03-07 2012-09-12 南京邮电大学 Method for detecting phishing web pages with spatial mixed index mechanism
CN103020164A (en) * 2012-11-26 2013-04-03 华北电力大学 Semantic search method based on multi-semantic analysis and personalized sequencing
US9697828B1 (en) * 2014-06-20 2017-07-04 Amazon Technologies, Inc. Keyword detection modeling using contextual and environmental information
CN105338001A (en) * 2015-12-04 2016-02-17 北京奇虎科技有限公司 Method and device for recognizing phishing website
CN105718577A (en) * 2016-01-22 2016-06-29 中国互联网络信息中心 Method and system for automatically detecting phishing aiming at added domain name
US20170223034A1 (en) * 2016-01-29 2017-08-03 Acalvio Technologies, Inc. Classifying an email as malicious
CN105786782A (en) * 2016-03-25 2016-07-20 北京搜狗科技发展有限公司 Word vector training method and device
CN105956472A (en) * 2016-05-12 2016-09-21 宝利九章(北京)数据技术有限公司 Method and system for identifying whether webpage includes malicious content or not

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108846097A (en) * 2018-06-15 2018-11-20 北京搜狐新媒体信息技术有限公司 The interest tags representation method of user, article recommended method and device, equipment
CN109413028A (en) * 2018-08-29 2019-03-01 集美大学 SQL injection detection method based on convolutional neural networks algorithm
CN109413028B (en) * 2018-08-29 2021-11-30 集美大学 SQL injection detection method based on convolutional neural network algorithm
CN109462582A (en) * 2018-10-30 2019-03-12 腾讯科技(深圳)有限公司 Text recognition method, device, server and storage medium
CN111324831A (en) * 2018-12-17 2020-06-23 中国移动通信集团北京有限公司 Method and device for detecting fraudulent website
CN109905359A (en) * 2018-12-24 2019-06-18 深圳市珍爱捷云信息技术有限公司 Communication message processing method, device, computer equipment and can read access medium
CN111488622A (en) * 2019-01-25 2020-08-04 深信服科技股份有限公司 Method and device for detecting webpage tampering behavior and related components
CN110191096A (en) * 2019-04-30 2019-08-30 安徽工业大学 A kind of term vector homepage invasion detection method based on semantic analysis
CN110191096B (en) * 2019-04-30 2023-05-09 安徽工业大学 Word vector webpage intrusion detection method based on semantic analysis
US11818170B2 (en) 2019-05-14 2023-11-14 Crowdstrike, Inc. Detection of phishing campaigns based on deep learning network detection of phishing exfiltration communications
US11303674B2 (en) * 2019-05-14 2022-04-12 International Business Machines Corporation Detection of phishing campaigns based on deep learning network detection of phishing exfiltration communications
CN110572359A (en) * 2019-08-01 2019-12-13 杭州安恒信息技术股份有限公司 Phishing webpage detection method based on machine learning
CN110427627B (en) * 2019-08-02 2023-04-28 北京百度网讯科技有限公司 Task processing method and device based on semantic representation model
CN110427627A (en) * 2019-08-02 2019-11-08 北京百度网讯科技有限公司 Task processing method and device based on semantic expressiveness model
CN112347244B (en) * 2019-08-08 2023-07-25 四川大学 Yellow-based and gambling-based website detection method based on mixed feature analysis
CN112347244A (en) * 2019-08-08 2021-02-09 四川大学 Method for detecting website involved in yellow and gambling based on mixed feature analysis
CN110825998A (en) * 2019-08-09 2020-02-21 国家计算机网络与信息安全管理中心 Website identification method and readable storage medium
CN110830489A (en) * 2019-11-14 2020-02-21 国网江苏省电力有限公司苏州供电分公司 Method and system for detecting counterattack type fraud website based on content abstract representation
CN111091019A (en) * 2019-12-23 2020-05-01 支付宝(杭州)信息技术有限公司 Information prompting method, device and equipment
CN111091019B (en) * 2019-12-23 2024-03-01 支付宝(杭州)信息技术有限公司 Information prompting method, device and equipment
CN112541476B (en) * 2020-12-24 2023-09-29 西安交通大学 Malicious webpage identification method based on semantic feature extraction
CN112541476A (en) * 2020-12-24 2021-03-23 西安交通大学 Malicious webpage identification method based on semantic feature extraction
CN115051817A (en) * 2022-01-05 2022-09-13 中国互联网络信息中心 Phishing detection method and system based on multi-mode fusion features
CN115051817B (en) * 2022-01-05 2023-11-24 中国互联网络信息中心 Phishing detection method and system based on multi-mode fusion characteristics
CN116962817A (en) * 2023-09-21 2023-10-27 世优(北京)科技有限公司 Video processing method, device, electronic equipment and storage medium
CN116962817B (en) * 2023-09-21 2023-12-08 世优(北京)科技有限公司 Video processing method, device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN108111478A (en) A kind of phishing recognition methods and device based on semantic understanding
CN104077396B (en) Method and device for detecting phishing website
CN110414219B (en) Injection attack detection method based on gated cycle unit and attention mechanism
WO2019085275A1 (en) Character string classification method and system, and character string classification device
CN104899508B (en) A kind of multistage detection method for phishing site and system
CN103530367B (en) A kind of fishing website identification system and method
US11762990B2 (en) Unstructured text classification
CN113596007B (en) Vulnerability attack detection method and device based on deep learning
CN104504335B (en) Fishing APP detection methods and system based on page feature and URL features
CN107992469A (en) A kind of fishing URL detection methods and system based on word sequence
CN109005145A (en) A kind of malice URL detection system and its method extracted based on automated characterization
CN110727766A (en) Method for detecting sensitive words
CN110830489B (en) Method and system for detecting counterattack type fraud website based on content abstract representation
CN108337255A (en) A kind of detection method for phishing site learnt based on web automatic tests and width
CN115051817B (en) Phishing detection method and system based on multi-mode fusion characteristics
CN110191096A (en) A kind of term vector homepage invasion detection method based on semantic analysis
Liu et al. An efficient multistage phishing website detection model based on the CASE feature framework: Aiming at the real web environment
CN110197389A (en) A kind of user identification method and device
Barlow et al. A novel approach to detect phishing attacks using binary visualisation and machine learning
CN111614616A (en) XSS attack automatic detection method
Opara et al. Look before You leap: Detecting phishing web pages by exploiting raw URL And HTML characteristics
CN113918936A (en) SQL injection attack detection method and device
CN115001763B (en) Phishing website attack detection method and device, electronic equipment and storage medium
CN114638984B (en) Malicious website URL detection method based on capsule network
CN114124448B (en) Cross-site script attack recognition method based on machine learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180601