CN108111478A - A kind of phishing recognition methods and device based on semantic understanding - Google Patents
A kind of phishing recognition methods and device based on semantic understanding Download PDFInfo
- Publication number
- CN108111478A CN108111478A CN201711085356.XA CN201711085356A CN108111478A CN 108111478 A CN108111478 A CN 108111478A CN 201711085356 A CN201711085356 A CN 201711085356A CN 108111478 A CN108111478 A CN 108111478A
- Authority
- CN
- China
- Prior art keywords
- text
- word
- term vector
- mrow
- semantic feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1441—Countermeasures against malicious traffic
- H04L63/1483—Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1822—Parsing for meaning understanding
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1416—Event detection, e.g. attack signature detection
Abstract
The present invention relates to a kind of phishing recognition methods based on semantic understanding and devices.This method includes:The word segment in the html text of webpage in website is extracted, obtains the text data of webpage;Text semantic feature is generated using the text data of the webpage;The text semantic feature of website to be detected is inputted into fishing detection model, to judge whether website to be detected is fishing website;The fishing detection model is to be built using the text semantic feature of website using machine learning algorithm.The text data of legal webpage is carried out train language model by this method as corpus, obtains the term vector of word, is carried out vectorial expression to the html text of webpage in legitimate site and fishing website using term vector, is generated text semantic feature.The present invention extracts series of features from the visual angle of web page text semantic analysis, can build the fishing detection model of more robust, and promotes the ability of phishing identification.
Description
Technical field
The invention belongs to network technique fields, and in particular to a kind of phishing recognition methods and dress based on semantic understanding
It puts.
Background technology
Phishing (Phishing) this term results from 1996, it be by go fishing (Fishing) word develop and
Come.During phishing, attacker is sent to a large number of users, phase using bait (such as Email, SMS)
Treat a few users " rising to the bait ", and then the purpose of " fishing " (privacy information for such as stealing user).International anti-phishing work
Make group (APWG) is to the definition of phishing:Phishing be it is a kind of using social engineering and technological means come steal consumption
The personal identification data of person and the network attack mode of accounts of finance voucher.Phishing attacks using social engineering means are past
Toward being to send duplicity Email seemingly from legal enterprise or mechanism, SMS etc. to user, user is lured to return
Multiple personal sensitive information clicks on the website that the links and accesses of the inside are forged, and then it is (such as user name, close to reveal credential information
Code) or download of malware.The property and personal secrets of phishing serious threat netizen, it has also become current internet maximum
One of security risk.
Phishing substantially belongs to brand counterfeit, and in order to achieve the effect that mix the spurious with the genuine, fishing website is in vision and language
Brand website is highly similar in justice.Fishing detection based on machine learning is current research hotspot, the selection of statistical nature
Validity concerning model.However, the extraction of existing statistical nature mainly around visual similarity, steal and third party's feature
Deng having ignored the excavation to web page semantics feature.
Deep learning achieved major progress in image identification, field of speech recognition in recent years, in natural language understanding
Multiple-task also achieve it is very good as a result, particularly subject classification, mood analysis, question and answer and language translation.It is natural
A critically important task is exactly to carry out vectorial expression to word, text in Language Processing, passes through instruction using depth learning technology
Practice language model, the term vector with semantic information and syntactic information, and relative similarity and semanteme between vector can be obtained
Similarity is relevant.
The content of the invention
In order to preferably portray the counterfeit characteristic of fishing website, the present invention proposes a kind of phishing based on semantic understanding
Recognition methods and device extract series of features from the visual angle of web page text semantic analysis, still unlapped to excavate research at present
Fishing characteristic, the fishing detection model of structure more robust promote the ability that phishing identifies.
The technical solution adopted by the present invention is as follows:
A kind of phishing recognition methods based on semantic understanding, comprises the following steps:
The word segment in the html text of webpage in website is extracted, obtains the text data of webpage;
Text semantic feature is generated using the text data of the webpage;
The text semantic feature of website to be detected is inputted into fishing detection model, to judge whether website to be detected is fishing
Website;The fishing detection model is to be built using the text semantic feature of website using machine learning algorithm.
Further, the method for the generation text semantic feature is:Using the text data of legal webpage as corpus
Carry out train language model, obtain the term vector of word;Using the term vector of the word to net in legitimate site and fishing website
The html text of page carries out vectorial expression, generates text semantic feature.
Further, the study of the language model is carried out using neural network model, is built by the training of term vector
Then the term vector table of word obtains the term vector of all words in web page text by query word vector table, and utilizes word
Term vector carry out text semantic character representation.
Further, it is for the processing mode of the word not in term vector table:A) for not in term vector table
Word, using the miss vector of predefined as the term vector of the word;B) build a high frequency vocabulary, for not word to
Word in scale but in high frequency vocabulary determines the term vector of the word according to word frequency, for term vector table and high frequency vocabulary
In not word, using the vector of a predefined as the term vector of the word.
Further, using the term vector of word, by way of averaging or the mode of weighting is asked to generate text semantic
Feature.
Further, the method for the generation text semantic feature is:Text language is directly generated using the method for doc2vec
Adopted feature.
A kind of phishing identification device based on semantic understanding, including:
Text data extraction module for extracting the word segment in website in the html text of webpage, obtains webpage
Text data;
Text semantic feature generation module, for generating text semantic feature using the text data of the webpage;
Fishing detection model training module for utilizing the text semantic feature, is built using machine learning algorithm and fished
Fish detection model;
Go fishing detection module, for call the text data extraction module and the text semantic feature generation module with
The text semantic feature of webpage in website to be detected is extracted, and is inputted the fishing detection model to judge website to be detected
Whether it is fishing website.
Further, the text semantic feature generation module is trained using the text data of legal webpage as corpus
Language model obtains the term vector of word, then using the term vector of the word to webpage in legitimate site and fishing website
Html text carry out vectorial expression, generate text semantic feature;Alternatively, the text semantic feature generation module utilizes
The method of doc2vec directly generates text semantic feature.
Compared with prior art, beneficial effects of the present invention are as follows:
1. excavate the still unlapped fishing characteristic of research at present from semantic angle, compensate for existing based on machine learning
The deficiency for identification technology of going fishing improves the robustness of detection model.
2. representing text semantic feature using term vector, web page text semantic feature represents fast and easy.According to language material
Storehouse is trained after obtaining term vector table, and subsequent web pages text semantic character representation is subject to simply to calculate by way of tabling look-up
It obtains.
3. the problem of multi-brand multiplexing of fishing template can be handled.Since term vector has functionally similar word at this
In space at least along some direction it is close to each other the characteristics of, the present invention is imitated for handling similar fishing template for different brands
The problem of emitting is advantageous.
4. the precision ratio and recall ratio of fishing detection can be promoted effectively, environment is detected suitable for actual internet.
Description of the drawings
Fig. 1 fishing detection model training flow charts.
Fig. 2 fishing overhaul flow charts.
Specific embodiment
In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, below by specific embodiment and
Attached drawing is described in further details the present invention.
In order to gain users to trust by cheating, fishing website is often looked like with legitimate site, and this similitude is embodied in
On a variety of visual elements such as URL, Logo, login frame, copyright statement.Existing mainstream research is by excavating visual similarity, stealing
Feature and third party's feature etc. are taken, realizes the detection of phishing.However essentially, fishing website is highly dependent on net
Content of text in page is counterfeit further to achieve the purpose that lure user to input sensitive information, i.e., it is fishing website that semanteme is counterfeit
Key property, it is existing research lack correlation analysis.Therefore, the present invention explores the Semantic Similarity for excavating fishing website, to carry
Rise the performance of fishing detection.The present invention represents term vector to introduce fishing detection, to expect preferably to portray the imitative of fishing website
Emit essence.
Fishing detection method proposed by the present invention based on semantic understanding carries out text semantic mark sheet using term vector
Show, realize the detection of phishing web.The training process and detection process of detection model are shown in Fig. 1, Fig. 2, mainly comprising following
Step:
1. the detection model training stage
The training process of fishing detection model mainly includes following four step:
A) segment:There is no the language in space between words and word for Chinese etc., the text in the html text of extraction webpage
, it is necessary to carry out word segmentation processing first after character segment;The language separated is done between words and word with space for English etc., then need not
It is segmented, directly extracts the word segment in html text.
B) train language model, the term vector for obtaining word represent:By the use of legal web page text data as corpus, select
The study (i.e. trained) that neural network model carries out language model is selected, is represented so as to obtain the term vector of word, forms term vector
Table.
C) semantic expressiveness is carried out to html text using term vector:Using b) obtain term vector table in word word to
Amount carries out vectorial expression, generation text semantic feature (i.e. text vector) to the html text of valid data, data of going fishing.
D) text semantic feature construction fishing detection model is utilized using machine learning algorithm.
The machine learning algorithm is not specifically designated herein, include but not limited to support vector machines, random forest,
The common Supervised machine learning algorithm such as AdaBoost.
The process using text semantic feature construction fishing detection model is instructed with common using machine learning algorithm
The mode for practicing model is similar:Using obtained text semantic feature as sample characteristics, the feature and label of training data are utilized
Whether (being fishing website) selects suitable machine learning algorithm to realize the training of fishing detection model.
2. phishing detection-phase
Extraction text semantic feature is had main steps that is be detected to webpage to be detected, then inputs semantic feature
Detection model go fishing to judge whether webpage to be detected is fishing.The process of the text semantic feature extraction in the stage is instructed with model
The process for practicing the text semantic feature extraction in stage is similar.
Two stages outlined above for illustrating the method for the present invention.In the two stages of the present invention, it is preferred that emphasis is webpage
Text semantic character representation.The present invention do not limit concrete implementation mode, by neural network model learn language model so as to
The term vector of word is obtained, does not limit specific neural network model;Text semantic feature is carried out using word term vector
It represents can respectively provide embodiment below by averaging, asking the modes such as weighting to realize.
1) word term vector is obtained
Term vector, also known as distributed word represent that training method has very much, but are all to utilize neural network model (example
Such as CBOW, Skip-gram, C&W, LBL) study language model, so as to obtain the term vector of word.Term vector table in the present invention
Building mode it is as follows:The data set of legal web page text is built, it is refreshing with reference to having as the corpus of training term vector
The training that neutral net carries out term vector through network model or is voluntarily built, builds the term vector table of word in corpus.Word to
Often row includes a word and the term vector (dimension N can be configured as needed) of the corresponding N-dimensional of the word in scale, should be somebody's turn to do
Each dimension of vector represents the potential grammer of the word or semantic feature.It can utilize the methods of word2vec and generate word
The term vector table of language.
Term vector makes functionally similar word at least close to each other along some direction in feature space, therefore, word
Between similitude can be weighed by the distance between its term vector (Euclidean distance, cosine similarity etc.).It can pass through
Be calculated with the highest word of given Words similarity, it is as shown below for the several words most like with " Construction Bank ", wherein,
The Section 1 of each tuple is word, and Section 2 is the similarity with " Construction Bank " word.
(agricultural bank, 0.708540976048)
(Societe Generale, 0.65518784523)
(Construction Bank, 0.636544108391)
(Bank of Communications, 0.616162657738)
(Huaxia Bank, 0.608458161354)
(subbranch, 0.608001768589)
(industrial and commercial bank, 0.59148645401)
2) text semantic character representation
The method of text semantic character representation is as follows:By query word vector table, all words in web page text are obtained
Term vector, and obtain text vector using certain calculation.Wherein, for the word not in term vector table, there are two types of
Processing mode:
First, using the miss of predefined vectorial (such as being all 0 vector) as the term vector of the word.
2nd, a high frequency vocabulary is built.For the word not in term vector table but in high frequency vocabulary, determined according to word frequency
The term vector of the fixed word;For term vector table and high frequency vocabulary not word, using predefined it is vectorial as
The term vector of the word.
The calculation for carrying out vectorial expression to a text using the term vector table of word is as follows:
A) average
The mode for calculating average thinks that the weight of each word in text is identical.Text vector is carried out using the mode averaged
During expression, in order to avoid the noise that stop words is brought, stop words is carried out to text first and is handled, is then retouched using formula (1)
The mode stated calculates text vector.
Wherein, diRepresent the vector expression of i-th of text;niRepresent the number of word in i-th of text;wijRepresent i-th
The term vector of j-th of word in a text.
B) weighting is asked
The calculation of weighting thinks the weighted of each word in text, and the calculation of weight includes but not limited to
TF-IDF (Term Frequency-Inverse Document Frequency), using TF-IDF as the text of term weighing
This vector calculation formula is as follows:
Wherein, di、ni、wijThe meaning of expression is identical with formula (1);tfidfijRepresent in i-th of text j-th word
TF-IDF values.
A concrete application example is provided below.
Assuming that the content of text of a webpage is " Mobile banking of the Industrial and Commercial Bank of China ", word segmentation result is " the industrial and commercial silver of China
Row/mobile phone/bank ", vector of these three words in term vector table are respectively (for convenience of description, only taking preceding 5 dimension herein):
Table 1. segments the term vector (first five dimension) of three obtained words
Since these three words are not in deactivated vocabulary, the text vector obtained using the mode averaged is this
The average value of three vector sums, i.e.,:
Text vector is calculated using weighting scheme:
D=[2.7928238* (- 0.037823,0.361873,0.033403, -0.252190, -0.015590)+
1.4973016*(-1.876170,0.183362,-0.304421,-0.512916,3.008589)+
1.7978696*(0.455634,-1.009433,-0.683979,-1.826192,1.280102)]
/(2.7928238+1.4973016+1.7978696)
=(0.455634, -1.009433, -0.683979, -1.826192,1.280102)
Common statistical nature is extracted, contrast test is carried out with the method for invention extraction.Respectively using average term vector,
Weighting term vector, statistical nature (steal the counterfeit feature of feature, copyright, the counterfeit feature of license, domain name timeliness including what table 2 described
Feature and link uniformity feature linear fusion, i.e. t1 ∪ t2 ∪ t3 ∪ t4 ∪ t5) and average term vector melt with statistical nature
These four Feature Selection modes are closed, have used tetra- AdaBoost, Bagging, Random Forest, SMO machine learning respectively
Algorithm carries out ten folding cross validations, and experimental result is shown in Table 3.
The statistical nature for comparison that table 2. extracts
The experimental result classified under 3. 4 kinds of machine learning algorithms of table using different characteristic
Each index is described as follows in table 3:
For two classification problems, sample can be divided into really according to the combination of its true classification and learning period prediction classification
Example (TP), false positive example (FP), true counter-example (TN), false counter-example (FN), the confusion matrix formed are as shown in table 4:
4. classification results confusion matrix of table
Following evaluation index can define according to confusion matrix:
P (accuracy rate):
R (recall rate):
F-measure:(β=1 in the present invention)
FP Rate (false drop rate):
Error Rate (error rate):
AUC:ROC curve is the curve formed using FPR and TPR as x-axis and y-axis, which is referred to as
For AUC.Wherein
As shown in Table 3, on the whole, the effect and the effect using statistical nature that fishing detection is carried out using only term vector
Quite, and effect that term vector is merged with statistical nature is then the most prominent.
Another embodiment of the present invention provides a kind of phishing identification device based on semantic understanding, including:
Text data extraction module for extracting the word segment in website in the html text of webpage, obtains webpage
Text data;
Text semantic feature generation module, for generating text semantic feature using the text data of the webpage;
Fishing detection model training module for utilizing the text semantic feature, is built using machine learning algorithm and fished
Fish detection model;
Go fishing detection module, for call the text data extraction module and the text semantic feature generation module with
The text semantic feature of webpage in website to be detected is extracted, and is inputted the fishing detection model to judge website to be detected
Whether it is fishing website.
The text data of legal webpage is carried out train language model by the text semantic feature generation module as corpus,
The term vector of word is obtained, then using the term vector of the word to the html text of webpage in legitimate site and fishing website
Vectorial expression is carried out, generates text semantic feature.
The methods of word2vec is utilized in above example generates the term vector table of word, and then generates text vector.
In other embodiments, the method that can also utilize doc2vec by training, directly generates the vector of an indefinite long text, i.e.,
Directly generate text semantic feature.Then using text semantic feature, fishing detection model is built using machine learning algorithm.
Phishing detection-phase extracts the text semantic feature of webpage in website to be detected, is inputted the fishing detection model
To judge whether website to be detected is fishing website.
The above embodiments are merely illustrative of the technical solutions of the present invention rather than is limited, the ordinary skill of this field
Personnel can be modified or replaced equivalently technical scheme, without departing from the spirit and scope of the present invention, this
The protection domain of invention should be subject to described in claims.
Claims (10)
1. a kind of phishing recognition methods based on semantic understanding, which is characterized in that comprise the following steps:
The word segment in the html text of webpage in website is extracted, obtains the text data of webpage;
Text semantic feature is generated using the text data of webpage;
The text semantic feature of website to be detected is inputted into fishing detection model, to judge whether website to be detected is Fishing net
It stands;The fishing detection model is to be built using the text semantic feature of website using machine learning algorithm.
2. the method as described in claim 1, which is characterized in that it is described generation text semantic feature method be:By legal net
The text data of page carrys out train language model as corpus, obtains the term vector of word;Utilize the term vector pair of the word
The html text of webpage carries out vectorial expression in legitimate site and fishing website, generates text semantic feature.
3. method as claimed in claim 2, which is characterized in that the language model is carried out using neural network model
It practises, by the term vector table of the training structure word of term vector, then obtains owning in web page text by query word vector table
The term vector of word, and carry out text semantic character representation using the term vector of word.
4. method as claimed in claim 3, which is characterized in that the processing mode for the word not in term vector table is:
A) for the word not in term vector table, using the miss vector of predefined as the term vector of the word;B) one is built
For the word not in term vector table but in high frequency vocabulary, the term vector of the word is determined according to word frequency for a high frequency vocabulary,
For in term vector table and high frequency vocabulary not word, using the vector of a predefined as the term vector of the word.
5. method as claimed in claim 2, which is characterized in that using the term vector of word, by way of averaging or ask
The mode of weighting generates text semantic feature.
6. method as claimed in claim 5, which is characterized in that the mode averaged to text disable first
Word processing, then calculates text vector using the following formula:
<mrow>
<msub>
<mi>d</mi>
<mi>i</mi>
</msub>
<mo>=</mo>
<mfrac>
<mn>1</mn>
<msub>
<mi>n</mi>
<mi>i</mi>
</msub>
</mfrac>
<msubsup>
<mo>&Sigma;</mo>
<mrow>
<mi>j</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<msub>
<mi>n</mi>
<mi>i</mi>
</msub>
</msubsup>
<msub>
<mi>w</mi>
<mrow>
<mi>i</mi>
<mi>j</mi>
</mrow>
</msub>
<mo>,</mo>
</mrow>
Wherein, diRepresent the vector expression of i-th of text;niRepresent the number of word in i-th of text;wijRepresent i-th of text
In j-th of word term vector.
7. method as claimed in claim 5, which is characterized in that the mode for asking weighting using the following formula calculate text to
Amount:
<mrow>
<msub>
<mi>d</mi>
<mi>i</mi>
</msub>
<mo>=</mo>
<mfrac>
<mn>1</mn>
<mrow>
<msubsup>
<mo>&Sigma;</mo>
<mrow>
<mi>j</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<msub>
<mi>n</mi>
<mi>i</mi>
</msub>
</msubsup>
<msub>
<mi>tfidf</mi>
<mrow>
<mi>i</mi>
<mi>j</mi>
</mrow>
</msub>
</mrow>
</mfrac>
<msubsup>
<mo>&Sigma;</mo>
<mrow>
<mi>j</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<msub>
<mi>n</mi>
<mi>i</mi>
</msub>
</msubsup>
<mrow>
<mo>(</mo>
<msub>
<mi>w</mi>
<mrow>
<mi>i</mi>
<mi>j</mi>
</mrow>
</msub>
<mo>&times;</mo>
<msub>
<mi>tfidf</mi>
<mrow>
<mi>i</mi>
<mi>j</mi>
</mrow>
</msub>
<mo>)</mo>
</mrow>
<mo>,</mo>
</mrow>
Wherein, diRepresent the vector expression of i-th of text;niRepresent the number of word in i-th of text;wijRepresent i-th of text
In j-th of word term vector;tfidfijRepresent the TF-IDF values of j-th of word in i-th of text.
8. the method as described in claim 1, which is characterized in that it is described generation text semantic feature method be:It utilizes
The method of doc2vec directly generates text semantic feature.
9. a kind of phishing identification device based on semantic understanding, which is characterized in that including:
Text data extraction module for extracting the word segment in website in the html text of webpage, obtains the text of webpage
Data;
Text semantic feature generation module, for generating text semantic feature using the text data of webpage;
For utilizing text semantic feature, fishing detection mould is built using machine learning algorithm for fishing detection model training module
Type;
Fishing detection module, for the text data extraction module and the text semantic feature generation module to be called to extract
The text semantic feature of webpage in website to be detected, and the fishing detection model is inputted whether to judge website to be detected
For fishing website.
10. device as claimed in claim 9, which is characterized in that the text semantic feature generation module is by legal webpage
Text data carrys out train language model as corpus, obtains the term vector of word, then utilizes the term vector pair of the word
The html text of webpage carries out vectorial expression in legitimate site and fishing website, generates text semantic feature;Alternatively, the text
Semantic feature generation module directly generates text semantic feature using the method for doc2vec.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711085356.XA CN108111478A (en) | 2017-11-07 | 2017-11-07 | A kind of phishing recognition methods and device based on semantic understanding |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711085356.XA CN108111478A (en) | 2017-11-07 | 2017-11-07 | A kind of phishing recognition methods and device based on semantic understanding |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108111478A true CN108111478A (en) | 2018-06-01 |
Family
ID=62207455
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711085356.XA Pending CN108111478A (en) | 2017-11-07 | 2017-11-07 | A kind of phishing recognition methods and device based on semantic understanding |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108111478A (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108846097A (en) * | 2018-06-15 | 2018-11-20 | 北京搜狐新媒体信息技术有限公司 | The interest tags representation method of user, article recommended method and device, equipment |
CN109413028A (en) * | 2018-08-29 | 2019-03-01 | 集美大学 | SQL injection detection method based on convolutional neural networks algorithm |
CN109462582A (en) * | 2018-10-30 | 2019-03-12 | 腾讯科技(深圳)有限公司 | Text recognition method, device, server and storage medium |
CN109905359A (en) * | 2018-12-24 | 2019-06-18 | 深圳市珍爱捷云信息技术有限公司 | Communication message processing method, device, computer equipment and can read access medium |
CN110191096A (en) * | 2019-04-30 | 2019-08-30 | 安徽工业大学 | A kind of term vector homepage invasion detection method based on semantic analysis |
CN110427627A (en) * | 2019-08-02 | 2019-11-08 | 北京百度网讯科技有限公司 | Task processing method and device based on semantic expressiveness model |
CN110572359A (en) * | 2019-08-01 | 2019-12-13 | 杭州安恒信息技术股份有限公司 | Phishing webpage detection method based on machine learning |
CN110825998A (en) * | 2019-08-09 | 2020-02-21 | 国家计算机网络与信息安全管理中心 | Website identification method and readable storage medium |
CN110830489A (en) * | 2019-11-14 | 2020-02-21 | 国网江苏省电力有限公司苏州供电分公司 | Method and system for detecting counterattack type fraud website based on content abstract representation |
CN111091019A (en) * | 2019-12-23 | 2020-05-01 | 支付宝(杭州)信息技术有限公司 | Information prompting method, device and equipment |
CN111324831A (en) * | 2018-12-17 | 2020-06-23 | 中国移动通信集团北京有限公司 | Method and device for detecting fraudulent website |
CN111488622A (en) * | 2019-01-25 | 2020-08-04 | 深信服科技股份有限公司 | Method and device for detecting webpage tampering behavior and related components |
CN112347244A (en) * | 2019-08-08 | 2021-02-09 | 四川大学 | Method for detecting website involved in yellow and gambling based on mixed feature analysis |
CN112541476A (en) * | 2020-12-24 | 2021-03-23 | 西安交通大学 | Malicious webpage identification method based on semantic feature extraction |
US11303674B2 (en) * | 2019-05-14 | 2022-04-12 | International Business Machines Corporation | Detection of phishing campaigns based on deep learning network detection of phishing exfiltration communications |
CN115051817A (en) * | 2022-01-05 | 2022-09-13 | 中国互联网络信息中心 | Phishing detection method and system based on multi-mode fusion features |
CN116962817A (en) * | 2023-09-21 | 2023-10-27 | 世优(北京)科技有限公司 | Video processing method, device, electronic equipment and storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102662959A (en) * | 2012-03-07 | 2012-09-12 | 南京邮电大学 | Method for detecting phishing web pages with spatial mixed index mechanism |
CN103020164A (en) * | 2012-11-26 | 2013-04-03 | 华北电力大学 | Semantic search method based on multi-semantic analysis and personalized sequencing |
CN105338001A (en) * | 2015-12-04 | 2016-02-17 | 北京奇虎科技有限公司 | Method and device for recognizing phishing website |
CN105718577A (en) * | 2016-01-22 | 2016-06-29 | 中国互联网络信息中心 | Method and system for automatically detecting phishing aiming at added domain name |
CN105786782A (en) * | 2016-03-25 | 2016-07-20 | 北京搜狗科技发展有限公司 | Word vector training method and device |
CN105956472A (en) * | 2016-05-12 | 2016-09-21 | 宝利九章(北京)数据技术有限公司 | Method and system for identifying whether webpage includes malicious content or not |
US9697828B1 (en) * | 2014-06-20 | 2017-07-04 | Amazon Technologies, Inc. | Keyword detection modeling using contextual and environmental information |
US20170223034A1 (en) * | 2016-01-29 | 2017-08-03 | Acalvio Technologies, Inc. | Classifying an email as malicious |
-
2017
- 2017-11-07 CN CN201711085356.XA patent/CN108111478A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102662959A (en) * | 2012-03-07 | 2012-09-12 | 南京邮电大学 | Method for detecting phishing web pages with spatial mixed index mechanism |
CN103020164A (en) * | 2012-11-26 | 2013-04-03 | 华北电力大学 | Semantic search method based on multi-semantic analysis and personalized sequencing |
US9697828B1 (en) * | 2014-06-20 | 2017-07-04 | Amazon Technologies, Inc. | Keyword detection modeling using contextual and environmental information |
CN105338001A (en) * | 2015-12-04 | 2016-02-17 | 北京奇虎科技有限公司 | Method and device for recognizing phishing website |
CN105718577A (en) * | 2016-01-22 | 2016-06-29 | 中国互联网络信息中心 | Method and system for automatically detecting phishing aiming at added domain name |
US20170223034A1 (en) * | 2016-01-29 | 2017-08-03 | Acalvio Technologies, Inc. | Classifying an email as malicious |
CN105786782A (en) * | 2016-03-25 | 2016-07-20 | 北京搜狗科技发展有限公司 | Word vector training method and device |
CN105956472A (en) * | 2016-05-12 | 2016-09-21 | 宝利九章(北京)数据技术有限公司 | Method and system for identifying whether webpage includes malicious content or not |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108846097A (en) * | 2018-06-15 | 2018-11-20 | 北京搜狐新媒体信息技术有限公司 | The interest tags representation method of user, article recommended method and device, equipment |
CN109413028A (en) * | 2018-08-29 | 2019-03-01 | 集美大学 | SQL injection detection method based on convolutional neural networks algorithm |
CN109413028B (en) * | 2018-08-29 | 2021-11-30 | 集美大学 | SQL injection detection method based on convolutional neural network algorithm |
CN109462582A (en) * | 2018-10-30 | 2019-03-12 | 腾讯科技(深圳)有限公司 | Text recognition method, device, server and storage medium |
CN111324831A (en) * | 2018-12-17 | 2020-06-23 | 中国移动通信集团北京有限公司 | Method and device for detecting fraudulent website |
CN109905359A (en) * | 2018-12-24 | 2019-06-18 | 深圳市珍爱捷云信息技术有限公司 | Communication message processing method, device, computer equipment and can read access medium |
CN111488622A (en) * | 2019-01-25 | 2020-08-04 | 深信服科技股份有限公司 | Method and device for detecting webpage tampering behavior and related components |
CN110191096A (en) * | 2019-04-30 | 2019-08-30 | 安徽工业大学 | A kind of term vector homepage invasion detection method based on semantic analysis |
CN110191096B (en) * | 2019-04-30 | 2023-05-09 | 安徽工业大学 | Word vector webpage intrusion detection method based on semantic analysis |
US11818170B2 (en) | 2019-05-14 | 2023-11-14 | Crowdstrike, Inc. | Detection of phishing campaigns based on deep learning network detection of phishing exfiltration communications |
US11303674B2 (en) * | 2019-05-14 | 2022-04-12 | International Business Machines Corporation | Detection of phishing campaigns based on deep learning network detection of phishing exfiltration communications |
CN110572359A (en) * | 2019-08-01 | 2019-12-13 | 杭州安恒信息技术股份有限公司 | Phishing webpage detection method based on machine learning |
CN110427627B (en) * | 2019-08-02 | 2023-04-28 | 北京百度网讯科技有限公司 | Task processing method and device based on semantic representation model |
CN110427627A (en) * | 2019-08-02 | 2019-11-08 | 北京百度网讯科技有限公司 | Task processing method and device based on semantic expressiveness model |
CN112347244B (en) * | 2019-08-08 | 2023-07-25 | 四川大学 | Yellow-based and gambling-based website detection method based on mixed feature analysis |
CN112347244A (en) * | 2019-08-08 | 2021-02-09 | 四川大学 | Method for detecting website involved in yellow and gambling based on mixed feature analysis |
CN110825998A (en) * | 2019-08-09 | 2020-02-21 | 国家计算机网络与信息安全管理中心 | Website identification method and readable storage medium |
CN110830489A (en) * | 2019-11-14 | 2020-02-21 | 国网江苏省电力有限公司苏州供电分公司 | Method and system for detecting counterattack type fraud website based on content abstract representation |
CN111091019A (en) * | 2019-12-23 | 2020-05-01 | 支付宝(杭州)信息技术有限公司 | Information prompting method, device and equipment |
CN111091019B (en) * | 2019-12-23 | 2024-03-01 | 支付宝(杭州)信息技术有限公司 | Information prompting method, device and equipment |
CN112541476B (en) * | 2020-12-24 | 2023-09-29 | 西安交通大学 | Malicious webpage identification method based on semantic feature extraction |
CN112541476A (en) * | 2020-12-24 | 2021-03-23 | 西安交通大学 | Malicious webpage identification method based on semantic feature extraction |
CN115051817A (en) * | 2022-01-05 | 2022-09-13 | 中国互联网络信息中心 | Phishing detection method and system based on multi-mode fusion features |
CN115051817B (en) * | 2022-01-05 | 2023-11-24 | 中国互联网络信息中心 | Phishing detection method and system based on multi-mode fusion characteristics |
CN116962817A (en) * | 2023-09-21 | 2023-10-27 | 世优(北京)科技有限公司 | Video processing method, device, electronic equipment and storage medium |
CN116962817B (en) * | 2023-09-21 | 2023-12-08 | 世优(北京)科技有限公司 | Video processing method, device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108111478A (en) | A kind of phishing recognition methods and device based on semantic understanding | |
CN104077396B (en) | Method and device for detecting phishing website | |
CN110414219B (en) | Injection attack detection method based on gated cycle unit and attention mechanism | |
WO2019085275A1 (en) | Character string classification method and system, and character string classification device | |
CN104899508B (en) | A kind of multistage detection method for phishing site and system | |
CN103530367B (en) | A kind of fishing website identification system and method | |
US11762990B2 (en) | Unstructured text classification | |
CN113596007B (en) | Vulnerability attack detection method and device based on deep learning | |
CN104504335B (en) | Fishing APP detection methods and system based on page feature and URL features | |
CN107992469A (en) | A kind of fishing URL detection methods and system based on word sequence | |
CN109005145A (en) | A kind of malice URL detection system and its method extracted based on automated characterization | |
CN110727766A (en) | Method for detecting sensitive words | |
CN110830489B (en) | Method and system for detecting counterattack type fraud website based on content abstract representation | |
CN108337255A (en) | A kind of detection method for phishing site learnt based on web automatic tests and width | |
CN115051817B (en) | Phishing detection method and system based on multi-mode fusion characteristics | |
CN110191096A (en) | A kind of term vector homepage invasion detection method based on semantic analysis | |
Liu et al. | An efficient multistage phishing website detection model based on the CASE feature framework: Aiming at the real web environment | |
CN110197389A (en) | A kind of user identification method and device | |
Barlow et al. | A novel approach to detect phishing attacks using binary visualisation and machine learning | |
CN111614616A (en) | XSS attack automatic detection method | |
Opara et al. | Look before You leap: Detecting phishing web pages by exploiting raw URL And HTML characteristics | |
CN113918936A (en) | SQL injection attack detection method and device | |
CN115001763B (en) | Phishing website attack detection method and device, electronic equipment and storage medium | |
CN114638984B (en) | Malicious website URL detection method based on capsule network | |
CN114124448B (en) | Cross-site script attack recognition method based on machine learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180601 |