CN107992469A - A kind of fishing URL detection methods and system based on word sequence - Google Patents
A kind of fishing URL detection methods and system based on word sequence Download PDFInfo
- Publication number
- CN107992469A CN107992469A CN201710952360.5A CN201710952360A CN107992469A CN 107992469 A CN107992469 A CN 107992469A CN 201710952360 A CN201710952360 A CN 201710952360A CN 107992469 A CN107992469 A CN 107992469A
- Authority
- CN
- China
- Prior art keywords
- url
- word
- word sequence
- fishing
- dictionary
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/247—Thesauruses; Synonyms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/02—Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
- H04L63/0227—Filtering policies
- H04L63/0236—Filtering by address, protocol, port number or service, e.g. IP-address or URL
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1441—Countermeasures against malicious traffic
- H04L63/1483—Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Computer Security & Cryptography (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Signal Processing (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computer Networks & Wireless Communication (AREA)
- Computer Hardware Design (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Machine Translation (AREA)
Abstract
The present invention provides a kind of fishing URL detection methods and system based on word sequence, for detecting fishing URL.By being segmented to URL character strings, and then obtain the vector representation of word sequence, then the contextual information and feature in word sequence are learnt automatically using deep learning model, it is not necessary to manually include the relevant text feature of word in extraction URL, be used for detecting fishing URL using trained model.So as to solve the problems, such as to run into the fishing URL detections of above-mentioned existing word-based feature.
Description
Technical field
The present invention relates to information security field, more particularly to a kind of fishing URL detection methods and system based on word sequence.
Background technology
The URL that goes fishing is a kind of phishing behavior, by disguising oneself as legal person's online media sites for winning a high reputation to obtain user
Sensitive information, such as user name, password and credit card detail.Fishing URL usually claims the social activity for oneself coming from prevalence
Website (including YouTube, Facebook, Twitter etc.), Auction Site (eBay), electronic business transaction website (PayPal,
Alibaba etc.) or network manager (Google, Yahoo, ISP) etc., with this, to inveigle, victim's is credulous.
Attacker pass through frequently with fraud be the embedded confusing user in URL keyword, as attacker utilize shaped like
The URL of " login.mydomain.tld/paypal " inveigles PayPal user.
At present, no matter in research field, or commercial product, the method and safety of existing many fishing URL detections are produced
Product, its cardinal principle are all based on greatly the feature of manually extraction URL related datas, build disaggregated model, classify to URL, from
And detect fishing URL.According to the difference of analysis data, existing detection method can be divided into the detection method based on multi-source information
With two major class of detection method based on URL itself.
Detection method based on multi-source information needs to gather the relevant a variety of data of URL, including Alexa rankings, WHOIS letter
Breath, web page contents etc., the model for constructing complexity is trained the data marked, for detecting whether unknown URL is fishing
URL.This method usually has higher accuracy rate, still, due to gather these a variety of data need very big resource and
The extra expense such as time, therefore, the real-time detection not being suitable in express network.
And based on the detection method of URL itself, the text feature of URL character strings in itself is only analyzed, for building classification mould
Type, is a kind of detection method of lightweight, suitable for detection in real time.
Specifically, the fishing detection method based on URL itself, by extracting the text feature of URL character strings, training point
Class model, for detecting fishing URL.The text feature of URL character strings in itself can be divided into two class of character feature and word feature again.
Character feature mainly considers to form the feature of the character performance of URL text strings, including character length, vowel-consonant ratio, numeral
Entropy that number, additional character number, character are distributed etc..The word for having semantic information included in word feature Main Analysis URL
And its occurrence frequency feature etc., common word login, update and the famous brand name paypal of prevalence in such as URL,
Alibaba etc..
Lightweight fishing detection based on URL itself more meets the demand of real-time response in express network.Based on character
Feature have ignored the semantic information included in URL, and URL is for facilitating people to remember, therefore is usually had readable and easy to remember
The property recalled, includes multiple significant everyday words.Moreover, in phishing attack, attacker pass through frequently with strategy be to utilize key
Word confuses user.
And the fishing URL detection methods of existing word-based feature are mostly using word and the frequency occurred as special at present
Sign, does not account for the word sequence feature included in URL, and these features are all based on manually proposing, there is certain limitation.
First, manually extraction feature needs to expend substantial amounts of manpower and resource goes statistical analysis and verifies the validity of feature;Secondly, people
The feature of work extraction is usually only effective to certain a kind of data, poor robustness;Moreover, the key that attacker uses in the URL that goes fishing
Word is usually similar to normal URL, so just can cause the reduction of disaggregated model detection efficiency with confusing user.
The content of the invention
In view of the deficiency of the prior art, it is an object of the invention to provide a kind of fishing based on word sequence
URL detection methods and system, for detecting fishing URL.By being segmented to URL character strings, so obtain word sequence to
Amount represents, then learns the contextual information and feature in word sequence automatically using deep learning model, it is not necessary to manually extraction
The relevant text feature of word is included in URL, is used for detecting fishing URL using trained model.So as to solve mentioned above
Existing word-based feature fishing URL detections in the problem of running into.
In order to achieve the above object, the present invention adopts the technical scheme that:
A kind of fishing URL detection methods based on word sequence, comprise the following steps:
URL will have been marked and be converted to word order column vector as training data;
Using training data train classification models;
Unknown URL is converted to word order column vector and is input in trained disaggregated model and is labeled.
Further, will mark URL or unknown URL and be converted to word order column vector includes:
Filter out and marked URL or agreement and generic top-level domain in unknown URL;
Remaining part after filtering is split, the character string of each segmentation obtained to segmentation is passed through using dictionary
The mode of Forward Maximum Method is segmented, and obtains word sequence;
Numbering is proceeded by from 1 to word all in above-mentioned dictionary, each word is had unique number, each having marked
The word sequence of URL or unknown URL are converted to the fixed length vector of digital representation.
Further, the agreement includes http, https, ftp, ftps, gopher;The generic top-level domain includes
com、org、net、edu、gov。
Further, it is described to carry out participle by way of Forward Maximum Method using dictionary and include:
Whole character string is judged whether in dictionary, if so, then no longer being segmented;
If it is not, then removing last character, judge remaining character string whether in dictionary;
Foregoing deterministic process is repeated until matching the word in dictionary, then removes the word in matching;
Above-mentioned steps are continued to the remaining part of character string, until character string is all disposed;
As character string do not include dictionary in word, then be divided into single character.
Further, the dictionary selects Google's English word corpus disclosed in Peter Norvig.
Further, the two-way LSTM models based on word sequence are selected to be instructed using the disaggregated model of training data training
Practice.
Further, included using training data train classification models:
Training data is randomly divided into training part and verification portion, by setting the hyper parameter of neural network model and swashing
The parameters such as function living are trained two-way LSTM models.
Further, two-way LSTM models include embeding layer, LSTM layers two-way, dropout layers and four layers of god of sigmoid layers
Through network, further included using training data train classification models:Output LSTM layers two-way is used to prevent using dropout functions
Only over-fitting.
A kind of fishing URL detecting systems based on word sequence, including:
Modular converter and classification based training model;
Modular converter is converted to training data of the word order column vector as train classification models will mark URL;And
It is labeled to be converted to word order column vector and be input in trained disaggregated model unknown URL.
As described above, method and system provided by the invention, it is not necessary to manually extract any feature, it is only necessary to which URL is turned
Word sequence vector representation is changed to, passes through the contextual information in deep neural network (two-way LSTM models) automatically study word sequence
And feature, for detecting fishing URL.
Compared to the technology of traditional detection fishing URL, has the following advantages:
First, it is not necessary to the related data of extra collection URL and the manually text feature of extraction URL, by using depth
Degree learning model learns the word sequence contextual information and feature of URL, and thereby detection fishing URL automatically;It is obvious to reduce expense.
In addition, contextual information and the feature that the word sequence of URL includes are excavated by depth, compared to based on artificial extraction
The machine learning model of word feature and the deep learning model based on character string have preferable effect, in same data set
On detection result it is preferable.
Finally, by the method and system of the present invention, trained model, on common server, single thread are used
Predetermined speed is no less than 600 URL up to each second.On the premise of Detection accuracy is improved, it can meet to detect in real time at the same time
Demand.
Brief description of the drawings
Fig. 1 is the flow diagram of the fishing URL detection methods based on word sequence in one embodiment of the invention.
Fig. 2 is the two-way LSTM models used in one embodiment of the invention in the fishing URL detection methods based on word sequence
Structure diagram.
Embodiment
Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, complete
Whole description.
In one embodiment of this invention, there is provided a kind of fishing URL detection methods and system based on word sequence, method
Key step includes:
(1) word sequence vector representation, first, the crucial word order included in URL is obtained using based on the matched method of dictionary
Row, are then based on dictionary and encode to obtain the vector representation of URL word sequences;
(2) model training, it is word-based using the training data training marked to the word order column vector obtained in previous step
The two-way LSTM models of sequence;
(3) go fishing URL detection, using the trained two-way unknown URL of LSTM model inspections based on word sequence whether be
Fishing.
System includes:Modular converter and classification based training model;
Modular converter is converted to training number of the word sequence vector representation as train classification models will mark URL
According to;And it is labeled to be converted to word sequence vector representation and be input in trained disaggregated model unknown URL.
Word sequence vector representation step in this method, primarily to obtaining the vector representation of URL word sequences, mainly has
The following steps:
I) first, filter out known agreement and generic top-level domain two parts in URL, common agreement include http,
Https, ftp, ftps, gopher etc., generic top-level domain is 14 including com, org, net, edu, gov etc.;
Ii) to remaining part, first split with symbol, preprepared dictionary then is used to each segmentation
Segmented by the method for Forward Maximum Method, with reference to the pseudocode shown in lower table algorithm 1, specific participle process is:First
Whole character string is judged whether in dictionary, if need not segmented again;If it was not then remove last
Character, judges remaining character string whether in dictionary, until matching the word in dictionary, then removes the word in matching,
Above-mentioned steps are continued to the remaining part of character string, have been handled until character string is whole, if character string does not include dictionary
In word, then be divided into single character.
The dictionary used during above-mentioned participle is that Google's English word corpus (includes disclosed in Peter Norvig
333,333 English words);Other English word dictionaries are not applied to, which is that Peter Norvig have been counted in web page
In common word, more meet the naming method of URL.
Iii) then, numbering being proceeded by from 1 to word all in above-mentioned dictionary, each word has only one numbering,
The word sequence of each URL is converted to the fixed length vector of digital representation;
Model training step in this method, gathers the vector obtained in previous step, is gathered using the vector marked
The two-way LSTM models based on word sequence are trained as training data.Training sample set is randomly divided into training and verification
Two parts (account for whole labeled data respectively 80% and 20%), by hyper parameter (each layer for setting neural network model
Output dimension etc.) and the parameter such as activation primitive two-way LSTM models are trained.Used deep learning model includes
Multilayer neural network, is respectively embeding layer, LSTM layers two-way, dropout layers and four layers of neutral net of sigmoid layers, to two-way
LSTM layers of output is used to prevent over-fitting using dropout functions.
Fishing URL detecting steps in this method, the main data realized to not marking, i.e., whether unknown URL, detect it
For fishing.The word order column vector of unknown URL is input in trained two-way LSTM models and is labeled, if output is 1
Then represent that it, for fishing URL, is otherwise normal URL.
It is described further with reference to example:Fishing URL detection methods based on word sequence, its overall procedure as shown in Figure 1,
Two-way LSTM model structures based on word sequence are as shown in Figure 2.
With the URL that goes fishing:http:Exemplified by //shen.mansell.tripod.com/games/gameboy.html, the URL
Mark state is 1, and word sequence vector representation and the two-way LSTM models of training of fixed length are carried out to URL, and uses trained mould
Type is to unknown URL:http://fly-project.net//yahoo.link/Yah/T/Y.html is detected.
1) word sequence vector representation is carried out to the URL of input first, URL is carried out first by preprepared dictionary
Participle:
Then the word in dictionary is numbered, word sequence is expressed as the fixed length vector that length is N, and the value of N can lead to
Cross statistics to obtain, find to include 13 words in the URL more than 90 percent by statistics, therefore set N=13, then two
URL is respectively obtained vectorial (1,4,5,6,7,11,13,0,0,0,0,0,0) and (2,19,3,9,12,8,14,0,0,0,0,0,0).
The word sequence vector representation of all URL in sample set is obtained with identical method.Include and marked in sample set
Normal URL and fishing url data.
2) it is input to using the vectorial data for gathering acceptance of the bid note of word sequence as training data as shown in Figure 2 based on word order
It is trained in the two-way LSTM models of row, the word order column vector of URL first is input to Embedding layers of dimension-reduction treatment, then
Be input to it is LSTM layers two-way in learnt, the result of study, which is input to dropout layers, prevents over-fitting, last layer
Sigmoid functions export testing result.Mark 1 is expressed as fishing URL, is labeled as the normal URL of 0 expression, really two classification
Problem, therefore model output carries out 0-1 classification using sigmoid functions.
All labeled data are input to training data in model, export trained model.
3) for the data not marked, its vector is input in trained model, exports annotation results, if output
Fishing URL is expressed as 1, is otherwise normal URL.
Thus, by examples detailed above, the method in this example need not manually extract any feature, it is only necessary to URL
Word sequence vector representation is converted to, is believed by the context in deep neural network (two-way LSTM models) automatically study word sequence
Breath and feature, for detecting fishing URL.
Its key step includes:1) word sequence vector representation, first segments URL, and URL herein is included and marked
And it is unknown.All URL will be converted to vector, then with the data training pattern of mark.Then padding sequence is utilized
Method be fixed the vector representation of length;" fixed length " represents that the word sequence vector length that each URL is obtained is identical.Fill out
It is the vector for handling different length to fill sequence method, is converted to equal length.
2) model training, the vector obtained to previous step, two-way LSTM models are trained using the training data marked.
3) URL that goes fishing is detected, and for the URL not marked, its vector representation is input to trained two-way LSTM models
In be labeled, be labeled as 1 for go fishing URL.
Step 1) by word sequence vector representation, obtains the fixed length vector representation of URL character strings, this method is to URL first
Vector representation be trained and analyze;
Step 2) uses two-way LSTM model of the data training marked based on word sequence to pretreated data;
Step 3) is input to the vector representation of unknown URL in trained two-way LSTM models and is labeled, and detects it
Whether it is fishing URL;
Fishing URL is detected using the above method;It is capable of contextual information and the spy that the word sequence of depth excavation URL includes
Sign, compared to the machine learning model based on the word feature manually extracted and the deep learning model based on character string have compared with
Good effect, the detection result in same data set are as shown in table 1;
Also, this method is a kind of fishing URL detection methods of lightweight, using trained model, in common clothes
It is engaged on device, single thread predetermined speed is no less than 600 URL up to each second.It can meet real-time while Detection accuracy is improved
The demand of detection.
The testing result contrast of the different detection models of 1 four kinds of table
Model | Precision | Recall | F1 |
The decision-tree model of word-based feature | 0.8803 | 0.8700 | 0.8751 |
The Random Forest model of word-based feature | 0.8981 | 0.8965 | 0.8973 |
Two-way LSTM models based on character string | 0.9553 | 0.9474 | 0.9513 |
Two-way LSTM models based on word sequence | 0.9808 | 0.9716 | 0.9762 |
Obviously, described embodiment is only part of the embodiment of the present invention, instead of all the embodiments.Based on this hair
Embodiment in bright, the every other implementation that those of ordinary skill in the art are obtained without making creative work
Example, belongs to the scope of protection of the invention.
Claims (9)
1. a kind of fishing URL detection methods based on word sequence, comprise the following steps:
URL will have been marked and be converted to word order column vector as training data;
Using training data train classification models;
Unknown URL is converted to word order column vector and is input in trained disaggregated model and is labeled.
2. the fishing URL detection methods based on word sequence as claimed in claim 1, it is characterised in that URL or not will have been marked
The URL known, which is converted to word order column vector, to be included:
Filter out and marked URL or agreement and generic top-level domain in unknown URL;
Remaining part after filtering is split, the character string of each segmentation obtained to segmentation passes through forward direction using dictionary
Maximum matched mode is segmented, and obtains word sequence;
Numbering is proceeded by from 1 to word all in above-mentioned dictionary, each word is had unique number, each having marked URL
Or the word sequence of unknown URL is converted to the fixed length vector of digital representation.
3. the fishing URL detection methods based on word sequence as claimed in claim 2, it is characterised in that the agreement includes
http、https、ftp、ftps、gopher;The generic top-level domain includes com, org, net, edu, gov.
4. the fishing URL detection methods based on word sequence as claimed in claim 2, it is characterised in that described to be led to using dictionary
The mode for crossing Forward Maximum Method carries out participle and includes:
Whole character string is judged whether in dictionary, if so, then no longer being segmented;
If not, removing last character, judge remaining character string whether in dictionary;
Foregoing deterministic process is repeated until matching the word in dictionary, then removes the word in matching;
Above-mentioned steps are continued to the remaining part of character string, until character string is all disposed;
As character string do not include dictionary in word, then be divided into single character.
5. the fishing URL detection methods based on word sequence as claimed in claim 4, it is characterised in that the dictionary is selected
Google's English word corpus disclosed in Peter Norvig.
6. the fishing URL detection methods based on word sequence as claimed in claim 2, it is characterised in that instructed using training data
Experienced disaggregated model selects the two-way LSTM models based on word sequence to be trained.
7. the fishing URL detection methods based on word sequence as claimed in claim 1, it is characterised in that instructed using training data
Practicing disaggregated model includes:
Training data is randomly divided into training part and verification portion, by the hyper parameter and activation letter that set neural network model
The parameters such as number are trained two-way LSTM models.
8. the fishing URL detection methods based on word sequence as claimed in claim 7, it is characterised in that two-way LSTM models bag
Containing embeding layer, LSTM layers two-way, dropout layers and four layers of neutral net of sigmoid layers, using training data train classification models
Further include:To output LSTM layers two-way using dropout functions for preventing over-fitting.
A kind of 9. fishing URL detecting systems based on word sequence, it is characterised in that including:
Modular converter and classification based training model;
Modular converter is converted to training data of the word order column vector as train classification models will mark URL;And to
Unknown URL is converted to word order column vector and is input in trained disaggregated model and is labeled.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710952360.5A CN107992469A (en) | 2017-10-13 | 2017-10-13 | A kind of fishing URL detection methods and system based on word sequence |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710952360.5A CN107992469A (en) | 2017-10-13 | 2017-10-13 | A kind of fishing URL detection methods and system based on word sequence |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107992469A true CN107992469A (en) | 2018-05-04 |
Family
ID=62028932
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710952360.5A Pending CN107992469A (en) | 2017-10-13 | 2017-10-13 | A kind of fishing URL detection methods and system based on word sequence |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107992469A (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108920463A (en) * | 2018-06-29 | 2018-11-30 | 北京奇虎科技有限公司 | A kind of segmenting method and system based on network attack |
CN109101552A (en) * | 2018-07-10 | 2018-12-28 | 东南大学 | A kind of fishing website URL detection method based on deep learning |
CN109391706A (en) * | 2018-11-07 | 2019-02-26 | 顺丰科技有限公司 | Domain name detection method, device, equipment and storage medium based on deep learning |
CN109450853A (en) * | 2018-10-11 | 2019-03-08 | 深圳市腾讯计算机系统有限公司 | Malicious websites determination method, device, terminal and server |
CN109450845A (en) * | 2018-09-18 | 2019-03-08 | 浙江大学 | A kind of algorithm generation malice domain name detection method based on deep neural network |
CN109522454A (en) * | 2018-11-20 | 2019-03-26 | 四川长虹电器股份有限公司 | The method for automatically generating web sample data |
CN109561084A (en) * | 2018-11-20 | 2019-04-02 | 四川长虹电器股份有限公司 | URL parameter rejecting outliers method based on LSTM autoencoder network |
CN110493088A (en) * | 2019-09-24 | 2019-11-22 | 国家计算机网络与信息安全管理中心 | A kind of mobile Internet traffic classification method based on URL |
CN111125563A (en) * | 2018-10-31 | 2020-05-08 | 安碁资讯股份有限公司 | Method for evaluating domain name and server thereof |
CN111447169A (en) * | 2019-01-17 | 2020-07-24 | 中国科学院信息工程研究所 | Method and system for identifying malicious webpage in real time on gateway |
CN112948725A (en) * | 2021-03-02 | 2021-06-11 | 北京六方云信息技术有限公司 | Phishing website URL detection method and system based on machine learning |
CN113051500A (en) * | 2021-03-25 | 2021-06-29 | 武汉大学 | Phishing website identification method and system fusing multi-source data |
CN114650152A (en) * | 2020-12-17 | 2022-06-21 | 中国科学院计算机网络信息中心 | Method and system for detecting vulnerability of super computing center |
CN116633684A (en) * | 2023-07-19 | 2023-08-22 | 中移(苏州)软件技术有限公司 | Phishing detection method, system, electronic device and readable storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120158626A1 (en) * | 2010-12-15 | 2012-06-21 | Microsoft Corporation | Detection and categorization of malicious urls |
CN102790762A (en) * | 2012-06-18 | 2012-11-21 | 东南大学 | Phishing website detection method based on uniform resource locator (URL) classification |
CN105956472A (en) * | 2016-05-12 | 2016-09-21 | 宝利九章(北京)数据技术有限公司 | Method and system for identifying whether webpage includes malicious content or not |
CN106776946A (en) * | 2016-12-02 | 2017-05-31 | 重庆大学 | A kind of detection method of fraudulent website |
CN107180077A (en) * | 2017-04-18 | 2017-09-19 | 北京交通大学 | A kind of social networks rumour detection method based on deep learning |
-
2017
- 2017-10-13 CN CN201710952360.5A patent/CN107992469A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120158626A1 (en) * | 2010-12-15 | 2012-06-21 | Microsoft Corporation | Detection and categorization of malicious urls |
CN102790762A (en) * | 2012-06-18 | 2012-11-21 | 东南大学 | Phishing website detection method based on uniform resource locator (URL) classification |
CN105956472A (en) * | 2016-05-12 | 2016-09-21 | 宝利九章(北京)数据技术有限公司 | Method and system for identifying whether webpage includes malicious content or not |
CN106776946A (en) * | 2016-12-02 | 2017-05-31 | 重庆大学 | A kind of detection method of fraudulent website |
CN107180077A (en) * | 2017-04-18 | 2017-09-19 | 北京交通大学 | A kind of social networks rumour detection method based on deep learning |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108920463A (en) * | 2018-06-29 | 2018-11-30 | 北京奇虎科技有限公司 | A kind of segmenting method and system based on network attack |
CN109101552A (en) * | 2018-07-10 | 2018-12-28 | 东南大学 | A kind of fishing website URL detection method based on deep learning |
CN109101552B (en) * | 2018-07-10 | 2022-01-28 | 东南大学 | Phishing website URL detection method based on deep learning |
CN109450845A (en) * | 2018-09-18 | 2019-03-08 | 浙江大学 | A kind of algorithm generation malice domain name detection method based on deep neural network |
CN109450853A (en) * | 2018-10-11 | 2019-03-08 | 深圳市腾讯计算机系统有限公司 | Malicious websites determination method, device, terminal and server |
CN109450853B (en) * | 2018-10-11 | 2022-02-18 | 深圳市腾讯计算机系统有限公司 | Malicious website determination method and device, terminal and server |
CN111125563A (en) * | 2018-10-31 | 2020-05-08 | 安碁资讯股份有限公司 | Method for evaluating domain name and server thereof |
CN109391706A (en) * | 2018-11-07 | 2019-02-26 | 顺丰科技有限公司 | Domain name detection method, device, equipment and storage medium based on deep learning |
CN109522454A (en) * | 2018-11-20 | 2019-03-26 | 四川长虹电器股份有限公司 | The method for automatically generating web sample data |
CN109561084A (en) * | 2018-11-20 | 2019-04-02 | 四川长虹电器股份有限公司 | URL parameter rejecting outliers method based on LSTM autoencoder network |
CN111447169B (en) * | 2019-01-17 | 2021-06-08 | 中国科学院信息工程研究所 | Method and system for identifying malicious webpage in real time on gateway |
CN111447169A (en) * | 2019-01-17 | 2020-07-24 | 中国科学院信息工程研究所 | Method and system for identifying malicious webpage in real time on gateway |
CN110493088A (en) * | 2019-09-24 | 2019-11-22 | 国家计算机网络与信息安全管理中心 | A kind of mobile Internet traffic classification method based on URL |
CN114650152A (en) * | 2020-12-17 | 2022-06-21 | 中国科学院计算机网络信息中心 | Method and system for detecting vulnerability of super computing center |
CN114650152B (en) * | 2020-12-17 | 2023-06-20 | 中国科学院计算机网络信息中心 | Super computing center vulnerability detection method and system |
CN112948725A (en) * | 2021-03-02 | 2021-06-11 | 北京六方云信息技术有限公司 | Phishing website URL detection method and system based on machine learning |
CN113051500A (en) * | 2021-03-25 | 2021-06-29 | 武汉大学 | Phishing website identification method and system fusing multi-source data |
CN113051500B (en) * | 2021-03-25 | 2022-08-16 | 武汉大学 | Phishing website identification method and system fusing multi-source data |
CN116633684A (en) * | 2023-07-19 | 2023-08-22 | 中移(苏州)软件技术有限公司 | Phishing detection method, system, electronic device and readable storage medium |
CN116633684B (en) * | 2023-07-19 | 2023-10-13 | 中移(苏州)软件技术有限公司 | Phishing detection method, system, electronic device and readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107992469A (en) | A kind of fishing URL detection methods and system based on word sequence | |
CN104899508B (en) | A kind of multistage detection method for phishing site and system | |
CN109005145A (en) | A kind of malice URL detection system and its method extracted based on automated characterization | |
CN103559235B (en) | A kind of online social networks malicious web pages detection recognition methods | |
CN101820366B (en) | Pre-fetching-based fishing web page detection method | |
CN109450845B (en) | Detection method for generating malicious domain name based on deep neural network algorithm | |
CN107786575A (en) | A kind of adaptive malice domain name detection method based on DNS flows | |
CN106940732A (en) | A kind of doubtful waterborne troops towards microblogging finds method | |
CN103500175B (en) | A kind of method based on sentiment analysis on-line checking microblog hot event | |
CN103313248B (en) | Method and device for identifying junk information | |
CN105072214B (en) | C&C domain name recognition methods based on domain name feature | |
CN109413028A (en) | SQL injection detection method based on convolutional neural networks algorithm | |
CN107566376A (en) | One kind threatens information generation method, apparatus and system | |
CN103136358B (en) | A kind of method of Automatic Extraction forum data | |
CN103577755A (en) | Malicious script static detection method based on SVM (support vector machine) | |
CN109657470A (en) | Malicious web pages detection model training method, malicious web pages detection method and system | |
CN107566391A (en) | Domain identification plus the method for the topic identification structure machine learning model detection dark chain of webpage | |
CN106874253A (en) | Recognize the method and device of sensitive information | |
CN103577556A (en) | Device and method for obtaining association degree of question and answer pair | |
CN110134876A (en) | A kind of cyberspace Mass disturbance perception and detection method based on gunz sensor | |
CN110830489B (en) | Method and system for detecting counterattack type fraud website based on content abstract representation | |
CN113422761B (en) | Malicious social user detection method based on counterstudy | |
CN107463844A (en) | WEB Trojan detecting methods and system | |
CN109714356A (en) | A kind of recognition methods of abnormal domain name, device and electronic equipment | |
CN107819790A (en) | The recognition methods of attack message and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180504 |
|
RJ01 | Rejection of invention patent application after publication |