CN109284465B - URL-based web page classifier construction method and classification method thereof - Google Patents

URL-based web page classifier construction method and classification method thereof Download PDF

Info

Publication number
CN109284465B
CN109284465B CN201811025751.3A CN201811025751A CN109284465B CN 109284465 B CN109284465 B CN 109284465B CN 201811025751 A CN201811025751 A CN 201811025751A CN 109284465 B CN109284465 B CN 109284465B
Authority
CN
China
Prior art keywords
training sample
url
word
web page
sample set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811025751.3A
Other languages
Chinese (zh)
Other versions
CN109284465A (en
Inventor
孙玉霞
赵晶晶
仇之
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinan University
Original Assignee
Jinan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinan University filed Critical Jinan University
Priority to CN201811025751.3A priority Critical patent/CN109284465B/en
Publication of CN109284465A publication Critical patent/CN109284465A/en
Application granted granted Critical
Publication of CN109284465B publication Critical patent/CN109284465B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The invention discloses a URL-based web page classifier construction method and a classification method thereof, firstly, URLs of a plurality of web pages are obtained, the web page attribute is marked aiming at each URL, and each URL marked with the web page attribute is used as a training sample to form a training sample set; aiming at each training sample in the training sample set, performing word segmentation processing on each training sample through selected characters, and then converting the word into a word vector; and training the word vectors of all the training samples marked with the webpage attributes in the training sample set by taking the word vectors as input aiming at the convolutional neural network to obtain the webpage classifier. Aiming at the web pages needing to be classified, firstly, acquiring the URL of the web pages as a test sample; then, carrying out word segmentation processing on the selected characters through the selected characters, and finally converting the selected characters into word vectors; and inputting the word vectors of the test samples into the constructed web page classifier, and outputting a classification result through the web page classifier. The method and the device greatly improve the classification accuracy of the malicious web pages.

Description

URL-based web page classifier construction method and classification method thereof
Technical Field
The invention relates to the technical field of information security, in particular to a webpage classifier construction method based on a Uniform Resource Locator (URL) and a classification method thereof.
Background
The openness and virtualization of the internet make privacy, data and transaction safety face serious challenges, and in recent years, the behavior of using malicious web pages to conduct network crimes is rampant. According to statistics, nearly one third of the webpages have potential maliciousness, and the malicious webpages attack users by sending junk mails, phishing and other modes, so that the users without any security defense consciousness suffer various damages, including fund loss, private information embezzlement and the like, and the property and information security of the users are seriously threatened. Therefore, how to timely and effectively identify the malicious web pages becomes an important problem to be solved urgently at present.
In the prior art, whether a webpage is a malicious webpage is generally identified by detecting the content and the behavior of the webpage; when the malicious webpage is identified by detecting the content of the webpage, text and picture content, malicious code fragments, behavior records in a server or a proxy log and the like of the webpage need to be detected, so that the difficulty that the content of the webpage is changeable, can be encrypted or equivalently replaced and the like cannot be avoided by identifying the malicious webpage through the content of the malicious webpage. When malicious web pages are identified by detecting the behavior of the web pages, the problems that the dynamic behavior of the web pages is difficult to trigger and track and the like must be faced.
Disclosure of Invention
The first purpose of the present invention is to overcome the disadvantages and shortcomings of the prior art, and to provide a method for constructing a web page classifier based on a Uniform Resource Locator (URL), where the web page classifier constructed by the method greatly improves the classification accuracy of malicious web pages.
The second purpose of the present invention is to provide a URL-based web page classification method implemented by the classifier constructed as above.
The first purpose of the invention is realized by the following technical scheme: a web page classifier construction method based on URL includes the following steps:
step S1, obtaining URLs of a plurality of webpages, marking the webpage attributes aiming at the URLs, and forming a training sample set by taking the URLs with marked webpage attributes as training samples;
step S2, aiming at each training sample in the training sample set, performing word segmentation processing on each training sample through selected characters, and then converting the word into a word vector;
and step S3, training the word vectors of the training samples marked with the webpage attributes in the training sample set by taking the word vectors as input aiming at the convolutional neural network to obtain the webpage classifier.
Preferably, in step S1, the URLs of the web pages are obtained from the good and malicious URL repository, and the training sample set includes a certain number of URLs whose web page attributes are malicious and a certain number of URLs whose web page attributes are good.
Preferably, in the step S2, the selected character includes "? "," & "," - "and" # ".
Preferably, in step S2, Word2vec is used to convert the training samples into Word vectors according to the results of the Word segmentation processing.
Further, in step S2, when the Word vector is obtained by using Word2vec conversion, the following parameters are set: word embedding dimension encoding-size, context window size window, and minimum word frequency min count.
Preferably, the convolutional neural network is constructed to include, from input to output, a first part, a second part, a third part, a fourth part, and a fifth part in this order; wherein:
the first part is an input layer and is used for inputting word vectors of all training samples;
the second part comprises a first convolution layer, a first pooling layer, a second convolution layer and a second pooling layer in sequence from input to output and is used for extracting context semantics of various degrees; the first convolution layer and the second convolution layer respectively comprise convolution kernels with three sizes, and the first convolution layer and the second convolution layer are the same in size;
the third part is a vector merging layer and is used for merging the convolution results of the convolution kernels of the second part into a feature vector;
the fourth part is a full connection layer and comprises a first full connection layer and a second full connection layer, the first full connection layer carries out Dropout processing on the feature vectors, and the second full connection layer obtains the category with the highest score corresponding to the feature vectors through a classifier;
and the fifth part is an output layer and is used for outputting a classification result.
Preferably, after the training sample set is obtained in step S1, the deduplication processing is performed on the training sample set, specifically as follows: n selects an initial value, obtains the first N characters of each training sample in the training sample set, and only one URL with the same first N characters in the training sample set is left after the deduplication processing, then judges whether the total number of the training samples in the training sample set is less than or equal to a threshold value, if not, reduces the value of N, and performs the same processing until the total number of the training samples in the training sample set is reduced to be less than or equal to the threshold value; and aiming at the final training sample set obtained after the duplication removing treatment, performing word segmentation treatment on each training sample in the training sample set through selected characters, and then converting the training samples into word vectors.
Furthermore, N is an integer of 20-30.
The second purpose of the invention is realized by the following technical scheme: a webpage classification method based on URL includes the following steps:
step X1, aiming at the webpage needing to be classified, firstly, acquiring the URL of the webpage as a test sample; then, carrying out word segmentation processing on the test sample through the selected characters, and finally converting the test sample into word vectors;
and step X2, inputting the word vectors of the test samples into the web page classifier constructed by the first objective method of the invention, and outputting the classification result through the web page classifier.
Preferably, in the step X1, by the selected character "? "," & "," - "and" # "perform a word segmentation process on each test sample;
in the step X1, Word2vec is used to convert the result after the Word segmentation processing for each test sample into a Word vector.
Compared with the prior art, the invention has the following advantages and effects:
(1) the invention relates to a URL-based web page classifier construction method, which comprises the steps of firstly, obtaining URLs of a plurality of web pages, marking web page attributes aiming at the URLs, and forming a training sample set by the URLs with marked web page attributes; for each training sample, performing word segmentation processing on each training sample through selected characters, and then converting the word into a word vector; and acquiring a constructed convolutional neural network model, and training the convolutional neural network by taking the word vectors of the training samples marked with the webpage attributes in the training sample set as input to obtain the webpage classifier. Therefore, the method provided by the invention builds the web page classifier by training the convolutional neural network on the basis of the vocabulary characteristics of the URL of the web page, and the URL of the web page is static and fixed and does not change, so that the classification result of the web page classifier built by the method provided by the invention is not influenced by the content of the web page and the dynamic behavior of the web page, the classification accuracy of malicious web pages can be greatly improved, and compared with the web page detection method in the prior art, the method provided by the invention has the advantages of simplicity in operation, low recall rate, low false report rate and low false report rate.
(2) In the method for constructing the web page classifier based on the URL, each training sample is subjected to word segmentation processing through selected characters; the URL is a unique address of each piece of information in the network, and consists of three parts: the resource type, the domain name of the host where the resource is located and the file name of the resource are in the following basic format, and the protocol is:// user name: password @ sub-domain name. Parameter # value flag. The three sections are separated by "/", the host name and the domain name are separated by ". times", and the common separators for transferring parameters are "? "," & "," - ". In general, phishing webpages work as articles between domain names and host names, and some domain name confusion malicious behaviors are performed, such as XSS cross site attacks and SQL injection. Is selected in the method of the present invention "? Six separators of ",", "&", "-" and "#" cut the URL link so that important information in the URL can be extracted, further improving the classification accuracy of the constructed web page classifier.
(3) In the method for constructing the webpage classifier based on the URL, a convolutional neural network is constructed from input to output and sequentially comprises a first part, a second part, a third part, a fourth part and a fifth part; the structures of all parts are specially set based on the vocabulary characteristics, so the web page classifier obtained by constructing the convolutional neural network training is more targeted for web page classification.
(4) The method for constructing the webpage classifier based on the URL comprises the following steps of carrying out duplicate removal treatment on a training sample set: selecting a value N, obtaining the first N characters of each training sample in a training sample set, aiming at URLs with the same first N characters in the training sample set, only leaving one URL after deduplication processing, then judging that the number of the training samples in the training sample set is smaller than or equal to a threshold value, if not, reducing the value of N, and carrying out the same processing until the number of the training samples in the training sample set is smaller than or equal to the threshold value; in the invention, the duplication elimination treatment greatly reduces the number of repeated training samples in the training sample set and improves the precision of the training sample set; therefore, the training sample set obtained by the operation of the invention can reduce the computational complexity and accelerate the construction speed of the web page classifier under the condition of ensuring the classification accuracy of the constructed classifier.
(5) The invention relates to a webpage classification method based on URL, aiming at the webpage needing classification, firstly acquiring the URL of the webpage as a test sample; performing word segmentation processing on each test sample through the selected characters, and finally converting the word into a word vector; and inputting the word vectors of the test samples into the web page classifier constructed by the method, and outputting a classification result through the web page classifier. Compared with other classification methods in the prior art, the webpage classification method has the advantage of higher malicious webpage detection rate.
Drawings
FIG. 1 is a flowchart of a method for constructing a URL-based web page classifier according to the present invention.
FIG. 2 is a diagram of a convolutional neural network model constructed in accordance with the present invention.
Fig. 3 is a process of deduplication processing of a training sample set in embodiment 2 of the present invention.
Fig. 4 is a storage format of a training sample set in embodiment 2 of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
Example 1
The invention discloses a method for constructing a webpage classifier based on URL (Uniform resource locator), which comprises the following steps as shown in figure 1:
step S1, obtaining URLs of a plurality of webpages, marking the webpage attributes aiming at the URLs, and forming a training sample set by taking the URLs with marked webpage attributes as training samples; in this embodiment, URLs of a plurality of web pages are obtained from a good and malicious URL repository, and the training sample set includes URLs whose web page attributes are malicious and URLs whose web page attributes are good.
Step S2, for each training sample in the training sample set, performing word segmentation processing on each training sample by using the selected character, and then converting the training sample into a word vector.
In this embodiment, the characters selected in this step include "? "," & "," - "and" # ", i.e. by"? "," "&", "-" and "#" perform word segmentation processing on each training sample, for example, a certain training sample corresponds to a URL:
tudu-free.blogspot.com/2008/02/jogos-java-aplicativos.html#footer-wrap2;
then, after the word segmentation processing is performed through the selected characters, the following steps are performed: 'tudu', 'free', 'blogspot', 'com', '2008', '02', 'jogs', 'java', 'aplativos', 'html', 'footer', 'wrap 2'.
In this embodiment, for the result after the Word segmentation processing of each training sample, Word2vec is used to convert the training sample into a Word vector. When a Word vector is obtained using Word2vec conversion, the following parameters are set: word embedding dimension encoding-size, context window size window, and minimum word frequency min count. In this embodiment, the word embedding dimension embedding-size, the context window size window, and the minimum word frequency min _ count are set to 128, 5, and 4, respectively.
And step S3, training the word vectors of the training samples marked with the webpage attributes in the training sample set by taking the word vectors as input aiming at the convolutional neural network to obtain the webpage classifier.
In this embodiment, the convolutional neural network is constructed as shown in fig. 2, specifically: the input-output circuit sequentially comprises a first part, a second part, a third part, a fourth part and a fifth part; wherein:
the first part is an input layer and is used for inputting word vectors of all training samples;
the second part comprises a first convolution layer (convolution layer 1), a first pooling layer (pooling layer 1), a second convolution layer (convolution layer 2) and a second pooling layer (pooling layer 2) in sequence from input to output and is used for extracting context semantics of various degrees; the first convolution layer and the second convolution layer both comprise convolution kernels with three sizes, and the first convolution layer and the second convolution layer are the same in size.
And the third part is a vector merging layer and is used for merging the convolution results of the convolution kernels of the second part into a feature vector.
The fourth part is a full-link layer, which includes a first full-link layer (full-link layer 1) and a second full-link layer (full-link layer 2), where the first full-link layer performs Dropout processing on the feature vector, in this embodiment, Dropout is set to 0.3, and the second full-link layer obtains the category with the highest score corresponding to the feature vector through a softmax classifier; in the present embodiment 1, it is shown that,
and the fifth part is an output layer and is used for outputting the classification result.
The network layer parameters of the convolutional neural network model constructed in this embodiment are shown in table 1 below:
TABLE 1
Figure BDA0001788485730000071
Therefore, the method provided by the embodiment of the invention has the advantages that the web page classifier is constructed and obtained by training the convolutional neural network on the basis of the vocabulary characteristics of the URL of the web page, the URL of the web page is static and fixed and does not change, the classification result of the web page classifier constructed and obtained by the method provided by the invention is not easily influenced by the content and the dynamic behavior of the web page, the classification accuracy of malicious web pages can be greatly improved, and in addition, compared with the web page detection method in the prior art, the method provided by the invention has the advantages of simplicity in operation, low recall rate, low false report rate and low false report rate.
The embodiment also discloses a webpage classification method based on the URL, which comprises the following steps:
step X1, aiming at the webpage needing to be classified, firstly, acquiring the URL of the webpage as a test sample; then, carrying out word segmentation processing on the test sample through the selected characters, and finally converting the test sample into word vectors;
in this embodiment, the step is by the selected character "? "," & "," - "and" # "perform a word segmentation process on each test sample; for the results after the test sample Word segmentation processing, Word2vec is used to convert into Word vectors. When Word2vec conversion is used to obtain a Word vector, the following parameters are set: word embedding dimension encoding-size, context window size window, and minimum word frequency min count. In this embodiment, the word embedding dimension embedding-size, the context window size window, and the minimum word frequency min _ count are set to 128, 5, and 4, respectively.
Step X2, inputting the word vectors of the test samples into the web page classifier constructed by the method of this embodiment, and outputting the classification result through the web page classifier.
Example 2
The embodiment discloses a method for constructing a web page classifier based on a URL, which is different from the method for constructing a web page classifier based on a URL in embodiment 1 only as follows:
in this embodiment, after the training sample set is obtained in step S1, the method further includes a step of performing deduplication processing on the training sample set, as shown in fig. 3, specifically as follows: n selects an initial value, obtains the first N characters of each training sample in the training sample set, and only one URL with the same first N characters in the training sample set is left after the deduplication processing, then judges whether the total number of the training samples in the training sample set is less than or equal to a threshold value, if not, reduces the value of N, and performs the same processing until the total number of the training samples in the training sample set is reduced to be less than or equal to the threshold value; and aiming at the final training sample set obtained after the duplication removing treatment, performing word segmentation treatment on each training sample in the training sample set through selected characters, and then converting the training samples into word vectors. In the embodiment, N is an integer of 20-30.
In this embodiment, after N is selected as an initial value, for example, equal to 30, URLs with the same first N characters in the training sample set are obtained, and for URLs with the same first N characters, only one of them is deleted after deduplication processing, and after this processing, if the total number of samples in the training sample set is greater than or equal to a certain threshold, the value of N is decreased, where the value of N is decreased by a certain value, for example, 5, each time, and the same processing is performed until the number of samples in the training sample set is less than or equal to the threshold, and the threshold may be set to 10 ten thousand in this embodiment. For example, if the total number of training samples in the training sample set is C, and the initial value of the N value is 30, the N value is first selected to be 30, and if x1 URLs with first 30 characters being a1, a2, a3, … a30 and y1 URLs with first 30 characters being b1, b2, b3, … b30 in the training sample set, the deduplication processing is performed, specifically: deleting x1-1 URLs of which the first 30 characters are a1, a2, a3 and … a30, and only remaining 1 URL of which the first 30 characters are a1, a2, a3 and … a 30; deleting y1-1 URLs of which the first 30 characters are b1, b2, b3 and … b30, and only remaining 1 URL of which the first 30 characters are b1, b2, b3 and … b 30; then after the duplication elimination treatment, judging whether the total number C of the samples in the training sample set is less than or equal to 10 ten thousand, if not, reducing the value of N to 25; next, if the x2 first 25 characters in the training sample set are URLs of c1, c2, c3, … c25, and if the y2 first 30 characters in the training sample set are URLs of d1, d2, d3, … d25, performing deduplication processing, specifically: deleting x2-1 URLs of which the first 25 characters are c1, c2, c3 and … c25, and only remaining 1 URL of which the first 25 characters are c1, c2, c3 and … c 25; deleting y2-1 URLs of which the first 25 characters are d1, d2, d3 and … d25, and only remaining 1 URL of which the first 25 characters are d1, d2, d3 and … d 25; then after the duplication elimination treatment, judging whether the total number C of the samples in the training sample set is less than or equal to 10 ten thousand, if not, reducing the value of N to 20; and then, the steps are repeatedly executed until the total number C of the samples in the training sample set is less than or equal to 10 ten thousand.
In this step of this embodiment, the URLs of 200000 web pages are obtained from a good and malicious URL repository, first N characters of the URLs are obtained by using a left function in Excel, deduplication processing is performed according to the first N characters of the URLs, the value of N is continuously adjusted, that is, N is gradually reduced, then manual screening is performed, invalid data is deleted by manual screening, and finally about 50000 URLs of benign web pages and malicious web pages are obtained, and the processed data are labeled (good (malicious) and bad (benign), and in this embodiment, 122331 URLs are finally obtained as training samples by combining with valid data in the existing public URL data sets, and as shown in fig. 4, the training samples are 49060-49076 th training samples, wherein the training sample sets are stored in a cvs format.
The embodiment also discloses a webpage classification method based on a URL, which is different from the webpage classification method in embodiment 1 only in that the webpage classifier obtained by the webpage classifier construction method is used in step X2.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (7)

1. A webpage classifier construction method based on URL is characterized by comprising the following steps:
step S1, obtaining URLs of a plurality of webpages, marking the webpage attributes aiming at the URLs, and forming a training sample set by taking the URLs with marked webpage attributes as training samples;
step S2, aiming at each training sample in the training sample set, performing word segmentation processing on each training sample through selected characters, and then converting the word into a word vector; the selected characters include "? "," & "," - "and" # ";
step S3, training the word vectors of all training samples marked with the webpage attributes in the training sample set by taking the word vectors as input aiming at the convolutional neural network to obtain a webpage classifier;
in step S1, acquiring URLs of multiple web pages from the benign and malicious URL repository, where the training sample set includes a certain number of URLs whose web page attributes are malicious and a certain number of URLs whose web page attributes are benign;
after the training sample set is obtained in step S1, deduplication processing is performed on the training sample set, which is specifically as follows: n selects an initial value, obtains the first N characters of each training sample in the training sample set, and only one URL with the same first N characters in the training sample set is left after the deduplication processing, then judges whether the total number of the training samples in the training sample set is less than or equal to a threshold value, if not, reduces the value of N, and performs the same processing until the total number of the training samples in the training sample set is reduced to be less than or equal to the threshold value; and aiming at the final training sample set obtained after the duplication removing treatment, performing word segmentation treatment on each training sample in the training sample set through selected characters, and then converting the training samples into word vectors.
2. The method for constructing a URL-based web page classifier according to claim 1, wherein in step S2, Word2vec is used to convert the training samples into Word vectors according to the result of the Word segmentation process.
3. The method for constructing a URL-based web page classifier according to claim 1, wherein in step S2, when Word vector is obtained by using Word2vec conversion, the following parameters are set: word embedding dimension encoding-size, context window size window, and minimum word frequency min count.
4. The URL-based web page classifier construction method according to claim 1, wherein the convolutional neural network is constructed to include, in order from input to output, a first part, a second part, a third part, a fourth part, and a fifth part; wherein:
the first part is an input layer and is used for inputting word vectors of all training samples;
the second part comprises a first convolution layer, a first pooling layer, a second convolution layer and a second pooling layer in sequence from input to output and is used for extracting context semantics of various degrees; the first convolution layer and the second convolution layer respectively comprise convolution kernels with three sizes, and the first convolution layer and the second convolution layer are the same in size;
the third part is a vector merging layer and is used for merging the convolution results of the convolution kernels of the second part into a feature vector;
the fourth part is a full connection layer and comprises a first full connection layer and a second full connection layer, the first full connection layer carries out Dropout processing on the feature vectors, and the second full connection layer obtains the category with the highest score corresponding to the feature vectors through a classifier;
and the fifth part is an output layer and is used for outputting a classification result.
5. The URL-based web page classifier building method according to claim 1, wherein N is an integer of 20-30.
6. A webpage classification method based on URL is characterized by comprising the following steps:
step X1, aiming at the webpage needing to be classified, firstly, acquiring the URL of the webpage as a test sample; then, carrying out word segmentation processing on the test sample through the selected characters, and finally converting the test sample into word vectors;
step X2, inputting the word vectors of the test samples into the web page classifier constructed by the method of any one of claims 1 to 5, and outputting the classification result through the web page classifier.
7. The URL-based web page classification method according to claim 6,
in said step X1, by the selected character "? "," & "," - "and" # "perform a word segmentation process on each test sample;
in the step X1, Word2vec is used to convert the result after the Word segmentation processing for each test sample into a Word vector.
CN201811025751.3A 2018-09-04 2018-09-04 URL-based web page classifier construction method and classification method thereof Active CN109284465B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811025751.3A CN109284465B (en) 2018-09-04 2018-09-04 URL-based web page classifier construction method and classification method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811025751.3A CN109284465B (en) 2018-09-04 2018-09-04 URL-based web page classifier construction method and classification method thereof

Publications (2)

Publication Number Publication Date
CN109284465A CN109284465A (en) 2019-01-29
CN109284465B true CN109284465B (en) 2021-03-19

Family

ID=65184422

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811025751.3A Active CN109284465B (en) 2018-09-04 2018-09-04 URL-based web page classifier construction method and classification method thereof

Country Status (1)

Country Link
CN (1) CN109284465B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110647745A (en) * 2019-07-24 2020-01-03 浙江工业大学 Detection method of malicious software assembly format based on deep learning
CN112749360A (en) * 2019-10-30 2021-05-04 北京国双科技有限公司 Webpage classification method and device
CN110830489B (en) * 2019-11-14 2022-09-13 国网江苏省电力有限公司苏州供电分公司 Method and system for detecting counterattack type fraud website based on content abstract representation

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102629282A (en) * 2012-05-03 2012-08-08 湖南神州祥网科技有限公司 Website classification method, device and system
EP2729895B1 (en) * 2011-07-08 2016-07-06 The UAB Research Foundation Syntactical fingerprinting
CN106126512A (en) * 2016-04-13 2016-11-16 北京天融信网络安全技术有限公司 The Web page classification method of a kind of integrated study and device
CN107741960A (en) * 2017-09-25 2018-02-27 厦门集微科技有限公司 URL sorting technique and device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102790762A (en) * 2012-06-18 2012-11-21 东南大学 Phishing website detection method based on uniform resource locator (URL) classification
CN102739679A (en) * 2012-06-29 2012-10-17 东南大学 URL(Uniform Resource Locator) classification-based phishing website detection method
CN105512143A (en) * 2014-09-26 2016-04-20 中兴通讯股份有限公司 Method and device for web page classification
CN106294815B (en) * 2016-08-16 2019-08-16 晶赞广告(上海)有限公司 A kind of clustering method and device of URL
CN107360200A (en) * 2017-09-20 2017-11-17 广东工业大学 A kind of fishing detection method based on classification confidence and web site features

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2729895B1 (en) * 2011-07-08 2016-07-06 The UAB Research Foundation Syntactical fingerprinting
CN102629282A (en) * 2012-05-03 2012-08-08 湖南神州祥网科技有限公司 Website classification method, device and system
CN106126512A (en) * 2016-04-13 2016-11-16 北京天融信网络安全技术有限公司 The Web page classification method of a kind of integrated study and device
CN107741960A (en) * 2017-09-25 2018-02-27 厦门集微科技有限公司 URL sorting technique and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于元搜索的网页去重算法;张玉连;《燕山大学学报》;20110803(第2期);第121-123页 *

Also Published As

Publication number Publication date
CN109284465A (en) 2019-01-29

Similar Documents

Publication Publication Date Title
CN109510815B (en) Multi-level phishing website detection method and system based on supervised learning
CN103559235B (en) A kind of online social networks malicious web pages detection recognition methods
CN108259415B (en) Mail detection method and device
CN108737423B (en) Phishing website discovery method and system based on webpage key content similarity analysis
CN110266675B (en) Automatic detection method for xss attack based on deep learning
CN109284465B (en) URL-based web page classifier construction method and classification method thereof
EP3454230B1 (en) Access classification device, access classification method, and access classification program
CN109922065B (en) Quick identification method for malicious website
WO2007143914A1 (en) Method, device and inputting system for creating word frequency database based on web information
CN107341399A (en) Assess the method and device of code file security
CN107463844B (en) WEB Trojan horse detection method and system
CN112989348B (en) Attack detection method, model training method, device, server and storage medium
CN113098887A (en) Phishing website detection method based on website joint characteristics
CN111654504B (en) DGA domain name detection method and device
CN112948725A (en) Phishing website URL detection method and system based on machine learning
CN106202349B (en) Webpage classification dictionary generation method and device
CN115004181A (en) Webpage detection method and device, electronic equipment and storage medium
CN114372267A (en) Malicious webpage identification and detection method based on static domain, computer and storage medium
CN110147839A (en) The method that algorithm based on XGBoost generates domain name detection model
CN111125704B (en) Webpage Trojan horse recognition method and system
CN112487422A (en) Malicious document detection method and device, electronic equipment and storage medium
CN114124448B (en) Cross-site script attack recognition method based on machine learning
CN113361597B (en) Training method and device for URL detection model, electronic equipment and storage medium
CN115001763A (en) Phishing website attack detection method and device, electronic equipment and storage medium
CN114363039A (en) Method, device, equipment and storage medium for identifying fraud websites

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant