CN112104765A - Illegal website detection method and device - Google Patents

Illegal website detection method and device Download PDF

Info

Publication number
CN112104765A
CN112104765A CN202011311250.9A CN202011311250A CN112104765A CN 112104765 A CN112104765 A CN 112104765A CN 202011311250 A CN202011311250 A CN 202011311250A CN 112104765 A CN112104765 A CN 112104765A
Authority
CN
China
Prior art keywords
website
domain name
illegal
detected
detection result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011311250.9A
Other languages
Chinese (zh)
Inventor
程波
叶志钢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Greenet Information Service Co Ltd
Original Assignee
Wuhan Greenet Information Service Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Greenet Information Service Co Ltd filed Critical Wuhan Greenet Information Service Co Ltd
Priority to CN202011311250.9A priority Critical patent/CN112104765A/en
Publication of CN112104765A publication Critical patent/CN112104765A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/45Network directories; Name-to-address mapping
    • H04L61/4505Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols
    • H04L61/4511Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols using domain name system [DNS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Security & Cryptography (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • Computer Hardware Design (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the invention provides a method and a device for detecting an illegal website. The method comprises the following steps: acquiring a target domain name of a website to be detected and a preset domain name library, wherein domain names of a plurality of legal websites and domain names of a plurality of illegal websites are stored in the domain name library; judging whether a domain name identical to a target domain name exists in a domain name library or not, if not, detecting the target website based on a preset neural network model to obtain a first detection result of the website to be detected; if the first detection result is that the website to be detected is an illegal website, acquiring a second detection result of the website to be detected according to the page content of the website to be detected; and taking the second detection result as a final result of the detection of the website to be detected. According to the method and the device provided by the embodiment of the invention, the modes of domain name library comparison, neural network model detection, page content detection and the like are combined to be commonly used for detecting the website to be detected, so that the accuracy of the detection result can be improved.

Description

Illegal website detection method and device
Technical Field
The invention relates to the technical field of network security, in particular to a method and a device for detecting an illegal website.
Background
Illegal websites such as lottery websites and the like are forbidden frequently, the illegal websites need to be detected firstly in order to block the illegal websites, the illegal websites are detected by usually adopting a website domain name or a destination IP filtering method in the prior art, however, most of the illegal websites at the present stage adopt a mode of encrypting a hypertext Transfer Protocol over Secure Socket Layer (HTTPS), frequently changing domain names or corresponding one domain name to one user so as to avoid being detected, and if the illegal websites are detected by continuously adopting the detection method in the prior art, the problem of low detection accuracy can occur.
Disclosure of Invention
Therefore, it is necessary to provide a method and an apparatus for detecting an illegal website, so as to solve the technical problem of low accuracy of the illegal website detection method in the prior art.
In a first aspect, an embodiment of the present invention provides a method for detecting an illegal website, including:
acquiring a target domain name of a website to be detected and a preset domain name library, wherein domain names of a plurality of legal websites and domain names of a plurality of illegal websites are stored in the domain name library;
judging whether a domain name identical to the target domain name exists in the domain name library or not, if not, detecting the target website based on a preset neural network model to obtain a first detection result of the website to be detected, wherein the first detection result indicates that the website to be detected is a legal website or an illegal website;
if the first detection result is that the website to be detected is an illegal website, acquiring a second detection result of the website to be detected according to the page content of the website to be detected, wherein the second detection result is that the website to be detected is a legal website or an illegal website;
and taking the second detection result as a final result of the detection of the website to be detected.
Further, before detecting the target website based on a preset neural network model, the method further includes:
obtaining a plurality of sample domain names;
combining each sample domain name and the name and sample label of the corresponding sample website to form a training sample to obtain a plurality of training samples; the sample label is used for marking the sample website as a legal website or an illegal website;
and combining a plurality of training samples into a training set, and training the neural network model based on the training set.
Further, obtaining a number of sample domain names includes:
acquiring user flow in a preset time period;
analyzing the user flow to obtain a plurality of domain names carried in the user flow and the visit quantity of each domain name;
ranking a plurality of domain names according to the sequence of the access amount from high to low, and taking each domain name with the name number lower than the preset name number as a domain name to be analyzed;
judging whether a domain name identical to the domain name to be analyzed exists in the domain name library, if not, acquiring a third detection result of the website according to the page content of the website corresponding to the domain name to be analyzed, wherein the third detection result is that the website is a legal website or an illegal website;
and if the third detection result is that the website is an illegal website, taking the domain name to be analyzed as the sample domain name to obtain a plurality of sample domain names.
Further, the domain name library includes a legal domain name library and an illegal domain name library, domain names of a plurality of legal websites are stored in the legal domain name library, domain names of a plurality of illegal websites are stored in the illegal domain name library, and the detection method of the illegal websites further includes:
if the third detection result is that the website is an illegal website, storing the domain name to be analyzed to the illegal domain name library;
and if the third detection result indicates that the website is a legal website, storing the domain name to be analyzed to the legal domain name library.
Further, the method for detecting the illegal website further comprises the following steps:
traversing the training set;
acquiring the category of the name of the sample website in the training set, and judging whether the category is the same as the category acquired in the last traversal;
if not, retraining the neural network model based on the training set;
if the number of the training samples in the training set is the same, acquiring the number of the training samples in the training set, judging whether the number is the same as the number acquired in the last traversal, and if the number is not the same, performing incremental training on the neural network model based on the training set.
Further, the method for detecting the illegal website further comprises the following steps:
if the domain name identical to the target domain name exists in the legal domain name library, taking a detection result that the website to be detected is a legal website as a final result of detection of the website to be detected;
if the domain name identical to the target domain name exists in the illegal domain name library, taking a detection result that the to-be-detected website is an illegal website as a final result of detection of the to-be-detected website;
and if the first detection result is that the website to be detected is a legal website, taking the detection result that the website to be detected is the legal website as a final result of the detection of the website to be detected.
Further, the method for detecting the illegal website further comprises the following steps:
acquiring a first domain name set, wherein the first domain name set comprises a plurality of domain names;
comparing the first domain name set with the domain name library, and eliminating domain names which are the same as the domain names in the domain name library in the first domain name set to obtain a second domain name set;
detecting each domain name in the second domain name set respectively based on the neural network model so as to divide websites corresponding to each domain name in the second domain name set into an initial legal website set and an initial illegal website set;
determining a fourth detection result of each website in the illegal initial website set according to the page content of each website in the initial illegal website set, wherein the fourth detection result indicates that each website in the initial illegal website set is a legal website or an illegal website;
selecting a website set with a preset proportion in the initial legal websites;
and distributing the domain name corresponding to each website in the website set to a plurality of predetermined terminals for detection, and acquiring a fifth detection result fed back by the terminals, wherein the fifth detection result indicates that each website in the website set is a legal website or an illegal website.
In a second aspect, an embodiment of the present invention provides an apparatus for detecting an illegal website, including:
the acquisition module is used for acquiring a target domain name of a website to be detected and a preset domain name library, wherein domain names of a plurality of legal websites and domain names of a plurality of illegal websites are stored in the domain name library;
the first detection module is used for judging whether a domain name identical to the target domain name exists in the domain name library or not, if not, detecting the target website based on a preset neural network model to obtain a first detection result of the website to be detected, wherein the first detection result is that the website to be detected is a legal website or an illegal website;
the second detection module is used for acquiring a second detection result of the website to be detected according to the page content of the website to be detected if the first detection result indicates that the website to be detected is an illegal website, and the second detection result indicates that the website to be detected is a legal website or an illegal website;
and the result determining module is used for taking the second detection result as a final result of the detection of the website to be detected.
In a third aspect, an embodiment of the present invention provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the method for detecting an illegal website according to the first aspect when executing the computer program.
In a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the method for detecting an illegal website as provided in the first aspect.
The method and the device for detecting the illegal website provided by the embodiment of the invention compare the domain name of the website to be detected with a preset domain name library, if the domain name library does not have the domain name same as the target domain name, the target domain name is detected based on a preset neural network model to obtain a first detection result of the website to be detected, and when the first detection result is that the website to be detected is the illegal website, a second detection result of the website to be detected is obtained according to the page content of the website to be detected, and the second detection result is used as a final result of detecting the website to be detected. The method combines the modes of domain name library comparison, neural network model detection, page content detection and the like to be commonly used for detecting the website to be detected, and can improve the accuracy of the detection result.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic view of an application scenario of a method for detecting an illegal website according to an embodiment of the present invention;
FIG. 2 is a flowchart of a method for detecting an illegal website according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an illegal website detection device according to an embodiment of the present invention;
fig. 4 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to more clearly describe the method for detecting an illegal website provided by the embodiment of the present invention, an application scenario of the method is first described. Fig. 1 is a schematic view of an application scenario of the method for detecting an illegal website according to an embodiment of the present invention, as shown in fig. 1, the method is applied to an operator Access Network, where the operator Access Network includes a plurality of Virtual Local Area Networks (VLANs), a plurality of Digital Subscriber Line Access Multiplexers (DSLAMs), a plurality of Switches (SWs), a plurality of Broadband Access servers (berss), a plurality of Deep Packet Inspection (DPI), and functional units such as a convergence layer, and an electrical connection relationship between each functional unit is shown in fig. 1 in detail, and is not described herein again.
The device for executing the method is referred to as an illegal website detection device, and the device may be located in any functional unit of the operator access network shown in fig. 1, which is capable of monitoring internet traffic to a large number of users, and is not limited herein.
Fig. 2 is a flowchart of a method for detecting an illegal website according to an embodiment of the present invention, as shown in fig. 2, the method includes:
step 201, a target domain name of a website to be detected and a preset domain name library are obtained, wherein domain names of a plurality of legal websites and domain names of a plurality of illegal websites are stored in the domain name library.
Specifically, in order to obtain a target Domain Name of a website to be detected, user real-time traffic needs to be obtained first, where the user real-time traffic includes any one or more of a hypertext Transfer Protocol (HTTP) request message, an HTTPs request message, and a Domain Name Server (DNS) response message; then, analyzing the acquired user real-time traffic to obtain a domain name carried in the user real-time traffic, wherein the domain name in the embodiment of the invention is preferably a top-level domain name; and finally, taking the website corresponding to the domain name analyzed from the real-time flow of the user as the website to be detected, and taking the domain name as the target domain name of the website to be detected. The embodiment of the invention aims to detect the website to be detected so as to judge whether the website to be detected is an illegal website. It should be noted that the illegal website in the embodiment of the present invention is preferably a lottery website.
The preset domain name library comprises a legal domain name library and an illegal domain name library, wherein domain names of a plurality of legal websites are stored in the legal domain name library, and domain names of a plurality of illegal websites are stored in the illegal domain name library.
Step 202, judging whether a domain name identical to the target domain name exists in the domain name library, if not, detecting the target website based on a preset neural network model to obtain a first detection result of the website to be detected, wherein the first detection result is that the website to be detected is a legal website or an illegal website.
Specifically, each domain name in the domain name library is sequentially compared with the target domain name to judge whether the domain names are the same, if the domain names are judged to be the same, namely the domain name which is the same as the target domain name exists in the domain name library, the comparison is stopped, and the subsequent detection process is stopped.
Further, if the domain name same as the target domain name exists in the legal domain name library, taking the detection result that the website to be detected is a legal website as the final result of the detection of the website to be detected, and if the domain name same as the target domain name exists in the illegal domain name library, taking the detection result that the website to be detected is an illegal website as the final result of the detection of the website to be detected.
If the domain name same as the target domain name does not exist in the domain name library, it is indicated that whether the website to be detected is an illegal website cannot be judged based on the domain name library, at this time, the target domain name is detected based on a preset neural network model, and a first detection result of the website to be detected is obtained, wherein the first detection result is that the website to be detected is a legal website or the website to be detected is an illegal website.
It should be noted that, for an illegal website, although the domain name is often transformed, the domain name usually has certain features, one of the features is that the domain name is generated by a program, and the other is that the domain name is convenient for identification.
Step 203, if the first detection result is that the website to be detected is an illegal website, obtaining a second detection result of the website to be detected according to the page content of the website to be detected, wherein the second detection result is that the website to be detected is a legal website or an illegal website.
Specifically, in order to further ensure the accuracy of the detection result and avoid that a legal website is mistaken by a neural network model as an illegal website and then blocked, if a first detection result of detecting the target domain name based on the neural network model is that the website to be detected is an illegal website, secondary detection is performed, that is, the website to be detected is actively visited, and a second detection result of the website to be detected is obtained according to page content returned by the website to be detected, wherein the second detection result is that the website to be detected is a legal website or the website to be detected is an illegal website.
The obtaining of the second detection result of the website to be detected according to the page content returned by the website to be detected may specifically be: and judging whether the page content contains illegal words, pictures or videos, if so, determining that the website to be detected is an illegal website, and otherwise, determining that the website to be detected is a legal website.
It should be noted that, in many websites, the number of illegal websites is very small compared with the number of legal websites, so the number of websites detected as illegal websites by the neural network model is very small, that is, the number of websites requiring secondary detection is very small, and therefore, the detection resources and time occupied by the two detection processes are also very small, thereby ensuring the detection timeliness on the basis of ensuring the accuracy of the detection result.
And 204, taking the second detection result as a final result of the detection of the website to be detected.
Specifically, if the second detection result is that the website to be detected is a legal website, the website to be detected is a legal website and is used as a final result of detection of the website to be detected, and if the second detection result is that the website to be detected is an illegal website, the website to be detected is an illegal website and is used as a final result of detection of the website to be detected.
The method provided by the embodiment of the invention comprises the steps of comparing the domain name of the website to be detected with a preset domain name library, detecting the target domain name based on a preset neural network model to obtain a first detection result of the website to be detected if the domain name library does not have the domain name same as the target domain name, acquiring a second detection result of the website to be detected according to the page content of the website to be detected when the first detection result is that the website to be detected is an illegal website, and taking the second detection result as a final result of detecting the website to be detected. The method combines the modes of domain name library comparison, neural network model detection, page content detection and the like to be commonly used for detecting the website to be detected, and can improve the accuracy of the detection result.
Based on the above embodiment, the method provided by the embodiment of the present invention further includes, before detecting the target website based on a preset neural network model:
a number of sample domain names are obtained.
Specifically, the sample domain name is preferably a domain name of an illegal website. There are various ways to obtain the sample domain name, for example, taking the domain name in the illegal domain name library as the sample domain name; for another example, the real-time traffic of the user is acquired and analyzed to obtain a domain name carried by the user, whether the website is an illegal website is judged according to the page content of the website corresponding to the domain name, and if so, the domain name is used as a sample domain name; for example, user traffic within a preset time period is acquired and analyzed to obtain a plurality of domain names carried by the user traffic and the visit volume of each domain name, page content of a website corresponding to the domain name with a smaller visit volume is acquired, whether the website is an illegal website is judged according to the page content, and if yes, the domain name is used as a sample domain name. The user traffic includes any one or more of HTTP request message, HTTPs request message, DNS response message, and the like, and is not limited herein.
Combining each sample domain name and the name and sample label of the corresponding sample website to form a training sample to obtain a plurality of training samples; the sample label is used for marking the sample website as a legal website or an illegal website.
Specifically, a website corresponding to a domain name serving as a sample domain name is referred to as a sample website corresponding to the sample domain name, the name of the sample website is preferably a website title, and the sample label is used for marking the sample website as a legal website or an illegal website. The sample label may be specifically represented as a number, for example, a sample website marked with 0 is a legal website, and a sample website marked with 1 is an illegal website. And forming a training sample by using a sample domain name, the name of the sample website corresponding to the sample domain name and the sample label, and sequentially executing the operation on each obtained sample domain name to obtain a plurality of training samples.
And combining a plurality of training samples into a training set, and training the neural network model based on the training set.
Based on the above embodiment, obtaining a plurality of sample domain names includes:
and acquiring the user flow in a preset time period.
Specifically, the preset time period is generally a time period including a plurality of hours, for example, 24 hours, and the number of hours included in the preset time period can be flexibly adjusted according to actual situations, and is generally greater than or equal to 24 hours, which is not limited herein. The user traffic includes any one or more of HTTP request message, HTTPs request message, DNS response message, and the like, and is not limited herein.
And analyzing the user traffic to obtain a plurality of domain names carried in the user traffic and the visit volume of each domain name.
Ranking the domain names according to the sequence of the access amount from high to low, and taking each domain name with the name number lower than the preset name number as the domain name to be analyzed.
Specifically, the preset ranking may be obtained by multiplying the number of the analyzed domain names by a preset ratio, for example, the number of the domain names is 1000, and the preset ratio is 80%, where the preset ranking is 800, and at this time, the 1000 domain names are ranked in order of the highest access amount, where the ranking of the domain name with the highest access amount is ranking 1, the ranking of the domain name with the lowest access amount is ranking 1000, the ranking 1000 is lower than ranking 1, and the domain name with the ranking lower than 800, that is, each domain name with the ranking between 801 and 1000, is taken as the domain name to be analyzed.
It should be noted that, because the domain name of the illegal website is usually known only by a few users and the domain name is often changed, the access amount of the domain name is smaller than that of the legal website, and accordingly, the access amount of the domain name is later in name, in order to save the computing resources, time and cost required for detection, the domain name with the later access amount is selected for subsequent detection in the embodiment of the present invention, and the preset proportion can be determined according to the size of the user traffic and/or the network bandwidth of the device itself executing the method, and is preferably a percentage between 80% and 100%.
Judging whether the domain name identical to the domain name to be analyzed exists in the domain name library, if not, acquiring a third detection result of the website according to the page content of the website corresponding to the domain name to be analyzed, wherein the third detection result is that the website is a legal website or an illegal website.
Specifically, each domain name in the domain name library is sequentially compared with the domain name to be analyzed to determine whether the domain names are the same, if the domain names are determined to be the same, it is determined that the domain name identical to the target domain name exists in the domain name library, the comparison is stopped, and the comparison process is executed for the next domain name to be analyzed.
If the domain name same as the domain name to be analyzed does not exist in the domain name library, actively accessing the website corresponding to the domain name to be analyzed, and obtaining a third detection result of the website according to the page content returned by the website, wherein the third detection result is that the website is a legal website or the website is an illegal website.
And if the third detection result is that the website is an illegal website, taking the domain name to be analyzed as the sample domain name to obtain a plurality of sample domain names.
Specifically, if the third detection result is that the website is an illegal website, the domain names to be analyzed are used as sample domain names, and the operations are sequentially performed on each domain name to be analyzed, so that a plurality of sample domain names can be obtained.
Based on the above embodiment, the domain name library in the embodiment of the present invention includes a legal domain name library and an illegal domain name library, where domain names of a plurality of legal websites are stored in the legal domain name library, domain names of a plurality of illegal websites are stored in the illegal domain name library, and the method for detecting an illegal website further includes:
and if the third detection result indicates that the website is an illegal website, storing the domain name to be analyzed to the illegal domain name library.
Specifically, in the process of constructing the training set, the legal domain name base may be updated, for example, for a domain name to be analyzed, if a domain name identical to the domain name to be analyzed does not exist in the domain name base, and the website is determined to be a legal website according to the page content of the website corresponding to the domain name to be analyzed, the domain name to be analyzed is stored in the legal domain name base to update the legal domain name base.
And if the third detection result indicates that the website is a legal website, storing the domain name to be analyzed to the legal domain name library.
Specifically, in the process of constructing the training set, the illegal domain name library may also be updated, for example, for a domain name to be analyzed, if a domain name identical to the domain name to be analyzed does not exist in the domain name library, and the website is determined to be an illegal website according to the page content of the website corresponding to the domain name to be analyzed, the domain name to be analyzed is stored in the illegal domain name library to update the illegal domain name library.
Based on the above embodiment, the method for detecting an illegal website further includes:
and traversing the training set.
Specifically, a time period may be preset, the training set may be periodically traversed according to the time period, and a time when the training set is traversed may also be randomly set, which is not limited herein.
And acquiring the category of the name of the sample website in the training set, and judging whether the category is the same as the category acquired in the last traversal.
And if the difference is not the same, retraining the neural network model based on the training set.
If the number of the training samples in the training set is the same, acquiring the number of the training samples in the training set, judging whether the number is the same as the number acquired in the last traversal, and if the number is not the same, performing incremental training on the neural network model based on the training set.
It should be noted that, in other embodiments, the neural network model may be retrained only when the training set changes in other situations, for example, the training set is traversed, the category of the name of the sample website in the training set is obtained, whether the category is the same as the category obtained in the last traversal is determined, if the category is different from the category obtained in the last traversal, whether the number of sample domain names corresponding to at least one category of names in the training set exceeds a preset threshold, for example, 1000 is determined, and if the category exists, the neural network model is retrained based on the training set.
Based on the above embodiment, the method for detecting an illegal website further includes:
and if the domain name identical to the target domain name exists in the legal domain name library, taking a detection result that the website to be detected is a legal website as a final result of detection of the website to be detected.
And if the domain name identical to the target domain name exists in the illegal domain name library, taking a detection result that the website to be detected is an illegal website as a final result of detection of the website to be detected.
And if the first detection result is that the website to be detected is a legal website, taking the detection result that the website to be detected is the legal website as a final result of the detection of the website to be detected.
Different from the method for detecting only one website in the above embodiment, the method provided in the embodiment of the present invention may further detect multiple websites, that is, the method for detecting an illegal website further includes:
a first domain name set is obtained, and the first domain name set comprises a plurality of domain names.
Specifically, a plurality of domain names can be further analyzed according to the user real-time traffic or the user traffic within a preset time period, and the first domain name set is a set including the analyzed plurality of domain names.
And comparing the first domain name set with the domain name library, and eliminating the domain names which are the same as the domain names in the domain name library in the first domain name set to obtain a second domain name set.
Specifically, one domain name in the first domain name set is sequentially compared with domain names in the domain name library, if a domain name identical to the domain name in the first domain name set is found in the domain name library, the domain name in the first domain name set is rejected, the comparison process is performed on each domain name in the first domain name set, and a set formed by all domain names which are not rejected in the first domain name set is called a second domain name set.
And respectively detecting each domain name in the second domain name set based on the neural network model so as to divide websites corresponding to each domain name in the second domain name set into an initial legal website set and an initial illegal website set.
Specifically, each domain name in the second domain name set can be detected based on the neural network model, so as to determine whether the website corresponding to the domain name is a legal website or an illegal website, a set formed by all judged legal websites is referred to as an initial legal website set, and a set formed by all judged illegal websites is referred to as an initial illegal website set.
And determining a fourth detection result of each website in the illegal initial website set according to the page content of each website in the initial illegal website set, wherein the fourth detection result indicates that each website in the initial illegal website set is a legal website or an illegal website.
Specifically, for the initial illegal website set, the website is determined to be a legal website or an illegal website according to the page content of each website, and the determination result is used as a final detection result of the website.
And selecting a website set with a preset proportion in the initial legal websites.
Specifically, for an initial legal website, in order to avoid that the neural network model determines an illegal website as a legal website by mistake, the detection accuracy is improved, a website set with a preset proportion is selected from the initial legal website set, it should be noted that the preset proportion can be generally adjusted according to actual conditions, and the preset proportion is not limited here.
And distributing the domain name corresponding to each website in the website set to a plurality of predetermined terminals for detection, and acquiring a fifth detection result fed back by the terminals, wherein the fifth detection result indicates that each website in the website set is a legal website or an illegal website.
Specifically, the domain name corresponding to each website in a website set with a preset proportion in the initial legal website is sent to a plurality of terminals which are determined in advance, so that a holder of each terminal actively accesses the website corresponding to the received domain name, and a fifth detection result of the website is obtained by checking page content of the website, wherein the fifth detection result is that the website is a legal website or an illegal website.
Fig. 3 is a schematic structural diagram of an illegal website detection device according to an embodiment of the present invention, and as shown in fig. 3, the device includes:
an obtaining module 301, configured to obtain a target domain name of a to-be-detected website and a preset domain name library, where domain names of a plurality of legal websites and domain names of a plurality of illegal websites are stored in the domain name library; a first detection module 302, configured to determine whether a domain name identical to the target domain name exists in the domain name library, and if not, detect the target website based on a preset neural network model to obtain a first detection result of the to-be-detected website, where the first detection result indicates that the to-be-detected website is a legal website or an illegal website; a second detection module 303, configured to, if the first detection result indicates that the website to be detected is an illegal website, obtain a second detection result of the website to be detected according to page content of the website to be detected, where the second detection result indicates that the website to be detected is a legal website or an illegal website; and the result determining module 304 is configured to use the second detection result as a final result of detecting the website to be detected.
It should be noted that, the steps of the method for detecting an illegal website provided by the apparatus in the embodiment of the present invention are not described herein again. The device provided by the embodiment of the invention compares the domain name of the website to be detected with a preset domain name library, detects the target domain name based on a preset neural network model to obtain a first detection result of the website to be detected if the domain name library does not have the domain name same as the target domain name, obtains a second detection result of the website to be detected according to the page content of the website to be detected when the first detection result is that the website to be detected is an illegal website, and takes the second detection result as a final result of detecting the website to be detected. The method combines the modes of domain name library comparison, neural network model detection, page content detection and the like to be commonly used for detecting the website to be detected, and can improve the accuracy of the detection result.
Fig. 4 is a schematic entity structure diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 4, the electronic device may include: a processor (processor)401, a communication Interface (communication Interface)402, a memory (memory)403 and a communication bus 404, wherein the processor 401, the communication Interface 402 and the memory 403 complete communication with each other through the communication bus 404. The processor 401 may invoke a computer program stored in the memory 403 and executable on the processor 401 to perform the methods provided by the above embodiments, including for example: acquiring a target domain name of a website to be detected and a preset domain name library, wherein domain names of a plurality of legal websites and domain names of a plurality of illegal websites are stored in the domain name library; judging whether a domain name identical to the target domain name exists in the domain name library or not, if not, detecting the target website based on a preset neural network model to obtain a first detection result of the website to be detected, wherein the first detection result indicates that the website to be detected is a legal website or an illegal website; if the first detection result is that the website to be detected is an illegal website, acquiring a second detection result of the website to be detected according to the page content of the website to be detected, wherein the second detection result is that the website to be detected is a legal website or an illegal website; and taking the second detection result as a final result of the detection of the website to be detected.
In addition, the logic instructions in the memory 403 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or make a contribution to the prior art, or may be implemented in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Embodiments of the present invention further provide a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the transmission method provided in the foregoing embodiments when executed by a processor, and the method includes: acquiring a target domain name of a website to be detected and a preset domain name library, wherein domain names of a plurality of legal websites and domain names of a plurality of illegal websites are stored in the domain name library; judging whether a domain name identical to the target domain name exists in the domain name library or not, if not, detecting the target website based on a preset neural network model to obtain a first detection result of the website to be detected, wherein the first detection result indicates that the website to be detected is a legal website or an illegal website; if the first detection result is that the website to be detected is an illegal website, acquiring a second detection result of the website to be detected according to the page content of the website to be detected, wherein the second detection result is that the website to be detected is a legal website or an illegal website; and taking the second detection result as a final result of the detection of the website to be detected.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for detecting an illegal website is characterized by comprising the following steps:
acquiring a target domain name of a website to be detected and a preset domain name library, wherein domain names of a plurality of legal websites and domain names of a plurality of illegal websites are stored in the domain name library;
judging whether a domain name identical to the target domain name exists in the domain name library or not, if not, detecting the target website based on a preset neural network model to obtain a first detection result of the website to be detected, wherein the first detection result indicates that the website to be detected is a legal website or an illegal website;
if the first detection result is that the website to be detected is an illegal website, acquiring a second detection result of the website to be detected according to the page content of the website to be detected, wherein the second detection result is that the website to be detected is a legal website or an illegal website;
and taking the second detection result as a final result of the detection of the website to be detected.
2. The method for detecting illegal website according to claim 1, wherein before detecting the target website based on a preset neural network model, the method further comprises:
obtaining a plurality of sample domain names;
combining each sample domain name and the name and sample label of the corresponding sample website to form a training sample to obtain a plurality of training samples; the sample label is used for marking the sample website as a legal website or an illegal website;
and combining a plurality of training samples into a training set, and training the neural network model based on the training set.
3. The method of claim 2, wherein obtaining a plurality of sample domain names comprises:
acquiring user flow in a preset time period;
analyzing the user flow to obtain a plurality of domain names carried in the user flow and the visit quantity of each domain name;
ranking a plurality of domain names according to the sequence of the access amount from high to low, and taking each domain name with the name number lower than the preset name number as a domain name to be analyzed;
judging whether a domain name identical to the domain name to be analyzed exists in the domain name library, if not, acquiring a third detection result of the website according to the page content of the website corresponding to the domain name to be analyzed, wherein the third detection result is that the website is a legal website or an illegal website;
and if the third detection result is that the website is an illegal website, taking the domain name to be analyzed as the sample domain name to obtain a plurality of sample domain names.
4. The method according to claim 3, wherein the domain name library includes a legal domain name library and an illegal domain name library, the legal domain name library stores domain names of a plurality of legal websites, the illegal domain name library stores domain names of a plurality of illegal websites, and the method further comprises:
if the third detection result is that the website is an illegal website, storing the domain name to be analyzed to the illegal domain name library;
and if the third detection result indicates that the website is a legal website, storing the domain name to be analyzed to the legal domain name library.
5. The method for detecting an illegal website according to claim 2, further comprising:
traversing the training set;
acquiring the category of the name of the sample website in the training set, and judging whether the category is the same as the category acquired in the last traversal;
if not, retraining the neural network model based on the training set;
if the number of the training samples in the training set is the same, acquiring the number of the training samples in the training set, judging whether the number is the same as the number acquired in the last traversal, and if the number is not the same, performing incremental training on the neural network model based on the training set.
6. The method for detecting illegal website according to claim 4, further comprising:
if the domain name identical to the target domain name exists in the legal domain name library, taking a detection result that the website to be detected is a legal website as a final result of detection of the website to be detected;
if the domain name identical to the target domain name exists in the illegal domain name library, taking a detection result that the to-be-detected website is an illegal website as a final result of detection of the to-be-detected website;
and if the first detection result is that the website to be detected is a legal website, taking the detection result that the website to be detected is the legal website as a final result of the detection of the website to be detected.
7. The method for detecting an illegal website according to claim 1, further comprising:
acquiring a first domain name set, wherein the first domain name set comprises a plurality of domain names;
comparing the first domain name set with the domain name library, and eliminating domain names which are the same as the domain names in the domain name library in the first domain name set to obtain a second domain name set;
detecting each domain name in the second domain name set respectively based on the neural network model so as to divide websites corresponding to each domain name in the second domain name set into an initial legal website set and an initial illegal website set;
determining a fourth detection result of each website in the illegal initial website set according to the page content of each website in the initial illegal website set, wherein the fourth detection result indicates that each website in the initial illegal website set is a legal website or an illegal website;
selecting a website set with a preset proportion in the initial legal websites;
and distributing the domain name corresponding to each website in the website set to a plurality of predetermined terminals for detection, and acquiring a fifth detection result fed back by the terminals, wherein the fifth detection result indicates that each website in the website set is a legal website or an illegal website.
8. An illegal website detection device, comprising:
the acquisition module is used for acquiring a target domain name of a website to be detected and a preset domain name library, wherein domain names of a plurality of legal websites and domain names of a plurality of illegal websites are stored in the domain name library;
the first detection module is used for judging whether a domain name identical to the target domain name exists in the domain name library or not, if not, detecting the target website based on a preset neural network model to obtain a first detection result of the website to be detected, wherein the first detection result is that the website to be detected is a legal website or an illegal website;
the second detection module is used for acquiring a second detection result of the website to be detected according to the page content of the website to be detected if the first detection result indicates that the website to be detected is an illegal website, and the second detection result indicates that the website to be detected is a legal website or an illegal website;
and the result determining module is used for taking the second detection result as a final result of the detection of the website to be detected.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method for detecting an illegitimate website of any one of claims 1 to 7 when executing the computer program.
10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the steps of the method for detecting an illegitimate website of any one of claims 1 to 7.
CN202011311250.9A 2020-11-20 2020-11-20 Illegal website detection method and device Pending CN112104765A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011311250.9A CN112104765A (en) 2020-11-20 2020-11-20 Illegal website detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011311250.9A CN112104765A (en) 2020-11-20 2020-11-20 Illegal website detection method and device

Publications (1)

Publication Number Publication Date
CN112104765A true CN112104765A (en) 2020-12-18

Family

ID=73785500

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011311250.9A Pending CN112104765A (en) 2020-11-20 2020-11-20 Illegal website detection method and device

Country Status (1)

Country Link
CN (1) CN112104765A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150358276A1 (en) * 2014-05-28 2015-12-10 International Business Machines Corporation Method, apparatus and system for resolving domain names in network
CN109510815A (en) * 2018-10-19 2019-03-22 杭州安恒信息技术股份有限公司 A kind of multistage detection method for phishing site and detection system based on supervised learning
CN110138758A (en) * 2019-05-05 2019-08-16 哈尔滨英赛克信息技术有限公司 Mistake based on domain name vocabulary plants domain name detection method
CN110247916A (en) * 2019-06-20 2019-09-17 四川长虹电器股份有限公司 Malice domain name detection method
CN110784462A (en) * 2019-10-23 2020-02-11 北京邮电大学 Three-layer phishing website detection system based on hybrid method
CN111291078A (en) * 2020-01-17 2020-06-16 武汉思普崚技术有限公司 Domain name matching detection method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150358276A1 (en) * 2014-05-28 2015-12-10 International Business Machines Corporation Method, apparatus and system for resolving domain names in network
CN109510815A (en) * 2018-10-19 2019-03-22 杭州安恒信息技术股份有限公司 A kind of multistage detection method for phishing site and detection system based on supervised learning
CN110138758A (en) * 2019-05-05 2019-08-16 哈尔滨英赛克信息技术有限公司 Mistake based on domain name vocabulary plants domain name detection method
CN110247916A (en) * 2019-06-20 2019-09-17 四川长虹电器股份有限公司 Malice domain name detection method
CN110784462A (en) * 2019-10-23 2020-02-11 北京邮电大学 Three-layer phishing website detection system based on hybrid method
CN111291078A (en) * 2020-01-17 2020-06-16 武汉思普崚技术有限公司 Domain name matching detection method and device

Similar Documents

Publication Publication Date Title
CN109981805B (en) Domain name resolution method and device
CN106656666B (en) Method and device for acquiring first screen time of webpage
CN104113519B (en) Network attack detecting method and its device
CN103179132B (en) A kind of method and device detecting and defend CC attack
CN109274632B (en) Website identification method and device
CN107624233B (en) VPN transmission tunnel scheduling method and device and VPN client server
CN107172064B (en) Data access control method and device and server
CN108259425A (en) The determining method, apparatus and server of query-attack
CN110830445B (en) Method and device for identifying abnormal access object
CN103095676A (en) Filtrating system and filtrating method
CN110417747B (en) Method and device for detecting violent cracking behavior
US7907543B2 (en) Apparatus and method for classifying network packet data
CN107426136B (en) Network attack identification method and device
KR101127246B1 (en) Method of identifying terminals which share an ip address and apparatus thereof
CN115190108B (en) Method, device, medium and electronic equipment for detecting monitored equipment
CN112131507A (en) Website content processing method, device, server and computer-readable storage medium
CN106713242B (en) Data request processing method and processing device
CN110944007B (en) Network access management method, system, device and storage medium
CN114640504B (en) CC attack protection method, device, equipment and storage medium
CN112449371B (en) Performance evaluation method of wireless router and electronic equipment
CN107948022B (en) Identification method and identification device for peer-to-peer network traffic
CN111131236A (en) Web fingerprint detection device, method, equipment and medium
CN106411819A (en) Method and apparatus for recognizing proxy Internet protocol address
CN109413022B (en) Method and device for detecting HTTP FLOOD attack based on user behavior
CN111225038B (en) Server access method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20201218