CN102663000B

CN102663000B - The maliciously recognition methods of the method for building up of network address database, maliciously network address and device

Info

Publication number: CN102663000B
Application number: CN201210069443.7A
Authority: CN
Inventors: 梁知音
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2012-03-15
Filing date: 2012-03-15
Publication date: 2016-08-03
Anticipated expiration: 2032-03-15
Also published as: CN102663000A

Abstract

The invention provides recognition methods and the device of the method for building up of a kind of malice network address database, maliciously network address, this method for building up includes: S1, structure the website information association data base；S2, structure anti-chain linked database；S3, obtain known malicious network address, add in queue to be detected, repeated execution of steps S4, until described queue to be detected be empty, utilize and all occur in data construct maliciously network address database in queue to be detected；S4, inquiry anti-chain linked database, determine that all anti-chain url of current url, the anti-chain url that weights exceed predetermined threshold value add in queue to be detected；Or, resolving the website attribute information of current url, inquiry station dot information linked database, determine, with current url, there is the website domain name of same site attribute information, the website domain name that weights exceed predetermined threshold value is added in queue to be detected.Compared to existing technology, the present invention improves promptness and the accuracy of detection, and minimizing is failed to report.

Description

Establishment method of malicious website database, and identification method and device of malicious website

[ technical field ] A method for producing a semiconductor device

The invention relates to the technical field of computer security, in particular to a method for establishing a malicious website database and a method and a device for identifying a malicious website.

[ background of the invention ]

With the continuous development of computer and network technologies, the internet is more and more important to people, and has penetrated into various aspects of work and life of people. But along with the increase of malicious behaviors aiming at the internet, various security problems greatly plague network users. At present, the number of websites for malicious behaviors such as fraud and the like on the Internet is large, and the security of users is threatened by the illegal profit websites due to the concealment of profit channels of the illegal profit websites. However, these illegal websites have short life cycle, and usually are banned or cancelled once discovered, and in order to guarantee the effect, the illegal website operator usually holds a large number of similar station groups for replacement at any time, and these station groups have close relationship, gradually refine and form a huge black industry chain, which is often called as "internet underground industry chain".

The existing malicious website detection means comprises: static feature detection and simulated browser detection. The static detection is to determine whether the feature codes are included in the HTML (hypertext markup language) codes of the web pages by using the pre-collected malicious code features, and if so, determine the web addresses as malicious addresses. The identification rate of the detection method is usually low, and the detection method is easy to be bypassed by various script encryption and coding modes. The simulation browser detection is to simulate a user to access a website by utilizing a pre-constructed browser environment, and if illegal behavior characteristics occur, the user is identified as a malicious website. The detection efficiency of the method is low, when a malicious website is encountered, the browser environment may need to be restored again, and the complete and real browser environment is difficult to construct, which easily causes the report omission. For the website library replaced by an illegal website operator at any time, judgment can be carried out only after one-by-one execution, malicious websites cannot be found in advance, and timeliness is poor.

[ summary of the invention ]

In view of this, the invention provides a method for establishing a malicious website database, a method and a device for identifying a malicious website, so as to improve the timeliness and accuracy of detection and reduce missing reports.

The specific technical scheme is as follows:

a method for establishing a malicious website database comprises the following steps:

s1, associating each website domain name with corresponding site attribute information in advance, and constructing a site information association database;

s2, constructing an anti-chain association database in advance, and storing the link relation among the urls;

s3, acquiring urls of known malicious websites, adding the urls into a queue to be detected, taking out the urls from the queue to be detected one by one, respectively executing the step S4 on the taken out current urls until the queue to be detected is empty, and constructing a malicious website database by using all the urls or website domain names added into the queue to be detected;

s4, inquiring the reverse link association database, determining all reverse link urls of the current url, and adding the reverse link urls of which the association degree with the urls of the known malicious websites meets the preset requirement into a queue to be detected; or

Analyzing the site attribute information of the current url, inquiring the site information association database, determining the website domain name with the same site attribute information as the current url, and adding the website domain name with the association degree between the website domain name and the url of the known malicious website meeting the preset requirement into a queue to be detected.

According to a preferred embodiment of the present invention, the site attribute information includes at least one of the following: website name, website owner contact information, company information, IP address information and ICP information.

According to a preferred embodiment of the present invention, step S3 further includes: giving initial weight values to urls of the malicious websites, setting a reverse link factor between urls with a reverse link relation, and setting an influence factor aiming at the type of the common site attribute information between website domain names, wherein the value ranges of the reverse link factor and the influence factor are intervals (0, 1);

the calculation of the degree of association between the anti-link url and the url of the known malicious website includes: multiplying the weight of the current url by the reverse chain factor to obtain the weight of the reverse chain url;

the calculation of the association degree between the website domain name and the url of the known malicious website comprises the following steps: multiplying the weight of the current url by an influence factor corresponding to the website domain name and the type of the common site attribute information of the current url to obtain the weight of the website domain name;

the association degree meeting the preset requirement is as follows: and the weight value of the reverse link url or the website domain name exceeds a preset threshold value.

According to a preferred embodiment of the present invention, the malicious website database further includes: and all the site attribute information and the weight corresponding to the url or the website domain name added to the queue to be detected.

A method for identifying a malicious website comprises the following steps:

acquiring a url to be detected, inquiring whether a malicious website database contains the url to be detected, and if so, determining that the url to be detected is a malicious website;

the malicious website database is established by adopting the establishment method of the malicious website database.

A method for identifying a malicious website comprises the following steps:

s201, acquiring a url to be detected, and analyzing the site attribute information of the url;

s202, searching a malicious website with the same attribute information as the url to be detected in a malicious website database by using the analyzed site attribute information, wherein the malicious website database is established by adopting the establishment method of the malicious website database;

s203, calculating the weight of the url to be detected by using the weight of the found malicious website;

s204, judging whether the weight value of the url to be detected exceeds a preset threshold value, and if so, identifying the url to be detected as a malicious url.

According to a preferred embodiment of the present invention, the step S203 specifically includes:

and combining and calculating the weight of the malicious website found in the step S202 to obtain the weight of the url to be detected.

According to a preferred embodiment of the invention, the combination is a maximum, or an average, or a sum.

An apparatus for establishing a malicious website database, the apparatus comprising:

the website information association module is used for associating each website domain name with corresponding website attribute information in advance and constructing a website information association database;

the reverse link association module is used for constructing a reverse link association database in advance and storing the link relation among the urls;

the database establishing module is used for acquiring urls of known malicious websites, adding the urls into a queue to be detected, taking out the urls from the queue to be detected one by one, providing the taken out current urls to the reverse chain detection module or the site information detection module until the queue to be detected is empty, and establishing a malicious website database by using all the urls or website domain names added into the queue to be detected;

the reverse link detection module is used for inquiring the reverse link association database, determining all reverse link urls of the current urls provided by the database establishment module, and adding the reverse link urls of which the association degree with the urls of the known malicious websites meets the preset requirement into a queue to be detected;

and the site information detection module is used for analyzing the site attribute information of the current url, inquiring the site information association database, determining the website domain name with the same site attribute information as the current url provided by the database establishment module, and adding the website domain name with the association degree between the website domain name and the url of the known malicious website meeting the preset requirement into the queue to be detected.

According to a preferred embodiment of the present invention, the apparatus further comprises:

the factor setting module is used for setting a reverse link factor for the urls with reverse link relations and setting an influence factor for the type of the common site attribute information among the website domain names, and the value ranges of the reverse link factor and the influence factor are intervals (0, 1);

the database establishing module is also used for endowing the url of the malicious website with an initial weight;

the reverse link detection module multiplies the weight of the current url by a reverse link factor respectively to obtain the weight of each reverse link url, and the correlation degree between the reverse link url and the url of the known malicious website is reflected by the weight of the reverse link url;

the website information detection module multiplies the weight of the current url by influence factors corresponding to the types of the website domain name and the website attribute information shared by the current url respectively to obtain the weight of the website domain name, and the correlation degree between the website domain name and the url of the known malicious website is reflected by the weight of the website domain name.

An apparatus for identifying a malicious website, the apparatus comprising: the query judging module is used for acquiring a url to be detected, querying whether the malicious website database contains the url to be detected, and if so, determining that the url to be detected is a malicious website;

the malicious website database is established by adopting the establishment device of the malicious website database.

An apparatus for identifying a malicious website, the apparatus comprising:

the analysis module is used for acquiring the url to be detected and analyzing the site attribute information of the url;

the query module is used for searching a malicious website with the same attribute information as the url to be detected in a malicious website database by using the analyzed site attribute information, wherein the malicious website database is established by adopting an establishing device of the malicious website database;

the merging module is used for calculating the weight of the url to be detected by using the weight of the found malicious website;

and the judging module is used for judging whether the weight value of the url to be detected exceeds a preset threshold value, and if so, identifying the url to be detected as a malicious url.

According to a preferred embodiment of the present invention, the merging module is specifically configured to:

and combining and calculating the weight of the malicious website searched in the query module to obtain the weight of the url to be detected.

According to the technical scheme, the establishment method of the malicious website database, the identification method of the malicious website and the identification device of the malicious website provided by the invention have the advantages that the relevance between the whole underground industrial chain is considered, the known malicious website url is expanded by utilizing the relevance data and the link relation of the site attribute information between the websites on the Internet, the malicious website database is established based on the expanded relevance degree of the url and the malicious website url, the identification method realized based on the malicious website database is not based on the malicious code characteristics, the detection accuracy is higher, the website which is not used yet can be judged without simulating the execution of the browser environment, the detection timeliness and the detection accuracy are improved, and the missing report is reduced.

[ description of the drawings ]

Fig. 1 is a flowchart of a method for establishing a malicious website database according to an embodiment of the present invention;

fig. 2 is a flowchart of a malicious website identification method according to a second embodiment of the present invention;

fig. 3 is a schematic diagram of an apparatus for establishing a malicious website database according to a third embodiment of the present invention;

fig. 4 is a schematic diagram of an apparatus for identifying a malicious website according to a fourth embodiment of the present invention.

[ detailed description ] embodiments

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

The first embodiment,

Fig. 1 is a flowchart of a method for establishing a malicious website database according to this embodiment, and as shown in fig. 1, the method includes:

step S101, associating each website domain name with corresponding site attribute information in advance, and constructing a site information association database.

A web site typically includes a plurality of web pages, each having a corresponding web address, which is typically represented by url (uniform resource locator), typically in the form of access protocol + domain name. For example, a hundredth website includes a number of web pages, the url of the hundredth first page being "http:// www.baidu.com" and the domain name being "baidu. Since the website domain name has uniqueness, a website can be represented by the website domain name.

For a domain name, the registration information of the website corresponding to the domain name can be queried by using tools such as whois and the like. Typically, the registration information includes a website name, a domain name of the application, a website owner, contact information of the website owner (including a unit name, a unit person in charge, a unit industry of the application, a communication address, a zip code, an e-mail, a telephone number, a fax number, and authentication information), a host name and an IP address of a domain name server, and the like.

In the underground industry chain, the same illegal website operator usually holds a plurality of malicious websites to form a similar website group, and the malicious websites usually have the same website attribute information, for example, the same website owner or the same domain name server may have. And discovering the station group of the illegal website operator by using the association relation among the station attribute information.

And constructing a site information association database by utilizing site attribute information of the websites existing on the Internet in advance so as to inquire association relations among the websites.

Specifically, when the site information association database is constructed, website registration information including website names, website owners, website owner contact information, company information, IP address information, and the like is collected for websites existing on the internet through a whois tool. And acquiring ICP (Internet content provider) information of the website, including company information, website record number, website name, website address of the website home page and other information, by using a webpage crawler and other methods, associating the information with the website domain name to form an association relation between the website domain name and the website attribute information, and constructing a website information association database.

The site information association database may be stored in a table index manner, but is not limited to, including an association relationship between a website domain name and corresponding site attribute information, where the site attribute information includes a website name, a website owner, website owner contact information, company information, IP address information, and the like.

And S102, constructing an inverse chain association database in advance, and storing the link relation among the urls.

A webpage may include a plurality of export links associated with other webpages, and accordingly, a webpage may also be associated with a plurality of webpages in the manner of import links.

The reverse link, i.e. the import link, refers to a link in other web pages that introduces a url into their web pages through a section of source text or path. All web addresses containing the import link of the url in the web page are the anti-link url of the url.

And constructing an anti-chain association database by using the link relation among the urls corresponding to the web pages. The method comprises the steps of crawling web page contents by adopting the existing methods such as a web crawler (webcrawler) and the like, storing the link relation among the urls, and constructing to obtain a reverse link association database so as to search the reverse links of the urls in the subsequent process.

Step S103, setting different influence factors for different association relations.

The two websites are associated, which means that the two websites have the same site attribute information. Different association relations mean that the types of the site attribute information associated between every two sites of each website are different. Because the types of the site attribute information associated between the websites are different, the association degree between the websites is also different. For example, websites registered with the same email address may be determined to be substantially the same registrant, and the same IP address indicates sharing of host IP among websites.

And setting different influence factors for different association relations according to the types of the site attribute information. The preset influence factors of each type are set according to the type of the site attribute information shared between the domain names of the websites. For example, an email factor is set for websites registered with the same email address, and is a fixed value of 0.9, an IP factor is set for websites with the same IP address, and is a fixed value of 0.8, and a reverse link factor is set for websites with a reverse link relationship, and is a fixed value of 0.8. And setting influence factors aiming at the types of the shared site attribute information among the website domain names, and setting reverse-link factors among the urls with reverse-link relations.

Each type of influence factor comprises influence factors of each site attribute information type, such as an inverse link factor, an email factor, an IP factor, a registered user name factor, a registered company factor, an ICP factor and the like. The different types of influencing factors alpha can be set according to the existing empirical data, but are not limited to, wherein 0 < alpha < 1.

And S104, acquiring the urls of the known malicious websites, adding the urls into a queue to be detected, taking out the urls from the queue to be detected one by one, and respectively executing the step S105 on the taken out current urls.

The known malicious website may be a website determined by means of existing antivirus software or a daily updated malicious website monitoring technology. And taking the malicious websites as input, giving initial weight values to the known malicious websites, and adding the initial weight values to the queue to be detected. At this time, the queue to be detected includes each malicious website and the initial weight of each malicious website.

And (5) taking out the websites (urls) in the queue to be detected one by one for detection, and executing the step (S105) on the taken out current url.

And S105, inquiring the reverse link association database, determining all reverse link urls of the current url, and adding the reverse link urls of which the association degree with the urls of the known malicious websites meets the preset requirement into a queue to be detected.

The calculation of the degree of association between the anti-link url and the url of the known malicious website includes: and multiplying the weight of the current url by the reverse chain factor to obtain the weight of each reverse chain url.

In this step, the retrieved anti-chain url is in an anti-chain relationship with the current url, and thus, the impact factor used is an anti-chain factor.

For the detected malicious website, the weight value adopted is the initial weight value of the malicious website, namely 1. And obtaining the weight of each reverse link url by using the initial weight and the reverse link factor of the malicious website. If the set reverse-chain factor is 0.8, the weight of each reverse-chain url is 0.8 × 1 — 0.8.

The association degree meeting the preset requirement is as follows: and the weight value of the reverse chain url exceeds a preset threshold value. And adding the reverse chain url with the weight value exceeding a preset threshold value into a queue to be detected. The preset threshold may be set according to actual experience, for example, if the preset threshold is set to 0.7, the reverse chain url with the weight value exceeding 0.7 and the corresponding weight value are added to the queue to be detected.

Step S106, analyzing the site attribute information of the current url, inquiring the site information association database, determining the website domain name with the same site attribute information as the current url, and adding the website domain name with the association degree between the website domain name and the url of the known malicious website meeting the preset requirement into a queue to be detected.

The calculation of the association degree between the website domain name and the url of the known malicious website comprises the following steps: and multiplying the weight of the current url by an influence factor corresponding to the website domain name and the type of the common site attribute information of the current url to obtain the weight of the website domain name.

The association degree meeting the preset requirement is as follows: and the weight value of the website domain name exceeds a preset threshold value.

Specifically, the corresponding influence factor is determined according to the type of the common site attribute information between each website domain name and the current url. And multiplying the weight of the current url by each corresponding influence factor to obtain the weight of each website domain name, and adding the website domain names with the weights exceeding a preset threshold value into a queue to be detected.

Extracting the website domain name corresponding to the current url, inquiring by using a whois tool to obtain the site attribute information corresponding to the current url, wherein the site attribute information comprises the website name, the website owner email, the company name, the ICP number and the like, matching in a site information association database by using the site attribute information, inquiring the website domain name with the same attribute, recording the site attribute information types of the association between the website domain names and the current url, and determining each influence factor.

Each influence factor refers to an influence factor corresponding to the site attribute information type associated with each website domain name and the current url. For example, if the website domain name a has the same email address as the current url, the weight of the website domain name a is the product of the weight of the current url and the email factor. If the website domain name B and the current url have the same IP address, the weight of the website domain name B is the product of the weight of the current url and the IP factor. And repeating the steps, and calculating to obtain the weight of each website domain name.

If there are multiple impact factors associated with the website domain name and the current url, for example, if there are the same email address and the same registered user name, the maximum value of the two impact factors may be selected as the impact factor of the website domain name and the current url when determining the impact factors. Or, different weights may be assigned to different pieces of site attribute information, but the sum is 1, and if there are a plurality of pieces of site attribute information that are the same, the coefficients corresponding to the pieces of site attribute information are weighted to determine the influence factor.

And adding the website domain name with the weight value exceeding a preset threshold value into a queue to be detected. The preset threshold is the same as in step S105.

It should be noted that the sequence of step S105 and step S106 may be exchanged, or only one of them may be adopted for detection.

And S107, taking out the next url or website domain name from the queue to be detected, repeating the step S105 and the step S106 until the queue to be detected is empty, and constructing a malicious website database by using all url or website domain names appearing in the queue to be detected and corresponding website attribute information.

Since the website domain name is a special case of url, in the url library, the website domain name points to the first page of the website. Therefore, the website domain name can be converted into a website home page url, and the url is uniformly used for representing in a malicious website database.

Because the set influence factor 0 < alpha < 1, the weight of the url obtained by calculation is smaller and smaller after repeated, and in the convergence process, when the weight of all urls is smaller than a preset threshold value, that is, no queue to be detected is newly added and the queue to be detected is empty, the closures of a batch of related suspicious websites are collected.

And storing all url or website domain names appearing in the queue to be detected, and site attribute information and weight values corresponding to the url or website domain names into a database to construct a malicious website database to form an underground industry data database. The malicious website database may be stored in a table index manner, but is not limited to, and includes collected url information, email address information, domain name (domain) information, ICP information, IP address information, and the like.

For example, if the obtained known malicious website has url1, the obtained known malicious websites are given an initial weight, for example, 1, and added to the queue to be detected. A url, such as url1, is removed as the current url for analysis.

All anti-link urls corresponding to the malicious web address url1 are found in an anti-link association database using url1, which may include url2 and url3, for example. The weight (i.e., the initial weight) of the malicious website url1 is multiplied by the set back-link factor to serve as the weight of the back-link url2 and url3, for example, if the set back-link factor is 0.8, the weight of the url2 and url3 is 0.8 × 1 — 0.8. And adding the reverse chain url with the weight value exceeding a preset threshold value into the queue to be detected, and adding url2 and url3 into the queue to be detected if the preset threshold value is 0.7.

Extracting corresponding domain names from url1, for example www.xxx123.com, using tools such as whois to query and obtain site attribute information corresponding to url1, including website name, website owner email, company name, IP address, ICP number, etc., using these site attribute information to match in the site information association database, and querying website domain names with the same attribute, for example, domain name 1 with the same email address and domain name 2 with the same IP address. Calculating the weight values of the domain name 1 and the domain name 2, if the set email factor is 0.9 and the IP factor is 0.8, the weight value of the domain name 1 is the product of the initial weight value and the email factor: 0.9 × 1 ═ 0.9, the weight of domain name 2 is the product of the initial weight and the IP factor: 0.8 × 1 ═ 0.8. And adding the domain name 1 and the domain name 2 into the queue to be detected as the weight values of the domain name 1 and the domain name 2 also exceed the preset threshold value of 0.7.

The next url or web site domain name is retrieved and a duplicate check is performed assuming url2 is retrieved.

All anti-chain urls corresponding to url2 are found in the anti-chain association database using url2, which may include url4 and url5, for example. The weight of the url2 is multiplied by the set inverse chain factor of 0.8 to be used as the weight of the inverse chain url4 and url5, so that the weight of the url4 and the url5 is 0.8 × 0.8 — 0.64. And as the weights of the url4 and the url5 are both less than the preset threshold value of 0.7, the weights are not added into the queue to be detected.

Extracting a corresponding domain name from url2, querying by using tools such as whois and the like to obtain site attribute information corresponding to url2, matching in a site information association database by using the site attribute information, and querying to obtain website domain names with the same attribute, such as domain name 3 with the same email address and domain name 4 with the same registered company. The weight of domain name 3 is calculated to be 0.8 × 0.9 — 0.72, and if the configured registrant factor is 0.8, the weight of domain name 4 is calculated to be 0.8 × 0.8 — 0.64. Because the domain name 3 exceeds the preset threshold value of 0.7, the domain name 3 is also added into the queue to be detected, and the domain name 4 is not added when the domain name is smaller than the preset threshold value of 0.7.

And repeating the steps S105 and S106 until the queue to be detected is empty, obtaining information about url1, url2, url3, domain name 1, domain name 2, domain name 3 and the like and corresponding weight values, and constructing a malicious website database.

And detecting unknown url which is malicious or not by using the constructed malicious website database. One way, a url to be detected can be directly acquired, whether the url to be detected is included in a malicious website database or not is inquired, and if the url to be detected is included in the malicious website database, the url to be detected is determined to be a malicious website. For url which can not be directly found in the malicious website database, the url can be identified by using the record containing the relevant information. The method for identifying a malicious website provided by the present invention is described below by embodiments.

Example II,

Fig. 2 is a flowchart of a method for identifying a malicious website provided in this embodiment, and as shown in fig. 2, the method includes:

step S201, acquiring the url to be detected, and analyzing the site attribute information of the url to be detected.

And for the url to be detected, extracting a corresponding domain name, and inquiring by using tools such as whois and the like to obtain site attribute information of the url to be detected, wherein the site attribute information comprises information such as a website name, a website owner email, a company name, an IP address, an ICP number and the like.

Step S202, searching a malicious website having the same attribute information as the url to be detected in a malicious website database by using the analyzed site attribute information, where the malicious website database is established by using the method according to the first embodiment.

In the constructed malicious website database of the first embodiment, the malicious urls including the site attribute information are extracted by using the site attribute information of the urls to be detected, so as to obtain a batch of malicious urls associated with the urls to be detected.

And step S203, calculating the weight of the url to be detected by using the weight of the found malicious website.

And combining and calculating the weight of the malicious website found in the step S202 to obtain the weight of the url to be detected. The combination calculation may be maximum, average, or sum. Preferably, the maximum value in the weight values corresponding to the found malicious urls is selected as the weight value of the url to be detected.

For malicious urls which repeatedly appear for many times, the weighting processing can be carried out during the combination calculation, and a preset weighting factor is added. When a url is judged as a suspicious url through different data sources, the higher the suspicion degree of the url as a malicious website is.

And S204, judging whether the weight value of the url to be detected exceeds a preset threshold value, and if so, identifying the url to be detected as a malicious url.

The preset threshold may be the same as that in step S105 and step S106 in the first embodiment, or may be another fixed value.

Therefore, for the unknown url, the established malicious website database can be used for judging whether the unknown url is a malicious website.

The above is a detailed description of the method provided by the present invention, and the following is a detailed description of the device for establishing the malicious website database and the device for identifying the malicious website provided by the present invention.

EXAMPLE III

Fig. 3 is a schematic diagram of an apparatus for creating a malicious website database according to this embodiment. As shown in fig. 3, the apparatus includes:

the site information association module 301 is configured to associate each website domain name with corresponding site attribute information in advance, and construct a site information association database.

The site information association module 301 constructs a site information association database in advance by using site attribute information of websites existing on the internet, so as to query association relationships among the websites.

And the anti-chain association module 302 is configured to pre-construct an anti-chain association database and store the link relationship between the urls.

The anti-chain association module 302 constructs an anti-chain association database by using the link relationship between urls corresponding to the web pages. The method comprises the steps of crawling web page contents by adopting the existing methods such as a web crawler (webcrawler) and the like, storing the link relation among the urls, and constructing to obtain a reverse link association database so as to search the reverse links of the urls in the subsequent process.

The factor setting module 303 is configured to set an inverse chain factor for each url having an inverse chain relationship, and set an influence factor for a type of the site attribute information shared between the domain names of the websites.

The numeric area of the reverse chain factor and the influence factor is the interval (0, 1).

The factor setting module 303 sets different influence factors for different association relations according to the type of the site attribute information. The preset influence factors of each type are set according to the type of the site attribute information shared between the domain names of the websites. For example, an email factor is set for websites registered with the same email address, and is a fixed value of 0.9, an IP factor is set for websites with the same IP address, and is a fixed value of 0.8, and a reverse link factor is set for websites with a reverse link relationship, and is a fixed value of 0.8. And setting corresponding influence factors according to the types of the shared site attribute information among the website domain names, and setting reverse-link factors among the urls with reverse-link relations.

The database establishing module 304 is configured to obtain urls of known malicious websites, add the urls into the queue to be detected, take out the urls from the queue to be detected one by one, provide the taken-out current urls to the anti-chain detection module 305 or the site information detection module 306 until the queue to be detected is empty, and establish a malicious website database by using all the urls or website domain names added into the queue to be detected.

And (4) taking out the websites (url) in the queue to be detected one by one, and detecting by using the reverse chain detection module 305 or the station information detection module 306.

The reverse link detection module 305 is configured to query the reverse link association database, determine all reverse link urls of the current url provided by the database establishment module 304, and add the reverse link urls, of which the association degree with the urls of known malicious websites meets preset requirements, into the queue to be detected.

The reverse link detection module 305 multiplies the weight of the current url by the reverse link factor to obtain the weight of each reverse link url, and the correlation degree between the reverse link url and the url of the known malicious website is represented by the weight of the reverse link url. And adding the reverse chain url with the weight value exceeding a preset threshold value into a queue to be detected.

The reverse chain detection module 305 adds the reverse chain url with the weight value exceeding the preset threshold value into the queue to be detected. The preset threshold may be set according to actual experience, for example, if the preset threshold is set to 0.7, the reverse chain url with the weight value exceeding 0.7 and the corresponding weight value are added to the queue to be detected.

The site information detection module 306 is configured to analyze the site attribute information of the current url, query the site information association database, determine a website domain name having the same site attribute information as the current url provided by the database establishment module 304, and add the website domain name whose association degree with the url of the known malicious website meets a preset requirement to the queue to be detected.

The site information detection module 306 determines the corresponding influence factor according to the type of the site attribute information shared between each website domain name and the current url. And respectively multiplying the weight of the current url by the influence factors corresponding to the types of the website domain name and the common site attribute information of the current url to obtain the weight of the website domain name, wherein the association degree between the website domain name and the url of the known malicious website is reflected by the weight of the website domain name. And adding the website domain name with the weight value exceeding a preset threshold value into a queue to be detected.

If there are multiple impact factors associated with the website domain name and the current url, for example, if there are the same email address and the same registered user name, the maximum value of the two impact factors may be selected as the impact factor of the website domain name and the current url when determining the impact factors. Or, different weights may be assigned to different pieces of site attribute information, but the sum is 1, and if there are a plurality of pieces of site attribute information that are the same, the coefficients corresponding to the pieces of site attribute information are weighted to determine the influence factor. And adding the website domain name with the weight value exceeding a preset threshold value into a queue to be detected.

Then, the database establishing module 304 takes out urls from the queue to be detected one by one, triggers the reverse chain detection module 305 or the site information detection module 306 for the current url taken out until the queue to be detected is empty, and establishes the malicious website database by using all urls or website domain names added to the queue to be detected.

And detecting unknown url which is malicious or not by using the constructed malicious website database. An identification device may include: and the query judging module is used for directly acquiring the url to be detected, querying whether the malicious website database contains the url to be detected, and if so, determining that the url to be detected is the malicious website. For url which can not be directly found in the malicious website database, the url can be identified by using the record containing the relevant information. The following describes the malicious website identification apparatus provided by the present invention with an embodiment four.

Fig. 4 is a schematic diagram of an apparatus for identifying a malicious website provided in this embodiment. As shown in fig. 4, the apparatus includes:

the analyzing module 401 is configured to obtain a url to be detected, and analyze the site attribute information of the url.

For the url to be detected, the analyzing module 401 extracts the corresponding domain name, and obtains the site attribute information of the url to be detected by querying with tools such as whois, including information such as a website name, a website owner email, a company name, an IP address, and an ICP number.

The query module 402 is configured to search, by using the analyzed site attribute information, a malicious website having the same attribute information as the url to be detected in a malicious website database, where the malicious website database is established by using the apparatus in the third embodiment.

The query module 402 extracts malicious urls including the site attribute information by using the site attribute information of the urls to be detected, and queries to obtain a batch of malicious urls associated with the urls to be detected.

And a merging module 403, configured to calculate a weight of the url to be detected by using the weight of the malicious website found by the querying module 402.

And combining and calculating the weight of the malicious website searched by the query module 402 to obtain the weight of the url to be detected. The combination calculation may be maximum, average, or sum. Preferably, the maximum value in the weight values corresponding to the found malicious urls is selected as the weight value of the url to be detected.

A determining module 404, configured to determine whether the weight of the url to be detected exceeds a preset threshold, and if so, identify the url to be detected as a malicious url.

For the unknown url, whether the unknown url is a malicious url can be judged by using the established malicious url database.

According to the establishment method of the malicious website database, the identification method of the malicious website and the identification device of the malicious website provided by the invention, the relevance between the whole underground industrial chain is considered, the malicious website database is established by utilizing the relevance data of the site attribute information between the websites on the Internet, the unknown website can be judged without execution, the timeliness and the accuracy of detection are improved, and the missing report is reduced.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for establishing a malicious website database is characterized by comprising the following steps:

s4, inquiring the reverse chain association database, determining all reverse chain urls of the current url, and adding the reverse chain urls with weights exceeding a preset threshold value into a queue to be detected, wherein the weights of the reverse chain urls are obtained by multiplying the weights of the current url by a reverse chain factor; or,

analyzing the site attribute information of the current url, inquiring the site information association database, determining the website domain name with the same site attribute information as the current url, and adding the website domain name with the weight value exceeding a preset threshold value into a queue to be detected, wherein the weight value of the website domain name is obtained by multiplying the weight value of the current url by an influence factor corresponding to the types of the website domain name and the common site attribute information of the current url.

2. The method of claim 1, wherein the site attribute information comprises at least one of: website name, website owner contact information, company information, IP address information and ICP information.

3. The method according to claim 1, wherein in the step S3, the method further comprises: giving initial weight values to urls of the malicious websites, setting a reverse link factor between urls with a reverse link relation, and setting an influence factor aiming at the type of the common site attribute information between website domain names, wherein the value ranges of the reverse link factor and the influence factor are intervals (0, 1);

if the webpage of one url contains the import link of another url, the two urls have an anti-link relationship.

4. The method of claim 3, wherein the malicious website database further comprises: and all the site attribute information and the weight corresponding to the url or the website domain name added to the queue to be detected.

5. A method for identifying a malicious website is characterized by comprising the following steps:

wherein the malicious website database is established by adopting the method as claimed in any one of claims 1 to 4.

6. A method for identifying a malicious website is characterized by comprising the following steps:

s202, searching a malicious website with the same attribute information as the url to be detected in a malicious website database by using the analyzed site attribute information, wherein the malicious website database is established by adopting the method as claimed in claim 4;

7. The method according to claim 6, wherein the step S203 is specifically:

8. The method of claim 7, wherein the combining is taking a maximum value, or taking an average value, or summing.

9. An apparatus for establishing a malicious website database, the apparatus comprising:

the reverse chain detection module is used for inquiring the reverse chain association database, determining all reverse chain urls of the current url provided by the database establishment module, and adding the reverse chain urls with weights exceeding a preset threshold value into a queue to be detected, wherein the weights of the reverse chain urls are obtained by multiplying the weights of the current urls by reverse chain factors;

and the website information detection module is used for analyzing the website attribute information of the current url, inquiring the database associated with the website information, determining the website domain name with the same website attribute information as the current url provided by the database establishment module, and adding the website domain name with the weight value exceeding a preset threshold value into a queue to be detected, wherein the weight value of the website domain name is obtained by multiplying the weight value of the current url by an influence factor corresponding to the types of the website domain name and the website attribute information shared by the current url.

10. The apparatus of claim 9, wherein the site attribute information comprises at least one of: website name, website owner contact information, company information, IP address information and ICP information.

11. The apparatus of claim 9, further comprising:

12. The apparatus of claim 11, wherein the malicious website database further comprises: and all the site attribute information and the weight corresponding to the url or the website domain name added to the queue to be detected.

13. An apparatus for identifying a malicious website, the apparatus comprising: the query judging module is used for acquiring a url to be detected, querying whether the malicious website database contains the url to be detected, and if so, determining that the url to be detected is a malicious website;

wherein the malicious website database is established by using the apparatus according to any one of claims 9 to 12.

14. An apparatus for identifying a malicious website, the apparatus comprising:

the query module is configured to search, by using the analyzed site attribute information, a malicious website having the same attribute information as the url to be detected in a malicious website database, where the malicious website database is established by using the apparatus according to claim 12;

15. The apparatus of claim 14, wherein the merging module is specifically configured to:

16. The apparatus of claim 15, wherein the combination calculation is taking a maximum value, or taking an average value, or summing.