CN105099996B - Website verification method and device - Google Patents

Website verification method and device Download PDF

Info

Publication number
CN105099996B
CN105099996B CN201410182046.XA CN201410182046A CN105099996B CN 105099996 B CN105099996 B CN 105099996B CN 201410182046 A CN201410182046 A CN 201410182046A CN 105099996 B CN105099996 B CN 105099996B
Authority
CN
China
Prior art keywords
website
blacklist
candidate
keyword
verified
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410182046.XA
Other languages
Chinese (zh)
Other versions
CN105099996A (en
Inventor
何振科
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qianxin Technology Co Ltd
Original Assignee
Qianxin Technology Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qianxin Technology Group Co Ltd filed Critical Qianxin Technology Group Co Ltd
Priority to CN201410182046.XA priority Critical patent/CN105099996B/en
Publication of CN105099996A publication Critical patent/CN105099996A/en
Application granted granted Critical
Publication of CN105099996B publication Critical patent/CN105099996B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a website verification method and a website verification device, wherein the method comprises the following steps: acquiring a keyword set included in a source code of a website to be verified; and querying a blacklist keyword corresponding relation library, and if at least two blacklist keywords appear in the keyword set and a corresponding relation exists between the two blacklist keywords, determining the website to be verified as a candidate blacklist website or directly determining the website to be verified as an illegal website. The method solves the problems of low efficiency and low accuracy of the existing illegal website identification method.

Description

Website verification method and device
Technical Field
The embodiment of the invention relates to the technical field of networks, in particular to a website verification method and device.
Background
The illegal site is a site that is intended to perform bad applications such as a kick, a yellow-related application, and a gambling application, or other abnormal applications.
At present, the identification of an illegal website is mainly to judge whether the website is suspected to be illegal by manually checking the content of the related website through a network supervision and management structure, however, the manual identification of the illegal website consumes a large amount of manpower and material resources, and the efficiency is very low;
in order to improve the identification efficiency of illegal websites, in the prior art, keywords for identifying the illegal websites are determined by analyzing the content semantics of the existing illegal websites, for example, keywords such as Liuhe lottery usually appear in a gambling website, websites are extracted from a large number of websites, text mining is performed on the extracted websites, and if the keywords such as Liuhe lottery appear reach a preset threshold value, the extracted websites are judged to have a high illegal probability;
however, in the prior art, the keywords for identifying the illegal websites can only be determined according to the existing illegal websites, and some illegal websites are relatively hidden in a large number of websites, for example, variant websites of the existing illegal websites which have already appeared, and the identification keywords determined by the existing illegal websites which have already appeared are illegal websites which cannot identify variants, so the existing illegal website identification method has the problem of low identification accuracy.
Disclosure of Invention
The embodiment of the invention provides a website verification method and a website verification device, which are used for solving the problem that the existing illegal website identification method is low in identification accuracy.
In a first aspect, the present invention provides a website verification method, including:
acquiring a keyword set included in a source code of a website to be verified;
inquiring a blacklist keyword corresponding relation library, if at least two blacklist keywords appear in the keyword set and a corresponding relation exists between the two blacklist keywords, determining the website to be verified as a candidate blacklist website, wherein the candidate blacklist website represents that the website to be verified is an unknown website with a high risk probability;
the corresponding relation library of the blacklist keywords comprises a plurality of blacklist keyword groups, and each blacklist keyword group at least comprises two blacklist keywords with corresponding relations.
Optionally, the method further comprises:
forming a set by blacklist keywords included in the source code of each blacklist website in a blacklist website set to obtain a plurality of blacklist keyword sets;
analyzing the plurality of blacklist keyword sets by using a big data analysis technology, and if the times of a first blacklist keyword and a second blacklist keyword appearing in the plurality of blacklist keyword sets simultaneously exceed a preset time threshold, determining that a corresponding relation exists between the first blacklist keyword and the second blacklist keyword;
and storing the corresponding relation between the first blacklist keyword and the second blacklist keyword in the corresponding relation library of the blacklist keywords.
Optionally, after determining the website to be verified as the candidate blacklist website, the method includes:
acquiring a uniform resource locator of the website to be verified;
inquiring a white list website set, wherein the white list website set comprises verified uniform resource locators of a plurality of white list websites;
and judging whether the uniform resource locator of the website to be verified is in the white list website set, if so, determining that the website to be verified is the white list website, and otherwise, storing the website to be verified in the candidate blacklist website set.
Optionally, after the website to be verified is saved in the candidate blacklist website set, the method includes:
acquiring an access record of the candidate blacklist website set, wherein the access record comprises terminal identifications and corresponding access times of candidate blacklist websites in the candidate blacklist website set which are accessed within a preset time period;
performing clustering analysis on access records of the candidate blacklist website set according to a clustering algorithm, and dividing the candidate blacklist website set into a plurality of candidate blacklist website subsets;
and respectively determining the legality of the candidate blacklist website subsets according to a blacklist website set, wherein the blacklist website set comprises verified uniform resource locators of the blacklist websites.
Optionally, respectively determining the legitimacy of the plurality of candidate blacklist website subsets according to the set of blacklist websites, including:
respectively comparing the uniform resource locator of each candidate blacklist website in each candidate blacklist website subset with the uniform resource locators included in the blacklist website set;
and if the number of the uniform resource locators which are the same in the candidate blacklist website subset and the blacklist website set is greater than a preset threshold value, determining that the websites in the candidate blacklist website subset are illegal websites.
In a second aspect, the present invention provides a website verification apparatus, comprising:
the system comprises an acquisition module, a verification module and a verification module, wherein the acquisition module is used for acquiring a keyword set included in a source code of a website to be verified, and the keyword set includes a plurality of keywords;
the determining module is used for inquiring a blacklist keyword corresponding relation library, and if at least two blacklist keywords appear in the keyword set and a corresponding relation exists between the two blacklist keywords, determining the website to be verified as a candidate blacklist website, wherein the candidate blacklist website represents that the website to be verified is an unknown website with a high risk probability;
the corresponding relation library of the blacklist keywords comprises a plurality of blacklist keyword groups, and each blacklist keyword group at least comprises two blacklist keywords with corresponding relations.
Optionally, the obtaining module is further configured to combine blacklist keywords included in the source code of each blacklist website in a set of blacklist websites into a set, so as to obtain a plurality of sets of blacklist keywords;
the determining module is further configured to analyze the plurality of blacklist keyword sets by using a big data analysis technology, and if the number of times that a first blacklist keyword and a second blacklist keyword appear in the plurality of blacklist keyword sets at the same time exceeds a preset number threshold, determine that a corresponding relationship exists between the first blacklist keyword and the second blacklist keyword;
the device further comprises:
and the storage module is used for storing the corresponding relation between the first blacklist keyword and the second blacklist keyword determined by the determination module in the corresponding relation library of the blacklist keywords.
Optionally, the obtaining module is further configured to obtain a uniform resource locator of the website to be verified;
the acquisition module is further configured to query a white list website set and acquire a uniform resource locator of each white list website in the white list website set;
the determining module is further configured to determine whether the uniform resource locator of the website to be verified matches with the uniform resource locator of one website in the white list website set, determine that the website to be verified is a white list website if the uniform resource locator of the website to be verified matches with the uniform resource locator of one website in the white list website set, and store the website to be verified in the candidate black list website set if the uniform resource locator of the website to be verified matches with the uniform resource locator of one website in the white list website.
Optionally, the obtaining module is further configured to obtain an access record of the candidate blacklist website set, where the access record includes a terminal identifier of a candidate blacklist website in the candidate blacklist website set that has been accessed within a preset time period and a corresponding access frequency;
the device further comprises:
the analysis module is used for carrying out clustering analysis on the access records of the candidate blacklist website set according to a clustering algorithm and dividing the candidate blacklist website set into a plurality of candidate blacklist website subsets;
the determining module is further configured to determine validity of the plurality of candidate blacklist website subsets according to a blacklist website set, where the blacklist website set includes verified uniform resource locators of the plurality of blacklist websites.
Optionally, the determining module is specifically configured to:
respectively comparing the uniform resource locator of each candidate blacklist website in each candidate blacklist website subset with the uniform resource locators included in the blacklist website set;
and if the number of the uniform resource locators which are the same in the candidate blacklist website subset and the blacklist website set is greater than a preset threshold value, determining that the websites in the candidate blacklist website subset are illegal websites.
The method of the embodiment of the invention is adopted to determine whether the blacklist keyword library of the corresponding relation exists in the keyword set in the website to be verified according to the corresponding relation library of the blacklist keyword, and if so, the website to be verified is determined to be an unknown website with higher risk probability; according to the embodiment of the invention, the keywords for identifying the illegal websites are determined according to the appeared illegal websites (blacklist websites), but the websites with unknown and high risk probability are identified according to the corresponding relation among the blacklist keywords, so that the websites with unknown varieties and high risk probability can be identified even in massive websites, and the problem of low identification accuracy of the existing illegal website identification method can be solved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a flowchart illustrating a website verification method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a website verification apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The website verification method provided by the embodiment of the invention can be particularly applied to identification and analysis of hidden and variant illegal websites, and can be executed through a website verification device, and the website verification device can be a server (such as a server unknown to 360 websites).
Fig. 1 is a schematic flow chart of a website verification method according to an embodiment of the present invention, as shown in fig. 1, the method of the embodiment includes:
101. acquiring a keyword set included in a source code of a website to be verified;
for example, a keyword set included in a source code of a website to be verified is obtained by using technical means such as web extraction and text mining technology, where the source code is, for example, a HyperText Markup Language (HTML) code.
102. Querying a blacklist keyword corresponding relation library, and if at least two blacklist keywords appear in the keyword set and a corresponding relation exists between the two blacklist keywords, determining the website to be verified as a candidate blacklist website; the candidate blacklist website represents a website which is unknown and has a high risk probability;
the blacklist keyword corresponding relation library comprises a plurality of blacklist keyword groups, and each blacklist keyword group at least comprises two blacklist keywords with corresponding relations.
Optionally, as can be known by those skilled in the art, under the condition of a high security level, the corresponding relation library of the blacklist keywords is queried, and if at least two blacklist keywords appear in the keyword set and a corresponding relation exists between the two blacklist keywords, the website to be verified can also be directly determined as an illegal website.
In an alternative embodiment of the present invention, step 102 is preceded by:
forming a set by blacklist keywords included in the source code of each blacklist website in a blacklist website set to obtain a plurality of blacklist keyword sets;
analyzing the plurality of blacklist keyword sets by using a big data analysis technology, and if the times of a first blacklist keyword and a second blacklist keyword appearing in the plurality of blacklist keyword sets simultaneously exceed a preset time threshold, determining that a corresponding relation exists between the first blacklist keyword and the second blacklist keyword;
and storing the corresponding relation between the first blacklist keyword and the second blacklist keyword in the corresponding relation library of the blacklist keywords.
For example, a blacklist website set is preset in the embodiment of the present invention, the blacklist website set includes verified and determined illegal websites, and the verified and determined illegal websites may be obtained from public information in a network, may also be provided for related departments, and may also be obtained by existing web page extraction and text mining technologies.
For example, the server according to the embodiment of the present invention may release a web crawler, also known as a web spider (Webspider), to obtain an illegal website on the blacklist website set, and the server performs word segmentation and semantic analysis on the obtained illegal website, respectively, to obtain a plurality of blacklist keyword sets; it should be noted that the web crawler is a program for automatically extracting web pages in the prior art, and the present invention does not describe this in detail.
For example, the big data analysis techniques described in the embodiments of the present invention include data mining tools such as Hadoop, High Performance Computing and Communications (HPCC), Storm, Apache Drill, RapidMiner, and the like, which are not described in detail herein. For example, after analyzing the plurality of blacklist keyword sets, assuming that the number of times of two keywords, namely, a jasmine flower and a Liuhe color (gambling tool), appearing in the plurality of blacklist keyword sets at the same time exceeds a preset threshold value, it may be determined that the jasmine flower and the Liuhe color are blacklist keywords having a corresponding relationship, and the corresponding relationship between the jasmine flower and the Liuhe color may be stored in a preset blacklist keyword corresponding relationship library.
To this end, in the embodiment of the present invention, a blacklist keyword correspondence library is preset, in which correspondence relationships between each group of blacklist keywords obtained by the big data analysis technology are stored, and table 1 is a result of the blacklist keyword correspondence library applied in the embodiment of the present invention, as shown in table 1:
Figure BDA0000499377540000071
in an alternative embodiment of the present invention, step 102 is followed by:
103. and judging whether the uniform resource locator of the website to be verified is in the white list website set, if so, executing step 104, and otherwise, executing step 105.
For example, a Uniform Resource Locator (URL) of the website to be verified is obtained, and a white list website set is queried, where the white list website set includes the Uniform resource locators of multiple verified white list websites; and judging whether the uniform resource locator of the website to be verified is in the white list website set.
104. And determining the website to be verified as a white list website.
And if the uniform resource locator of the website to be verified is judged to be in the white list website set, determining that the website to be verified is the white list website. For example, a news website is a website which is verified to be legal through verification, when the news website includes illegal website news, the obtained corresponding relation of the blacklist keywords also appears in the news website, and the embodiment of the invention compares the URL of the website to be verified with the URLs in the white list website set, so that the legal white list website can be prevented from being mistakenly verified to be the illegal website, and the accuracy of website identification can be improved.
105. And storing the website to be verified into a candidate blacklist website set.
Through the steps, the confirmed websites to be verified which are not in the white list website set and have the corresponding relation of the blacklist keywords can be stored in the candidate blacklist website set so as to be convenient for further judgment in the following.
In an alternative embodiment of the present invention, step 105 is followed by:
106. acquiring an access record of the candidate blacklist website set, wherein the access record comprises terminal identifications and corresponding access times of candidate blacklist websites in the candidate blacklist website set which are accessed within a preset time period;
for example, the access records of the candidate blacklisted website set may be obtained from a domain name server or a recursive server, for example; because each terminal accesses each website, the domain name server or the recursive server stores the access record of the website.
107. Performing clustering analysis on access records of the candidate blacklist website set according to a clustering algorithm, and dividing the candidate blacklist website set into a plurality of candidate blacklist website subsets;
the clustering algorithm may specifically be a Latent Semantic Analysis (LSA) algorithm or a Probabilistic Latent Semantic Analysis (PLSA) algorithm, and the like, and performs clustering Analysis on access records of the candidate blacklist website set according to the clustering algorithm, so as to divide the candidate blacklist website set into a plurality of candidate blacklist website subsets, where each candidate blacklist website subset includes at least one website, and access behaviors of websites in the candidate blacklist website subsets have similarity.
108. And respectively determining the legality of the candidate blacklist website subsets according to a blacklist website set, wherein the blacklist website set comprises verified uniform resource locators of the blacklist websites.
In an alternative embodiment of the present invention, step 108 is implemented by:
respectively comparing the uniform resource locator of each candidate blacklist website in each candidate blacklist website subset with the uniform resource locators included in the blacklist website set;
and if the number of the uniform resource locators which are the same in the candidate blacklist website subset and the blacklist website set is greater than a preset threshold value, determining that the websites in the candidate blacklist website subset are illegal websites.
For example, a blacklist website set is preset in the embodiment of the present invention, the blacklist website set includes verified and determined illegal websites, and the verified and determined illegal websites may be obtained from public information in a network, may also be provided for related departments, and may also be obtained by existing web page extraction and text mining technologies. And comparing each divided candidate blacklist website subset with a known blacklist website set respectively, wherein if a certain candidate blacklist website subset contains a part or all of illegal websites in the known blacklist website set, the candidate blacklist website subset can be determined as a set of illegal websites, and each website in the candidate blacklist website subset is an illegal website.
Due to the particularity of the illegal website, the user group facing the internet is often relatively independent and several, and the user group facing the legal website can show a larger difference. The internet users who have special interest in illegal websites must have different website access behaviors from internet user groups with different interests. That is to say, the potential association relationship between the illegal websites is stronger, and the independence between the illegal websites is higher than that between the legal websites, so that the candidate blacklist website set is divided according to the co-occurrence relationship between the internet user and the websites, and the illegal websites and the legal websites can be effectively distinguished.
The method of the embodiment of the invention is adopted to determine whether the blacklist keyword library of the corresponding relation exists in the keyword set in the website to be verified according to the corresponding relation library of the blacklist keyword, and if so, the website to be verified is determined to be an unknown website with higher risk probability; according to the embodiment of the invention, the keywords for identifying the illegal websites are determined according to the appeared illegal websites (blacklist websites), but the websites with unknown and high risk probability are identified according to the corresponding relation among the blacklist keywords, so that the websites with unknown varieties and high risk probability can be identified even in massive websites, and the problem of low identification accuracy of the existing illegal website identification method can be solved.
Further, according to the access records of the candidate blacklist website set, the clustering algorithm is used for carrying out clustering analysis on the candidate blacklist website set, the candidate blacklist website set is divided into a plurality of subsets, and whether websites in each subset are illegal websites is determined according to the known blacklist website set. Aiming at the particularity of the illegal websites, the internet user groups facing the illegal websites and the internet user groups facing the legal websites can show the characteristic of larger difference, and the potential association relationship between the illegal websites is analyzed, so that the legal websites and the illegal websites are distinguished, and the identification efficiency and the accuracy of the illegal websites are improved.
Fig. 2 is a schematic structural diagram of a website verification apparatus according to an embodiment of the present invention, as shown in fig. 2, including:
the acquiring module 21 is configured to acquire a keyword set included in a source code of a website to be verified, where the keyword set includes multiple keywords;
a determining module 22, configured to query a blacklist keyword correspondence library, and if at least two blacklist keywords appear in the keyword set and a correspondence exists between the two blacklist keywords, determine the website to be verified as a candidate blacklist website, where the candidate blacklist website indicates that the website to be verified is an unknown website with a high risk probability;
the corresponding relation library of the blacklist keywords comprises a plurality of blacklist keyword groups, and each blacklist keyword group at least comprises two blacklist keywords with corresponding relations.
Wherein:
the obtaining module 21 is further configured to combine blacklist keywords included in the source code of each blacklist website in a set of blacklist websites into a set, so as to obtain a plurality of sets of blacklist keywords;
the determining module 22 is further configured to analyze the plurality of blacklist keyword sets by using a big data analysis technology, and if a number of times that a first blacklist keyword and a second blacklist keyword appear in the plurality of blacklist keyword sets at the same time exceeds a preset number threshold, determine that a corresponding relationship exists between the first blacklist keyword and the second blacklist keyword;
the device further comprises:
a storing module 23, configured to store the correspondence between the first blacklist keyword and the second blacklist keyword, which is determined by the determining module 22, in the blacklist keyword correspondence library.
Wherein:
the obtaining module 21 is further configured to obtain a uniform resource locator of the website to be verified;
the obtaining module 21 is further configured to query a white list website set, and obtain a uniform resource locator of each white list website in the white list website set;
the determining module 22 is further configured to determine whether the uniform resource locator of the website to be verified is matched with the uniform resource locator of one website in the white list website set, determine that the website to be verified is a white list website if the uniform resource locator of the website to be verified is matched with the uniform resource locator of one website in the white list website set, and store the website to be verified in the candidate black list website set if the uniform resource locator of the website to be verified is not matched with the uniform resource locator of one website in.
Wherein:
the obtaining module 21 is further configured to obtain an access record of the candidate blacklist website set, where the access record includes a terminal identifier of a candidate blacklist website in the candidate blacklist website set that has been accessed within a preset time period and a corresponding access frequency;
the device further comprises:
the analysis module 24 is configured to perform cluster analysis on the access records of the candidate blacklist website set acquired by the acquisition module 21 according to a clustering algorithm, and divide the candidate blacklist website set into a plurality of candidate blacklist website subsets;
the determining module 22 is further configured to determine validity of the plurality of candidate blacklist website subsets according to a blacklist website set, where the blacklist website set includes verified uniform resource locators of the plurality of blacklist websites.
Wherein the determining module 22 is specifically configured to:
respectively comparing the uniform resource locator of each candidate blacklist website in each candidate blacklist website subset with the uniform resource locators included in the blacklist website set;
and if the number of the uniform resource locators which are the same in the candidate blacklist website subset and the blacklist website set is greater than a preset threshold value, determining that the websites in the candidate blacklist website subset are illegal websites.
According to the embodiment of the invention, whether the blacklist keyword library of the corresponding relation exists in the keyword set of the website to be verified is determined according to the corresponding relation library of the blacklist keywords, and if the blacklist keyword library of the corresponding relation exists in the keyword set of the website to be verified, the website to be verified is determined to be an unknown website with higher risk probability; according to the embodiment of the invention, the keywords for identifying the illegal websites are determined according to the appeared illegal websites (blacklist websites), but the websites with unknown and high risk probability are identified according to the corresponding relation among the blacklist keywords, so that the websites with unknown varieties and high risk probability can be identified even in massive websites, and the problem of low identification accuracy of the existing illegal website identification method can be solved.
Further, according to the access records of the candidate blacklist website set, the clustering algorithm is used for carrying out clustering analysis on the candidate blacklist website set, the candidate blacklist website set is divided into a plurality of subsets, and whether websites in each subset are illegal websites is determined according to the known blacklist website set. Aiming at the particularity of the illegal websites, the internet user groups facing the illegal websites and the internet user groups facing the legal websites can show the characteristic of larger difference, and the potential association relationship between the illegal websites is analyzed, so that the legal websites and the illegal websites are distinguished, and the identification efficiency and the accuracy of the illegal websites are improved.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The integrated unit implemented in the form of a software functional unit may be stored in a computer-readable storage medium in the form of a code. The code is stored in a computer readable storage medium and includes instructions for causing a processor or hardware circuitry to perform some or all of the steps of the methods described in the various embodiments of the invention. And the aforementioned storage medium includes: a micro high-capacity removable Memory disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, which do not need a physical drive, and the like, of the usb interface.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (8)

1. A website authentication method, comprising:
acquiring a keyword set included in a source code of a website to be verified;
inquiring a blacklist keyword corresponding relation library, and if at least two blacklist keywords appear in the keyword set and a corresponding relation exists between the two blacklist keywords, determining the website to be verified as a candidate blacklist website or directly determining the website to be verified as an illegal website;
the blacklist keyword corresponding relation library comprises a plurality of blacklist keyword groups, and each blacklist keyword group comprises at least two blacklist keywords with corresponding relations;
wherein, still include:
forming a set by blacklist keywords included in the source code of each blacklist website in a blacklist website set to obtain a plurality of blacklist keyword sets;
analyzing the plurality of blacklist keyword sets, and if the times of a first blacklist keyword and a second blacklist keyword appearing in the same keyword set of the plurality of blacklist keyword sets at the same time exceed a preset time threshold, determining that a corresponding relationship exists between the first blacklist keyword and the second blacklist keyword;
and storing the corresponding relation between the first blacklist keyword and the second blacklist keyword in the corresponding relation library of the blacklist keywords.
2. The method of claim 1, wherein the determining the website to be verified as a candidate blacklisted website comprises:
acquiring a uniform resource locator of the website to be verified;
inquiring a white list website set, wherein the white list website set comprises verified uniform resource locators of a plurality of white list websites;
and judging whether the uniform resource locator of the website to be verified is in the white list website set, if so, determining that the website to be verified is the white list website, and otherwise, storing the website to be verified in the candidate blacklist website set.
3. The method of claim 2, wherein saving the website to be verified to the set of candidate blacklisted websites comprises:
acquiring an access record of the candidate blacklist website set, wherein the access record comprises terminal identifications and corresponding access times of candidate blacklist websites in the candidate blacklist website set which are accessed within a preset time period;
performing clustering analysis on access records of the candidate blacklist website set according to a clustering algorithm, and dividing the candidate blacklist website set into a plurality of candidate blacklist website subsets;
and respectively determining the legality of the candidate blacklist website subsets according to a blacklist website set, wherein the blacklist website set comprises verified uniform resource locators of the blacklist websites.
4. The method of claim 3, wherein determining the legitimacy of each of the plurality of subsets of candidate blacklisted websites from the set of blacklisted websites comprises:
respectively comparing the uniform resource locator of each candidate blacklist website in each candidate blacklist website subset with the uniform resource locators included in the blacklist website set;
if the number of uniform resource locators which are the same in the candidate blacklist website subset and the blacklist website set is larger than a preset threshold value, all websites in the candidate blacklist website subset are determined to be illegal websites.
5. A website authentication device, comprising:
the system comprises an acquisition module, a verification module and a verification module, wherein the acquisition module is used for acquiring a keyword set included in a source code of a website to be verified, and the keyword set includes a plurality of keywords;
the determining module is used for inquiring a blacklist keyword corresponding relation library, and if at least two blacklist keywords appear in the keyword set and a corresponding relation exists between the two blacklist keywords, determining the website to be verified as a candidate blacklist website or directly determining the website to be verified as an illegal website;
the blacklist keyword corresponding relation library comprises a plurality of blacklist keyword groups, and each blacklist keyword group comprises at least two blacklist keywords with corresponding relations;
wherein:
the obtaining module is further configured to combine blacklist keywords included in the source code of each blacklist website in a set of blacklist websites into a set, so as to obtain a plurality of sets of blacklist keywords;
the determining module is further configured to analyze the plurality of blacklist keyword sets, and if the number of times that a first blacklist keyword and a second blacklist keyword appear in the same one of the plurality of blacklist keyword sets at the same time exceeds a preset number threshold, determine that a corresponding relationship exists between the first blacklist keyword and the second blacklist keyword;
the device further comprises:
and the storage module is used for storing the corresponding relation between the first blacklist keyword and the second blacklist keyword determined by the determination module in the corresponding relation library of the blacklist keywords.
6. The apparatus of claim 5, wherein:
the acquisition module is further used for acquiring the uniform resource locator of the website to be verified;
the acquisition module is further configured to query a white list website set and acquire a uniform resource locator of each white list website in the white list website set;
the determining module is further configured to determine whether the uniform resource locator of the website to be verified matches with the uniform resource locator of one website in the white list website set, determine that the website to be verified is a white list website if the uniform resource locator of the website to be verified matches with the uniform resource locator of one website in the white list website set, and store the website to be verified in the candidate black list website set if the uniform resource locator of the website to be verified matches with the uniform resource locator of one website in the white list website.
7. The apparatus of claim 6, wherein:
the acquisition module is further configured to acquire an access record of the candidate blacklist website set, where the access record includes a terminal identifier of a candidate blacklist website in the candidate blacklist website set that has been accessed within a preset time period and a corresponding access frequency;
the device further comprises:
the analysis module is used for carrying out clustering analysis on the access records of the candidate blacklist website set according to a clustering algorithm and dividing the candidate blacklist website set into a plurality of candidate blacklist website subsets;
the determining module is further configured to determine validity of the plurality of candidate blacklist website subsets according to a blacklist website set, where the blacklist website set includes verified uniform resource locators of the plurality of blacklist websites.
8. The apparatus of claim 7, wherein the determining module is specifically configured to:
respectively comparing the uniform resource locator of each candidate blacklist website in each candidate blacklist website subset with the uniform resource locators included in the blacklist website set;
if the number of uniform resource locators which are the same in the candidate blacklist website subset and the blacklist website set is larger than a preset threshold value, all websites in the candidate blacklist website subset are determined to be illegal websites.
CN201410182046.XA 2014-04-30 2014-04-30 Website verification method and device Active CN105099996B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410182046.XA CN105099996B (en) 2014-04-30 2014-04-30 Website verification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410182046.XA CN105099996B (en) 2014-04-30 2014-04-30 Website verification method and device

Publications (2)

Publication Number Publication Date
CN105099996A CN105099996A (en) 2015-11-25
CN105099996B true CN105099996B (en) 2020-03-06

Family

ID=54579560

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410182046.XA Active CN105099996B (en) 2014-04-30 2014-04-30 Website verification method and device

Country Status (1)

Country Link
CN (1) CN105099996B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105635126B (en) * 2015-12-24 2018-10-09 北京奇虎科技有限公司 Malice network address accesses means of defence, client, security server and system
CN108664584A (en) * 2018-05-07 2018-10-16 秦德玉 Infringement site search recognition methods and device
CN113127715A (en) * 2021-03-04 2021-07-16 微梦创科网络科技(中国)有限公司 Method and system for identifying gambling-related information

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079042A (en) * 2006-12-28 2007-11-28 腾讯科技(深圳)有限公司 System and method for quickly inquiring about black and white name list
CN101470731A (en) * 2007-12-26 2009-07-01 中国科学院自动化研究所 Personalized web page filtering method
CN102946390A (en) * 2012-11-02 2013-02-27 孙霁 Multimedia long-distance family education system and equipment of system
CN102999638A (en) * 2013-01-05 2013-03-27 南京邮电大学 Phishing website detection method excavated based on network group
CN103475669A (en) * 2013-09-25 2013-12-25 上海交通大学 Website credit blacklist generating method and system based on relational analysis

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008068655A2 (en) * 2006-12-08 2008-06-12 International Business Machines Corporation Privacy enhanced comparison of data sets
CN102523311B (en) * 2011-11-25 2014-08-06 中国科学院计算机网络信息中心 Illegal domain name recognition method and device
CN103049483B (en) * 2012-11-30 2016-04-20 北京奇虎科技有限公司 The recognition system of webpage danger

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079042A (en) * 2006-12-28 2007-11-28 腾讯科技(深圳)有限公司 System and method for quickly inquiring about black and white name list
CN101470731A (en) * 2007-12-26 2009-07-01 中国科学院自动化研究所 Personalized web page filtering method
CN102946390A (en) * 2012-11-02 2013-02-27 孙霁 Multimedia long-distance family education system and equipment of system
CN102999638A (en) * 2013-01-05 2013-03-27 南京邮电大学 Phishing website detection method excavated based on network group
CN103475669A (en) * 2013-09-25 2013-12-25 上海交通大学 Website credit blacklist generating method and system based on relational analysis

Also Published As

Publication number Publication date
CN105099996A (en) 2015-11-25

Similar Documents

Publication Publication Date Title
CN107251037B (en) Blacklist generation device, blacklist generation system, blacklist generation method, and recording medium
CN103559235B (en) A kind of online social networks malicious web pages detection recognition methods
CN107707545B (en) Abnormal webpage access fragment detection method, device, equipment and storage medium
US11848913B2 (en) Pattern-based malicious URL detection
CN108572990B (en) Information pushing method and device
CN111585955B (en) HTTP request abnormity detection method and system
CN102957664B (en) A kind of method and device identifying fishing website
US20150205951A1 (en) Systems and methods for sql query constraint solving
US11526586B2 (en) Copyright detection in videos based on channel context
CN108881138B (en) Webpage request identification method and device
CN108023868B (en) Malicious resource address detection method and device
CN112989348B (en) Attack detection method, model training method, device, server and storage medium
US20200272765A1 (en) Method and apparatus for detecting label data leakage channel
CN111104579A (en) Identification method and device for public network assets and storage medium
CN110572359A (en) Phishing webpage detection method based on machine learning
CN113779481B (en) Method, device, equipment and storage medium for identifying fraud websites
CN108337269A (en) A kind of WebShell detection methods
US8572073B1 (en) Spam detection for user-generated multimedia items based on appearance in popular queries
CN106372202B (en) Text similarity calculation method and device
CN112328936A (en) Website identification method, device and equipment and computer readable storage medium
CN110020161B (en) Data processing method, log processing method and terminal
CN105099996B (en) Website verification method and device
CN110619075A (en) Webpage identification method and equipment
CN108494728B (en) Method, device, equipment and medium for creating blacklist library for preventing traffic hijacking
CN108171053B (en) Rule discovery method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20161219

Address after: 100088 Jiuxianqiao Chaoyang District Beijing Road No. 10, building 15, floor 17, layer 1701-26, 3

Applicant after: BEIJING QI'ANXIN SCIENCE & TECHNOLOGY CO., LTD.

Address before: 100088 Beijing city Xicheng District xinjiekouwai Street 28, block D room 112 (Desheng Park)

Applicant before: Beijing Qihoo Technology Co., Ltd.

Applicant before: Qizhi Software (Beijing) Co., Ltd.

C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: No. 32, Building 3, 102, 28 Xinjiekouwai Street, Xicheng District, Beijing

Applicant after: Qianxin Technology Group Co., Ltd.

Address before: Beijing Chaoyang District Jiuxianqiao Road 10, building 15, floor 17, layer 1701-26, 3

Applicant before: BEIJING QI'ANXIN SCIENCE & TECHNOLOGY CO., LTD.

GR01 Patent grant
GR01 Patent grant