CN110309402A - Detect the method and system of website - Google Patents

Detect the method and system of website Download PDF

Info

Publication number
CN110309402A
CN110309402A CN201810164312.4A CN201810164312A CN110309402A CN 110309402 A CN110309402 A CN 110309402A CN 201810164312 A CN201810164312 A CN 201810164312A CN 110309402 A CN110309402 A CN 110309402A
Authority
CN
China
Prior art keywords
website
detected
similarity
library
keyword
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810164312.4A
Other languages
Chinese (zh)
Inventor
庞玉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201810164312.4A priority Critical patent/CN110309402A/en
Publication of CN110309402A publication Critical patent/CN110309402A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a kind of method and systems for detecting website.Wherein, this method comprises: determining the similarity of the structure of web page of website and benchmark website to be detected;In the case where similarity is greater than the first preset value, the keyword that whether there is specified type in website to be detected is judged;There are in the case where the keyword of specified type, determine that website to be detected is the website of specified type in determining website to be detected.The present invention solve detection website in the prior art whether be violation website the low technical problem of accuracy rate.

Description

Detect the method and system of website
Technical field
The present invention relates to network detection fields, in particular to a kind of method and system for detecting website.
Background technique
With flourishing for Internet technology, people will receive a large amount of bad letters when accessing the data of various websites The interference of breath, especially gambling, pornography are spread unchecked.Therefore, carrying out identification to the flame of website is to realize GreenNet The premise of network.
Currently, mainly including the following two kinds to the detection of gambling, the pornographic network information:
(1) violation website is identified based on the dictionary of sensitive keys word.This method needs a large amount of manpower to regularly update word Library, the amount of recalling depend on the sample dictionary of sensitive keys word, and can also have a large amount of wrong report phenomenon.
(2) violation website is identified based on the method for picture recognition, but this method not only needs to consume a large amount of computer Resource, and discrimination is lower.
For the above-mentioned website of detection in the prior art whether be violation website the low problem of accuracy rate, not yet propose at present Effective solution scheme.
Summary of the invention
The embodiment of the invention provides a kind of method and systems for detecting website, at least to solve to detect net in the prior art Stand whether be violation website the low technical problem of accuracy rate.
According to an aspect of an embodiment of the present invention, a kind of method for detecting website is provided, comprising: determine survey grid to be checked It stands and the similarity of the structure of web page of benchmark website;In the case where similarity is greater than the first preset value, website to be detected is judged In whether there is specified type keyword;There are in the case where the keyword of specified type in determining website to be detected, really Fixed website to be detected is the website of specified type.
According to another aspect of an embodiment of the present invention, a kind of method for detecting website is additionally provided, comprising: obtain to be detected The data to be tested of website;Determine the first similarity of data to be tested and the data in the library of abnormal website, wherein abnormal website Library includes the structure of web page of multiple abnormal websites;Determine the second similarity of data to be tested and the keyword in sensitive dictionary; If the first similarity is greater than first threshold, and the second similarity is greater than second threshold, it is determined that website to be detected is specified type Website.
According to another aspect of an embodiment of the present invention, a kind of method for detecting website is additionally provided, comprising: receive to be detected The data information of website;The data information of website to be detected is evaluated based on multiple abnormality detection libraries, obtains survey grid to be checked The value-at-risk stood, wherein different abnormality detection libraries corresponds to different judgment rules, and judgment rule is for determining website to be detected Value-at-risk under different abnormality detection libraries;The Type of website of website to be detected is determined based on the value-at-risk of website to be detected.
According to another aspect of an embodiment of the present invention, a kind of system for detecting website is additionally provided, comprising: input unit, For obtaining website to be detected;Processor, the similarity of the structure of web page for determining website to be detected and benchmark website, and In the case that similarity is greater than the first preset value, there are in the case where the keyword of specified type in determining website to be detected, Determine that website to be detected is the website of specified type.
According to another aspect of an embodiment of the present invention, a kind of storage medium is additionally provided, which includes storage Program, wherein the method that equipment where control storage medium executes detection website in program operation.
According to another aspect of an embodiment of the present invention, a kind of processor is additionally provided, which is used to run program, In, the method for detection website is executed when program is run.
According to another aspect of an embodiment of the present invention, a kind of system for detecting website is additionally provided, comprising: processor;With And memory, connect with processor, for providing the instruction for handling following processing step for processor: determine website to be detected and The similarity of the structure of web page of benchmark website;Similarity be greater than the first preset value in the case where, judge be in website to be detected It is no that there are the keywords of specified type;There are in the case where the keyword of specified type in determining website to be detected, determine to Detect the website that website is specified type.
In embodiments of the present invention, using the detection mode based on web site architecture, pass through determination website to be detected and benchmark The similarity of the structure of web page of website, then similarity be greater than the first preset value in the case where, judge be in website to be detected No there are the keywords of specified type, finally, there are in the case where the keyword of specified type in determining website to be detected, really Fixed website to be detected is the website of specified type, has achieved the purpose that the detection efficiency for improving detection violation website, to realize Recall rate, the technical effect of high rate of false alarm caused by keyword detection is used only are avoided, and then solves and examines in the prior art Survey grid station whether be violation website the low technical problem of accuracy rate.
Detailed description of the invention
The drawings described herein are used to provide a further understanding of the present invention, constitutes part of this application, this hair Bright illustrative embodiments and their description are used to explain the present invention, and are not constituted improper limitations of the present invention.In the accompanying drawings:
Fig. 1 is a kind of system structure diagram for detecting website according to an embodiment of the present invention;
Fig. 2 is a kind of method flow diagram for detecting website according to an embodiment of the present invention;
Fig. 3 is a kind of method flow diagram for detecting website according to an embodiment of the present invention;
Fig. 4 (a) is a kind of method flow diagram of optional detection website according to an embodiment of the present invention;
Fig. 4 (b) is a kind of method flow diagram of optional detection website according to an embodiment of the present invention;
Fig. 5 is a kind of flow chart in optional building abnormality detection library according to an embodiment of the present invention;
Fig. 6 is a kind of method flow diagram for detecting website according to an embodiment of the present invention;
Fig. 7 is a kind of apparatus structure schematic diagram for detecting website according to an embodiment of the present invention;And
Fig. 8 is a kind of hardware block diagram of terminal according to an embodiment of the present invention.
Specific embodiment
In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work It encloses.
It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product Or other step or units that equipment is intrinsic.
Firstly, the part noun or term that occur during the embodiment of the present application is described are suitable for following solution It releases:
(1) clustering, i.e. cluster analysis are a kind of statistical analysis techniques and data for studying sample or index classification problem Mining algorithm.For example, the set of physics or abstract object is grouped into the analytic process for the multiple classes being made of similar object, It is made of several modes, in general, vector or a point in hyperspace that mode refers to a measurement.
(2) DOM (Document Object Model, i.e. DOM Document Object Model) is that the processing of World Wide Web Consortium recommendation can The standard programming interface of extension flag language.Wherein, on webpage, organize the object of the page or document be organized in one it is tree-like In structure, i.e. dom tree.
Embodiment 1
According to embodiments of the present invention, a kind of system for detecting website is provided, it should be noted that the application was proposed The system of detection website can be applied to network context of detection, for example, to prevent minor child's access in family unsound Website, parent can be by using the systems of detection website provided herein.Child in manager is a certain by computer to access When website, the system for detecting website carries out the structure of web page of the website and violation website (for example, gambling site, porn site) Matching, obtains the maximum similarity of similarity with multiple violation websites, and further determine that whether maximum similarity is greater than Default similarity, whether if maximum similarity is greater than default similarity, further detecting in website includes sensitive keys Word;If in website including sensitive keys word, it is determined that the website is located at the corresponding violation website of sensitive keys word.In determination The system control computer shutdown of website is after violation website, is detected or generates warning in the website of access, to remind visitor The website is violation website.
As shown in the above, the system for the detection website that the application is proposed can carry out various dimensions to website to be detected Detection, wherein various dimensions detection includes but is not limited to based on the dimensions such as structure of web page and sensitive keys word to website to be detected It is detected, whether is violation website (for example, gambling site, porn site) with determination website to be detected, to reach purification The purpose of network environment.
Specifically, as shown in Figure 1, the system of detection website provided herein specifically include that input unit 10 and Processor 20.Wherein, input unit 10, for obtaining website to be detected;Processor 20, for determining website to be detected and benchmark The similarity of the structure of web page of website, and in the case where similarity is greater than the first preset value, if existed in website to be detected The keyword of specified type, it is determined that website to be detected is the website of specified type.
It should be noted that can be by obtaining the data to be tested of website to be detected, and determined by data to be tested The Type of website of website to be detected, wherein the data to be tested of website to be detected can be but be not limited to the net of website to be detected Keyword included in page structure, website to be detected etc..Said reference website is the website in the library of abnormal website, wherein different There are multiple violation websites in normal website library.The keyword of above-mentioned specified type is the sensitive keys word that can identify the Type of website, For example, " gambling house ", " gambling ", " mahjong " etc., can identify the website is gambling site.Above-mentioned specified type is the type of website, For example, normal website, gambling site, porn site etc..
In an alternative embodiment, user by input unit input network address to determine the website to be accessed, Determining website to be detected, the processor connecting with input unit can get the data to be tested of website to be detected, and according to The data to be tested of website to be detected obtain the structure of web page of website to be detected, and by the structure of web page and exception of website to be detected The structure of web page of each benchmark website in the library of website is matched, and multiple similarities are obtained, and then determines multiple similar degrees It is worth similarity of the maximum similarity as website to be detected and benchmark website.If it is default that the similarity is less than or equal to first Value then illustrates website to be detected and violation website and do not match that, therefore, the inspection no longer to website to be detected progress next step It surveys;If the similarity is greater than the first preset value, illustrate that website to be detected is possible to as violation website, therefore, it is necessary to treat Detection website is further detected, that is, passes through keyword detection.Specifically, the keyword in website to be detected is extracted, And the crucially no matching of the keyword extracted and specified type is compared, if it does, then determining website Zhong Bao to be detected Keyword containing specified type, so that it is determined that website to be detected is the net of specified type corresponding with the keyword of specified type Network, for example, determining in website to be detected through overmatching comprising keyword " gambling ", it is determined that the website to be detected is gambling net It stands.
From the foregoing, it will be observed that obtain website to be detected by input equipment, processor determines website to be detected and benchmark website The similarity of structure of web page, and in the case where similarity is greater than the first preset value, if there is specified class in website to be detected The keyword of type, it is determined that website to be detected is the website of specified type.
It is easily noted that, due to being from the similarity of the structure of web page of website to be detected and benchmark website and to be checked Two dimensions of keyword in survey grid station with the presence or absence of specified type detect website to be detected, and not only from sensitive word Library is detected, and carries out detecting the recall rate for leading to website to website to be detected to reach and effectively avoid being used only keyword Problem low, rate of false alarm is high.
As shown in the above, the system of detection website provided herein, which can achieve, improves detection violation website The purpose of detection efficiency, to realize the technology effect for avoiding low, the high rate of false alarm of recall rate caused by keyword detection is used only Fruit, so solve detection website in the prior art whether be violation website the low technical problem of accuracy rate.
In an alternative embodiment, the system of detection website provided herein further includes memory.Wherein, it deposits Reservoir, for storing the website that website to be detected is specified type.Specifically, determining that website to be detected is the net of specified type It stands after (for example, gambling site), website to be detected is stored into abnormal website library, to increase the benchmark in the library of abnormal website The quantity of website provides guarantee to provide accurate detection result.In addition, having due to the fast development of Internet technology Abnormal website library in benchmark website may not be suitable for detecting existing violation website.Therefore, it is necessary to exception Website library is updated.And above-mentioned memory stores the violation website detected every time into abnormal website library, can achieve The purpose that abnormal website library is updated.
In an alternative embodiment, processor is also used to obtain the dom tree of website to be detected;Dom tree is divided Solution obtains tree set of paths;The similarity of the structure of web page of website to be detected and benchmark website is determined according to tree set of paths.
In an alternative embodiment, processor is also used to extract the keyword in tree set of paths;Comparison tree path The keyword of keyword and specified type in set, obtains similarity;In the case where similarity is greater than similarity threshold, really There are the keywords of specified type in fixed website to be detected.
It should be noted that the system of detection website provided herein can also be based on domain-name information to website to be detected It is detected.Mainly pass through the similarity of the domain name of the domain name and benchmark website that detect website to be detected, and/or, it is to be detected The size of the domain name price of website and default domain name price, come determine website to be detected whether be specified type website.Specifically , in the case where similarity is less than or equal to the first preset value and is greater than the second preset value, processor judges website to be detected Whether the similarity of domain name and the domain name of benchmark website is greater than third preset value, and/or, whether the domain name price of website to be detected Less than the 4th preset value;It is greater than third preset value in the similarity of the domain name of the domain name and benchmark website that determine website to be detected, And/or in the case that the domain name price of website to be detected is less than the 4th preset value, judge in website to be detected with the presence or absence of specified The keyword of type;There are in the case where the keyword of specified type, determine that website to be detected is in determining website to be detected The website of specified type.For example, processor the domain name for determining website to be detected and the domain name of the violation website of benchmark phase It is greater than third preset value like degree, also, the domain name price of website to be detected is less than the 4th preset value, it is determined that website to be detected is Violation website, or abnormal website.
In addition it is also necessary to explanation, the system of detection website provided by the present application not only can be from sensitive dictionary, exception Detection library and three, domain-name information library dimension are detected, and can also be detected from other dimensions, details are not described herein.
Embodiment 2
According to embodiments of the present invention, a kind of embodiment of the method for detecting website is additionally provided, it should be noted that in attached drawing Process the step of illustrating can execute in a computer system such as a set of computer executable instructions, although also, Logical order is shown in flow charts, but in some cases, can be executed with the sequence for being different from herein it is shown or The step of description.Wherein, embodiment of the method provided herein can be held in the system of detection website in embodiment 1 Row.Wherein, Fig. 2 shows it is a kind of detect website method flow diagram, as shown in Figure 2, the method for detecting website specifically include as Lower step:
Step S202 determines the similarity of the structure of web page of website to be detected and benchmark website.
It should be noted that benchmark website is the website in abnormality detection library, wherein the website in abnormality detection library is equal For violation website, for example, gambling site, porn site etc..Wherein, the data to be tested and base of comparison website to be detected can be passed through The data of quasi- website determine the Type of website of website to be detected, the data to be tested of website to be detected can be but be not limited to Detect the structure of web page of website, keyword, the domain-name information of website to be detected that website to be detected is included etc..
In an alternative embodiment, user by client access website (website i.e. to be detected), with client into The system of the detection website of row communication gets client and accesses the solicited message of website, and is obtained according to solicited message to be detected The feedback information that website returns obtains website to be detected it is then detected that the system of website carries out dissection process to feedback information Data to be tested are to get data such as the structure of web page for arriving website to be detected.The system of website is detected by the net to website to be detected Page is handled, and tree set of paths is obtained, meanwhile, the structure of web page of benchmark website is obtained from abnormality detection library, wherein reference net The form that the structure of web page stood can also set set of paths indicates, by comparing the tree set of paths and reference net of website to be detected The tree set of paths for the structure of web page stood obtains the similarity of the two.
In addition it is also necessary to which the similarity of the structure of web page of explanation, website to be detected and benchmark website is survey grid to be checked The highest similarity of structure of web page similarity stood with benchmark website in abnormality detection library.For example, benchmark website includes three nets Page A, B and C, wherein the similarity of the webpage of the webpage and benchmark website of website to be detected is respectively A1, B1 and C1, and A1 < C1 < B1, then using B1 as the similarity of the structure of web page of website to be detected and benchmark website.
Step S204 judges in website to be detected in the case where similarity is greater than the first preset value with the presence or absence of specified The keyword of type.
It should be noted that the keyword of specified type is the keyword in sensitive dictionary.If it is determined that website to be detected It is greater than the first preset value with the similarity of the structure of web page of benchmark website, then illustrates that website to be detected may be violation website.For Further determine that whether website to be detected is violation website, needs further to examine the keyword in website to be detected It surveys.Specifically, can be to be checked to determine by comparing the similarity of the keyword of keyword and specified type in website to be detected It whether there is the keyword of specified type in survey grid station.Wherein, if similarity is greater than preset similarity threshold, really There are the keywords of specified type in fixed website to be detected.
Step S206 determines survey grid to be checked there are in the case where the keyword of specified type in determining website to be detected It stands as the website of specified type.
It should be noted that detecting website in the case where determining that website to be detected includes the keyword of specified type System will acquire the type of the keyword of specified type, and the type of website to be detected is determined according to the type of keyword.For example, The type for the keyword that website to be detected includes is gambling, it is determined that website to be detected is gambling site.
Based on step defined by above-mentioned steps S202 to step S206, can know, by determination website to be detected and The similarity of the structure of web page of benchmark website.Similarity be greater than the first preset value in the case where, judge be in website to be detected It is no that there are the keywords of specified type.Finally in determining website to be detected there are in the case where the keyword of specified type, really Fixed website to be detected is the website of specified type.
It is easily noted that, due to being from the similarity of the structure of web page of website to be detected and benchmark website and to be checked Two dimensions of keyword in survey grid station with the presence or absence of specified type detect website to be detected, and not only from sensitive word Library is detected, and carries out detecting the recall rate for leading to website to website to be detected to reach and effectively avoid being used only keyword Problem low, rate of false alarm is high.
As shown in the above, the system of detection website provided herein, which can achieve, improves detection violation website The purpose of detection efficiency, to realize the technology effect for avoiding low, the high rate of false alarm of recall rate caused by keyword detection is used only Fruit, so solve detection website in the prior art whether be violation website the low technical problem of accuracy rate.
It should be noted that the similarity in the structure of web page of website to be detected and benchmark website is greater than the first preset value In the case of, need further to be determined according to the domain-name information of website to be detected website to be detected whether be specified type net It stands, wherein the domain-name information of website to be detected includes at least: the domain name valence of the domain name of website to be detected and website to be detected Lattice.Determined according to the domain-name information of website to be detected website to be detected whether be specified type website specific steps such as Under:
Step S210 judges to be checked in the case where similarity is less than or equal to the first preset value and is greater than the second preset value Whether the similarity of the domain name of the domain name and benchmark website at survey grid station is greater than third preset value, and/or, the domain name of website to be detected Whether price is less than the 4th preset value;
Step S212, it is default greater than third in the similarity of the domain name of the domain name and benchmark website that determine website to be detected Value, and/or, in the case that the domain name price of website to be detected is less than the 4th preset value, judge to whether there is in website to be detected The keyword of specified type;
Step S214 determines survey grid to be checked there are in the case where the keyword of specified type in determining website to be detected It stands as the website of specified type.
In an alternative embodiment, in the structure of web page for determining website to be detected and the structure of web page of benchmark website In the case that similarity is less than or equal to the first preset value, website to be detected may be not identical as the Type of website of benchmark website, is Determine that the Type of website of website to be detected needs in order to which whether determination website to be detected is identical as the type of benchmark website The domain-name information of website to be detected is detected.Wherein, it is less than or equal to the first preset value in similarity and is greater than second in advance If determining the similarity of the domain name of website to be detected and the domain name of benchmark website in the case where value, if similarity is greater than third Preset value, it is determined that the Type of website of website to be detected is identical as the Type of website of benchmark website.Can have by the above method Effect avoids the problem that improving according only to structure of web page similarity and the keyword erroneous judgement Type of website and determining the Type of website to be detected Accuracy.
In an alternative embodiment, in the structure of web page of the structure of web page and benchmark website that determine website to be detected Similarity be less than or equal to the first preset value in the case where, determine whether the domain name price of website to be detected default less than the 4th Value, if the domain name price of website to be detected is less than the 4th preset value, it is determined that website to be detected is violation website.It needs to illustrate , the domain name price of the higher website of risk is generally relatively low, therefore, by detect website to be detected domain name price whether Less than the domain name price of the lower website of risk or the domain name price of normal website, can effectively avoid according only to structure of web page phase Like the problem of degree and the keyword erroneous judgement Type of website.
There is also a kind of optional embodiments, in the structure of web page of the structure of web page and benchmark website that determine website to be detected Similarity be less than or equal to the first preset value after, detect the similarity of the domain name of website to be detected and the domain name of benchmark website, And the domain name price of website to be detected.It is pre- greater than third in the domain name of website to be detected and the similarity of the domain name of benchmark website If value, meanwhile, the domain name price of website to be detected is less than the 4th preset value, it is determined that website to be detected is the net of specified type It stands.It should be noted that low with the similarity-rough set of the structure of web page of benchmark website in the structure of web page for determining website to be detected In the case where, further detect the domain-name information of website to be detected, and then determined according to the domain-name information of website to be detected to The Type of website for detecting website, can achieve the purpose for accurately determining the Type of website of website to be detected.
In an alternative embodiment, the tree path of the structure of web page of comparison website to be detected and benchmark website can be passed through The mode of set determines the similarity of the structure of web page of website to be detected and benchmark website, the specific steps are as follows:
Step S2040 obtains the dom tree of website to be detected;
Step S2042, decomposes dom tree, obtains tree set of paths;
Step S2044 determines the similarity of the structure of web page of website to be detected and benchmark website according to tree set of paths.
It should be noted that each dom tree can be analyzed to a plurality of tree set of paths, wherein if the structure of two webpages Similar, then the two webpages can be analyzed to a plurality of similar tree path.In addition, the similarity of structure of web page can pass through best match Path calculates.Wherein, the similarity of two structure of web page is the similarity in every tree path and the tree path of its best match Average value.
Specifically, the feature of tree set of paths is counted based on DOM tree node after the dom tree for obtaining website to be detected, And based on the feature extracted to tree set of paths link match, be then based on minimum editing distance principle determine it is to be checked The similarity of the structure of web page of survey grid station and benchmark website, i.e., by the corresponding tree path of page structure in benchmark website with it is to be checked The tree path at survey grid station, obtains the difference value or distance in two paths, and difference value or the webpage apart from the smallest benchmark website The similarity of the structure of web page of structure and website to be detected, the phase of as above-mentioned website to be detected and the structure of web page of benchmark website Like degree.
In an alternative embodiment, the mode that DOM decomposition can be used judges in website to be detected with the presence or absence of specified The keyword of type, the specific steps are as follows:
Step S2060 extracts the keyword in tree set of paths;
Step S2062, comparison set the keyword of keyword and specified type in set of paths, obtain similarity;
Step S2064 determines that there are specified types in website to be detected in the case where similarity is greater than similarity threshold Keyword.
Specifically, the system of detection website is pre-processed by the data to be tested to website to be detected, obtain to be checked The dom tree of the webpage at survey grid station, then analyzes dom tree, obtains tree set of paths, while by extracting tree set of paths In content extracted, then after segmenting, removing stop words, use TF-IDF (Term Frequency-Inverse The reverse file word frequency of Document Frequency, i.e. word frequency -) method, IDF (i.e. reverse file word frequency) numerical value is biggish Word extracts to arrive the keyword of website to be detected.Then by the pass of the keyword of website to be detected and specified type Keyword is compared, and obtains the corresponding similarity of keyword.If similarity is greater than similarity threshold, it is determined that website to be detected It is middle that there are the keywords of specified type.
Embodiment 3
According to embodiments of the present invention, a kind of embodiment of the method for detecting website is additionally provided, wherein Fig. 3 shows one kind The method flow diagram of website is detected, from the figure 3, it may be seen that the method for detection website specifically comprises the following steps:
Step S302 obtains the data to be tested of website to be detected.
It should be noted that the data to be tested of above-mentioned website to be detected can be but be not limited to the webpage of website to be detected Keyword included in structure, website to be detected etc..
Step S304 determines the first similarity of data to be tested and the data in the library of abnormal website, wherein abnormal website Library includes the structure of web page of multiple abnormal websites.
It should be noted that being stored at least one violation website in the library of abnormal website.Wherein, data to be tested and exception The similarity of data in the library of website is the similarity of the structure of web page of data to be tested and the benchmark website in the library of abnormal website. Benchmark website is the highest website of similarity in the library of abnormal website with data to be tested.
Step S306 determines the second similarity of data to be tested and the keyword in sensitive dictionary.
It should be noted that including multiple sensitive vocabulary in sensitive dictionary, wherein the keyword in sensitive dictionary is quick Feel dictionary in the highest keyword of data to be tested similarity.
Step S308, if the first similarity is greater than first threshold, and the second similarity is greater than second threshold, it is determined that be checked Survey grid station is the website of specified type.
It should be noted that above-mentioned exception website characterizes the website to be detected to have risky website.Specifically, true Fixed first similarity is greater than after first threshold, illustrates that website to be detected may be violation website;It then proceedes to similar to second Degree is judged, if the second similarity is greater than second threshold, it is determined that website to be detected is abnormal website.Further, according to The classification of the keyword in sensitive dictionary to match with data to be tested can determine the type of website to be detected, for example, really Fixed website to be detected is gambling site.
Based on step defined by above-mentioned steps S302 to step S308, it can know, by obtaining website to be detected Data to be tested, and determine the first similarity of data to be tested and the data in the library of abnormal website, then determine number to be detected According to the second similarity with the keyword in sensitive dictionary, if the first similarity is greater than first threshold, and the second similarity is greater than Second threshold, it is determined that website to be detected is the website of specified type, wherein abnormal website library includes the net of multiple risk websites Page structure.
It is easily noted that, due to being from the similarity of the structure of web page of website to be detected and benchmark website and to be checked Two dimensions of keyword in survey grid station with the presence or absence of specified type detect website to be detected, and not only from sensitive word Library is detected, and carries out detecting the recall rate for leading to website to website to be detected to reach and effectively avoid being used only keyword Problem low, rate of false alarm is high.
As shown in the above, it is risky to can achieve raising detection tool for the system of detection website provided herein The purpose of the detection efficiency of website avoids low recall rate caused by keyword detection is used only, high rate of false alarm to realize Technical effect, so solve detection website in the prior art whether be violation website the low technical problem of accuracy rate.
In an alternative embodiment, before the data to be tested for obtaining website to be detected, need to construct abnormal inspection Survey library, the specific steps are as follows:
Step S30 constructs abnormal website library and sensitive dictionary;
Step S32 constructs abnormality detection library according to abnormal website library and sensitive dictionary.
Wherein, sensitive dictionary is constructed to specifically comprise the following steps:
Step S3002a obtains the data set of multiple abnormal websites;
Step S3004a handles the data set of multiple abnormal websites, obtains the tree set of paths of data set;
Step S3006a extracts the keyword in tree set of paths, wherein keyword is sensitive keys word;
Step S3008a constructs sensitive dictionary according to the keyword being drawn into.
Specifically, the system of detection website pre-processes the data set of multiple abnormal websites, by the net of abnormal website The dom tree of page is indicated with one group of tree set of paths, while carrying out whole extractions to the content of abnormal website, by segmenting, going After stop words processing, using the method for TF-IDF, the biggish word of IDF numerical value is extracted to get quick into sensitive dictionary Feel word.Multiple abnormal websites are carried out sensitive word to extract being that may make up sensitive word library.
In an alternative embodiment, abnormal website library is constructed, comprising:
Step S3002b obtains the dom tree of multiple abnormal websites;
Step S3004b, decomposes dom tree, obtains tree set of paths;
Step S3006b determines the similarity of the structure of web page of multiple abnormal websites according to tree set of paths;
Step S3008b carries out clustering processing to multiple abnormal websites according to the similarity of structure of web page, obtains cluster knot Fruit;
Step S3010b constructs abnormal website library according to cluster result.
Specifically, one kind as shown in Fig. 4 (a) optionally detection website method flow diagram, by Fig. 4 (a) it is found that from After sample database gets the data of abnormal website, needs the data to abnormal website to pre-process, obtain dom tree, then Documents structured Cluster processing is carried out to obtained dom tree, and abnormal website library is formed according to cluster result.
It should be noted that determining the similarity of the structure of web page of multiple abnormal websites according to tree set of paths, comprising:
Step S402 obtains the first similarity of each path and coupling path in tree set of paths, wherein matching road Diameter is the highest path of similarity corresponding with each path;
Step S404 determines the similarity of structure of web page according to the first similarity.
Specifically, when carrying out documents structured Cluster to multiple abnormal websites, it is necessary first to be parsed to webpage, spanning tree path Set.Then calculate this it is multiple tree set of paths similarities, formed similarity matrix, and based on similarity matrix obtain it is above-mentioned The best matching path of multiple tree set of paths, and then obtain the similarity of multiple webpages.Be then based on the similarity of webpage into Row cluster merges, and completes the cluster of abnormal website.
It should be noted that abnormal website can be constructed according to cluster result after the cluster result for obtaining abnormal website Library, the specific steps are as follows:
Step S406 determines the Template web page in every class exception website according to cluster result, wherein Template web page is base The webpage of quasi- website.
Step S408 constructs abnormal website library based on Template web page.
It should be noted that abnormality detection library further include: domain-name information library, domain-name information library include violation domain-name information library And domain name price library, wherein before determining website to be detected for abnormal website, the method that detects website further include:
Step S502 obtains the domain-name information of website to be detected;
Step S504, determine website to be detected domain name and violation domain-name information library in domain name third similarity, and/ Or, determining the domain name price of website to be detected according to domain name price library.
It should be noted that obtaining website to be detected and abnormal website library for the type for accurately determining website to be detected In data the first similarity and website to be detected data to be tested it is similar to second of the keyword in sensitive dictionary While spending, it is also necessary to determine the domain-name information of website to be detected according to violation domain-name information library and domain name price library.
In an alternative embodiment, the domain name of website to be detected and the domain name in violation domain-name information library are compared It is right, determine that the domain name of website to be detected and the highest similarity of the domain name in violation domain-name information library, the similarity are as to be checked The domain name at survey grid station and the third similarity of the domain name in violation domain-name information library.And then the highest similarity of above-mentioned determination determines With the domain name in the domain name similarity highest violation site information library of website to be detected, then according to the domain name in domain name price Domain name price is determined in library, wherein the domain name price is the domain name price of website to be detected.
It should be noted that after the similarity and/or domain name price in domain name price library has been determined, and then according to domain name The similarity and/or domain name price in price library are that can determine that website to be detected is the website of specified type, specifically, if the first phase It is greater than first threshold like degree, the second similarity is greater than second threshold, and third similarity is greater than third threshold value, and/or, it is to be detected The domain name price of website is less than default price, it is determined that website to be detected is the website of specified type.
Specifically, can determine that website to be detected is the website of specified type according to any one following mode:
One: the first similarity of mode is greater than first threshold, and the second similarity is greater than second threshold, and third similarity is big In third threshold value;
Mode two: the first similarity is greater than first threshold, and the second similarity is greater than second threshold, and website to be detected Domain name price is less than default price;
Three: the first similarity of mode is greater than first threshold, and the second similarity is greater than second threshold, and third similarity is greater than the Three threshold values, and the domain name price of website to be detected is less than default price.
It in an alternative embodiment, can be to net to be detected by Fig. 4 (a) it is found that after constructing abnormal website library Station is detected, specifically, obtaining the data (data to be tested of website i.e. to be detected) of website on cloud, and carries out label to it The extraction of content, and carry out characteristic set processing obtain website to be detected whether include specified type keyword.Simultaneously The structure of web page of website to be detected is detected based on the abnormal website library built, determines the tree road set of paths Zhong Meitiao The size and data to be tested of the first similarity and first threshold of diameter and coupling path and the keyword in sensitive dictionary The size of second similarity and second threshold.If the first similarity is greater than first threshold, while the second similarity is greater than second Threshold value then carries out platform audit, if audit passes through, it is determined that website to be detected is abnormal website, and stores abnormal website extremely Abnormality detection library.As a member in the sample database of abnormal website, and then the data issued with network are together as webpage training Collection.
In an alternative embodiment, Fig. 4 (b) shows a kind of method flow diagram of optional detection website.By Fig. 4 (b) it is found that extracting webpage from the sample database for being stored with abnormal website to obtain webpage training set, and then abnormal website is obtained Data.After the data for obtaining abnormal website, the data of abnormal website can be pre-processed.Specifically, abnormal website Webpage be HTML (i.e. hypertext markup language) webpage, HTML parsing is carried out to the webpage of abnormal website, XML can be obtained (i.e. extensible markup language), then to XML carry out DOM parsing, thus obtain DOM object (i.e. Document object) to get To the dom tree of the webpage of abnormal website.After obtaining the dom tree of webpage, documents structured Cluster processing is carried out to dom tree, specifically, It is primarily based on dom tree to parse webpage, spanning tree set of paths, for example, generating P1=(N11,N12,...,N1m) and P2= (N21,N22,...,N2m) two tree set of paths, similarities of the two tree set of paths are then calculated, similarity moment is formed Battle array, and the best matching path of above-mentioned two tree set of paths is obtained based on similarity matrix, and then obtain the phase of two webpages Like degree, last web-based similarity obtains cluster result.After obtaining cluster result, generated according to cluster result multiple Webpage cluster, and by artificial screening, i.e. manual intervention formation rule library in Fig. 4 (b) to get to abnormal website library (i.e. in violation of rules and regulations Website form library).After constructing abnormal website library, website to be detected can be detected, specifically, obtaining website on cloud Data (data to be tested of website i.e. to be detected), and the extraction of label substance is carried out to it, and carry out characteristic set Processing is then based on the abnormal website library of building and treated characteristic set, constructs model, and according to the model built Website to be detected is detected, determines the first similarity, to be detected of each path and coupling path in tree set of paths Second similarity of data and the keyword in sensitive dictionary, then carries out threshold value comparison to above-mentioned testing result.Specifically, than Compared with the size of the size and the second similarity and second threshold of the first similarity and first threshold.If the first similarity is greater than First threshold, while the second similarity is greater than second threshold, then carries out RCP (Rich Client Platform, i.e. rich clients Platform) audit, if audit passes through, it is determined that website to be detected is abnormal website, is punished to the head of a station of website to be detected, And illegal website is stored to abnormality detection library.As a member in the sample database of abnormal website, and then the data issued with network Together as webpage training set.
Embodiment 4
According to embodiments of the present invention, a kind of embodiment of the method for detecting website is additionally provided, wherein Fig. 6 shows one kind The method flow diagram of website is detected, it will be appreciated from fig. 6 that the method for detection website specifically comprises the following steps:
Step S602 receives the data information of website to be detected;
Step S604 evaluates the data information of website to be detected based on multiple abnormality detection libraries, obtains to be detected The value-at-risk of website, wherein different abnormality detection libraries corresponds to different judgment rules, and judgment rule is for determining survey grid to be checked The risk stood under different abnormality detection libraries;
Step S606 determines the Type of website of website to be detected based on the value-at-risk of website to be detected.
It should be noted that being detected based on different abnormality detection libraries to the data information of website to be detected, can obtain To different value-at-risks, summation then is weighted to the value-at-risk obtained under different detection libraries, survey grid to be checked can be obtained The value-at-risk stood.It is the Type of website that can determine website to be detected according to the section where the numerical value of value-at-risk, for example, value-at-risk Numerical value section [A, B) in, it is determined that the Type of website of website to be detected be gambling site.
In addition it is also necessary to which explanation, multiple abnormality detection libraries include at least: sensitive dictionary, abnormal website library, domain name letter Cease library, wherein abnormal website library includes the structure of web page of multiple abnormal websites.
Based on step defined by above-mentioned steps S602 to step S606, it can know, by receiving website to be detected Data information is then based on multiple abnormality detection libraries and evaluates the data information of website to be detected, obtains website to be detected Value-at-risk, finally the value-at-risk based on website to be detected determines the Type of website of website to be detected, wherein different abnormal inspections It surveys library and corresponds to different judgment rules, judgment rule is for determining risk of the website to be detected under different abnormality detection libraries Value.
It is easily noted that, due to being from the similarity of the structure of web page of website to be detected and benchmark website and to be checked Two dimensions of keyword in survey grid station with the presence or absence of specified type detect website to be detected, and not only from sensitive word Library is detected, and carries out detecting the recall rate for leading to website to website to be detected to reach and effectively avoid being used only keyword Problem low, rate of false alarm is high.
As shown in the above, the system of detection website provided herein, which can achieve to improve, detects abnormal website The purpose of detection efficiency avoids low recall rate caused by keyword detection is used only, the technology of high rate of false alarm effect to realize Fruit, so solve detection website in the prior art whether be abnormal website the low technical problem of accuracy rate.Wherein, abnormal net (or violation website) is stood as the website with security risk.
It should be noted that detecting based on data information of the abnormality detection library to website to be detected, risk is obtained After value, if website to be detected is abnormal website, website to be detected is stored to abnormality detection library.
In an alternative embodiment, the data information of website to be detected is commented based on multiple abnormality detection libraries Valence before obtaining the value-at-risk of website to be detected, needs to construct sensitive dictionary and abnormal website library, wherein construct sensitive dictionary Include:
Step S60 obtains the data set of abnormal website;
Step S62 handles the data set of abnormal website, obtains the tree set of paths of data set;
Step S64 extracts the keyword in tree set of paths, wherein keyword is sensitive keys word;
Step S66 constructs sensitive dictionary according to the keyword being drawn into.
Specifically, the system of detection website pre-processes the data set of multiple abnormal websites, by the net of abnormal website The dom tree of page is indicated with one group of tree set of paths, while carrying out whole extractions to the content of abnormal website, by segmenting, going After stop words processing, using the method for TF-IDF, the biggish word of IDF numerical value is extracted to get quick into sensitive dictionary Feel word.Multiple abnormal websites are carried out sensitive word to extract being that may make up sensitive word library.
In an alternative embodiment, abnormal website library is constructed, comprising:
Step S70 obtains the dom tree of multiple abnormal websites;
Step S72, decomposes dom tree, obtains tree set of paths;
Step S74 determines the similarity of the structure of web page of multiple abnormal websites according to tree set of paths;
Step S76 carries out clustering processing to multiple abnormal websites according to the similarity of structure of web page, obtains cluster result;
Step S78 constructs abnormal website library according to cluster result.
Specifically, by Fig. 4 (b) it is found that after the data for getting abnormal website, need to the data of abnormal website into Row pretreatment.Specifically, the webpage of abnormal website is HTML (i.e. hypertext markup language) webpage, to the webpage of abnormal website into Row HTML parsing, can be obtained XML (i.e. extensible markup language), then DOM parsing be carried out to XML, to obtain DOM pairs As the dom tree to get the webpage for arriving abnormal website.Then obtained dom tree is subjected to documents structured Cluster processing, and is tied according to cluster Fruit generates multiple webpage clusters, and carries out conversion processing to multiple webpage clusters, so that formation rule library is to get to abnormal website library.
In an alternative embodiment, it is detected, is obtained based on data information of the abnormality detection library to website to be detected To value-at-risk, comprising:
Step S6040 obtains the data information for treating detection website based on sensitive dictionary and is detected, the first obtained wind Danger value;
Step S6042, acquisition detect the data information of website to be detected based on abnormal website library, second obtained Value-at-risk;
Step S6044, acquisition are detected based on data information of the domain-name information library to website to be detected, obtain third wind Danger value;
Step S6044 is weighted summation to the first value-at-risk, the second value-at-risk and third value-at-risk, determines to be checked The risk at survey grid station.
It should be noted that the weight of the first value-at-risk, the second value-at-risk and third value-at-risk can according to the actual situation into Row setting, wherein the weight highest of the first value-at-risk.
In an alternative embodiment, the value-at-risk based on website to be detected determines the Type of website of website to be detected, It specifically includes:
Step S6060, judges whether the value-at-risk of website to be detected is greater than default value-at-risk;
Step S6062 determines that website to be detected is in the case where the value-at-risk of website to be detected is greater than default value-at-risk Abnormal website.
It should be noted that after determining website to be detected for abnormal website, it can be according to the value-at-risk of website to be detected Affiliated numerical intervals determine the concrete type of website to be detected, for example, value-at-risk is greater than A, it is determined that website to be detected is Abnormal website;If further value-at-risk numerical value section [A, B) in, it is determined that the Type of website of website to be detected is gambling Website.
In an alternative embodiment, Fig. 5 shows a kind of flow chart in optional building abnormality detection library, such as Fig. 5 Known to, the specific steps are as follows:
Step S51 establishes the risk library of various dimensions after obtaining the data (for example, html source code) of abnormal website, In, the risk library of various dimensions can be but be not limited to label violation dictionary (i.e. sensitive dictionary), the black sample form library of website structure The information bank of (i.e. abnormal website library) and domain name.
Step S53 is generated in abnormality detection model (i.e. abnormality detection library) after establishing the risk library of various dimensions.
Step S55 detects the data to be tested of website to be detected based on the abnormality detection model built.Wherein, exist When detecting to website to be detected, each risk library can obtain a value-at-risk, to each obtained risk in risk library Value is weighted summation, and the value-at-risk of website to be detected can be obtained.And then it can determine according to the value-at-risk of website to be detected Whether website to be detected is abnormal website.
Embodiment 5
According to embodiments of the present invention, a kind of Installation practice for detecting website is additionally provided, wherein Fig. 7 shows one kind The apparatus structure schematic diagram of website is detected, as shown in Figure 7, the device for detecting website specifically includes: the first determining module 701 is sentenced Disconnected module 703 and the second determining module 705.
Wherein, the first determining module 701, the similarity of the structure of web page for determining website to be detected and benchmark website; Judgment module 703, in the case where similarity is greater than the first preset value, judging in website to be detected with the presence or absence of specified class The keyword of type;Second determining module 705, for the case where there are the keywords of specified type in determining website to be detected Under, determine that website to be detected is the website of specified type.
Herein it should be noted that above-mentioned first determining module 701, judgment module 703 and the second determining module 705 are right Should be in the step S202 to step S206 in embodiment 2, example and application scenarios that three modules and corresponding step are realized It is identical, but it is not limited to the above embodiments two disclosure of that.
In an alternative embodiment, detect the device of website further include: first judgment module, the second judgment module with And the 5th determining module.Wherein, first judgment module is preset for being less than or equal to the first preset value in similarity and being greater than second In the case where value, judge whether the similarity of the domain name of website to be detected and the domain name of benchmark website is greater than third preset value, and/ Or, whether the domain name price of website to be detected is less than the 4th preset value;Second judgment module, for determining website to be detected The similarity of domain name and the domain name of benchmark website is greater than third preset value, and/or, the domain name price of website to be detected is less than the 4th In the case where preset value, the keyword that whether there is specified type in website to be detected is judged;5th determining module, for true There are in the case where the keyword of specified type, determine that website to be detected is the website of specified type in fixed website to be detected.
Herein it should be noted that above-mentioned first judgment module, the second judgment module and the 5th determining module correspond to Step S210 to step S214 in embodiment 2, three modules are identical as example and application scenarios that corresponding step is realized, But it is not limited to the above embodiments two disclosure of that.
In an alternative embodiment, the first determining module includes: that the first acquisition module, decomposing module and third are true Cover half block.Wherein, first module is obtained, for obtaining the dom tree of website to be detected;Decomposing module, for dividing dom tree Solution obtains tree set of paths;Third determining module, for determining the net of website to be detected Yu benchmark website according to tree set of paths The similarity of page structure.
Herein it should be noted that above-mentioned first obtains module, decomposing module and third determining module corresponding to implementation Step S2040 to step S2044 in example 2, three modules are identical as example and application scenarios that corresponding step is realized, but It is not limited to the above embodiments two disclosure of that.
In an alternative embodiment, judgment module includes: abstraction module, contrast module and the 4th determining module. Wherein, abstraction module, for extracting the keyword in tree set of paths;Contrast module, for comparing the pass in tree set of paths The keyword of keyword and specified type, obtains similarity;4th determining module, for being greater than the feelings of similarity threshold in similarity Under condition, the keyword in website to be detected there are specified type is determined.
Herein it should be noted that above-mentioned abstraction module, contrast module and the 4th determining module correspond in embodiment 2 Step S2060 to step S2064, the example and application scenarios that three modules and corresponding step are realized be identical but unlimited In two disclosure of that of above-described embodiment.
Embodiment 6
According to embodiments of the present invention, a kind of system embodiment for detecting website is additionally provided, wherein the system for detecting website It include: processor and memory.Wherein, memory is connect with processor, for providing processing following processing step for processor Instruction: determine the similarity of the structure of web page of website to be detected and benchmark website;It is greater than the feelings of default similarity in similarity Under condition, the keyword that whether there is specified type in website to be detected is judged;There are specified types in determining website to be detected Keyword in the case where, determine website to be detected be specified type website.
Embodiment 7
The embodiment of the present invention can provide a kind of terminal, which can be in terminal group Any one computer terminal.Optionally, in the present embodiment, above-mentioned terminal also could alternatively be mobile whole The terminal devices such as end.
Optionally, in the present embodiment, above-mentioned terminal can be located in multiple network equipments of computer network At least one network equipment.
Fig. 8 shows a kind of hardware block diagram of terminal.As shown in figure 8, terminal A may include one (processor 802 may include but unlimited for a or multiple (802a, 802b ... ... being used in figure, 802n is shown) processor 802 In the processing unit of Micro-processor MCV or programmable logic device FPGA etc.), memory 804, Yi Jiyong for storing data In the transmitting device 806 of communication function.In addition to this, it can also include: display, input/output interface (I/O interface), lead to With the port universal serial bus (USB) (can be used as a port in the port of I/O interface is included), network interface, power supply and/ Or camera.It will appreciated by the skilled person that structure shown in Fig. 8 is only to illustrate, not to above-mentioned electronic device Structure cause to limit.For example, terminal A may also include the more perhaps less component than shown in Fig. 8 or have The configuration different from shown in Fig. 8.
It is to be noted that said one or multiple processors 802 and/or other data processing circuits lead to herein Can often " data processing circuit " be referred to as.The data processing circuit all or part of can be presented as software, hardware, firmware Or any other combination.In addition, data processing circuit for single independent processing module or all or part of can be integrated to meter In any one in other elements in calculation machine terminal A.As involved in the embodiment of the present application, the data processing circuit (such as the selection for the variable resistance end path connecting with interface) is controlled as a kind of processor.
Processor 802 can call the information and application program of memory storage by transmitting device, to execute following steps It is rapid: to determine the similarity of the structure of web page of website to be detected and benchmark website;In the case where similarity is greater than the first preset value, Judge the keyword that whether there is specified type in website to be detected;There are the keys of specified type in determining website to be detected In the case where word, determine that website to be detected is the website of specified type.
Memory 804 can be used for storing the software program and module of application software, such as the detection in the embodiment of the present application Corresponding program instruction/the data storage device of the method for website, processor 802 are stored in soft in memory 804 by operation Part program and module realize the method for above-mentioned detection website thereby executing various function application and data processing.It deposits Reservoir 804 may include high speed random access memory, may also include nonvolatile memory, as one or more magnetic storage fills It sets, flash memory or other non-volatile solid state memories.In some instances, memory 804 can further comprise relative to place The remotely located memory of device 802 is managed, these remote memories can pass through network connection to terminal A.Above-mentioned network Example include but is not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.
Transmitting device 806 is used to that data to be received or sent via a network.Above-mentioned network specific example may include The wireless network that the communication providers of terminal A provide.In an example, transmitting device 806 includes that a network is suitable Orchestration (Network Interface Controller, NIC), can be connected by base station with other network equipments so as to Internet is communicated.In an example, transmitting device 806 can be radio frequency (Radio Frequency, RF) module, For wirelessly being communicated with internet.
Display can such as touch-screen type liquid crystal display (LCD), the liquid crystal display aloow user with The user interface of terminal A interacts.
Herein it should be noted that in some optional embodiments, above-mentioned terminal A shown in Fig. 8 may include Hardware element (including circuit), software element (including the computer code that may be stored on the computer-readable medium) or hardware member The combination of both part and software element.It should be pointed out that Fig. 8 is only an example of particular embodiment, and it is intended to show It may be present in the type of the component in above-mentioned terminal A out.
In the present embodiment, above-mentioned terminal A can execute the program generation of following steps in the method for detection website Code: the similarity of the structure of web page of website to be detected and benchmark website is determined;In the case where similarity is greater than the first preset value, Judge the keyword that whether there is specified type in website to be detected;There are the keys of specified type in determining website to be detected In the case where word, determine that website to be detected is the website of specified type.
Processor can call the information and application program of memory storage by transmitting device, to execute following step: In the case where similarity is less than or equal to the first preset value and is greater than the second preset value, the domain name and benchmark of website to be detected are judged Whether the similarity of the domain name of website is greater than third preset value, and/or, whether the domain name price of website to be detected is pre- less than the 4th If value;It is greater than third preset value in the similarity of the domain name of the domain name and benchmark website that determine website to be detected, and/or, it is to be checked In the case that the domain name price at survey grid station is less than the 4th preset value, the key that whether there is specified type in website to be detected is judged Word;There are in the case where the keyword of specified type, determine that website to be detected is specified type in determining website to be detected Website.
Processor can call the information and application program of memory storage by transmitting device, to execute following step: Obtain the dom tree of website to be detected;Dom tree is decomposed, tree set of paths is obtained;It is determined according to tree set of paths to be detected The similarity of the structure of web page of website and benchmark website.
Processor can call the information and application program of memory storage by transmitting device, to execute following step: Extract the keyword in tree set of paths;The keyword of keyword and specified type in comparison tree set of paths, obtains similar Degree;In the case where similarity is greater than similarity threshold, the keyword in website to be detected there are specified type is determined.
It will appreciated by the skilled person that structure shown in Fig. 8 is only to illustrate, terminal is also possible to intelligence It can mobile phone (such as Android phone, iOS mobile phone), tablet computer, applause computer and mobile internet device (Mobile Internet Devices, MID), the terminal devices such as PAD.Fig. 8 it does not cause to limit to the structure of above-mentioned electronic device.Example Such as, terminal A may also include the more or less component (such as network interface, display device) than shown in Fig. 8, or Person has the configuration different from shown in Fig. 8.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of above-described embodiment is can It is completed with instructing the relevant hardware of terminal device by program, which can store in a computer readable storage medium In, storage medium may include: flash disk, read-only memory (Read-Only Memory, ROM), random access device (Random Access Memory, RAM), disk or CD etc..
Embodiment 8
The embodiments of the present invention also provide a kind of storage mediums.Optionally, in the present embodiment, above-mentioned storage medium can To detect program code performed by the method for website provided by above-described embodiment for saving.
Optionally, in the present embodiment, above-mentioned storage medium can be located in computer network in computer terminal group In any one terminal, or in any one mobile terminal in mobile terminal group.
Optionally, in the present embodiment, storage medium is arranged to store the program code for executing following steps: really The similarity of the structure of web page of fixed website to be detected and benchmark website;In the case where similarity is greater than the first preset value, judgement It whether there is the keyword of specified type in website to be detected;There are the keywords of specified type in determining website to be detected In the case of, determine that website to be detected is the website of specified type.
Optionally, in the present embodiment, storage medium is arranged to store the program code for executing following steps: In the case that similarity is less than or equal to the first preset value and is greater than the second preset value, the domain name and reference net of website to be detected are judged Whether the similarity for the domain name stood is greater than third preset value, and/or, whether the domain name price of website to be detected is default less than the 4th Value;It is greater than third preset value in the similarity of the domain name of the domain name and benchmark website that determine website to be detected, and/or, it is to be detected In the case that the domain name price of website is less than the 4th preset value, the key that whether there is specified type in website to be detected is judged Word;There are in the case where the keyword of specified type, determine that website to be detected is specified type in determining website to be detected Website.
Optionally, in the present embodiment, storage medium is arranged to store the program code for executing following steps: obtaining Take the dom tree of website to be detected;Dom tree is decomposed, tree set of paths is obtained;Survey grid to be checked is determined according to tree set of paths It stands and the similarity of the structure of web page of benchmark website.
Optionally, in the present embodiment, storage medium is arranged to store the program code for executing following steps: taking out Take the keyword in tree set of paths;The keyword of keyword and specified type in comparison tree set of paths, obtains similarity; In the case where similarity is greater than similarity threshold, the keyword in website to be detected there are specified type is determined.
The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.
In the above embodiment of the invention, it all emphasizes particularly on different fields to the description of each embodiment, does not have in some embodiment The part of detailed description, reference can be made to the related descriptions of other embodiments.
In several embodiments provided herein, it should be understood that disclosed technology contents can pass through others Mode is realized.Wherein, the apparatus embodiments described above are merely exemplary, such as the division of the unit, only A kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or components can combine or Person is desirably integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual Between coupling, direct-coupling or communication connection can be through some interfaces, the INDIRECT COUPLING or communication link of unit or module It connects, can be electrical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can for personal computer, server or network equipment etc.) execute each embodiment the method for the present invention whole or Part steps.And storage medium above-mentioned includes: that USB flash disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic or disk etc. be various to can store program code Medium.
The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims (25)

1. a kind of method for detecting website characterized by comprising
Determine the similarity of the structure of web page of website to be detected and benchmark website;
In the case where the similarity is greater than the first preset value, judge in the website to be detected with the presence or absence of specified type Keyword;
There are in the case where the keyword of the specified type in determining the website to be detected, the website to be detected is determined For the website of specified type.
2. the method according to claim 1, wherein the method also includes:
In the case where the similarity is less than or equal to first preset value and is greater than the second preset value, judge described to be detected Whether the similarity of the domain name of website and the domain name of the benchmark website is greater than third preset value, and/or, the website to be detected Domain name price whether less than the 4th preset value;
It is greater than the third preset value in the similarity of the domain name of the domain name and the benchmark website that determine the website to be detected, And/or in the case that the domain name price of the website to be detected is less than the 4th preset value, judge in the website to be detected With the presence or absence of the keyword of the specified type;
There are in the case where the keyword of the specified type in determining the website to be detected, the website to be detected is determined For the website of the specified type.
3. the method according to claim 1, wherein determining the webpage knot of the website to be detected and benchmark website The similarity of structure, comprising:
Obtain the dom tree of the website to be detected;
The dom tree is decomposed, tree set of paths is obtained;
The similarity of the structure of web page of the website to be detected and the benchmark website is determined according to the tree set of paths.
4. according to the method described in claim 3, it is characterized in that, judging in the website to be detected with the presence or absence of specified type Keyword, comprising:
Extract the keyword in the tree set of paths;
The keyword for comparing the keyword and the specified type in the tree set of paths, obtains the similarity;
In the case where the similarity is greater than similarity threshold, determine that there are the specified types in the website to be detected Keyword.
5. a kind of method for detecting website characterized by comprising
Obtain the data to be tested of website to be detected;
Determine the first similarity of the data to be tested and the data in the library of abnormal website, wherein the exception website Ku Bao Structure of web page containing multiple abnormal websites;
Determine the second similarity of the data to be tested and the keyword in sensitive dictionary;
If first similarity is greater than first threshold, and second similarity is greater than second threshold, it is determined that described to be checked Survey grid station is the website of specified type.
6. according to the method described in claim 5, it is characterized in that, before the data to be tested for obtaining website to be detected, institute State method further include:
Construct the abnormal website library and the sensitive dictionary;
Abnormality detection library is constructed according to the abnormal website library and the sensitive dictionary.
7. according to the method described in claim 6, it is characterized in that, the building sensitive dictionary, comprising:
Obtain the data set of the multiple abnormal website;
The data set of the multiple abnormal website is handled, the tree set of paths of the data set is obtained;
Extract the keyword in the tree set of paths, wherein the keyword is sensitive keys word;
According to the keyword building the being drawn into sensitive dictionary.
8. according to the method described in claim 6, it is characterized in that, the building abnormal website library, comprising:
Obtain the dom tree of the multiple abnormal website;
The dom tree is decomposed, tree set of paths is obtained;
The similarity of the structure of web page of the multiple abnormal website is determined according to the tree set of paths;
Clustering processing is carried out to the multiple abnormal website according to the similarity of the structure of web page, obtains cluster result;
According to the cluster result building abnormal website library.
9. according to the method described in claim 8, it is characterized in that, determining the multiple abnormal net according to the tree set of paths The similarity for the structure of web page stood, comprising:
Obtain the first similarity of each path and coupling path in the tree set of paths, wherein the coupling path is The highest path of similarity corresponding with each path;
The similarity of the structure of web page is determined according to first similarity.
10. according to the method described in claim 8, it is characterized in that, according to the cluster result building abnormal website library, Include:
The Template web page in every class exception website is determined according to the cluster result;
Based on the Template web page building abnormal website library.
11. according to the method described in claim 6, it is characterized in that, the abnormality detection library further include: domain-name information library, institute Stating domain-name information library includes violation domain-name information library and domain name price library, wherein is determining that the website to be detected is specified Before the website of type, the method also includes:
Obtain the domain-name information of the website to be detected;
Determine the domain name of the website to be detected and the third similarity of the domain name in violation domain-name information library, and/or,
The domain name price of the website to be detected is determined according to domain name price library.
12. according to the method for claim 11, which is characterized in that determine that the website to be detected is the net of specified type It stands, comprising:
If first similarity is greater than the first threshold, second similarity is greater than the second threshold, and described the Three similarities are greater than third preset value, and/or, the domain name price of the website to be detected is less than default price, it is determined that described Website to be detected is the website of the specified type.
13. the method according to any one of claim 6 or 11, which is characterized in that determining the website to be detected After the website of specified type, the method also includes:
Store website to the abnormality detection library of the specified type.
14. a kind of method for detecting website characterized by comprising
Receive the data information of website to be detected;
It is evaluated based on data information of multiple abnormality detection libraries to the website to be detected, obtains the website to be detected Value-at-risk, wherein different abnormality detection libraries corresponds to different judgment rules, and the judgment rule is described to be detected for determining Value-at-risk of the website under the different abnormality detection library;
The Type of website of the website to be detected is determined based on the value-at-risk of the website to be detected.
15. according to the method for claim 14, which is characterized in that the multiple abnormality detection library includes at least: sensitive word Library, abnormal website library, domain-name information library, wherein the exception website library includes the structure of web page of multiple abnormal websites.
16. according to the method for claim 14, which is characterized in that be based on multiple abnormality detection libraries to the survey grid to be checked The data information stood is evaluated, after obtaining the value-at-risk of the website to be detected, the method also includes:
In the case where determining the website to be detected is abnormal website, the website to be detected is stored to corresponding abnormality detection In library.
17. according to the method for claim 15, which is characterized in that be based on multiple abnormality detection libraries to the survey grid to be checked The data information stood is evaluated, before obtaining the value-at-risk of the website to be detected, the method also includes: building is described quick Feel dictionary and the abnormal website library, wherein constructing the sensitive dictionary includes:
Obtain the data set of abnormal website;
The data set of the abnormal website is handled, the tree set of paths of the data set is obtained;
Extract the keyword in the tree set of paths, wherein the keyword is sensitive keys word;
According to the keyword building the being drawn into sensitive dictionary.
18. according to the method for claim 15, which is characterized in that the building abnormal website library, comprising:
Obtain the dom tree of the multiple abnormal website;
The dom tree is decomposed, tree set of paths is obtained;
The similarity of the structure of web page of the multiple abnormal website is determined according to the tree set of paths;
Clustering processing is carried out to the multiple abnormal website according to the similarity of the structure of web page, obtains cluster result;
According to the cluster result building abnormal website library.
19. according to the method for claim 15, which is characterized in that based on multiple abnormality detection libraries to the website to be detected Data information evaluated, obtain the value-at-risk of the website to be detected, comprising:
It obtains and the data information of the website to be detected is detected based on the sensitive dictionary, obtain the first value-at-risk;
It obtains and the data information of the website to be detected is detected based on the abnormal website library, obtain the second value-at-risk;
Acquisition is detected based on data information of the domain name information bank to the website to be detected, obtains third value-at-risk;
Summation is weighted to first value-at-risk, second value-at-risk and the third value-at-risk, determine it is described to Detect the value-at-risk of website.
20. according to the method for claim 19, which is characterized in that described in the value-at-risk determination based on the website to be detected The Type of website of website to be detected, comprising:
Judge whether the value-at-risk of the website to be detected is greater than default value-at-risk;
In the case where the value-at-risk of the website to be detected is greater than default value-at-risk, determine that the website to be detected is described different Normal website.
21. a kind of system for detecting website characterized by comprising
Input unit, for obtaining website to be detected;
Processor, the similarity of the structure of web page for determining the website to be detected and benchmark website, and in the similarity In the case where greater than the first preset value, if there are the keywords of specified type in the website to be detected, it is determined that it is described to Detect the website that website is specified type.
22. system according to claim 21, which is characterized in that the system also includes:
Memory, for storing the website that the website to be detected is the specified type.
23. a kind of storage medium, which is characterized in that the storage medium includes the program of storage, wherein run in described program When control the storage medium where equipment execute following steps:
Determine the similarity of the structure of web page of website to be detected and benchmark website;
In the case where the similarity is greater than the first preset value, judge in the website to be detected with the presence or absence of specified type Keyword;
There are in the case where the keyword of the specified type in determining the website to be detected, the website to be detected is determined For the website of specified type.
24. a kind of processor, which is characterized in that the processor is for running program, wherein executed when described program is run with Lower step:
Determine the similarity of the structure of web page of website to be detected and benchmark website;
In the case where the similarity is greater than the first preset value, judge in the website to be detected with the presence or absence of specified type Keyword;
There are in the case where the keyword of the specified type in determining the website to be detected, the website to be detected is determined For the website of specified type.
25. a kind of system for detecting website characterized by comprising
Processor;And
Memory is connected to the processor, for providing the instruction for handling following processing step for the processor:
Determine the similarity of the structure of web page of website to be detected and benchmark website;
In the case where the similarity is greater than the first preset value, judge in the website to be detected with the presence or absence of specified type Keyword;
There are in the case where the keyword of the specified type in determining the website to be detected, the website to be detected is determined For the website of specified type.
CN201810164312.4A 2018-02-27 2018-02-27 Detect the method and system of website Pending CN110309402A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810164312.4A CN110309402A (en) 2018-02-27 2018-02-27 Detect the method and system of website

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810164312.4A CN110309402A (en) 2018-02-27 2018-02-27 Detect the method and system of website

Publications (1)

Publication Number Publication Date
CN110309402A true CN110309402A (en) 2019-10-08

Family

ID=68073643

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810164312.4A Pending CN110309402A (en) 2018-02-27 2018-02-27 Detect the method and system of website

Country Status (1)

Country Link
CN (1) CN110309402A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111078962A (en) * 2019-12-24 2020-04-28 北京海致星图科技有限公司 Method, system, medium and device for finding similar website sections
CN112328732A (en) * 2020-10-22 2021-02-05 上海艾融软件股份有限公司 Sensitive word detection method and device and sensitive word tree construction method and device
CN112347327A (en) * 2020-10-22 2021-02-09 杭州安恒信息技术股份有限公司 Website detection method and device, readable storage medium and computer equipment
CN116680700A (en) * 2023-05-18 2023-09-01 北京天融信网络安全技术有限公司 Risk detection method, apparatus, device and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102739679A (en) * 2012-06-29 2012-10-17 东南大学 URL(Uniform Resource Locator) classification-based phishing website detection method
US20130086677A1 (en) * 2010-12-31 2013-04-04 Huawei Technologies Co., Ltd. Method and device for detecting phishing web page
CN103179095A (en) * 2011-12-22 2013-06-26 阿里巴巴集团控股有限公司 Method and client device for detecting phishing websites
CN103685307A (en) * 2013-12-25 2014-03-26 北京奇虎科技有限公司 Method, system, client and server for detecting phishing fraud webpage based on feature library
CN104615760A (en) * 2015-02-13 2015-05-13 北京瑞星信息技术有限公司 Phishing website recognizing method and phishing website recognizing system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130086677A1 (en) * 2010-12-31 2013-04-04 Huawei Technologies Co., Ltd. Method and device for detecting phishing web page
CN103179095A (en) * 2011-12-22 2013-06-26 阿里巴巴集团控股有限公司 Method and client device for detecting phishing websites
CN102739679A (en) * 2012-06-29 2012-10-17 东南大学 URL(Uniform Resource Locator) classification-based phishing website detection method
CN103685307A (en) * 2013-12-25 2014-03-26 北京奇虎科技有限公司 Method, system, client and server for detecting phishing fraud webpage based on feature library
CN104615760A (en) * 2015-02-13 2015-05-13 北京瑞星信息技术有限公司 Phishing website recognizing method and phishing website recognizing system

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111078962A (en) * 2019-12-24 2020-04-28 北京海致星图科技有限公司 Method, system, medium and device for finding similar website sections
CN112328732A (en) * 2020-10-22 2021-02-05 上海艾融软件股份有限公司 Sensitive word detection method and device and sensitive word tree construction method and device
CN112347327A (en) * 2020-10-22 2021-02-09 杭州安恒信息技术股份有限公司 Website detection method and device, readable storage medium and computer equipment
CN112347327B (en) * 2020-10-22 2024-03-19 杭州安恒信息技术股份有限公司 Website detection method and device, readable storage medium and computer equipment
CN116680700A (en) * 2023-05-18 2023-09-01 北京天融信网络安全技术有限公司 Risk detection method, apparatus, device and storage medium
CN116680700B (en) * 2023-05-18 2024-06-14 北京天融信网络安全技术有限公司 Risk detection method, apparatus, device and storage medium

Similar Documents

Publication Publication Date Title
CN109271512B (en) Emotion analysis method, device and storage medium for public opinion comment information
CN110275958B (en) Website information identification method and device and electronic equipment
CN110309402A (en) Detect the method and system of website
CN111881983B (en) Data processing method and device based on classification model, electronic equipment and medium
CN109918560A (en) A kind of answering method and device based on search engine
CN106570513A (en) Fault diagnosis method and apparatus for big data network system
CN107220296A (en) The generation method of question and answer knowledge base, the training method of neutral net and equipment
CN105577685A (en) Intrusion detection independent analysis method and system in cloud calculation environment
CN108229170B (en) Software analysis method and apparatus using big data and neural network
CN110222171A (en) A kind of application of disaggregated model, disaggregated model training method and device
CN111866004B (en) Security assessment method, apparatus, computer system, and medium
CN110134961A (en) Processing method, device and the storage medium of text
CN110019519A (en) Data processing method, device, storage medium and electronic device
CN110414581B (en) Picture detection method and device, storage medium and electronic device
CN109657459A (en) Webpage back door detection method, equipment, storage medium and device
CN109634820A (en) A kind of fault early warning method, relevant device and the system of the collaboration of cloud mobile terminal
CN110365691A (en) Fishing website method of discrimination and device based on deep learning
CN110472866A (en) A kind of work order quality inspection analysis method and device
CN113704420A (en) Method and device for identifying role in text, electronic equipment and storage medium
CN114462040A (en) Malicious software detection model training method, malicious software detection method and malicious software detection device
CN112131354B (en) Answer screening method and device, terminal equipment and computer readable storage medium
CN113628043A (en) Complaint validity judgment method, device, equipment and medium based on data classification
CN108875374B (en) Malicious PDF detection method and device based on document node type
CN115879110A (en) System for identifying financial risk website based on fingerprint penetration technology
CN109714342A (en) The guard method of a kind of electronic equipment and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40015537

Country of ref document: HK

RJ01 Rejection of invention patent application after publication

Application publication date: 20191008

RJ01 Rejection of invention patent application after publication