CN107436890A - A kind of detection method and device of the Type of website - Google Patents

A kind of detection method and device of the Type of website Download PDF

Info

Publication number
CN107436890A
CN107436890A CN201610362232.0A CN201610362232A CN107436890A CN 107436890 A CN107436890 A CN 107436890A CN 201610362232 A CN201610362232 A CN 201610362232A CN 107436890 A CN107436890 A CN 107436890A
Authority
CN
China
Prior art keywords
website
page
detected
level
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610362232.0A
Other languages
Chinese (zh)
Inventor
赵燕雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201610362232.0A priority Critical patent/CN107436890A/en
Publication of CN107436890A publication Critical patent/CN107436890A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Abstract

The embodiment of the present application provides a kind of detection method and device of the Type of website, and methods described includes:At least two-stage page of the website to be detected is accessed according to the address of website to be detected;Web page code at least corresponding to the two-stage page described in obtaining;Characteristic information is extracted from the web page code, as basic feature information;The matching degree of at least two-stage page and default template is obtained according to the basic feature information, as the first matching degree;If first matching degree is more than predetermined threshold value, determine that the website to be detected belongs to the Type of website corresponding to the default template.It can be seen that the embodiment of the present application provides a kind of mode of the automatic detection Type of website, so as to reduce workload and improve detection efficiency.The empirical value that testing staff is no longer dependent in the embodiment of the present application is additionally, since, and is detected according at least two-stage page of website to be detected, the accuracy rate of testing result can be effectively improved.

Description

A kind of detection method and device of the Type of website
Technical field
The application is related to Internet technical field, more particularly, to a kind of detection method and device of the Type of website.
Background technology
With the development of Internet technology, it is required for detecting the Type of website under many scenes.For example, pass through inspection Survey the security that the Type of website judges website;In another example put on record for Ministry of Industry and Information website when, it is necessary to detect the class of the website Whether type, the type reported when determining the type with putting on record are consistent.
At present when detecting the type of website, the content that is generally shown by testing staff according to website manually determines website Type.Obviously, this mode causes larger workload, causes detection efficiency relatively low.
Therefore, the automatic detection to the Type of website how is realized, is current urgent problem to be solved.
The content of the invention
The application solve technical problem be to provide a kind of detection method and device of the Type of website, with realize can from The dynamic detection Type of website, so as to reduce workload and improve detection efficiency.
Therefore, the technical scheme that the application solves technical problem is:
This application provides a kind of detection method of the Type of website, including:
At least two-stage page of the website to be detected is accessed according to the address of website to be detected;
Web page code at least corresponding to the two-stage page described in obtaining;
Characteristic information is extracted from the web page code, as basic feature information;
The matching degree of at least two-stage page and default template is obtained according to the basic feature information, as first With degree;
If first matching degree is more than predetermined threshold value, determine that the website to be detected belongs to corresponding to the default template The Type of website.
Optionally, the detection method also includes:
If first matching degree is less than the predetermined threshold value, determine that the website to be detected is not belonging to the website class Type, or determine whether the website to be detected belongs to the website class according to the basic feature information and additional feature information Type.
Optionally, determine whether the website to be detected belongs to institute according to the basic feature information and additional feature information The Type of website is stated, including:
The next stage page of at least two-stage page described in accessing;
Obtain web page code corresponding to the next stage page;
Characteristic information is extracted from web page code corresponding to the next stage page, as the additional feature information;
At least three-level page and the default template are obtained according to the basic feature information and the additional feature information Matching degree, as the second matching degree;At least three-level page includes at least the two-stage page and at least two-stage page The next stage page in face;
According to the comparative result of second matching degree and the predetermined threshold value, determine whether the website to be detected belongs to The Type of website.
Optionally, the default template includes at least one module, modules have corresponding matching characteristic information and Weights;The matching degree of at least two-stage page and default template is obtained according to the basic feature information, including:
According to matching characteristic information corresponding to the basic feature information and modules, determine and at least two-stage N number of module of page matching, N >=0;
According to the N number of module respectively corresponding to weights, obtain of at least two-stage page and the default template With degree.
Optionally, the detection method also includes:
Acquisition belongs to characteristic information corresponding to the website of the Type of website, as feedback characteristic information;
According to the feedback characteristic information, the default template is adjusted.
Optionally, at least two-stage page includes first level pages and the two level page, is visited according to the address of website to be detected Asking at least two-stage page of the website to be detected includes:
The first level pages of the website to be detected are accessed according to the first level pages address of the website to be detected;
The web page code of the first level pages is obtained, the ground of the two level page is obtained from the web page code of the first level pages Location;
The two level page is accessed according to the address of the two level page.
Optionally, the basic feature information includes the mark and/or content of page elements.
Optionally, when accessing at least two-stage page of the website to be detected, the detection method also includes:
Website to be detected described in simulation login and/or simulated operation.
Optionally, the Type of website is ecommerce class, and the default template is mould corresponding to the ecommerce class Plate.
Present invention also provides a kind of detection means of the Type of website, including:
Access unit, for accessing at least two-stage page of the website to be detected according to the address of website to be detected;
First acquisition unit, for obtaining the web page code at least corresponding to the two-stage page;
Extraction unit, for extracting characteristic information from the web page code, as basic feature information;
Second acquisition unit, for obtaining at least two-stage page and default template according to the basic feature information Matching degree, as the first matching degree;
First determining unit, if being more than predetermined threshold value for first matching degree, determine that the website to be detected belongs to The Type of website corresponding to the default template.
Optionally, in addition to the second determining unit or the 3rd determining unit;
Second determining unit is used for, if first matching degree is less than the predetermined threshold value, determines described to be detected Website is not belonging to the Type of website;3rd determining unit is used for, and is believed according to the basic feature information and supplementary features Breath determines whether the website to be detected belongs to the Type of website.
Optionally, the 3rd determining unit includes:
First accesses subelement, for accessing the next stage page of at least two-stage page;
First obtains subelement, for obtaining web page code corresponding to the next stage page;
Subelement is extracted, for extracting characteristic information from web page code corresponding to the next stage page, as described Additional feature information;
Second obtains subelement, for obtaining at least three-level according to the basic feature information and the additional feature information The page and the matching degree of the default template, as the second matching degree;At least three-level page includes at least two-stage page Face and the next stage page of at least two-stage page;
First determination subelement, for the comparative result according to second matching degree and the predetermined threshold value, determine institute State whether website to be detected belongs to the Type of website.
Optionally, the default template includes at least one module, modules have corresponding matching characteristic information and Weights;Second acquisition module includes:
Second determination subelement, for the matching characteristic information according to corresponding to the basic feature information and modules, Determine the N number of module matched with least two-stage page, N >=0;
3rd obtains subelement, for according to the N number of module respectively corresponding to weights, obtain described at least two-stage page With the matching degree of the default template.
Optionally, in addition to:
3rd acquiring unit, belong to characteristic information corresponding to the website of the Type of website for obtaining, it is special as feedback Reference ceases;
Adjustment unit, for according to the feedback characteristic information, adjusting the default template.
Optionally, at least two-stage page includes first level pages and the two level page;The access unit includes:
Second accesses subelement, for accessing the website to be detected according to the first level pages address of the website to be detected First level pages;
4th obtains subelement, for obtaining the web page code of the first level pages, from the webpage generation of the first level pages The address of the two level page is obtained in code;
3rd accesses subelement, for accessing the two level page according to the address of the two level page.
Optionally, the basic feature information includes the mark and/or content of page elements.
Optionally, in addition to:Simulate login unit and/or simulated operation unit;
The simulation login unit, the website to be detected is logged in for simulating;
The simulated operation unit, for website to be detected described in simulated operation.
Optionally, the Type of website is ecommerce class, and the default template is mould corresponding to the ecommerce class Plate.
According to the above-mentioned technical solution, in the embodiment of the present application, by least two-stage for accessing website to be detected automatically The page, the web page code at least corresponding to the two-stage page can be obtained, according to the characteristic information extracted from web page code, energy It is enough to obtain at least two-stage page matching degree corresponding with default template, due to presetting template one Type of website of correspondence, therefore If the matching degree is more than predetermined threshold value, illustrate that the website to be detected belongs to the Type of website.It can be seen that the embodiment of the present application carries A kind of mode of the automatic detection Type of website is supplied, so as to reduce workload and improve detection efficiency.It is additionally, since the application reality The empirical value that testing staff is no longer dependent in example is applied, and is detected according at least two-stage page of website to be detected, energy Enough effectively improve the accuracy rate of testing result.
Brief description of the drawings
In order to illustrate more clearly of the technical scheme in the embodiment of the present application, make required in being described below to embodiment Accompanying drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the present application, for For those of ordinary skill in the art, other accompanying drawings can also be obtained according to these accompanying drawings.
Fig. 1 is a kind of schematic flow sheet for embodiment of the method that the application provides;
Fig. 2 is the schematic flow sheet for another embodiment of the method that the application provides;
Fig. 3 is the schematic diagram on the first level pages top that the application provides;
Fig. 4 is the schematic diagram for the first level pages bottom that the application provides;
Fig. 5 is the schematic diagram for the two level page that the application provides;
Fig. 6 is the schematic diagram for the three-level page that the application provides;
Fig. 7 is a kind of structural representation for device embodiment that the application provides.
Embodiment
It is required for detecting the Type of website under many scenes.For example, it must be reported in Ministry of Industry and Information's recorded website The Type of website.However, the Type of website much reported at present is often inconsistent with the Type of website of reality, it is therefore desirable to by detecting Personnel access the website according to the address of website, the content shown according to the first level pages of website, rule of thumb manually determine The actual Type of website.Obviously, this mode causes larger workload, causes detection efficiency relatively low.And due to detection Personnel's limited experience, cause the accuracy rate of testing result often relatively low.
The embodiment of the present application provides a kind of detection method and device of the Type of website, being capable of automatic detection website class with realization Type, so as to reduce workload and improve detection efficiency.
In order that those skilled in the art more fully understand the technical scheme in the application, it is real below in conjunction with the application The accompanying drawing in example is applied, the technical scheme in the embodiment of the present application is clearly and completely described, it is clear that described implementation Example only some embodiments of the present application, rather than whole embodiments.It is common based on the embodiment in the application, this area The every other embodiment that technical staff is obtained under the premise of creative work is not made, should all belong to protection of the present invention Scope.
Referring to Fig. 1, the embodiment of the present application provides a kind of embodiment of the method for the detection method of the Type of website.This implementation The methods described of example includes:
S101:At least two-stage page of the website to be detected is accessed according to the address of website to be detected.
In the embodiment of the present application, when needing to carry out the detection of the Type of website to website to be detected, this can be got The address of website, such as domain name of the website etc., the website to be detected can be accessed automatically by addresses such as the domain names.
If at least two-stage page includes first level pages and the two level page, the address is usually the website to be detected First level pages address, i.e. home address, the first level pages of the website are able to access that according to the first level pages address, by obtaining The web page code of the first level pages is taken, the address of the two level page, root can be obtained from the web page code of the first level pages The two level page is accessed according to the address of the two level page.Similar, the three-level page, level Four page etc. can be accessed successively.
Wherein, the M level pages, M >=2, M concrete numerical value can be preset before at least two-stage page is usually. That is the embodiment of the present application in addition to accessing first level pages, can also access the two level page of the website to be detected even More rear class pages, so as to obtain website to be detected more fully information, improve the accuracy rate of testing result.
Study and find by inventor, generally access the preceding three-level page of the website, i.e. first level pages, the two level page and three The level page, it typically can just be accurately judged to the Type of website of the website to be detected.
S102:Web page code at least corresponding to the two-stage page described in obtaining.
This step can specifically be realized by technologies such as web crawlers.Wherein, web page code can include static Web page mark Remember code, and/or JavaScript dynamic script codes.
S103:Characteristic information is extracted from the web page code, as basic feature information.
By analyzing the web page code, the characteristic information of the web page code, this feature letter can be extracted Breath can reflect the base attribute of at least two-stage page, such as display properties etc..
Wherein, the characteristic information can include the mark and/or content of page elements.For example, in web page code, will Variable order be entered as ' order 01 ', the variable i.e. be used as a page elements, the page elements are identified as " order ", this The content of page elements is " order 01 ".
S104:The matching degree of at least two-stage page and default template is obtained according to the basic feature information, as First matching degree.
Wherein, the corresponding Type of website of the default template, and the specified genus that the website of the type has can be reflected Property.For example, the default template corresponds to ecommerce class, the attribute that ecommerce class website has can be reflected, such as It can reflect that ecommerce class website generally has commodity classification area, electric business authentication record area, commodity details page etc..By the base Eigen information can obtain at least two-stage page and the matching degree of the default template compared with default template.
S105:If first matching degree is more than predetermined threshold value, determine that the website to be detected belongs to the default template The corresponding Type of website.
If first matching degree is more than predetermined threshold value, the matching of at least two-stage page and the default template is represented Degree is higher, therefore can illustrate that the website to be detected belongs to the Type of website corresponding to the default template.If for example, institute It is template corresponding to the ecommerce class to state default template, and first matching degree is more than predetermined threshold value, is then capable of determining that The website to be detected belongs to ecommerce class.
The embodiment of the present application can be used for that the handheld devices such as mobile phone, computer, server etc. are any to have detection function Electronic equipment in.
According to the above-mentioned technical solution, in the embodiment of the present application, by least two-stage for accessing website to be detected automatically The page, the web page code at least corresponding to the two-stage page can be obtained, according to the characteristic information extracted from web page code, energy It is enough to obtain at least two-stage page matching degree corresponding with default template, due to presetting template one Type of website of correspondence, therefore If the matching degree is more than predetermined threshold value, illustrate that the website to be detected belongs to the Type of website.It can be seen that the embodiment of the present application carries A kind of mode of the automatic detection Type of website is supplied, so as to reduce workload and improve detection efficiency.It is additionally, since the application reality The empirical value that testing staff is no longer dependent in example is applied, and is detected according at least two-stage page of website to be detected, energy Enough effectively improve the accuracy rate of testing result.
In the embodiment of the present application, at least two-stage page described in access, login and/or simulated operation institute can also be simulated State website to be detected.Such as log in the website, Huo Zhe using the automatic technologies such as TestNG, automatic register account number to simulate Shopping page carries out simulating Shopping Behaviors etc..
In the embodiment of the present application, if at least two-stage page and the matching degree of default template are less than predetermined threshold value, It can determine that the website to be detected is not belonging to the Type of website corresponding to the default template, it is also assumed that based on current number According to can not judge whether the website to be detected belongs to the Type of website, therefore can be further combined with other data, example Characteristic information corresponding to following first level pages is judged.It is specifically described below.
Methods described can also include:If first matching degree is less than the predetermined threshold value, the survey grid to be checked is determined Station is not belonging to the Type of website.Testing result can now be prompted the user with;Or if first matching degree be less than it is described Predetermined threshold value, determine whether the website to be detected belongs to the website according to the basic feature information and additional feature information Type.
Wherein, the additional feature information can be obtained by accessing the next stage page.Specifically, according to the base Eigen information and additional feature information determine whether the website to be detected belongs to the Type of website, can include:Access The next stage page of at least two-stage page, obtains web page code corresponding to the next stage page, from the next stage page Characteristic information is extracted in web page code corresponding to face, as the additional feature information;According to the basic feature information and institute State additional feature information and obtain at least three-level page and the matching degree of the default template, as the second matching degree;It is described at least The three-level page includes at least next stage page of the two-stage page and at least two-stage page;According to second matching degree With the comparative result of the predetermined threshold value, determine whether the website to be detected belongs to the Type of website.
For example, at least two-stage page includes first level pages and the two level page, then the next stage page is three Level the page, by accessing the three-level page, characteristic information is extracted from web page code corresponding to the three-level page, according to first level pages, Characteristic information corresponding to the two level page, the three-level page, the matching degree of this three-level page and default template is obtained, according to the matching degree Determine whether website to be detected belongs to the Type of website.Wherein, the mode of the three-level page is accessed, may refer to according to one-level page The web page code in face accesses the implementation of the two level page, repeats no more here.
If the second matching degree is more than predetermined threshold value, illustrate that website to be detected belongs to the Type of website, if the second matching Degree is less than predetermined threshold value, then illustrates that website to be detected is not belonging to the Type of website, or can be further combined with next stage page The level Four page in face, such as above-mentioned example determines whether.The maximum series of detection, example can be set in the embodiment of the present application Maximum series are such as set as 6, if the matching degree determined according to the characteristic information of the first six grade of page still is below predetermined threshold value, Then finally determine that website to be detected is not belonging to the Type of website.
Wherein, in order to be not repeated to detect to the page crossed after testing, link endless loop is avoided, the embodiment of the present application exists Before accessing the next stage page, whether the address for the next stage page that can also judge to obtain had accessed, if it is not, then after It is continuous to access, if it is, reacquiring the address of the next stage page.For example, is extracted from the web page code of the two level page One address, if the first address is actually the address of first level pages, it will judge that the first address has been accessed, again The second address is extracted from the web page code of the two level page, if the second address is the address of the three-level page, it will judge the Double-address is not visited, then accesses the second address.
In the S104 of the embodiment of the present application, at least two-stage page and of default template are obtained according to basic feature information With degree, a kind of specific matching degree acquisition modes are provided below.It should be noted that the specific acquisition modes will not be to the application Embodiment plays restriction effect.
The default template can include at least one module, and modules have corresponding matching characteristic information, are used for It is determined that whether at least two-stage page matches with the module;The matching according to corresponding to the basic feature information and modules Characteristic information, determines the N number of module matched with least two-stage page, N >=0, actually i.e., by basic feature information with The matching characteristic information of each module is matched, if the match is successful, illustrates that at least two-stage page matches with the module; Modules have corresponding weights, according to N number of module respectively corresponding to weights, obtain at least two-stage page with advance If the matching degree of template.It is illustrated below so that default template is template corresponding to ecommerce class as an example.
The default template is included with one or more of lower module module:Commodity classification module, electric business authentication record Module, commodity details module, shopping cart module, order module and logistics module.Each module has corresponding respectively match Characteristic information and weights.For example, the matching characteristic information of the commodity classification module is " ICP cards ", if the essential characteristic The match is successful with " ICP cards " for information, then at least two-stage page described in explanation matches with the commodity classification module.Repeat above-mentioned mistake Journey, be capable of determining that N number of module of at least two-stage page matching, according to weights corresponding to N number of module, can calculate to Few two-stage page and the matching degree of default template, for example, weights corresponding to N number of module are added, can obtain the matching degree.
Wherein, can be according to certain bits when the basic feature information is matched with the matching characteristic information Corresponding basic feature information is put, the matching characteristic information of module corresponding with ad-hoc location is matched.For example, by website The matching characteristic information of the basic feature information just obtained commodity classification module corresponding with above website is matched, if matching It is successful then determine that at least two-stage page includes the commodity classification module.
Wherein, the default template can be adjusted and updated in real time, such as carry out self study to default template.Specifically Ground, the detection method also include:Acquisition belongs to characteristic information corresponding to the website of the Type of website, believes as feedback characteristic Breath;According to the feedback characteristic information, the default template is adjusted.For example, can according to the feedback characteristic information, increase or Person deletes the module in default template, or the weights of module are modified.
Below by a specific embodiment, illustrate the detection method of the embodiment of the present application.
Referring to Fig. 2, the embodiment of the present application provides another embodiment of the method for the detection method of the Type of website.This Shen It please be illustrated so that the Type of website is ecommerce class as an example in embodiment.
The methods described of the present embodiment includes:
S201:The address of website to be detected is obtained, accesses the first level pages of the website to be detected.
For example, the list item that affiliated industry is ecommerce class can be determined from Ministry of Industry and Information's docketing system, it is automatic to obtain Corresponding address in the list item.Wherein it is possible to the address of multiple websites to be detected is provided with the forms such as excel batch.
S202:Web page code corresponding to first level pages is obtained, and extracts the characteristic information in the web page code, as feature Information 01.
Such as the first level pages shown in for Fig. 3 and Fig. 4, the feature letter extracted from web page code corresponding to first level pages Breath 01, can include:Mother and baby's toy, mobile phone digital, air conditioner electric regard, " ICP cards " etc..
S203:The address of the two level page is obtained from the web page code of first level pages, is accessed according to the address of the two level page The two level page.
S204:The web page code of the two level page is obtained, and extracts the characteristic information in the web page code, as characteristic information 02.Wherein, characteristic information 01 and characteristic information 02 form basic feature information.
Such as the two level page shown in Fig. 5, the characteristic information 02 extracted from web page code corresponding to the two level page, It can include:Deliver to, quantity purchase, add shopping cart, immediately purchase etc..
S205:By characteristic information 01 and characteristic information 02, matched with the modules in default template, determine with N1 module of preceding two-stage page matching.
In the present embodiment, default template can be as shown in table 1., wherein it is desired to explanation, table 1 is only that one kind is illustrated It is bright, included modules in table 1, and matching characteristic information and weights corresponding to modules, can be according to actual feelings Condition is adjusted.
Table 1
For example, characteristic information 01 is matched with the matching characteristic information of modules, determine and first level pages The electric business authentication record module and commodity classification module matched somebody with somebody, by the progress of the matching characteristic information of characteristic information 02 and modules Match somebody with somebody, determine the commodity details module matched with the two level page.
In the embodiment of the present application, when basic feature information is matched with matching characteristic information, can use accurate Matching or fuzzy matching, module matching include synonym matching etc., prevent the loss of critical data.
S206:According to weights corresponding to N1 module difference, the two-stage page and the matching degree of default template before acquisition.
For example, by power corresponding to electric business authentication record module, commodity classification module and commodity details module these three modules Value is added, and obtains matching degree 0.15+0.2+0.2=55%.
S207:By matching degree compared with predetermined threshold value, if matching degree is more than predetermined threshold value, S208 is performed, if It is less than predetermined threshold value with degree, then performs S208.
S208:Determine that the website to be detected belongs to ecommerce class.
S209:The address of the three-level page is obtained from the web page code of the two level page, is accessed according to the address of the three-level page The three-level page.
In the present embodiment, if detecting to need to log in when accessing the three-level page, the automation such as TestNG can be utilized Technology, automatic register account number log in the website to simulate.
S210:The web page code of the three-level page is obtained, and extracts the characteristic information in the web page code, as characteristic information 03.Wherein, characteristic information 03 is used as additional feature information.
Such as the three-level page shown in Fig. 6, the characteristic information 03 extracted from web page code corresponding to the three-level page, It can include:Unit price, quantity, clearing etc..
S211:By the modules progress in characteristic information 01, characteristic information 02 and characteristic information 03, with default template Match somebody with somebody, determine the N2 module matched with the preceding three-level page.
For example, characteristic information 03 is matched with the matching characteristic information of modules, determine and the three-level page The shopping cart module matched somebody with somebody.Therefore, N2 module includes:Electric business authentication record module, commodity classification module, commodity details module and Shopping cart module.
S212:According to weights corresponding to N2 module difference, the three-level page and the matching degree of default template before acquisition.
For example, by electric business authentication record module, commodity classification module, commodity details module, shopping cart module this four moulds Weights corresponding to block are added, and obtain matching degree 0.15+0.2+0.2+0.15=70%.
S213:By matching degree compared with predetermined threshold value, if matching degree is more than predetermined threshold value, S208 is performed, if It is less than predetermined threshold value with degree, then performs S214.
S214:It can determine that the website to be detected is not belonging to ecommerce class, can continue to access the level Four page Judged.Wherein it is possible to set the maximum series of detection, such as maximum series are set as 6, if according to the first six grade of page The matching degree that characteristic information is determined still is below predetermined threshold value, then finally determines that website to be detected is not belonging to the website class Type.
Corresponding above method embodiment, present invention also provides device embodiment, is specifically described below.
Referring to Fig. 7, this application provides a kind of device embodiment of the detection means of the Type of website.The institute of the present embodiment Stating detection means includes:Access unit 701, first acquisition unit 702, extraction unit 703, second acquisition unit 704 and first Determining unit 705.
Access unit 701, for accessing at least two-stage page of the website to be detected according to the address of website to be detected.
In the embodiment of the present application, when needing to carry out the detection of the Type of website to website to be detected, this can be got The address of website, such as domain name of the website etc., the website to be detected can be accessed automatically by addresses such as the domain names.
If at least two-stage page includes first level pages and the two level page, the address is usually the website to be detected First level pages address, i.e. home address, the first level pages of the website are able to access that according to the first level pages address, by obtaining The web page code of the first level pages is taken, the address of the two level page, root can be obtained from the web page code of the first level pages The two level page is accessed according to the address of the two level page.Similar, the three-level page, level Four page etc. can be accessed successively.
Wherein, the M level pages, M >=2, M concrete numerical value can be preset before at least two-stage page is usually. That is the embodiment of the present application in addition to accessing first level pages, can also access the two level page of the website to be detected even More rear class pages, so as to obtain website to be detected more fully information, improve the accuracy rate of testing result.
Study and find by inventor, generally access the preceding three-level page of the website, i.e. first level pages, the two level page and three The level page, it typically can just be accurately judged to the Type of website of the website to be detected.
First acquisition unit 702, for obtaining the web page code at least corresponding to the two-stage page.
First acquisition unit 702 can specifically pass through web page code described in the technical limit spacings such as web crawlers.Wherein, webpage generation Code can include static Web page marker code, and/or JavaScript dynamic script codes.
Extraction unit 703, for extracting characteristic information from the web page code, as basic feature information;
Extraction unit 703 can extract the feature letter of the web page code by analyzing the web page code Breath, this feature information can reflect the base attribute of at least two-stage page, such as display properties etc..
Wherein, the characteristic information can include the mark and/or content of page elements.For example, in web page code, will Variable order be entered as ' order 01 ', the variable i.e. be used as a page elements, the page elements are identified as " order ", this The content of page elements is " order 01 ".
Second acquisition unit 704, for obtaining at least two-stage page and default mould according to the basic feature information The matching degree of plate, as the first matching degree.
Wherein, the corresponding Type of website of the default template, and the specified genus that the website of the type has can be reflected Property.For example, the default template corresponds to ecommerce class, the attribute that ecommerce class website has can be reflected, such as It can reflect that ecommerce class website generally has commodity classification area, electric business authentication record area, commodity details page etc..Second obtains Unit 704 by the basic feature information compared with default template, can obtain at least two-stage page with it is described pre- If the matching degree of template.
First determining unit 705, if being more than predetermined threshold value for first matching degree, determine the website category to be detected In the Type of website corresponding to the default template.
If first matching degree is more than predetermined threshold value, the matching of at least two-stage page and the default template is represented Degree is higher, therefore can illustrate that the website to be detected belongs to the Type of website corresponding to the default template.If for example, institute It is template corresponding to the ecommerce class to state default template, and first matching degree is more than predetermined threshold value, then first determines list Member 705 is capable of determining that the website to be detected belongs to ecommerce class.
It is any that the detection means of the embodiment of the present application can be used for the handheld devices such as mobile phone, computer, server etc. In electronic equipment of the kind with detection function.
According to the above-mentioned technical solution, in the embodiment of the present application, by least two-stage for accessing website to be detected automatically The page, the web page code at least corresponding to the two-stage page can be obtained, according to the characteristic information extracted from web page code, energy It is enough to obtain at least two-stage page matching degree corresponding with default template, due to presetting template one Type of website of correspondence, therefore If the matching degree is more than predetermined threshold value, illustrate that the website to be detected belongs to the Type of website.It can be seen that the embodiment of the present application carries A kind of mode of the automatic detection Type of website is supplied, so as to reduce workload and improve detection efficiency.It is additionally, since the application reality The empirical value that testing staff is no longer dependent in example is applied, and is detected according at least two-stage page of website to be detected, energy Enough effectively improve the accuracy rate of testing result.
Optionally, the detection means also includes the second determining unit or the 3rd determining unit;
Wherein, second determining unit is used for, if first matching degree is less than the predetermined threshold value, it is determined that described treat Detection website is not belonging to the Type of website;3rd determining unit is used for, according to the basic feature information and additional spy Reference breath determines whether the website to be detected belongs to the Type of website.
Optionally, the 3rd determining unit includes:
First accesses subelement, for accessing the next stage page of at least two-stage page;
First obtains subelement, for obtaining web page code corresponding to the next stage page;
Subelement is extracted, for extracting characteristic information from web page code corresponding to the next stage page, as described Additional feature information;
Second obtains subelement, for obtaining at least three-level according to the basic feature information and the additional feature information The page and the matching degree of the default template, as the second matching degree;At least three-level page includes at least two-stage page Face and the next stage page of at least two-stage page;
First determination subelement, for the comparative result according to second matching degree and the predetermined threshold value, determine institute State whether website to be detected belongs to the Type of website.
Optionally, the default template includes at least one module, modules have corresponding matching characteristic information and Weights;Second acquisition module includes:
Second determination subelement, for the matching characteristic information according to corresponding to the basic feature information and modules, Determine the N number of module matched with least two-stage page, N >=0;
3rd obtains subelement, for according to the N number of module respectively corresponding to weights, obtain described at least two-stage page With the matching degree of the default template.
Optionally, the detection means also includes:
3rd acquiring unit, belong to characteristic information corresponding to the website of the Type of website for obtaining, it is special as feedback Reference ceases;
Adjustment unit, for according to the feedback characteristic information, adjusting the default template.
Optionally, at least two-stage page includes first level pages and the two level page;The access unit includes:
Second accesses subelement, for accessing the website to be detected according to the first level pages address of the website to be detected First level pages;
4th obtains subelement, for obtaining the web page code of the first level pages, from the webpage generation of the first level pages The address of the two level page is obtained in code;
3rd accesses subelement, for accessing the two level page according to the address of the two level page.
Optionally, the detection means also includes:Simulate login unit and/or simulated operation unit;
The simulation login unit, the website to be detected is logged in for simulating;The simulated operation unit, for simulating Operate the website to be detected.
Optionally, the Type of website is ecommerce class, and the default template is mould corresponding to the ecommerce class Plate.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, the corresponding process in preceding method embodiment is may be referred to, will not be repeated here.
In several embodiments provided herein, it should be understood that disclosed system, apparatus and method can be with Realize by another way.For example, device embodiment described above is only schematical, for example, the unit Division, only a kind of division of logic function, can there is other dividing mode, such as multiple units or component when actually realizing Another system can be combined or be desirably integrated into, or some features can be ignored, or do not perform.It is another, it is shown or The mutual coupling discussed or direct-coupling or communication connection can be the indirect couplings by some interfaces, device or unit Close or communicate to connect, can be electrical, mechanical or other forms.
The unit illustrated as separating component can be or may not be physically separate, show as unit The part shown can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple On NE.Some or all of unit therein can be selected to realize the mesh of this embodiment scheme according to the actual needs 's.
In addition, each functional unit in each embodiment of the application can be integrated in a processing unit, can also That unit is individually physically present, can also two or more units it is integrated in a unit.Above-mentioned integrated list Member can both be realized in the form of hardware, can also be realized in the form of SFU software functional unit.
If the integrated unit is realized in the form of SFU software functional unit and is used as independent production marketing or use When, it can be stored in a computer read/write memory medium.Based on such understanding, the technical scheme of the application is substantially The part to be contributed in other words to prior art or all or part of the technical scheme can be in the form of software products Embody, the computer software product is stored in a storage medium, including some instructions are causing a computer Equipment (can be personal computer, server, or network equipment etc.) performs the complete of each embodiment methods described of the application Portion or part steps.And foregoing storage medium includes:USB flash disk, mobile hard disk, read-only storage (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disc or CD etc. are various can store journey The medium of sequence code.
Described above, above example is only to illustrate the technical scheme of the application, rather than its limitations;Although with reference to before Embodiment is stated the application is described in detail, it will be understood by those within the art that:It still can be to preceding State the technical scheme described in each embodiment to modify, or equivalent substitution is carried out to which part technical characteristic;And these Modification is replaced, and the essence of appropriate technical solution is departed from the spirit and scope of each embodiment technical scheme of the application.

Claims (18)

  1. A kind of 1. detection method of the Type of website, it is characterised in that including:
    At least two-stage page of the website to be detected is accessed according to the address of website to be detected;
    Web page code at least corresponding to the two-stage page described in obtaining;
    Characteristic information is extracted from the web page code, as basic feature information;
    At least two-stage page and the matching degree of default template are obtained according to the basic feature information, matched as first Degree;
    If first matching degree is more than predetermined threshold value, determine that the website to be detected belongs to website corresponding to the default template Type.
  2. 2. detection method according to claim 1, it is characterised in that the detection method also includes:
    If first matching degree is less than the predetermined threshold value, determine that the website to be detected is not belonging to the Type of website, or Person determines whether the website to be detected belongs to the Type of website according to the basic feature information and additional feature information.
  3. 3. detection method according to claim 2, it is characterised in that believed according to the basic feature information and supplementary features Breath determines whether the website to be detected belongs to the Type of website, including:
    The next stage page of at least two-stage page described in accessing;
    Obtain web page code corresponding to the next stage page;
    Characteristic information is extracted from web page code corresponding to the next stage page, as the additional feature information;
    At least three-level page and of the default template are obtained according to the basic feature information and the additional feature information With degree, as the second matching degree;At least three-level page includes at least the two-stage page and at least two-stage page The next stage page;
    According to the comparative result of second matching degree and the predetermined threshold value, determine whether the website to be detected belongs to described The Type of website.
  4. 4. detection method according to claim 1, it is characterised in that the default template includes at least one module, respectively Individual module has corresponding matching characteristic information and weights;According to the basic feature information obtain at least two-stage page and The matching degree of default template, including:
    According to matching characteristic information corresponding to the basic feature information and modules, determine and at least two-stage page N number of module of matching, N >=0;
    According to the N number of module respectively corresponding to weights, obtain the matching degree of at least two-stage page and the default template.
  5. 5. detection method according to claim 1, it is characterised in that the detection method also includes:
    Acquisition belongs to characteristic information corresponding to the website of the Type of website, as feedback characteristic information;
    According to the feedback characteristic information, the default template is adjusted.
  6. 6. detection method according to claim 1, it is characterised in that at least two-stage page includes first level pages and two The level page, at least two-stage page of the website to be detected is accessed according to the address of website to be detected to be included:
    The first level pages of the website to be detected are accessed according to the first level pages address of the website to be detected;
    The web page code of the first level pages is obtained, the address of the two level page is obtained from the web page code of the first level pages;
    The two level page is accessed according to the address of the two level page.
  7. 7. detection method according to claim 1, it is characterised in that the basic feature information includes the mark of page elements Knowledge and/or content.
  8. 8. detection method according to claim 1, it is characterised in that access at least two-stage page of the website to be detected When, the detection method also includes:
    Website to be detected described in simulation login and/or simulated operation.
  9. 9. according to the detection method described in any one of claim 1 to 8, it is characterised in that the Type of website is ecommerce Class, the default template are template corresponding to the ecommerce class.
  10. A kind of 10. detection means of the Type of website, it is characterised in that including:
    Access unit, for accessing at least two-stage page of the website to be detected according to the address of website to be detected;
    First acquisition unit, for obtaining the web page code at least corresponding to the two-stage page;
    Extraction unit, for extracting characteristic information from the web page code, as basic feature information;
    Second acquisition unit, at least matching of the two-stage page and default template according to basic feature information acquisition Degree, as the first matching degree;
    First determining unit, if being more than predetermined threshold value for first matching degree, determine that the website to be detected belongs to described The Type of website corresponding to default template.
  11. 11. detection means according to claim 10, it is characterised in that also determined including the second determining unit or the 3rd Unit;
    Second determining unit is used for, if first matching degree is less than the predetermined threshold value, determines the website to be detected It is not belonging to the Type of website;3rd determining unit is used for, true according to the basic feature information and additional feature information Whether the fixed website to be detected belongs to the Type of website.
  12. 12. detection means according to claim 11, it is characterised in that the 3rd determining unit includes:
    First accesses subelement, for accessing the next stage page of at least two-stage page;
    First obtains subelement, for obtaining web page code corresponding to the next stage page;
    Subelement is extracted, for extracting characteristic information from web page code corresponding to the next stage page, as described additional Characteristic information;
    Second obtains subelement, for obtaining at least three-level page according to the basic feature information and the additional feature information With the matching degree of the default template, as the second matching degree;At least three-level page include at least two-stage page and The next stage page of at least two-stage page;
    First determination subelement, for the comparative result according to second matching degree and the predetermined threshold value, it is determined that described treat Whether detection website belongs to the Type of website.
  13. 13. detection means according to claim 10, it is characterised in that the default template includes at least one module, Modules have corresponding matching characteristic information and weights;Second acquisition module includes:
    Second determination subelement, for the matching characteristic information according to corresponding to the basic feature information and modules, it is determined that Go out the N number of module matched with least two-stage page, N >=0;
    3rd obtains subelement, for according to the N number of module respectively corresponding to weights, obtain described at least the two-stage page and institute State the matching degree of default template.
  14. 14. detection means according to claim 10, it is characterised in that also include:
    3rd acquiring unit, belong to characteristic information corresponding to the website of the Type of website for obtaining, believe as feedback characteristic Breath;
    Adjustment unit, for according to the feedback characteristic information, adjusting the default template.
  15. 15. detection means according to claim 10, it is characterised in that at least two-stage page include first level pages and The two level page;The access unit includes:
    Second accesses subelement, for accessing the one of the website to be detected according to the first level pages address of the website to be detected The level page;
    4th obtains subelement, for obtaining the web page code of the first level pages, from the web page code of the first level pages Obtain the address of the two level page;
    3rd accesses subelement, for accessing the two level page according to the address of the two level page.
  16. 16. detection means according to claim 10, it is characterised in that the basic feature information includes page elements Mark and/or content.
  17. 17. detection means according to claim 10, it is characterised in that also include:Simulate login unit and/or simulation behaviour Make unit;
    The simulation login unit, the website to be detected is logged in for simulating;
    The simulated operation unit, for website to be detected described in simulated operation.
  18. 18. according to the detection means described in any one of claim 10 to 17, it is characterised in that the Type of website is electronics business Business class, the default template is template corresponding to the ecommerce class.
CN201610362232.0A 2016-05-26 2016-05-26 A kind of detection method and device of the Type of website Pending CN107436890A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610362232.0A CN107436890A (en) 2016-05-26 2016-05-26 A kind of detection method and device of the Type of website

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610362232.0A CN107436890A (en) 2016-05-26 2016-05-26 A kind of detection method and device of the Type of website

Publications (1)

Publication Number Publication Date
CN107436890A true CN107436890A (en) 2017-12-05

Family

ID=60454521

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610362232.0A Pending CN107436890A (en) 2016-05-26 2016-05-26 A kind of detection method and device of the Type of website

Country Status (1)

Country Link
CN (1) CN107436890A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108742457A (en) * 2018-05-14 2018-11-06 佛山市顺德区美的洗涤电器制造有限公司 Dishwashing machine dispenser recognition methods, device and computer readable storage medium
CN108875060A (en) * 2018-06-29 2018-11-23 成都市映潮科技股份有限公司 A kind of website identification method and identifying system
CN109101657A (en) * 2018-08-30 2018-12-28 杭州安恒信息技术股份有限公司 Multiple level marketing referrer website identification method, device and equipment
CN109753619A (en) * 2018-12-25 2019-05-14 杭州安恒信息技术股份有限公司 A kind of website industry type quickly knows method for distinguishing
CN110929129A (en) * 2018-08-31 2020-03-27 阿里巴巴集团控股有限公司 Information detection method, equipment and machine-readable storage medium
CN111833064A (en) * 2019-04-17 2020-10-27 马上消费金融股份有限公司 Cheating detection method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102819597A (en) * 2012-08-13 2012-12-12 北京星网锐捷网络技术有限公司 Web page classification method and equipment
CN103179095A (en) * 2011-12-22 2013-06-26 阿里巴巴集团控股有限公司 Method and client device for detecting phishing websites
CN103577447A (en) * 2012-07-30 2014-02-12 百度在线网络技术(北京)有限公司 Method and equipment used for determining page type information of target pages
US20140304814A1 (en) * 2011-10-19 2014-10-09 Cornell University System and methods for automatically detecting deceptive content
CN104750754A (en) * 2013-12-31 2015-07-01 北龙中网(北京)科技有限责任公司 Website industry classification method and server
CN104978423A (en) * 2015-06-30 2015-10-14 北京奇虎科技有限公司 Website type detection method and apparatus

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140304814A1 (en) * 2011-10-19 2014-10-09 Cornell University System and methods for automatically detecting deceptive content
CN103179095A (en) * 2011-12-22 2013-06-26 阿里巴巴集团控股有限公司 Method and client device for detecting phishing websites
CN103577447A (en) * 2012-07-30 2014-02-12 百度在线网络技术(北京)有限公司 Method and equipment used for determining page type information of target pages
CN102819597A (en) * 2012-08-13 2012-12-12 北京星网锐捷网络技术有限公司 Web page classification method and equipment
CN104750754A (en) * 2013-12-31 2015-07-01 北龙中网(北京)科技有限责任公司 Website industry classification method and server
CN104978423A (en) * 2015-06-30 2015-10-14 北京奇虎科技有限公司 Website type detection method and apparatus

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
郭庚麒: ""基于Web挖掘的中文专业搜索引擎设计关键技术研究"", 《万方—中国学位论文全文数据库》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108742457A (en) * 2018-05-14 2018-11-06 佛山市顺德区美的洗涤电器制造有限公司 Dishwashing machine dispenser recognition methods, device and computer readable storage medium
CN108875060A (en) * 2018-06-29 2018-11-23 成都市映潮科技股份有限公司 A kind of website identification method and identifying system
CN108875060B (en) * 2018-06-29 2021-02-26 成都市映潮科技股份有限公司 Website identification method and identification system
CN109101657A (en) * 2018-08-30 2018-12-28 杭州安恒信息技术股份有限公司 Multiple level marketing referrer website identification method, device and equipment
CN110929129A (en) * 2018-08-31 2020-03-27 阿里巴巴集团控股有限公司 Information detection method, equipment and machine-readable storage medium
CN110929129B (en) * 2018-08-31 2023-12-26 阿里巴巴集团控股有限公司 Information detection method, equipment and machine-readable storage medium
CN109753619A (en) * 2018-12-25 2019-05-14 杭州安恒信息技术股份有限公司 A kind of website industry type quickly knows method for distinguishing
CN111833064A (en) * 2019-04-17 2020-10-27 马上消费金融股份有限公司 Cheating detection method and device

Similar Documents

Publication Publication Date Title
CN107436890A (en) A kind of detection method and device of the Type of website
CN108416198B (en) Device and method for establishing human-machine recognition model and computer readable storage medium
CN107807987B (en) Character string classification method and system and character string classification equipment
CN108629043B (en) Webpage target information extraction method, device and storage medium
CN101694668B (en) Method and device for confirming web structure similarity
CN107168992A (en) Article sorting technique and device, equipment and computer-readable recording medium based on artificial intelligence
CN109299258A (en) A kind of public sentiment event detecting method, device and equipment
CN109062972A (en) Web page classification method, device and computer readable storage medium
CN107491536B (en) Test question checking method, test question checking device and electronic equipment
CN103235803B (en) A kind of method and apparatus obtaining goods attribute value from text
CN109714356A (en) A kind of recognition methods of abnormal domain name, device and electronic equipment
CN108053545A (en) Certificate verification method and apparatus, server, storage medium
CN113961473A (en) Data testing method and device, electronic equipment and computer readable storage medium
CN107895117A (en) Malicious code mask method and device
CN108804918A (en) Safety defence method, device, electronic equipment and storage medium
CN104346408A (en) Method and equipment for labeling network user
CN106168968A (en) A kind of Website classification method and device
CN108763961A (en) A kind of private data stage division and device based on big data
CN108959289B (en) Website category acquisition method and device
CN104572810A (en) Method for carrying out operation processing on massive files by using bitmap
CN105550183A (en) Identifying method of identifying information in webpage and electronic device
CN110457603A (en) Customer relationship abstracting method, device, electronic equipment and readable storage medium storing program for executing
CN109145307A (en) User's face sketch recognition method, method for pushing, device, equipment and storage medium
CN102902820B (en) The recognition methods of type of database and device
CN104991920A (en) Label generation method and apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20171205

RJ01 Rejection of invention patent application after publication