CN107436890A - A kind of detection method and device of the Type of website - Google Patents
A kind of detection method and device of the Type of website Download PDFInfo
- Publication number
- CN107436890A CN107436890A CN201610362232.0A CN201610362232A CN107436890A CN 107436890 A CN107436890 A CN 107436890A CN 201610362232 A CN201610362232 A CN 201610362232A CN 107436890 A CN107436890 A CN 107436890A
- Authority
- CN
- China
- Prior art keywords
- website
- page
- detected
- level
- type
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
- G06F16/986—Document structures and storage, e.g. HTML extensions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
Abstract
The embodiment of the present application provides a kind of detection method and device of the Type of website, and methods described includes:At least two-stage page of the website to be detected is accessed according to the address of website to be detected;Web page code at least corresponding to the two-stage page described in obtaining;Characteristic information is extracted from the web page code, as basic feature information;The matching degree of at least two-stage page and default template is obtained according to the basic feature information, as the first matching degree;If first matching degree is more than predetermined threshold value, determine that the website to be detected belongs to the Type of website corresponding to the default template.It can be seen that the embodiment of the present application provides a kind of mode of the automatic detection Type of website, so as to reduce workload and improve detection efficiency.The empirical value that testing staff is no longer dependent in the embodiment of the present application is additionally, since, and is detected according at least two-stage page of website to be detected, the accuracy rate of testing result can be effectively improved.
Description
Technical field
The application is related to Internet technical field, more particularly, to a kind of detection method and device of the Type of website.
Background technology
With the development of Internet technology, it is required for detecting the Type of website under many scenes.For example, pass through inspection
Survey the security that the Type of website judges website;In another example put on record for Ministry of Industry and Information website when, it is necessary to detect the class of the website
Whether type, the type reported when determining the type with putting on record are consistent.
At present when detecting the type of website, the content that is generally shown by testing staff according to website manually determines website
Type.Obviously, this mode causes larger workload, causes detection efficiency relatively low.
Therefore, the automatic detection to the Type of website how is realized, is current urgent problem to be solved.
The content of the invention
The application solve technical problem be to provide a kind of detection method and device of the Type of website, with realize can from
The dynamic detection Type of website, so as to reduce workload and improve detection efficiency.
Therefore, the technical scheme that the application solves technical problem is:
This application provides a kind of detection method of the Type of website, including:
At least two-stage page of the website to be detected is accessed according to the address of website to be detected;
Web page code at least corresponding to the two-stage page described in obtaining;
Characteristic information is extracted from the web page code, as basic feature information;
The matching degree of at least two-stage page and default template is obtained according to the basic feature information, as first
With degree;
If first matching degree is more than predetermined threshold value, determine that the website to be detected belongs to corresponding to the default template
The Type of website.
Optionally, the detection method also includes:
If first matching degree is less than the predetermined threshold value, determine that the website to be detected is not belonging to the website class
Type, or determine whether the website to be detected belongs to the website class according to the basic feature information and additional feature information
Type.
Optionally, determine whether the website to be detected belongs to institute according to the basic feature information and additional feature information
The Type of website is stated, including:
The next stage page of at least two-stage page described in accessing;
Obtain web page code corresponding to the next stage page;
Characteristic information is extracted from web page code corresponding to the next stage page, as the additional feature information;
At least three-level page and the default template are obtained according to the basic feature information and the additional feature information
Matching degree, as the second matching degree;At least three-level page includes at least the two-stage page and at least two-stage page
The next stage page in face;
According to the comparative result of second matching degree and the predetermined threshold value, determine whether the website to be detected belongs to
The Type of website.
Optionally, the default template includes at least one module, modules have corresponding matching characteristic information and
Weights;The matching degree of at least two-stage page and default template is obtained according to the basic feature information, including:
According to matching characteristic information corresponding to the basic feature information and modules, determine and at least two-stage
N number of module of page matching, N >=0;
According to the N number of module respectively corresponding to weights, obtain of at least two-stage page and the default template
With degree.
Optionally, the detection method also includes:
Acquisition belongs to characteristic information corresponding to the website of the Type of website, as feedback characteristic information;
According to the feedback characteristic information, the default template is adjusted.
Optionally, at least two-stage page includes first level pages and the two level page, is visited according to the address of website to be detected
Asking at least two-stage page of the website to be detected includes:
The first level pages of the website to be detected are accessed according to the first level pages address of the website to be detected;
The web page code of the first level pages is obtained, the ground of the two level page is obtained from the web page code of the first level pages
Location;
The two level page is accessed according to the address of the two level page.
Optionally, the basic feature information includes the mark and/or content of page elements.
Optionally, when accessing at least two-stage page of the website to be detected, the detection method also includes:
Website to be detected described in simulation login and/or simulated operation.
Optionally, the Type of website is ecommerce class, and the default template is mould corresponding to the ecommerce class
Plate.
Present invention also provides a kind of detection means of the Type of website, including:
Access unit, for accessing at least two-stage page of the website to be detected according to the address of website to be detected;
First acquisition unit, for obtaining the web page code at least corresponding to the two-stage page;
Extraction unit, for extracting characteristic information from the web page code, as basic feature information;
Second acquisition unit, for obtaining at least two-stage page and default template according to the basic feature information
Matching degree, as the first matching degree;
First determining unit, if being more than predetermined threshold value for first matching degree, determine that the website to be detected belongs to
The Type of website corresponding to the default template.
Optionally, in addition to the second determining unit or the 3rd determining unit;
Second determining unit is used for, if first matching degree is less than the predetermined threshold value, determines described to be detected
Website is not belonging to the Type of website;3rd determining unit is used for, and is believed according to the basic feature information and supplementary features
Breath determines whether the website to be detected belongs to the Type of website.
Optionally, the 3rd determining unit includes:
First accesses subelement, for accessing the next stage page of at least two-stage page;
First obtains subelement, for obtaining web page code corresponding to the next stage page;
Subelement is extracted, for extracting characteristic information from web page code corresponding to the next stage page, as described
Additional feature information;
Second obtains subelement, for obtaining at least three-level according to the basic feature information and the additional feature information
The page and the matching degree of the default template, as the second matching degree;At least three-level page includes at least two-stage page
Face and the next stage page of at least two-stage page;
First determination subelement, for the comparative result according to second matching degree and the predetermined threshold value, determine institute
State whether website to be detected belongs to the Type of website.
Optionally, the default template includes at least one module, modules have corresponding matching characteristic information and
Weights;Second acquisition module includes:
Second determination subelement, for the matching characteristic information according to corresponding to the basic feature information and modules,
Determine the N number of module matched with least two-stage page, N >=0;
3rd obtains subelement, for according to the N number of module respectively corresponding to weights, obtain described at least two-stage page
With the matching degree of the default template.
Optionally, in addition to:
3rd acquiring unit, belong to characteristic information corresponding to the website of the Type of website for obtaining, it is special as feedback
Reference ceases;
Adjustment unit, for according to the feedback characteristic information, adjusting the default template.
Optionally, at least two-stage page includes first level pages and the two level page;The access unit includes:
Second accesses subelement, for accessing the website to be detected according to the first level pages address of the website to be detected
First level pages;
4th obtains subelement, for obtaining the web page code of the first level pages, from the webpage generation of the first level pages
The address of the two level page is obtained in code;
3rd accesses subelement, for accessing the two level page according to the address of the two level page.
Optionally, the basic feature information includes the mark and/or content of page elements.
Optionally, in addition to:Simulate login unit and/or simulated operation unit;
The simulation login unit, the website to be detected is logged in for simulating;
The simulated operation unit, for website to be detected described in simulated operation.
Optionally, the Type of website is ecommerce class, and the default template is mould corresponding to the ecommerce class
Plate.
According to the above-mentioned technical solution, in the embodiment of the present application, by least two-stage for accessing website to be detected automatically
The page, the web page code at least corresponding to the two-stage page can be obtained, according to the characteristic information extracted from web page code, energy
It is enough to obtain at least two-stage page matching degree corresponding with default template, due to presetting template one Type of website of correspondence, therefore
If the matching degree is more than predetermined threshold value, illustrate that the website to be detected belongs to the Type of website.It can be seen that the embodiment of the present application carries
A kind of mode of the automatic detection Type of website is supplied, so as to reduce workload and improve detection efficiency.It is additionally, since the application reality
The empirical value that testing staff is no longer dependent in example is applied, and is detected according at least two-stage page of website to be detected, energy
Enough effectively improve the accuracy rate of testing result.
Brief description of the drawings
In order to illustrate more clearly of the technical scheme in the embodiment of the present application, make required in being described below to embodiment
Accompanying drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the present application, for
For those of ordinary skill in the art, other accompanying drawings can also be obtained according to these accompanying drawings.
Fig. 1 is a kind of schematic flow sheet for embodiment of the method that the application provides;
Fig. 2 is the schematic flow sheet for another embodiment of the method that the application provides;
Fig. 3 is the schematic diagram on the first level pages top that the application provides;
Fig. 4 is the schematic diagram for the first level pages bottom that the application provides;
Fig. 5 is the schematic diagram for the two level page that the application provides;
Fig. 6 is the schematic diagram for the three-level page that the application provides;
Fig. 7 is a kind of structural representation for device embodiment that the application provides.
Embodiment
It is required for detecting the Type of website under many scenes.For example, it must be reported in Ministry of Industry and Information's recorded website
The Type of website.However, the Type of website much reported at present is often inconsistent with the Type of website of reality, it is therefore desirable to by detecting
Personnel access the website according to the address of website, the content shown according to the first level pages of website, rule of thumb manually determine
The actual Type of website.Obviously, this mode causes larger workload, causes detection efficiency relatively low.And due to detection
Personnel's limited experience, cause the accuracy rate of testing result often relatively low.
The embodiment of the present application provides a kind of detection method and device of the Type of website, being capable of automatic detection website class with realization
Type, so as to reduce workload and improve detection efficiency.
In order that those skilled in the art more fully understand the technical scheme in the application, it is real below in conjunction with the application
The accompanying drawing in example is applied, the technical scheme in the embodiment of the present application is clearly and completely described, it is clear that described implementation
Example only some embodiments of the present application, rather than whole embodiments.It is common based on the embodiment in the application, this area
The every other embodiment that technical staff is obtained under the premise of creative work is not made, should all belong to protection of the present invention
Scope.
Referring to Fig. 1, the embodiment of the present application provides a kind of embodiment of the method for the detection method of the Type of website.This implementation
The methods described of example includes:
S101:At least two-stage page of the website to be detected is accessed according to the address of website to be detected.
In the embodiment of the present application, when needing to carry out the detection of the Type of website to website to be detected, this can be got
The address of website, such as domain name of the website etc., the website to be detected can be accessed automatically by addresses such as the domain names.
If at least two-stage page includes first level pages and the two level page, the address is usually the website to be detected
First level pages address, i.e. home address, the first level pages of the website are able to access that according to the first level pages address, by obtaining
The web page code of the first level pages is taken, the address of the two level page, root can be obtained from the web page code of the first level pages
The two level page is accessed according to the address of the two level page.Similar, the three-level page, level Four page etc. can be accessed successively.
Wherein, the M level pages, M >=2, M concrete numerical value can be preset before at least two-stage page is usually.
That is the embodiment of the present application in addition to accessing first level pages, can also access the two level page of the website to be detected even
More rear class pages, so as to obtain website to be detected more fully information, improve the accuracy rate of testing result.
Study and find by inventor, generally access the preceding three-level page of the website, i.e. first level pages, the two level page and three
The level page, it typically can just be accurately judged to the Type of website of the website to be detected.
S102:Web page code at least corresponding to the two-stage page described in obtaining.
This step can specifically be realized by technologies such as web crawlers.Wherein, web page code can include static Web page mark
Remember code, and/or JavaScript dynamic script codes.
S103:Characteristic information is extracted from the web page code, as basic feature information.
By analyzing the web page code, the characteristic information of the web page code, this feature letter can be extracted
Breath can reflect the base attribute of at least two-stage page, such as display properties etc..
Wherein, the characteristic information can include the mark and/or content of page elements.For example, in web page code, will
Variable order be entered as ' order 01 ', the variable i.e. be used as a page elements, the page elements are identified as " order ", this
The content of page elements is " order 01 ".
S104:The matching degree of at least two-stage page and default template is obtained according to the basic feature information, as
First matching degree.
Wherein, the corresponding Type of website of the default template, and the specified genus that the website of the type has can be reflected
Property.For example, the default template corresponds to ecommerce class, the attribute that ecommerce class website has can be reflected, such as
It can reflect that ecommerce class website generally has commodity classification area, electric business authentication record area, commodity details page etc..By the base
Eigen information can obtain at least two-stage page and the matching degree of the default template compared with default template.
S105:If first matching degree is more than predetermined threshold value, determine that the website to be detected belongs to the default template
The corresponding Type of website.
If first matching degree is more than predetermined threshold value, the matching of at least two-stage page and the default template is represented
Degree is higher, therefore can illustrate that the website to be detected belongs to the Type of website corresponding to the default template.If for example, institute
It is template corresponding to the ecommerce class to state default template, and first matching degree is more than predetermined threshold value, is then capable of determining that
The website to be detected belongs to ecommerce class.
The embodiment of the present application can be used for that the handheld devices such as mobile phone, computer, server etc. are any to have detection function
Electronic equipment in.
According to the above-mentioned technical solution, in the embodiment of the present application, by least two-stage for accessing website to be detected automatically
The page, the web page code at least corresponding to the two-stage page can be obtained, according to the characteristic information extracted from web page code, energy
It is enough to obtain at least two-stage page matching degree corresponding with default template, due to presetting template one Type of website of correspondence, therefore
If the matching degree is more than predetermined threshold value, illustrate that the website to be detected belongs to the Type of website.It can be seen that the embodiment of the present application carries
A kind of mode of the automatic detection Type of website is supplied, so as to reduce workload and improve detection efficiency.It is additionally, since the application reality
The empirical value that testing staff is no longer dependent in example is applied, and is detected according at least two-stage page of website to be detected, energy
Enough effectively improve the accuracy rate of testing result.
In the embodiment of the present application, at least two-stage page described in access, login and/or simulated operation institute can also be simulated
State website to be detected.Such as log in the website, Huo Zhe using the automatic technologies such as TestNG, automatic register account number to simulate
Shopping page carries out simulating Shopping Behaviors etc..
In the embodiment of the present application, if at least two-stage page and the matching degree of default template are less than predetermined threshold value,
It can determine that the website to be detected is not belonging to the Type of website corresponding to the default template, it is also assumed that based on current number
According to can not judge whether the website to be detected belongs to the Type of website, therefore can be further combined with other data, example
Characteristic information corresponding to following first level pages is judged.It is specifically described below.
Methods described can also include:If first matching degree is less than the predetermined threshold value, the survey grid to be checked is determined
Station is not belonging to the Type of website.Testing result can now be prompted the user with;Or if first matching degree be less than it is described
Predetermined threshold value, determine whether the website to be detected belongs to the website according to the basic feature information and additional feature information
Type.
Wherein, the additional feature information can be obtained by accessing the next stage page.Specifically, according to the base
Eigen information and additional feature information determine whether the website to be detected belongs to the Type of website, can include:Access
The next stage page of at least two-stage page, obtains web page code corresponding to the next stage page, from the next stage page
Characteristic information is extracted in web page code corresponding to face, as the additional feature information;According to the basic feature information and institute
State additional feature information and obtain at least three-level page and the matching degree of the default template, as the second matching degree;It is described at least
The three-level page includes at least next stage page of the two-stage page and at least two-stage page;According to second matching degree
With the comparative result of the predetermined threshold value, determine whether the website to be detected belongs to the Type of website.
For example, at least two-stage page includes first level pages and the two level page, then the next stage page is three
Level the page, by accessing the three-level page, characteristic information is extracted from web page code corresponding to the three-level page, according to first level pages,
Characteristic information corresponding to the two level page, the three-level page, the matching degree of this three-level page and default template is obtained, according to the matching degree
Determine whether website to be detected belongs to the Type of website.Wherein, the mode of the three-level page is accessed, may refer to according to one-level page
The web page code in face accesses the implementation of the two level page, repeats no more here.
If the second matching degree is more than predetermined threshold value, illustrate that website to be detected belongs to the Type of website, if the second matching
Degree is less than predetermined threshold value, then illustrates that website to be detected is not belonging to the Type of website, or can be further combined with next stage page
The level Four page in face, such as above-mentioned example determines whether.The maximum series of detection, example can be set in the embodiment of the present application
Maximum series are such as set as 6, if the matching degree determined according to the characteristic information of the first six grade of page still is below predetermined threshold value,
Then finally determine that website to be detected is not belonging to the Type of website.
Wherein, in order to be not repeated to detect to the page crossed after testing, link endless loop is avoided, the embodiment of the present application exists
Before accessing the next stage page, whether the address for the next stage page that can also judge to obtain had accessed, if it is not, then after
It is continuous to access, if it is, reacquiring the address of the next stage page.For example, is extracted from the web page code of the two level page
One address, if the first address is actually the address of first level pages, it will judge that the first address has been accessed, again
The second address is extracted from the web page code of the two level page, if the second address is the address of the three-level page, it will judge the
Double-address is not visited, then accesses the second address.
In the S104 of the embodiment of the present application, at least two-stage page and of default template are obtained according to basic feature information
With degree, a kind of specific matching degree acquisition modes are provided below.It should be noted that the specific acquisition modes will not be to the application
Embodiment plays restriction effect.
The default template can include at least one module, and modules have corresponding matching characteristic information, are used for
It is determined that whether at least two-stage page matches with the module;The matching according to corresponding to the basic feature information and modules
Characteristic information, determines the N number of module matched with least two-stage page, N >=0, actually i.e., by basic feature information with
The matching characteristic information of each module is matched, if the match is successful, illustrates that at least two-stage page matches with the module;
Modules have corresponding weights, according to N number of module respectively corresponding to weights, obtain at least two-stage page with advance
If the matching degree of template.It is illustrated below so that default template is template corresponding to ecommerce class as an example.
The default template is included with one or more of lower module module:Commodity classification module, electric business authentication record
Module, commodity details module, shopping cart module, order module and logistics module.Each module has corresponding respectively match
Characteristic information and weights.For example, the matching characteristic information of the commodity classification module is " ICP cards ", if the essential characteristic
The match is successful with " ICP cards " for information, then at least two-stage page described in explanation matches with the commodity classification module.Repeat above-mentioned mistake
Journey, be capable of determining that N number of module of at least two-stage page matching, according to weights corresponding to N number of module, can calculate to
Few two-stage page and the matching degree of default template, for example, weights corresponding to N number of module are added, can obtain the matching degree.
Wherein, can be according to certain bits when the basic feature information is matched with the matching characteristic information
Corresponding basic feature information is put, the matching characteristic information of module corresponding with ad-hoc location is matched.For example, by website
The matching characteristic information of the basic feature information just obtained commodity classification module corresponding with above website is matched, if matching
It is successful then determine that at least two-stage page includes the commodity classification module.
Wherein, the default template can be adjusted and updated in real time, such as carry out self study to default template.Specifically
Ground, the detection method also include:Acquisition belongs to characteristic information corresponding to the website of the Type of website, believes as feedback characteristic
Breath;According to the feedback characteristic information, the default template is adjusted.For example, can according to the feedback characteristic information, increase or
Person deletes the module in default template, or the weights of module are modified.
Below by a specific embodiment, illustrate the detection method of the embodiment of the present application.
Referring to Fig. 2, the embodiment of the present application provides another embodiment of the method for the detection method of the Type of website.This Shen
It please be illustrated so that the Type of website is ecommerce class as an example in embodiment.
The methods described of the present embodiment includes:
S201:The address of website to be detected is obtained, accesses the first level pages of the website to be detected.
For example, the list item that affiliated industry is ecommerce class can be determined from Ministry of Industry and Information's docketing system, it is automatic to obtain
Corresponding address in the list item.Wherein it is possible to the address of multiple websites to be detected is provided with the forms such as excel batch.
S202:Web page code corresponding to first level pages is obtained, and extracts the characteristic information in the web page code, as feature
Information 01.
Such as the first level pages shown in for Fig. 3 and Fig. 4, the feature letter extracted from web page code corresponding to first level pages
Breath 01, can include:Mother and baby's toy, mobile phone digital, air conditioner electric regard, " ICP cards " etc..
S203:The address of the two level page is obtained from the web page code of first level pages, is accessed according to the address of the two level page
The two level page.
S204:The web page code of the two level page is obtained, and extracts the characteristic information in the web page code, as characteristic information
02.Wherein, characteristic information 01 and characteristic information 02 form basic feature information.
Such as the two level page shown in Fig. 5, the characteristic information 02 extracted from web page code corresponding to the two level page,
It can include:Deliver to, quantity purchase, add shopping cart, immediately purchase etc..
S205:By characteristic information 01 and characteristic information 02, matched with the modules in default template, determine with
N1 module of preceding two-stage page matching.
In the present embodiment, default template can be as shown in table 1., wherein it is desired to explanation, table 1 is only that one kind is illustrated
It is bright, included modules in table 1, and matching characteristic information and weights corresponding to modules, can be according to actual feelings
Condition is adjusted.
Table 1
For example, characteristic information 01 is matched with the matching characteristic information of modules, determine and first level pages
The electric business authentication record module and commodity classification module matched somebody with somebody, by the progress of the matching characteristic information of characteristic information 02 and modules
Match somebody with somebody, determine the commodity details module matched with the two level page.
In the embodiment of the present application, when basic feature information is matched with matching characteristic information, can use accurate
Matching or fuzzy matching, module matching include synonym matching etc., prevent the loss of critical data.
S206:According to weights corresponding to N1 module difference, the two-stage page and the matching degree of default template before acquisition.
For example, by power corresponding to electric business authentication record module, commodity classification module and commodity details module these three modules
Value is added, and obtains matching degree 0.15+0.2+0.2=55%.
S207:By matching degree compared with predetermined threshold value, if matching degree is more than predetermined threshold value, S208 is performed, if
It is less than predetermined threshold value with degree, then performs S208.
S208:Determine that the website to be detected belongs to ecommerce class.
S209:The address of the three-level page is obtained from the web page code of the two level page, is accessed according to the address of the three-level page
The three-level page.
In the present embodiment, if detecting to need to log in when accessing the three-level page, the automation such as TestNG can be utilized
Technology, automatic register account number log in the website to simulate.
S210:The web page code of the three-level page is obtained, and extracts the characteristic information in the web page code, as characteristic information
03.Wherein, characteristic information 03 is used as additional feature information.
Such as the three-level page shown in Fig. 6, the characteristic information 03 extracted from web page code corresponding to the three-level page,
It can include:Unit price, quantity, clearing etc..
S211:By the modules progress in characteristic information 01, characteristic information 02 and characteristic information 03, with default template
Match somebody with somebody, determine the N2 module matched with the preceding three-level page.
For example, characteristic information 03 is matched with the matching characteristic information of modules, determine and the three-level page
The shopping cart module matched somebody with somebody.Therefore, N2 module includes:Electric business authentication record module, commodity classification module, commodity details module and
Shopping cart module.
S212:According to weights corresponding to N2 module difference, the three-level page and the matching degree of default template before acquisition.
For example, by electric business authentication record module, commodity classification module, commodity details module, shopping cart module this four moulds
Weights corresponding to block are added, and obtain matching degree 0.15+0.2+0.2+0.15=70%.
S213:By matching degree compared with predetermined threshold value, if matching degree is more than predetermined threshold value, S208 is performed, if
It is less than predetermined threshold value with degree, then performs S214.
S214:It can determine that the website to be detected is not belonging to ecommerce class, can continue to access the level Four page
Judged.Wherein it is possible to set the maximum series of detection, such as maximum series are set as 6, if according to the first six grade of page
The matching degree that characteristic information is determined still is below predetermined threshold value, then finally determines that website to be detected is not belonging to the website class
Type.
Corresponding above method embodiment, present invention also provides device embodiment, is specifically described below.
Referring to Fig. 7, this application provides a kind of device embodiment of the detection means of the Type of website.The institute of the present embodiment
Stating detection means includes:Access unit 701, first acquisition unit 702, extraction unit 703, second acquisition unit 704 and first
Determining unit 705.
Access unit 701, for accessing at least two-stage page of the website to be detected according to the address of website to be detected.
In the embodiment of the present application, when needing to carry out the detection of the Type of website to website to be detected, this can be got
The address of website, such as domain name of the website etc., the website to be detected can be accessed automatically by addresses such as the domain names.
If at least two-stage page includes first level pages and the two level page, the address is usually the website to be detected
First level pages address, i.e. home address, the first level pages of the website are able to access that according to the first level pages address, by obtaining
The web page code of the first level pages is taken, the address of the two level page, root can be obtained from the web page code of the first level pages
The two level page is accessed according to the address of the two level page.Similar, the three-level page, level Four page etc. can be accessed successively.
Wherein, the M level pages, M >=2, M concrete numerical value can be preset before at least two-stage page is usually.
That is the embodiment of the present application in addition to accessing first level pages, can also access the two level page of the website to be detected even
More rear class pages, so as to obtain website to be detected more fully information, improve the accuracy rate of testing result.
Study and find by inventor, generally access the preceding three-level page of the website, i.e. first level pages, the two level page and three
The level page, it typically can just be accurately judged to the Type of website of the website to be detected.
First acquisition unit 702, for obtaining the web page code at least corresponding to the two-stage page.
First acquisition unit 702 can specifically pass through web page code described in the technical limit spacings such as web crawlers.Wherein, webpage generation
Code can include static Web page marker code, and/or JavaScript dynamic script codes.
Extraction unit 703, for extracting characteristic information from the web page code, as basic feature information;
Extraction unit 703 can extract the feature letter of the web page code by analyzing the web page code
Breath, this feature information can reflect the base attribute of at least two-stage page, such as display properties etc..
Wherein, the characteristic information can include the mark and/or content of page elements.For example, in web page code, will
Variable order be entered as ' order 01 ', the variable i.e. be used as a page elements, the page elements are identified as " order ", this
The content of page elements is " order 01 ".
Second acquisition unit 704, for obtaining at least two-stage page and default mould according to the basic feature information
The matching degree of plate, as the first matching degree.
Wherein, the corresponding Type of website of the default template, and the specified genus that the website of the type has can be reflected
Property.For example, the default template corresponds to ecommerce class, the attribute that ecommerce class website has can be reflected, such as
It can reflect that ecommerce class website generally has commodity classification area, electric business authentication record area, commodity details page etc..Second obtains
Unit 704 by the basic feature information compared with default template, can obtain at least two-stage page with it is described pre-
If the matching degree of template.
First determining unit 705, if being more than predetermined threshold value for first matching degree, determine the website category to be detected
In the Type of website corresponding to the default template.
If first matching degree is more than predetermined threshold value, the matching of at least two-stage page and the default template is represented
Degree is higher, therefore can illustrate that the website to be detected belongs to the Type of website corresponding to the default template.If for example, institute
It is template corresponding to the ecommerce class to state default template, and first matching degree is more than predetermined threshold value, then first determines list
Member 705 is capable of determining that the website to be detected belongs to ecommerce class.
It is any that the detection means of the embodiment of the present application can be used for the handheld devices such as mobile phone, computer, server etc.
In electronic equipment of the kind with detection function.
According to the above-mentioned technical solution, in the embodiment of the present application, by least two-stage for accessing website to be detected automatically
The page, the web page code at least corresponding to the two-stage page can be obtained, according to the characteristic information extracted from web page code, energy
It is enough to obtain at least two-stage page matching degree corresponding with default template, due to presetting template one Type of website of correspondence, therefore
If the matching degree is more than predetermined threshold value, illustrate that the website to be detected belongs to the Type of website.It can be seen that the embodiment of the present application carries
A kind of mode of the automatic detection Type of website is supplied, so as to reduce workload and improve detection efficiency.It is additionally, since the application reality
The empirical value that testing staff is no longer dependent in example is applied, and is detected according at least two-stage page of website to be detected, energy
Enough effectively improve the accuracy rate of testing result.
Optionally, the detection means also includes the second determining unit or the 3rd determining unit;
Wherein, second determining unit is used for, if first matching degree is less than the predetermined threshold value, it is determined that described treat
Detection website is not belonging to the Type of website;3rd determining unit is used for, according to the basic feature information and additional spy
Reference breath determines whether the website to be detected belongs to the Type of website.
Optionally, the 3rd determining unit includes:
First accesses subelement, for accessing the next stage page of at least two-stage page;
First obtains subelement, for obtaining web page code corresponding to the next stage page;
Subelement is extracted, for extracting characteristic information from web page code corresponding to the next stage page, as described
Additional feature information;
Second obtains subelement, for obtaining at least three-level according to the basic feature information and the additional feature information
The page and the matching degree of the default template, as the second matching degree;At least three-level page includes at least two-stage page
Face and the next stage page of at least two-stage page;
First determination subelement, for the comparative result according to second matching degree and the predetermined threshold value, determine institute
State whether website to be detected belongs to the Type of website.
Optionally, the default template includes at least one module, modules have corresponding matching characteristic information and
Weights;Second acquisition module includes:
Second determination subelement, for the matching characteristic information according to corresponding to the basic feature information and modules,
Determine the N number of module matched with least two-stage page, N >=0;
3rd obtains subelement, for according to the N number of module respectively corresponding to weights, obtain described at least two-stage page
With the matching degree of the default template.
Optionally, the detection means also includes:
3rd acquiring unit, belong to characteristic information corresponding to the website of the Type of website for obtaining, it is special as feedback
Reference ceases;
Adjustment unit, for according to the feedback characteristic information, adjusting the default template.
Optionally, at least two-stage page includes first level pages and the two level page;The access unit includes:
Second accesses subelement, for accessing the website to be detected according to the first level pages address of the website to be detected
First level pages;
4th obtains subelement, for obtaining the web page code of the first level pages, from the webpage generation of the first level pages
The address of the two level page is obtained in code;
3rd accesses subelement, for accessing the two level page according to the address of the two level page.
Optionally, the detection means also includes:Simulate login unit and/or simulated operation unit;
The simulation login unit, the website to be detected is logged in for simulating;The simulated operation unit, for simulating
Operate the website to be detected.
Optionally, the Type of website is ecommerce class, and the default template is mould corresponding to the ecommerce class
Plate.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description,
The specific work process of device and unit, the corresponding process in preceding method embodiment is may be referred to, will not be repeated here.
In several embodiments provided herein, it should be understood that disclosed system, apparatus and method can be with
Realize by another way.For example, device embodiment described above is only schematical, for example, the unit
Division, only a kind of division of logic function, can there is other dividing mode, such as multiple units or component when actually realizing
Another system can be combined or be desirably integrated into, or some features can be ignored, or do not perform.It is another, it is shown or
The mutual coupling discussed or direct-coupling or communication connection can be the indirect couplings by some interfaces, device or unit
Close or communicate to connect, can be electrical, mechanical or other forms.
The unit illustrated as separating component can be or may not be physically separate, show as unit
The part shown can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple
On NE.Some or all of unit therein can be selected to realize the mesh of this embodiment scheme according to the actual needs
's.
In addition, each functional unit in each embodiment of the application can be integrated in a processing unit, can also
That unit is individually physically present, can also two or more units it is integrated in a unit.Above-mentioned integrated list
Member can both be realized in the form of hardware, can also be realized in the form of SFU software functional unit.
If the integrated unit is realized in the form of SFU software functional unit and is used as independent production marketing or use
When, it can be stored in a computer read/write memory medium.Based on such understanding, the technical scheme of the application is substantially
The part to be contributed in other words to prior art or all or part of the technical scheme can be in the form of software products
Embody, the computer software product is stored in a storage medium, including some instructions are causing a computer
Equipment (can be personal computer, server, or network equipment etc.) performs the complete of each embodiment methods described of the application
Portion or part steps.And foregoing storage medium includes:USB flash disk, mobile hard disk, read-only storage (ROM, Read-Only
Memory), random access memory (RAM, Random Access Memory), magnetic disc or CD etc. are various can store journey
The medium of sequence code.
Described above, above example is only to illustrate the technical scheme of the application, rather than its limitations;Although with reference to before
Embodiment is stated the application is described in detail, it will be understood by those within the art that:It still can be to preceding
State the technical scheme described in each embodiment to modify, or equivalent substitution is carried out to which part technical characteristic;And these
Modification is replaced, and the essence of appropriate technical solution is departed from the spirit and scope of each embodiment technical scheme of the application.
Claims (18)
- A kind of 1. detection method of the Type of website, it is characterised in that including:At least two-stage page of the website to be detected is accessed according to the address of website to be detected;Web page code at least corresponding to the two-stage page described in obtaining;Characteristic information is extracted from the web page code, as basic feature information;At least two-stage page and the matching degree of default template are obtained according to the basic feature information, matched as first Degree;If first matching degree is more than predetermined threshold value, determine that the website to be detected belongs to website corresponding to the default template Type.
- 2. detection method according to claim 1, it is characterised in that the detection method also includes:If first matching degree is less than the predetermined threshold value, determine that the website to be detected is not belonging to the Type of website, or Person determines whether the website to be detected belongs to the Type of website according to the basic feature information and additional feature information.
- 3. detection method according to claim 2, it is characterised in that believed according to the basic feature information and supplementary features Breath determines whether the website to be detected belongs to the Type of website, including:The next stage page of at least two-stage page described in accessing;Obtain web page code corresponding to the next stage page;Characteristic information is extracted from web page code corresponding to the next stage page, as the additional feature information;At least three-level page and of the default template are obtained according to the basic feature information and the additional feature information With degree, as the second matching degree;At least three-level page includes at least the two-stage page and at least two-stage page The next stage page;According to the comparative result of second matching degree and the predetermined threshold value, determine whether the website to be detected belongs to described The Type of website.
- 4. detection method according to claim 1, it is characterised in that the default template includes at least one module, respectively Individual module has corresponding matching characteristic information and weights;According to the basic feature information obtain at least two-stage page and The matching degree of default template, including:According to matching characteristic information corresponding to the basic feature information and modules, determine and at least two-stage page N number of module of matching, N >=0;According to the N number of module respectively corresponding to weights, obtain the matching degree of at least two-stage page and the default template.
- 5. detection method according to claim 1, it is characterised in that the detection method also includes:Acquisition belongs to characteristic information corresponding to the website of the Type of website, as feedback characteristic information;According to the feedback characteristic information, the default template is adjusted.
- 6. detection method according to claim 1, it is characterised in that at least two-stage page includes first level pages and two The level page, at least two-stage page of the website to be detected is accessed according to the address of website to be detected to be included:The first level pages of the website to be detected are accessed according to the first level pages address of the website to be detected;The web page code of the first level pages is obtained, the address of the two level page is obtained from the web page code of the first level pages;The two level page is accessed according to the address of the two level page.
- 7. detection method according to claim 1, it is characterised in that the basic feature information includes the mark of page elements Knowledge and/or content.
- 8. detection method according to claim 1, it is characterised in that access at least two-stage page of the website to be detected When, the detection method also includes:Website to be detected described in simulation login and/or simulated operation.
- 9. according to the detection method described in any one of claim 1 to 8, it is characterised in that the Type of website is ecommerce Class, the default template are template corresponding to the ecommerce class.
- A kind of 10. detection means of the Type of website, it is characterised in that including:Access unit, for accessing at least two-stage page of the website to be detected according to the address of website to be detected;First acquisition unit, for obtaining the web page code at least corresponding to the two-stage page;Extraction unit, for extracting characteristic information from the web page code, as basic feature information;Second acquisition unit, at least matching of the two-stage page and default template according to basic feature information acquisition Degree, as the first matching degree;First determining unit, if being more than predetermined threshold value for first matching degree, determine that the website to be detected belongs to described The Type of website corresponding to default template.
- 11. detection means according to claim 10, it is characterised in that also determined including the second determining unit or the 3rd Unit;Second determining unit is used for, if first matching degree is less than the predetermined threshold value, determines the website to be detected It is not belonging to the Type of website;3rd determining unit is used for, true according to the basic feature information and additional feature information Whether the fixed website to be detected belongs to the Type of website.
- 12. detection means according to claim 11, it is characterised in that the 3rd determining unit includes:First accesses subelement, for accessing the next stage page of at least two-stage page;First obtains subelement, for obtaining web page code corresponding to the next stage page;Subelement is extracted, for extracting characteristic information from web page code corresponding to the next stage page, as described additional Characteristic information;Second obtains subelement, for obtaining at least three-level page according to the basic feature information and the additional feature information With the matching degree of the default template, as the second matching degree;At least three-level page include at least two-stage page and The next stage page of at least two-stage page;First determination subelement, for the comparative result according to second matching degree and the predetermined threshold value, it is determined that described treat Whether detection website belongs to the Type of website.
- 13. detection means according to claim 10, it is characterised in that the default template includes at least one module, Modules have corresponding matching characteristic information and weights;Second acquisition module includes:Second determination subelement, for the matching characteristic information according to corresponding to the basic feature information and modules, it is determined that Go out the N number of module matched with least two-stage page, N >=0;3rd obtains subelement, for according to the N number of module respectively corresponding to weights, obtain described at least the two-stage page and institute State the matching degree of default template.
- 14. detection means according to claim 10, it is characterised in that also include:3rd acquiring unit, belong to characteristic information corresponding to the website of the Type of website for obtaining, believe as feedback characteristic Breath;Adjustment unit, for according to the feedback characteristic information, adjusting the default template.
- 15. detection means according to claim 10, it is characterised in that at least two-stage page include first level pages and The two level page;The access unit includes:Second accesses subelement, for accessing the one of the website to be detected according to the first level pages address of the website to be detected The level page;4th obtains subelement, for obtaining the web page code of the first level pages, from the web page code of the first level pages Obtain the address of the two level page;3rd accesses subelement, for accessing the two level page according to the address of the two level page.
- 16. detection means according to claim 10, it is characterised in that the basic feature information includes page elements Mark and/or content.
- 17. detection means according to claim 10, it is characterised in that also include:Simulate login unit and/or simulation behaviour Make unit;The simulation login unit, the website to be detected is logged in for simulating;The simulated operation unit, for website to be detected described in simulated operation.
- 18. according to the detection means described in any one of claim 10 to 17, it is characterised in that the Type of website is electronics business Business class, the default template is template corresponding to the ecommerce class.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610362232.0A CN107436890A (en) | 2016-05-26 | 2016-05-26 | A kind of detection method and device of the Type of website |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610362232.0A CN107436890A (en) | 2016-05-26 | 2016-05-26 | A kind of detection method and device of the Type of website |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107436890A true CN107436890A (en) | 2017-12-05 |
Family
ID=60454521
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610362232.0A Pending CN107436890A (en) | 2016-05-26 | 2016-05-26 | A kind of detection method and device of the Type of website |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107436890A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108742457A (en) * | 2018-05-14 | 2018-11-06 | 佛山市顺德区美的洗涤电器制造有限公司 | Dishwashing machine dispenser recognition methods, device and computer readable storage medium |
CN108875060A (en) * | 2018-06-29 | 2018-11-23 | 成都市映潮科技股份有限公司 | A kind of website identification method and identifying system |
CN109101657A (en) * | 2018-08-30 | 2018-12-28 | 杭州安恒信息技术股份有限公司 | Multiple level marketing referrer website identification method, device and equipment |
CN109753619A (en) * | 2018-12-25 | 2019-05-14 | 杭州安恒信息技术股份有限公司 | A kind of website industry type quickly knows method for distinguishing |
CN110929129A (en) * | 2018-08-31 | 2020-03-27 | 阿里巴巴集团控股有限公司 | Information detection method, equipment and machine-readable storage medium |
CN111833064A (en) * | 2019-04-17 | 2020-10-27 | 马上消费金融股份有限公司 | Cheating detection method and device |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102819597A (en) * | 2012-08-13 | 2012-12-12 | 北京星网锐捷网络技术有限公司 | Web page classification method and equipment |
CN103179095A (en) * | 2011-12-22 | 2013-06-26 | 阿里巴巴集团控股有限公司 | Method and client device for detecting phishing websites |
CN103577447A (en) * | 2012-07-30 | 2014-02-12 | 百度在线网络技术(北京)有限公司 | Method and equipment used for determining page type information of target pages |
US20140304814A1 (en) * | 2011-10-19 | 2014-10-09 | Cornell University | System and methods for automatically detecting deceptive content |
CN104750754A (en) * | 2013-12-31 | 2015-07-01 | 北龙中网(北京)科技有限责任公司 | Website industry classification method and server |
CN104978423A (en) * | 2015-06-30 | 2015-10-14 | 北京奇虎科技有限公司 | Website type detection method and apparatus |
-
2016
- 2016-05-26 CN CN201610362232.0A patent/CN107436890A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140304814A1 (en) * | 2011-10-19 | 2014-10-09 | Cornell University | System and methods for automatically detecting deceptive content |
CN103179095A (en) * | 2011-12-22 | 2013-06-26 | 阿里巴巴集团控股有限公司 | Method and client device for detecting phishing websites |
CN103577447A (en) * | 2012-07-30 | 2014-02-12 | 百度在线网络技术(北京)有限公司 | Method and equipment used for determining page type information of target pages |
CN102819597A (en) * | 2012-08-13 | 2012-12-12 | 北京星网锐捷网络技术有限公司 | Web page classification method and equipment |
CN104750754A (en) * | 2013-12-31 | 2015-07-01 | 北龙中网(北京)科技有限责任公司 | Website industry classification method and server |
CN104978423A (en) * | 2015-06-30 | 2015-10-14 | 北京奇虎科技有限公司 | Website type detection method and apparatus |
Non-Patent Citations (1)
Title |
---|
郭庚麒: ""基于Web挖掘的中文专业搜索引擎设计关键技术研究"", 《万方—中国学位论文全文数据库》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108742457A (en) * | 2018-05-14 | 2018-11-06 | 佛山市顺德区美的洗涤电器制造有限公司 | Dishwashing machine dispenser recognition methods, device and computer readable storage medium |
CN108875060A (en) * | 2018-06-29 | 2018-11-23 | 成都市映潮科技股份有限公司 | A kind of website identification method and identifying system |
CN108875060B (en) * | 2018-06-29 | 2021-02-26 | 成都市映潮科技股份有限公司 | Website identification method and identification system |
CN109101657A (en) * | 2018-08-30 | 2018-12-28 | 杭州安恒信息技术股份有限公司 | Multiple level marketing referrer website identification method, device and equipment |
CN110929129A (en) * | 2018-08-31 | 2020-03-27 | 阿里巴巴集团控股有限公司 | Information detection method, equipment and machine-readable storage medium |
CN110929129B (en) * | 2018-08-31 | 2023-12-26 | 阿里巴巴集团控股有限公司 | Information detection method, equipment and machine-readable storage medium |
CN109753619A (en) * | 2018-12-25 | 2019-05-14 | 杭州安恒信息技术股份有限公司 | A kind of website industry type quickly knows method for distinguishing |
CN111833064A (en) * | 2019-04-17 | 2020-10-27 | 马上消费金融股份有限公司 | Cheating detection method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107436890A (en) | A kind of detection method and device of the Type of website | |
CN108416198B (en) | Device and method for establishing human-machine recognition model and computer readable storage medium | |
CN107807987B (en) | Character string classification method and system and character string classification equipment | |
CN108629043B (en) | Webpage target information extraction method, device and storage medium | |
CN101694668B (en) | Method and device for confirming web structure similarity | |
CN107168992A (en) | Article sorting technique and device, equipment and computer-readable recording medium based on artificial intelligence | |
CN109299258A (en) | A kind of public sentiment event detecting method, device and equipment | |
CN109062972A (en) | Web page classification method, device and computer readable storage medium | |
CN107491536B (en) | Test question checking method, test question checking device and electronic equipment | |
CN103235803B (en) | A kind of method and apparatus obtaining goods attribute value from text | |
CN109714356A (en) | A kind of recognition methods of abnormal domain name, device and electronic equipment | |
CN108053545A (en) | Certificate verification method and apparatus, server, storage medium | |
CN113961473A (en) | Data testing method and device, electronic equipment and computer readable storage medium | |
CN107895117A (en) | Malicious code mask method and device | |
CN108804918A (en) | Safety defence method, device, electronic equipment and storage medium | |
CN104346408A (en) | Method and equipment for labeling network user | |
CN106168968A (en) | A kind of Website classification method and device | |
CN108763961A (en) | A kind of private data stage division and device based on big data | |
CN108959289B (en) | Website category acquisition method and device | |
CN104572810A (en) | Method for carrying out operation processing on massive files by using bitmap | |
CN105550183A (en) | Identifying method of identifying information in webpage and electronic device | |
CN110457603A (en) | Customer relationship abstracting method, device, electronic equipment and readable storage medium storing program for executing | |
CN109145307A (en) | User's face sketch recognition method, method for pushing, device, equipment and storage medium | |
CN102902820B (en) | The recognition methods of type of database and device | |
CN104991920A (en) | Label generation method and apparatus |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20171205 |
|
RJ01 | Rejection of invention patent application after publication |