CN103617228A - Method and device for calculating relevant webpage URL pattern - Google Patents

Method and device for calculating relevant webpage URL pattern Download PDF

Info

Publication number
CN103617228A
CN103617228A CN201310607851.8A CN201310607851A CN103617228A CN 103617228 A CN103617228 A CN 103617228A CN 201310607851 A CN201310607851 A CN 201310607851A CN 103617228 A CN103617228 A CN 103617228A
Authority
CN
China
Prior art keywords
url
page
characteristic
pattern
page turning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310607851.8A
Other languages
Chinese (zh)
Inventor
王智广
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201310607851.8A priority Critical patent/CN103617228A/en
Publication of CN103617228A publication Critical patent/CN103617228A/en
Priority to PCT/CN2014/086522 priority patent/WO2015074455A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Abstract

The invention discloses a method and device for calculating a relevant webpage URL pattern. The method comprises the steps that whether a page-turning feature anchor exists in page elements of an appointed webpage or not is judged; if yes, a relevant URL correspondingly connected to the page-turning feature anchor is extracted; according to the URL of the appointed webpage and the relevant URL correspondingly connected to the page-turning feature anchor, the relevant webpage URL pattern corresponding to the appointed webpage is calculated; according to the method and device for calculating the relevant webpage URL pattern, the page-turning feature anchor is used for identifying the relevant webpage, the identification accuracy is high, the relevant webpage URL pattern is calculated on the basis of the URL of the appointed webpage and the relevant webpage, and the calculation efficiency is high.

Description

The method and apparatus of a kind of compute associations webpage URL pattern pattern
Technical field
The present invention relates to technical field of data processing, be specifically related to the method for compute associations webpage URL pattern pattern a kind of, the device of a kind of compute associations webpage URL pattern pattern.
Background technology
Along with the development of the Internet, more and more many information is to be presented on the Internet and to be inquired about for user by webpage mode, and the same Search engine data query in the Internet that passes through also becomes the data search method the most often using.
During search engine webpage, need to take different scheduling strategies for different types of webpage, the identification of webpage kind is an element task, and the identification of wherein page turning (Page turning) webpage is a more crucial job.So-called page turning webpage, checks a upper page of paging file, the next page or the non-current page existing arbitrarily.Page turning webpage can change the content in entity book or mobile Web forms, to watch different content.While using on the internet, this mechanism also presents the user interface element that can be used for browsing to other pages.
The recognition methods of existing page turning webpage is the URL(Uniform Resource Locator according to webpage, URL(uniform resource locator)) whether the keyword that comprises is identified be index page.For example, as URL, include while having numeral after the keywords such as page, pn, p and keyword, judge that the webpage that this URL is corresponding is page turning webpage.
But, this recognition methods recall rate is low, and the page turning of a lot of websites is not have these keywords, such as " http://cq.ABC.com/lvshi/o12/ ", " http://bbs.BCA.com/t661_10 ", " http://china.BCD.com/product/20110617/2647 ", but these webpages are still page turnings, make these recognition methodss easily cause maloperation, practicality is low.
Summary of the invention
In view of the above problems, the present invention has been proposed to provide the method for a kind of compute associations webpage URL pattern pattern that overcomes the problems referred to above or address the above problem at least in part a kind of and the device of corresponding a kind of compute associations webpage URL pattern pattern.
According to one aspect of the present invention, the method for compute associations webpage URL pattern pattern a kind of is provided, comprising:
Judge in the page elements of named web page and whether there is page turning feature anchor; If so, extract the associated URL that described page turning feature anchor correspondence is linked to;
The associated URL being linked to according to the URL of described named web page and described page turning feature anchor correspondence calculates the associating web pages URL pattern pattern corresponding with described named web page.
Alternatively, the step that whether has page turning feature anchor in the described page elements that judges named web page comprises:
Adopt page turning feature anchor to mate in the dom tree node of current web page;
When the match is successful, judge that current web page has page turning feature anchor.
Alternatively, described page turning feature anchor correspondence is linked to one or more associated URL.
Alternatively, the described step of calculating described associating web pages URL pattern pattern according to the URL of described named web page and described associated page URL further comprises:
Use wild-character to replace the digital block in the URL of named web page, obtain First Characteristic URL prefix; Wherein, described digital block is to be spaced apart individual digit or a plurality of numeral that sign is partitioned into;
Use wild-character to replace the digital block in described associated URL, obtain Second Characteristic URL prefix;
When described First Characteristic URL prefix is identical with described Second Characteristic URL prefix, using described First Characteristic URL prefix or Second Characteristic URL prefix as associating web pages URL pattern pattern.
Alternatively, the digital block in the URL of described use wild-character replacement named web page, the step that obtains First Characteristic URL prefix is:
Adopt identical wild-character to replace the digital block of diverse location in the URL of named web page, obtain First Characteristic URL prefix;
Described use wild-character is replaced the digital block in described associated URL, and the step that obtains Second Characteristic URL prefix is:
Adopt identical wild-character to replace the digital block of diverse location in described associated URL, obtain Second Characteristic URL prefix.
Alternatively, the digital block in the URL of described use wild-character replacement named web page, the step that obtains First Characteristic URL prefix is:
Adopt respectively different wild-characters, the digital block of diverse location in the URL of replacement named web page, obtains First Characteristic URL prefix;
Described use wild-character is replaced the digital block in described associated URL, and the step that obtains Second Characteristic URL prefix is:
Adopt respectively the wild-character identical with First Characteristic URL to replace described associated URL at the digital block of same position, obtain Second Characteristic URL prefix.
Alternatively, also comprise:
By the general character in associating web pages URL pattern pattern is partly carried out to structure analysis, extract the page turning piece in associating web pages URL pattern pattern, described page turning piece is replaced with to the URL that homepage sign obtains homepage associating web pages; Wherein, described page turning piece is the identical but digital different digital block in position in a plurality of associating web pages URL pattern pattern.
Alternatively, described homepage sign comprise 0,1 and/or current associating web pages in greatest measure.
According to a further aspect in the invention, provide the device of a kind of compute associations webpage URL pattern pattern, having comprised:
Page turning feature anchor judge module, is suitable for judging in the page elements of named web page whether have page turning feature anchor; If so, call associated URL extraction module;
URL extraction module, is suitable for extracting the associated URL that described page turning feature anchor correspondence is linked to;
Associating web pages URL pattern pattern computing module, the associated URL that is suitable for being linked to according to the URL of described named web page and described page turning feature anchor correspondence calculates the associating web pages URL pattern pattern corresponding with described named web page.
Alternatively, described page turning feature anchor judge module is also suitable for:
Adopt page turning feature anchor to mate in the dom tree node of current web page;
When the match is successful, judge that current web page has page turning feature anchor.
Alternatively, described page turning feature anchor correspondence is linked to one or more associated URL.
Alternatively, described associating web pages URL pattern pattern computing module comprises:
First Characteristic URL prefix obtains submodule, and the digital block in the URL that is suitable for using wild-character to replace named web page obtains First Characteristic URL prefix; Wherein, described digital block is to be spaced apart individual digit or a plurality of numeral that sign is partitioned into;
Second Characteristic URL prefix obtains submodule, is suitable for using wild-character to replace the digital block in described associated URL, obtains Second Characteristic URL prefix;
Associating web pages URL pattern pattern obtains module, is suitable for when described First Characteristic URL prefix is identical with described Second Characteristic URL prefix, using described First Characteristic URL prefix or Second Characteristic URL prefix as associating web pages URL pattern pattern.
Alternatively, described First Characteristic URL prefix acquisition submodule is also suitable for:
Adopt identical wild-character to replace the digital block of diverse location in the URL of named web page, obtain First Characteristic URL prefix;
Described Second Characteristic URL prefix obtains submodule and is also suitable for:
Adopt identical wild-character to replace the digital block of diverse location in described associated URL, obtain Second Characteristic URL prefix.
Alternatively, described First Characteristic URL prefix acquisition submodule is also suitable for:
Adopt respectively different wild-characters, the digital block of diverse location in the URL of replacement named web page, obtains First Characteristic URL prefix;
Second Characteristic URL prefix obtains submodule and is also suitable for:
Adopt respectively the wild-character identical with First Characteristic URL to replace described associated URL at the digital block of same position, obtain Second Characteristic URL prefix.
Alternatively, it is characterized in that, also comprise:
Homepage associating web pages URL obtains module, be suitable for by the general character in associating web pages URL pattern pattern is partly carried out to structure analysis, extract the page turning piece in associating web pages URL pattern pattern, described page turning piece is replaced with to the URL that homepage sign obtains homepage associating web pages; Wherein, described page turning piece is the identical but digital different digital block in position in a plurality of associating web pages URL pattern pattern.
Alternatively, described homepage sign comprise 0,1 and/or current associating web pages in greatest measure.
The present invention adopts page turning feature anchor identification associating web pages, and recognition accuracy is high, in the URL based on named web page, calculates associating web pages URL pattern pattern with associated URL, and counting yield is high.
The present invention uses wild-character to replace digital block and obtains First Characteristic URL prefix and obtain Second Characteristic URL prefix, when described First Characteristic URL prefix is identical with described Second Characteristic URL prefix, using described First Characteristic URL prefix or Second Characteristic URL prefix as associating web pages URL pattern, the present invention adopts the general character of URL partly to mate, further improved the recognition accuracy of associating web pages, recall rate is significantly improved, can identify more than 90% associating web pages in actual applications.
The present invention replaces with by the page turning piece of associating web pages URL pattern pattern the URL that homepage sign obtains homepage associating web pages, in like manner, also page turning piece can be replaced with to the URL that other chaining banners obtain other associating web pages, thereby increased the coverage rate of associating web pages, make it possible to obtain more comprehensively associating web pages, and then realized the operation of fine granularity.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to better understand technological means of the present invention, and can be implemented according to the content of instructions, and for above and other objects of the present invention, feature and advantage can be become apparent, below especially exemplified by the specific embodiment of the present invention.
Accompanying drawing explanation
By reading below detailed description of the preferred embodiment, various other advantage and benefits will become cheer and bright for those of ordinary skills.Accompanying drawing is only for the object of preferred implementation is shown, and do not think limitation of the present invention.And in whole accompanying drawing, by identical reference symbol, represent identical parts.In the accompanying drawings:
Fig. 1 shows the flow chart of steps of the embodiment of the method 1 of a kind of according to an embodiment of the invention compute associations webpage URL pattern pattern;
Fig. 2 shows a kind of according to an embodiment of the invention structure of web page exemplary plot;
Fig. 3 shows the exemplary plot of a kind of page turning piece of one embodiment of the invention;
Fig. 4 shows the flow chart of steps of the embodiment of the method 2 of a kind of according to an embodiment of the invention compute associations webpage URL pattern pattern;
Fig. 5 shows the structured flowchart of the device embodiment 1 of a kind of according to an embodiment of the invention compute associations webpage URL pattern pattern; And,
Fig. 6 shows the structured flowchart of the device embodiment 2 of a kind of according to an embodiment of the invention compute associations webpage URL pattern pattern.
Embodiment
Exemplary embodiment of the present disclosure is described below with reference to accompanying drawings in more detail.Although shown exemplary embodiment of the present disclosure in accompanying drawing, yet should be appreciated that and can realize the disclosure and the embodiment that should do not set forth limits here with various forms.On the contrary, it is in order more thoroughly to understand the disclosure that these embodiment are provided, and can by the scope of the present disclosure complete convey to those skilled in the art.
With reference to Fig. 1, show the flow chart of steps of embodiment of the method 1 of a kind of compute associations webpage URL pattern pattern of one embodiment of the invention, specifically can comprise the steps:
Step 101, judges in the page elements of named web page whether have page turning feature anchor; If so, perform step 102;
Webpage can be divided into a plurality of regions according to function, with (the Bulletin Board System of some forums, BBS) the page is example, as shown in Figure 2, this page can be divided into navigation block (1), executing garbage (2,4), page turning piece (3), title piece (5), author information piece (6), date issued piece (7), text block (8).Wherein, navigation block can be positioned at webpage header top, or the banner of banner(webpage) bottom, be used in reference to the information column to webpage.Executing garbage can be the region with the very low page elements place of the Web page subject degree of correlation, function buttons such as " posting ", " reply ".Page turning piece can be the region of indication page turning.Title piece can be the region at the title of Web page subject (example " secure browser assemble black Thursday " as shown in Figure 2) place.Author information piece is for recording the region of this Web page subject author information.Text block is for recording the region of this Web page subject text.
With reference to Fig. 3, show the exemplary plot of a kind of page turning piece of one embodiment of the invention.
As shown in Figure 3, page turning piece mainly can be comprised of page turning feature anchor, and page turning feature anchor is page turning feature string, and it can be for for identifying the page elements of page turning.
In specific implementation, page turning feature anchor can comprise following one or more:
[<<], [>>], [>], [<], [lower one page], [page up], [upper one], [next], [next], [last page], [endpage], [front page], [rear page], [< page up], [< upper one], [next >], [lower one page >], [1...].
Certainly, above-mentioned page turning feature anchor, just as example, when implementing the embodiment of the present invention, can arrange other page turning features anchor according to actual conditions, and the embodiment of the present invention is not limited this.
In a preferred embodiment of the present invention, described step 101 specifically can comprise following sub-step:
Sub-step S11, adopts page turning feature anchor to mate in the dom tree node of current web page;
Sub-step S12, when the match is successful, judges that current web page has page turning feature anchor.
DOM(document dbject model, DocumentObjectModel) is the standard program interface of processing extensible markup language.DOM can access and revise the content and structure of a document in a kind of mode that is independent of platform and language, mean and process the common method of a HTML or XML document.
DOM is actually the document model of describing with object-oriented way.DOM has defined and has represented and required object, the behavior of these objects and the relation between attribute and these objects of modification document.DOM can be thought to a tree represenation of data and structure on the page, but the page may not be the mode specific implementation with this tree certainly.
Can the whole html document of reconstruct by JavaScript, can add, remove, change or reset the project on the page.
Change certain thing of the page, JavaScript just needs to obtain the entrance that all elements in html document is conducted interviews.This entrance, together with the method that html element element is added, moves, changed or removes and attribute, all obtains (DOM) by DOM Document Object Model.
Can regard html document as tree construction, and this structure is called as node tree (HTML DOM).By HTMLDOM, all nodes in tree all can conduct interviews by JavaScript.All html element elements (node) all can be modified, and also can create or deletion of node.
Node in node tree has hierarchical relationship each other.Can adopt the terms such as father (parent), son (child) and compatriot (sibling) to be used for describing these relations.Wherein, father node has child node.Child node at the same level is called as compatriot (brothers or sisters).In node tree, top node is called as root (root).Each node has father node, except root (it does not have father node).A node can have the son of any amount, and compatriot is the node that has identical father node.
Specifically can at node tree, search by several method the web page element of wishing operation:
For example, can be by using getElementById () and getElementsByTagName () method to search.
Again for example, can be by using parentNode, firstChild and the lastChild attribute of a node element.
Wherein, these two kinds of methods of getElementById () and getElementsByTagName (), can search any html element element in whole html document.And these two kinds of methods can be ignored the structure of document.If search <p> elements all in document, getElementsByTagName () can all find them, no matter which level of <p> element in document.Meanwhile, getElementById () method also can be returned to correct element, no matter where it is hidden in file structure.These two kinds of methods can provide any needed html element element, no matter their residing positions in document.
In addition, getElementById () can return to web page element by the ID of appointment.
In specific implementation, can by identification this webpage html text dom tree in hyperlink <a>(anchor, anchor point) sign whether comprise [<<], [>>],
Figure BDA0000421698240000081
one or more in [>], [<], [lower one page], [page up], [upper one], [next], [next], [last page], [endpage], [front page], [rear page], [< page up], [< upper one], [next >], [lower one page >], [1...], if so, judge that current web page has page turning feature anchor.
Wherein, <a> can be for being connected to the text of current location or picture other the page, text or image etc.
The basic syntax structure of < a > sign can be as follows:
<a
class=type
id=value
href=reference
name=value
rel=same|next|parent|previous
rev=value
target=window
style=value
title=title
onclick=function
onmouseout=function
Code </a > of onMouseOver=function > display text or picture
For example in following a kind of html text, the content of <a> sign is:
<divid=″pgt″class=″bm?bw0?pgs?cl″>
<spanid=″fd_page_top″>
<divclass=″pg″>
<a
href=″forum-99-1.html″class=″prev″></a>
<a
href=″forum-99-1.html″>1</a><strong>2<>
<a
href=″forum-99-3.html″>3</a>
<a
href=″forum-99-4.html″>4</a>
<a
href=″forum-99-5.html″>5</a>
<a
href=″forum-99-6.html″>6</a>
<a
href=″forum-99-7.html″>7</a>
<a
href=″forum-99-8.html″>8</a>
<a
href=″forum-99-9.html″>9</a>
<a
href=″forum-99-10.html″>10</a>
<a
href=″forum-99-1000.html″class=″last″>...2107</a>
<label>
" the input page number, by the quick redirect of carriage return " value=" 2 " onkeydown=" if (event.keyCode==13) { window.location=' forum.php mod=forumdisplay & fid=99 & page='+this.valu e for <input type=" text " name=" custompage " class=" px " size=" 2 " title=; Doane (event); "/>
<spantitle=" totally 1000 pages " >/1000 page </span>
</label>
<a
One page </a> under href=" forum-99-3.html " class=" nxt " >
</div>
</span>
Coupling by <a> sign in html text, can judge that this webpage has one or more page turning feature anchor.
Step 102, extracts the associated URL that described page turning feature anchor correspondence is linked to;
In realizing application, described page turning feature anchor can correspondence be linked to one or more associated URL.
Particularly, after identifying these one or more page turning feature anchor, extract the one or more associated URL of these one or more page turning feature anchor links, these one or more associated URL point to other the page turning webpage associated with current web page.
Step 103, the associated URL being linked to according to the URL of described named web page and described page turning feature anchor correspondence calculates the associating web pages URL pattern pattern corresponding with described named web page.
Associating web pages URL pattern Pattern, the set that can get together and form for appearance or functionally similar URL/ webpage.
In a preferred embodiment of the present invention, described step 103 specifically can comprise following sub-step:
Sub-step S21, is used wild-character to replace the digital block in the URL of named web page, obtains First Characteristic URL prefix; Wherein, described digital block is to be spaced apart individual digit or a plurality of numeral that sign is partitioned into;
Sub-step S31, is used wild-character to replace the digital block in described associated URL, obtains Second Characteristic URL prefix;
It should be noted that, wild-character can be any character, and the embodiment of the present invention is not limited this.Spacing identification can in URL for the symbol at interval, for example "/", ". ", "-", "? ", ": " etc.Digital block need to be numeral continuous in spacing identification, and for example " 123ABC " is not digital block.
In a kind of preferred exemplary of the embodiment of the present invention, described sub-step S21 further can comprise following sub-step:
Sub-step S211, adopts identical wild-character to replace the digital block of diverse location in the URL of named web page, obtains First Characteristic URL prefix;
With sub-step S211 accordingly, described sub-step S31 further can comprise following sub-step:
Sub-step S311, adopts identical wild-character to replace the digital block of diverse location in described associated URL, obtains Second Characteristic URL prefix.
In specific implementation, the URL of named web page can have one or more digital blocks with associated URL, for reducing the operation steps of replacement and the resource occupation of system, can replace digital block with identical wild-character.
For example, the URL of named web page is http://bbs.XXX.com/forum-99-2.html, associated URL is http://bbs.XXX.com/forum-99-3.html, wherein " 99 ", " 2 " are identified is digital block, using " (d+) " a kind of example as wild-character, First Characteristic URL prefix can be the .html of http://bbs.XXX.com/forum-(d+)-(d+), and Second Characteristic URL prefix can be the .html of http://bbs.XXX.com/forum-(d+)-(d+).
In an embodiment of the present invention, described sub-step S21 further can comprise following sub-step:
Sub-step S212, adopts respectively different substitute characters, and the digital block of diverse location in the URL of replacement named web page, obtains First Characteristic URL prefix;
With sub-step S212 accordingly, described step 103 specifically can comprise following sub-step:
Sub-step S312, adopts respectively the wild-character identical with First Characteristic URL to replace described associated URL at the digital block of same position, obtains Second Characteristic URL prefix.
In specific implementation, the URL of named web page can have one or more digital blocks with associated URL, for improving judgement and the efficiency to the sign of digital block whether follow-up First Characteristic URL prefix is identical with Second Characteristic URL, can adopt different wild-characters to replace digital block.
For example, the URL of named web page is http://bbs.XXX.com/forum-99-2.html, associated URL is http://bbs.XXX.com/forum-99-3.html, wherein " 99 ", " 2 " are identified is digital block, using " (d+) ", " (e+) " a kind of example as wild-character, First Characteristic URL prefix can be the .html of http://bbs.XXX.com/forum-(d+)-(e+), and Second Characteristic URL prefix can be the .html of http://bbs.XXX.com/forum-(d+)-(e+).
Sub-step S41, when described First Characteristic URL prefix is identical with described Second Characteristic URL prefix, using described First Characteristic URL prefix or Second Characteristic URL prefix as associating web pages URL pattern pattern.
In actual applications, when First Characteristic URL prefix is identical with Second Characteristic URL prefix, the webpage corresponding with associated URL that can judge named web page is associated page turning webpage.
Because First Characteristic URL prefix is identical with Second Characteristic URL, using First Characteristic URL prefix or Second Characteristic URL prefix all can as associating web pages URL pattern Pattern.
The present invention adopts page turning feature anchor identification associating web pages, and recognition accuracy is high, in the URL based on named web page, calculates associating web pages URL pattern pattern with associated URL, and counting yield is high.
The present invention uses wild-character to replace digital block and obtains First Characteristic URL prefix and obtain Second Characteristic URL prefix, when described First Characteristic URL prefix is identical with described Second Characteristic URL prefix, using described First Characteristic URL prefix or Second Characteristic URL prefix as associating web pages URL pattern, the present invention adopts the general character of URL partly to mate, further improved the recognition accuracy of associating web pages, recall rate is significantly improved, can identify more than 90% associating web pages in actual applications.
With reference to Fig. 4, show the flow chart of steps of embodiment of the method 2 of a kind of compute associations webpage URL pattern pattern of one embodiment of the invention, specifically can comprise the steps:
Step 401, judges in the page elements of named web page whether have page turning feature anchor; If so, perform step 402;
Step 402, extracts the associated URL that described page turning feature anchor correspondence is linked to;
Step 403, the associated URL being linked to according to the URL of described named web page and described page turning feature anchor correspondence calculates the associating web pages URL pattern pattern corresponding with described named web page;
Step 404, by the general character in associating web pages URL pattern pattern is partly carried out to structure analysis, extracts the page turning piece in associating web pages URL pattern pattern, and described page turning piece is replaced with to the URL that homepage sign obtains homepage associating web pages;
Wherein, described page turning piece is the identical but digital different digital block in position in a plurality of associating web pages URL pattern pattern.
In actual applications, URL can comprise one or more following structures:
1, protocol(agreement): specify the host-host protocol using, the most frequently used is http protocol, and it is also agreement most widely used in current WWW.Particularly, host-host protocol comprises that (resource is the file on local computer to file agreement, form is file: // /), ftp agreement is (by FTP access resources, form is FTP: //), gopher(is by Gopher protocol access resource), http agreement is (by HTTP access resources, form is HTTP: //), https agreement (by the HTTPS access resources of safety, form is HTTPS: //) etc.
2, hostname(host name): domain name system (DNS) host name or the IP address that refer to deposit the server of resource.Sometimes, before host name, also can comprise and be connected to the required username and password of server (form is username:password).
3, port(port numbers): the default port of operational version during omission, various host-host protocols have the port numbers of acquiescence, if the default port of http is 80.If omit during input, use default port number.Sometimes for safety or other, consider, can on server, to port, redefine, adopt non-standard ports number, now, in URL, just can not omit port numbers this.
4, path(path): by zero or the character string that separates of a plurality of "/" symbols, be generally used for representing catalogue or file address on main frame.
5, parameters(parameter): the option that can be used to specify special parameter.
6, query (inquiry): can be for giving dynamic web page (as used the webpage of the fabrication techniques such as CGI, ISAPI, PHP/JSP/ASP/ASP.NET) Transfer Parameters, can there be a plurality of parameters, with " & " symbol, separate, name and the value of each parameter separate with "=" symbol.
7, fragment(pieces of information): can be used to specify the segment in Internet resources.For example in a webpage, there are a plurality of explanations of nouns, can use fragment to be directly targeted to a certain explanation of nouns.
In specific implementation, by the general character in a plurality of associating web pages URL patterns is partly carried out to structure analysis, extract the page turning piece in associating web pages URL pattern, then described page turning piece is replaced with to the URL that homepage sign obtains homepage associating web pages.
For example, for associating web pages URL pattern-http://bbs.XXX.com/forum-of above-mentioned example (d+)-(e+) .html, (e+) is page turning piece identifying, then page turning piece is replaced with after homepage sign, obtain the URL-http of homepage associating web pages: //bbs.XXX.com/forum-99-1.html.
In a kind of preferred exemplary of the embodiment of the present invention, described homepage sign can comprise 0,1 and/or current associating web pages in greatest measure.
In specific implementation, the homepage associating web pages in associating web pages generally can record important content, example text block as shown in Figure 3, so the important ratio of homepage associating web pages is higher, therefore knows that homepage associating web pages has important meaning.And different websites can adopt different page turning structures, caused the difference of homepage associating web pages.For example, some website can adopt the 0th page as homepage associating web pages, and some website can adopt the 1st page as homepage associating web pages, and some website can adopt maximum page (example as shown in Figure 3 2100) as homepage associating web pages, etc.
Certainly, above-mentioned homepage associating web pages is just as example, and when implementing the embodiment of the present invention, the sign that can numeral be replaced with to arbitrary associating web pages soon according to actual conditions is obtained corresponding associating web pages, and the embodiment of the present invention is not described in detail one by one to this.
The present invention replaces with by the page turning piece of associating web pages URL pattern pattern the URL that homepage sign obtains homepage associating web pages, in like manner, also page turning piece can be replaced with to the URL that other chaining banners obtain other associating web pages, thereby increased the coverage rate of associating web pages, make it possible to obtain more comprehensively associating web pages, and then realized the operation of fine granularity.
For embodiment of the method, for simple description, therefore it is all expressed as to a series of combination of actions, but those skilled in the art should know, the present invention is not subject to the restriction of described sequence of movement, because according to the present invention, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in instructions all belongs to preferred embodiment, and related action and module might not be that the present invention is necessary.
With reference to Fig. 5, show the structured flowchart of device embodiment 1 of a kind of compute associations webpage URL pattern pattern of one embodiment of the invention, specifically can comprise as lower module:
Page turning feature anchor judge module 501, is suitable for judging in the page elements of named web page whether have page turning feature anchor; If so, call associated URL extraction module 502;
URL extraction module 502, is suitable for extracting the associated URL that described page turning feature anchor correspondence is linked to;
Associating web pages URL pattern pattern computing module 503, the associated URL that is suitable for being linked to according to the URL of described named web page and described page turning feature anchor correspondence calculates the associating web pages URL pattern pattern corresponding with described named web page.
In a preferred embodiment of the present invention, described page turning feature anchor judge module 501 can also be suitable for:
Adopt page turning feature anchor to mate in the dom tree node of current web page;
When the match is successful, judge that current web page has page turning feature anchor.
In a preferred embodiment of the present invention, described page turning feature anchor can correspondence be linked to one or more associated URL.
In a preferred embodiment of the present invention, described associating web pages URL pattern pattern computing module 503 specifically can comprise following submodule:
First Characteristic URL prefix obtains submodule, and the digital block in the URL that is suitable for using wild-character to replace named web page obtains First Characteristic URL prefix; Wherein, described digital block is to be spaced apart individual digit or a plurality of numeral that sign is partitioned into;
Second Characteristic URL prefix obtains submodule, is suitable for using wild-character to replace the digital block in described associated URL, obtains Second Characteristic URL prefix;
Associating web pages URL pattern pattern obtains module, is suitable for when described First Characteristic URL prefix is identical with described Second Characteristic URL prefix, using described First Characteristic URL prefix or Second Characteristic URL prefix as associating web pages URL pattern pattern.
In a preferred embodiment of the present invention, described First Characteristic URL prefix obtains submodule and can also be suitable for:
Adopt identical wild-character to replace the digital block of diverse location in the URL of named web page, obtain First Characteristic URL prefix;
Described Second Characteristic URL prefix obtains submodule and can also be suitable for:
Adopt identical wild-character to replace the digital block of diverse location in described associated URL, obtain Second Characteristic URL prefix.
In a preferred embodiment of the present invention, described First Characteristic URL prefix obtains submodule and can also be suitable for:
Adopt respectively different wild-characters, the digital block of diverse location in the URL of replacement named web page, obtains First Characteristic URL prefix;
Second Characteristic URL prefix obtains submodule and can also be suitable for:
Adopt respectively the wild-character identical with First Characteristic URL to replace described associated URL at the digital block of same position, obtain Second Characteristic URL prefix.
For the device embodiment of Fig. 5, because it is substantially similar to the embodiment of the method for Fig. 1, so description is fairly simple, relevant part is referring to the part explanation of embodiment of the method.
With reference to Fig. 6, the device that shows a kind of associating web pages URL of calculating pattern pattern of one embodiment of the invention is executed the structured flowchart of example 2, specifically can comprise as lower module:
Page turning feature anchor judge module 601, is suitable for judging in the page elements of named web page whether have page turning feature anchor; If so, call associated URL extraction module 602;
URL extraction module 602, is suitable for extracting the associated URL that described page turning feature anchor correspondence is linked to;
Associating web pages URL pattern pattern computing module 603, the associated URL that is suitable for being linked to according to the URL of described named web page and described page turning feature anchor correspondence calculates the associating web pages URL pattern pattern corresponding with described named web page;
Homepage associating web pages URL obtains module 604, be suitable for by the general character in associating web pages URL pattern pattern is partly carried out to structure analysis, extract the page turning piece in associating web pages URL pattern pattern, described page turning piece is replaced with to the URL that homepage sign obtains homepage associating web pages; Wherein, described page turning piece is the identical but digital different digital block in position in a plurality of associating web pages URL pattern pattern.
In a kind of preferred exemplary of the embodiment of the present invention, described homepage sign can comprise 0,1 and/or current associating web pages in greatest measure.
For the device embodiment of Fig. 6, because it is substantially similar to the embodiment of the method for Fig. 4, so description is fairly simple, relevant part is referring to the part explanation of embodiment of the method.
The algorithm providing at this is intrinsic not relevant to any certain computer, virtual system or miscellaneous equipment with demonstration.Various general-purpose systems also can with based on using together with this teaching.According to description above, it is apparent constructing the desired structure of this type systematic.In addition, the present invention is not also for any certain programmed language.It should be understood that and can utilize various programming languages to realize content of the present invention described here, and the description of above language-specific being done is in order to disclose preferred forms of the present invention.
In the instructions that provided herein, a large amount of details have been described.Yet, can understand, embodiments of the invention can not put into practice in the situation that there is no these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand one or more in each inventive aspect, in the above in the description of exemplary embodiment of the present invention, each feature of the present invention is grouped together into single embodiment, figure or sometimes in its description.Yet, the method for the disclosure should be construed to the following intention of reflection: the present invention for required protection requires than the more feature of feature of clearly recording in each claim.Or rather, as reflected in claims below, inventive aspect is to be less than all features of disclosed single embodiment above.Therefore, claims of following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and can the module in the equipment in embodiment are adaptively changed and they are arranged in one or more equipment different from this embodiment.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and can put them into a plurality of submodules or subelement or sub-component in addition.At least some in such feature and/or process or unit are mutually repelling, and can adopt any combination to combine all processes or the unit of disclosed all features in this instructions (comprising claim, summary and the accompanying drawing followed) and disclosed any method like this or equipment.Unless clearly statement in addition, in this instructions (comprising claim, summary and the accompanying drawing followed) disclosed each feature can be by providing identical, be equal to or the alternative features of similar object replaces.
In addition, those skilled in the art can understand, although embodiment more described herein comprise some feature rather than further feature included in other embodiment, the combination of the feature of different embodiment means within scope of the present invention and forms different embodiment.For example, in the following claims, the one of any of embodiment required for protection can be used with array mode arbitrarily.
All parts embodiment of the present invention can realize with hardware, or realizes with the software module moved on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that and can use in practice microprocessor or digital signal processor (DSP) to realize according to the some or all functions of the some or all parts in the equipment of the compute associations webpage URL pattern pattern of the embodiment of the present invention.The present invention for example can also be embodied as, for carrying out part or all equipment or device program (, computer program and computer program) of method as described herein.Realizing program of the present invention and can be stored on computer-readable medium like this, or can there is the form of one or more signal.Such signal can be downloaded and obtain from internet website, or provides on carrier signal, or provides with any other form.
It should be noted above-described embodiment the present invention will be described rather than limit the invention, and those skilled in the art can design alternative embodiment in the situation that do not depart from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and is not listed as element or step in the claims.Being positioned at word " " before element or " one " does not get rid of and has a plurality of such elements.The present invention can be by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In having enumerated the unit claim of some devices, several in these devices can be to carry out imbody by same hardware branch.The use of word first, second and C grade does not represent any order.Can be title by these word explanations.

Claims (10)

1. a method of compute associations webpage URL pattern pattern, comprising:
Judge in the page elements of named web page and whether there is page turning feature anchor; If so, extract the associated URL that described page turning feature anchor correspondence is linked to;
The associated URL being linked to according to the URL of described named web page and described page turning feature anchor correspondence calculates the associating web pages URL pattern pattern corresponding with described named web page.
2. the method for claim 1, is characterized in that, the step whether in the described page elements that judges named web page with page turning feature anchor comprises:
Adopt page turning feature anchor to mate in the dom tree node of current web page;
When the match is successful, judge that current web page has page turning feature anchor.
3. the method for claim 1, is characterized in that, described page turning feature anchor correspondence is linked to one or more associated URL.
4. the method as described in claim 1 or 2 or 3, is characterized in that, the described step of calculating described associating web pages URL pattern pattern according to the URL of described named web page and described associated page URL further comprises:
Use wild-character to replace the digital block in the URL of named web page, obtain First Characteristic URL prefix; Wherein, described digital block is to be spaced apart individual digit or a plurality of numeral that sign is partitioned into;
Use wild-character to replace the digital block in described associated URL, obtain Second Characteristic URL prefix;
When described First Characteristic URL prefix is identical with described Second Characteristic URL prefix, using described First Characteristic URL prefix or Second Characteristic URL prefix as associating web pages URL pattern pattern.
5. method as claimed in claim 4, is characterized in that, the digital block in the URL of described use wild-character replacement named web page, and the step that obtains First Characteristic URL prefix is:
Adopt identical wild-character to replace the digital block of diverse location in the URL of named web page, obtain First Characteristic URL prefix;
Described use wild-character is replaced the digital block in described associated URL, and the step that obtains Second Characteristic URL prefix is:
Adopt identical wild-character to replace the digital block of diverse location in described associated URL, obtain Second Characteristic URL prefix.
6. method as claimed in claim 5, is characterized in that, the digital block in the URL of described use wild-character replacement named web page, and the step that obtains First Characteristic URL prefix is:
Adopt respectively different wild-characters, the digital block of diverse location in the URL of replacement named web page, obtains First Characteristic URL prefix;
Described use wild-character is replaced the digital block in described associated URL, and the step that obtains Second Characteristic URL prefix is:
Adopt respectively the wild-character identical with First Characteristic URL to replace described associated URL at the digital block of same position, obtain Second Characteristic URL prefix.
7. the method as described in claim 1 or 2 or 3 or 5 or 6, is characterized in that, also comprises:
By the general character in associating web pages URL pattern pattern is partly carried out to structure analysis, extract the page turning piece in associating web pages URL pattern pattern, described page turning piece is replaced with to the URL that homepage sign obtains homepage associating web pages; Wherein, described page turning piece is the identical but digital different digital block in position in a plurality of associating web pages URL pattern pattern.
8. method as claimed in claim 7, is characterized in that, described homepage sign comprise 0,1 and/or current associating web pages in greatest measure.
9. a device of compute associations webpage URL pattern pattern, comprising:
Page turning feature anchor judge module, is suitable for judging in the page elements of named web page whether have page turning feature anchor; If so, call associated URL extraction module;
URL extraction module, is suitable for extracting the associated URL that described page turning feature anchor correspondence is linked to;
Associating web pages URL pattern pattern computing module, the associated URL that is suitable for being linked to according to the URL of described named web page and described page turning feature anchor correspondence calculates the associating web pages URL pattern pattern corresponding with described named web page.
10. device as claimed in claim 9, is characterized in that, described page turning feature anchor judge module is also suitable for:
Adopt page turning feature anchor to mate in the dom tree node of current web page;
When the match is successful, judge that current web page has page turning feature anchor.
CN201310607851.8A 2013-11-25 2013-11-25 Method and device for calculating relevant webpage URL pattern Pending CN103617228A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201310607851.8A CN103617228A (en) 2013-11-25 2013-11-25 Method and device for calculating relevant webpage URL pattern
PCT/CN2014/086522 WO2015074455A1 (en) 2013-11-25 2014-09-15 Method and apparatus for computing url pattern of associated webpage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310607851.8A CN103617228A (en) 2013-11-25 2013-11-25 Method and device for calculating relevant webpage URL pattern

Publications (1)

Publication Number Publication Date
CN103617228A true CN103617228A (en) 2014-03-05

Family

ID=50167931

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310607851.8A Pending CN103617228A (en) 2013-11-25 2013-11-25 Method and device for calculating relevant webpage URL pattern

Country Status (1)

Country Link
CN (1) CN103617228A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015074455A1 (en) * 2013-11-25 2015-05-28 北京奇虎科技有限公司 Method and apparatus for computing url pattern of associated webpage

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102123168A (en) * 2011-01-14 2011-07-13 广州市动景计算机科技有限公司 Web page pre-reading and integration method and system based on relay server
CN103150358A (en) * 2013-02-27 2013-06-12 三星半导体(中国)研究开发有限公司 Device and method capable of performing continuous web browsing in mobile equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102123168A (en) * 2011-01-14 2011-07-13 广州市动景计算机科技有限公司 Web page pre-reading and integration method and system based on relay server
CN103150358A (en) * 2013-02-27 2013-06-12 三星半导体(中国)研究开发有限公司 Device and method capable of performing continuous web browsing in mobile equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015074455A1 (en) * 2013-11-25 2015-05-28 北京奇虎科技有限公司 Method and apparatus for computing url pattern of associated webpage

Similar Documents

Publication Publication Date Title
US10698960B2 (en) Content validation and coding for search engine optimization
CN100442283C (en) Extraction method and system of structured data of internet based on sample &amp; faced to regime
CN103605688A (en) Intercept method and intercept device for homepage advertisements and browser
CN104077388A (en) Summary information extraction method and device based on search engine and search engine
CN104765809A (en) Preview method and device of search pictures of mobile terminal
CN102664925B (en) A kind of method of displaying searching result and device
CN101534306A (en) Detecting method and a device for fishing website
CN103631906A (en) Method and device for recognizing page number identification in webpage URL
CN107066576A (en) A kind of big data web crawlers paging system of selection and system
CN102436563A (en) Method and device for detecting page tampering
CN102930057A (en) Search implementation method and device
CN102880711A (en) Processing method and processing device for input data in browser address bar
CN103678511A (en) Method and device for extracting webpage content according to visualized template
CN102591965A (en) Method and device for detecting black chain
CN103617225A (en) Associated webpage searching method and system
CN103577566A (en) Web reading content loading method and device
CN103678509A (en) Method and device for generating webpage template
CN102982118A (en) Searching method and device based on favorites
CN103034707A (en) Website navigation method, device and browser client
CN103258058A (en) Page display method and system and browser
CN102970339A (en) Method for displaying web address and browser
CN105468627A (en) Method and system for shielding and filtering web page contents
CN103617229A (en) Method and device for establishing relevant-webpage data base
CN102567521A (en) Webpage data capturing and filtering method
WO2015074455A1 (en) Method and apparatus for computing url pattern of associated webpage

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20140305