WO2015074455A1 - Method and apparatus for computing url pattern of associated webpage - Google Patents

Method and apparatus for computing url pattern of associated webpage Download PDF

Info

Publication number
WO2015074455A1
WO2015074455A1 PCT/CN2014/086522 CN2014086522W WO2015074455A1 WO 2015074455 A1 WO2015074455 A1 WO 2015074455A1 CN 2014086522 W CN2014086522 W CN 2014086522W WO 2015074455 A1 WO2015074455 A1 WO 2015074455A1
Authority
WO
WIPO (PCT)
Prior art keywords
url
webpage
feature
page
pattern
Prior art date
Application number
PCT/CN2014/086522
Other languages
French (fr)
Chinese (zh)
Inventor
王智广
Original Assignee
北京奇虎科技有限公司
奇智软件(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201310607851.8A external-priority patent/CN103617228A/en
Priority claimed from CN201310603918.0A external-priority patent/CN103617225B/en
Priority claimed from CN201310607854.1A external-priority patent/CN103617229A/en
Priority claimed from CN201310606990.9A external-priority patent/CN103631906A/en
Application filed by 北京奇虎科技有限公司, 奇智软件(北京)有限公司 filed Critical 北京奇虎科技有限公司
Publication of WO2015074455A1 publication Critical patent/WO2015074455A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Definitions

  • the present invention relates to the field of data processing technologies, and in particular, to a method for calculating an associated web page URL pattern pattern, and an apparatus for calculating an associated web page URL pattern pattern.
  • Search engines need to adopt different scheduling strategies for different types of web pages.
  • the identification of web page types is a basic work.
  • the identification of page turning pages is a relatively important task.
  • the so-called page turning page is to view the previous page of the paging file, the next page or any non-current page existing. Turning pages can change the content of a physical book or mobile web form to view different content.
  • This mechanism also presents user interface elements that can be used to browse to other pages when used on the Internet.
  • the existing method for identifying a page turning page is to identify whether it is an index page according to a keyword included in a URL (Uniform Resource Locator) of the web page. For example, when the URL includes keywords such as page, pn, and p, and a number after the keyword, the web page corresponding to the URL is determined to be a page turning page.
  • a URL Uniform Resource Locator
  • the present invention has been made in order to provide a method of calculating an associated web page URL pattern pattern and a corresponding apparatus for calculating an associated web page URL pattern pattern that overcomes the above problems or at least partially solves the above problems.
  • a method for calculating an associated web page URL pattern pattern including:
  • a method for identifying a page number identifier in a webpage URL including:
  • a method for establishing an associated web page database including:
  • the associated webpage database is established by using the associated webpage corresponding to the associated webpage URL pattern.
  • an associated web page search method including:
  • Receiving a search request the request includes a search keyword
  • Determining whether the webpage is an associated webpage if yes, returning the webpage and the homepage information associated with the webpage.
  • an apparatus for calculating an associated web page URL pattern pattern including:
  • the page turning feature anchor determining module is adapted to determine whether the page element of the specified webpage has a page turning feature anchor; if yes, calling the associated URL extracting module;
  • a URL extraction module configured to extract an associated URL to which the page turning feature anchor is linked
  • the associated webpage URL pattern calculation module is adapted to calculate an associated webpage URL pattern pattern corresponding to the specified webpage according to the URL of the specified webpage and the associated URL to which the page turning feature anchor is linked.
  • a computer program comprising computer readable code when said calculating
  • the machine readable code when run on a computing device, causes the computing device to perform the method of calculating an associated web page URL pattern pattern according to any of claims 1-8.
  • a computer readable medium storing the computer program according to claim 23 is provided.
  • the invention adopts the page turning feature anchor to identify the associated webpage, and the recognition accuracy is high.
  • the associated webpage URL pattern patte is calculated based on the URL of the specified webpage and the associated URL, and the calculation efficiency is high.
  • the present invention replaces a digital block with a wildcard character to obtain a first feature URL prefix and obtain a second feature URL prefix.
  • first feature URL prefix is the same as the second feature URL prefix
  • second feature URL prefix is used as the associated webpage URL pattern.
  • the present invention uses the common part of the URL to perform matching, further improves the recognition accuracy of the associated webpage, and the recall rate is greatly improved, and more than 90% of the associated webpages can be identified in practical applications. .
  • the invention replaces the page turning block of the associated webpage URL pattern pattern with the first page identifier to obtain the URL of the related page of the first page.
  • the page turning block can be replaced with other linked webpage identifiers to obtain the URLs of other related webpages, thereby increasing the association.
  • the coverage of the webpage enables a more comprehensive associated web page to be obtained, thereby achieving fine-grained operations.
  • the invention extracts the associated webpage URL pattern based on the currently captured webpage, and establishes the associated webpage database by using the associated webpage corresponding to the webpage URL pattern, thereby avoiding repeated crawling of the webpage, reducing the occupation of system resources, and greatly improving the database establishment efficiency. .
  • the invention When the webpage that is matched with the keyword is determined to be the associated webpage, the invention returns the webpage and the homepage information associated with the webpage, thereby avoiding the process of the user repeating the search or searching the homepage, further reducing the operation of the system and reducing the system resources. Occupied, improving the efficiency of search.
  • FIG. 1 is a flow chart showing the steps of Embodiment 1 of a method for calculating an associated web page URL pattern pattern according to an embodiment of the present invention
  • FIG. 2 is a view schematically showing an example of a web page structure according to an embodiment of the present invention
  • FIG. 3 is a view schematically showing an example of a page turning block showing an embodiment of the present invention
  • FIG. 4 is a flow chart showing the steps of Embodiment 2 of a method for calculating an associated web page URL pattern pattern according to an embodiment of the present invention
  • FIG. 5 is a flow chart showing the steps of an embodiment of a method for identifying a page number identifier in a webpage URL according to an embodiment of the present invention
  • FIG. 6 is a flow chart showing the steps of an embodiment of a method for establishing an associated webpage database according to an embodiment of the present invention
  • FIG. 7 is a flow chart showing the steps of an embodiment of an associated webpage search method according to an embodiment of the present invention.
  • FIG. 8 is a block diagram showing a structural diagram of Embodiment 1 of an apparatus for calculating an associated web page URL pattern pattern according to an embodiment of the present invention
  • FIG. 9 is a block diagram showing a structural diagram of Embodiment 2 of an apparatus for calculating an associated web page URL pattern pattern according to an embodiment of the present invention.
  • Figure 10 schematically shows a block diagram of a computing device for performing the method according to the invention
  • Fig. 11 schematically shows a storage unit for holding or carrying program code implementing the method according to the invention.
  • FIG. 1 a flow chart of the steps of the method for calculating the associated web page URL pattern patte is shown in the following steps.
  • Step 101 it is determined whether the page element of the specified web page has a page turning feature anchor; if yes, step 102 is performed;
  • the webpage can be divided into multiple areas according to functions. Take a page of a forum (BBS) as an example. As shown in FIG. 2, the page can be divided into a navigation block (1) and a garbage block (2, 4). , page turning block (3), title block (5), author information block (6), publication date block (7), text block (8). Wherein, the navigation block can be located at the top of the web page header, or The lower part of the banner (the banner of the web page) is used to point to the information section of the web page.
  • a garbage block can be an area where a page element having a low relevance to a web page topic is located, such as a "post", "reply", and the like.
  • the page turning block can be an area indicating the page turning.
  • the title block can be the area in which the title of the web page (such as "Secure Browser Gather Black Thursday” shown in Figure 2) is located.
  • the author information block is an area that records the author information of the web page.
  • the body block is the area in which the body of the subject of the web page is recorded.
  • FIG. 3 there is shown an exemplary diagram showing a page turning block in accordance with one embodiment of the present invention.
  • the page turning block may mainly be composed of a page turning feature anchor, and the page turning feature anchor is a page turning feature string, which may be a page element for identifying a page turning.
  • the page turning feature anchor may include one or more of the following:
  • page turning feature anchor is only used as an example.
  • other page turning feature anchors may be set according to actual conditions, which is not limited by the embodiment of the present invention.
  • the step 101 may specifically include the following sub-steps:
  • Sub-step S11 using a page turning feature anchor to perform matching in the DOM tree node of the current webpage
  • Sub-step S12 when the matching is successful, it is determined that the current webpage has a page turning feature anchor.
  • the DOM (Document Object Model) is a standard programming interface for handling extensible markup languages.
  • the DOM can access and modify the content and structure of a document in a platform- and language-independent manner, representing and processing an HTML (Hypertext Markup Language) or XML (eXtensible Markup Language).
  • HTML Hypertext Markup Language
  • XML eXtensible Markup Language
  • the DOM is actually a document model that is described in an object-oriented manner.
  • the DOM defines the objects needed to represent and modify documents, the behavior and properties of those objects, and the relationships between these objects.
  • the DOM can be thought of as a tree representation of the data and structure on the page, but of course the page may not be implemented in this way.
  • HTML document can be refactored via JavaScript, and items on the page can be added, removed, changed, or rearranged.
  • HTML documents can be thought of as a tree structure, and this structure is called a node tree (HTML DOM). With the HTML DOM, all nodes in the tree are accessible via JavaScript. All HTML elements (nodes) can be modified, and nodes can be created or deleted.
  • HTML DOM node tree
  • the nodes in the node tree have a hierarchical relationship with each other. Terms such as parent, child, and sibling can be used to describe these relationships. Among them, the parent node has child nodes. The child nodes of the same level are called siblings (brothers or sisters). In the node tree, the top node is called the root. Each node has a parent node, except for the root (it has no parent). A node can have any number of children, and a sibling is a node that has the same parent.
  • getElementById() and getElementsByTagName() can find any HTML element in the entire HTML document. Both methods ignore the structure of the document. If you look up all the ⁇ p> elements in the document, getElementsByTagName() will find them all, no matter which level in the document the ⁇ p> element is in. At the same time, the getElementById() method will also return the correct element, no matter where it is hidden in the document structure. These two methods provide whatever HTML elements are needed, regardless of where they are in the document.
  • getElementById() returns the page element with the specified ID.
  • the hyperlink ⁇ a> (anchor) in the HTML text DOM tree of the web page may be identified to include [ ⁇ ], [>>], [ ⁇ ⁇ ], [> >], [ "], ["], [>], [ ⁇ ], [Previous], [Previous], [Next], [next], [Last], [Last] One or more of [Previous Page], [Next Page], [ ⁇ Previous Page], [ ⁇ Previous], [Next], [Next Page], [1...] If yes, it is determined that the current webpage has a page turning feature anchor.
  • ⁇ a> can be used to connect the text or picture at the current position to other pages, texts or images.
  • the basic syntax structure of the ⁇ a> tag can be as follows:
  • the content of the ⁇ a> identifier in the following HTML text is:
  • Step 102 Extract an associated URL (Un and nn Resource Locator) to which the page turning feature anchor is linked;
  • the page flip feature anchor may be linked to one or more associated URLs.
  • Step 103 Calculate according to the URL of the specified webpage and the associated URL to which the page flipping feature anchor is linked Calculating an associated webpage URL pattern pattern corresponding to the specified webpage.
  • the associated web page URL pattern Pattern which can be a collection of long-formed or functionally similar URLs/web pages.
  • the step 103 may specifically include the following sub-steps:
  • Sub-step S21 replacing a digital block in a URL of a specified webpage with a wildcard character to obtain a first feature URL prefix; wherein the digital block is a single number or a plurality of numbers segmented by the interval identifier;
  • Sub-step S31 replacing the digital block in the associated URL with a wildcard character to obtain a second feature URL prefix
  • the wildcard character may be any character, which is not limited in this embodiment of the present invention.
  • the interval identifier may be a symbol for the interval in the URL, such as "/", “.”, “-”, “?”, “:”, and the like.
  • the digital block needs to be a consecutive number in the interval identifier, for example "123ABC" is not a digital block.
  • the sub-step S21 may further include the following sub-steps:
  • Sub-step S211 replacing the digital block at different positions in the URL of the specified webpage with the same wildcard character to obtain the first feature URL prefix
  • the sub-step S31 may further comprise the following sub-steps:
  • Sub-step S311 replacing the digital block at different positions in the associated URL with the same wildcard character to obtain a second feature URL prefix.
  • the URL of the specified webpage and the associated URL may have one or more digital blocks.
  • the digital block may be replaced with the same wildcard character.
  • the URL of the specified web page is http://bbs.XXX.com/forum-99-2.html
  • the associated URL is http://bbs.XXX.com/forum-99-3.html, where "99” "2" is recognized as a digital block, and "( ⁇ d+)" is an example of a wildcard character.
  • the first feature URL prefix can be http://bbs.XXX.com/forum-( ⁇ d+ )-( ⁇ d+).html
  • the second feature URL prefix can be http://bbs.XXX.com/forum-( ⁇ d+)-( ⁇ d+).html.
  • the sub-step S21 may further include the following sub-steps:
  • Sub-step S212 using different replacement characters to replace the digital blocks in different positions in the URL of the specified webpage, to obtain the first feature URL prefix;
  • the step 103 may specifically include the following sub-steps:
  • Sub-step S312 replacing the digital block of the associated URL at the same location with the same wildcard character as the first feature URL, respectively, to obtain a second feature URL prefix.
  • the URL of the specified webpage and the associated URL may have one or more digital blocks, and may be different to determine whether the subsequent first feature URL prefix is the same as the second feature URL and the efficiency of the identification of the digital block.
  • the wildcard character replaces the numeric block.
  • the URL of the specified web page is http://bbs.XXX.com/forum-99-2.html
  • the associated URL is http://bbs.XXX.com/forum-99-3.html
  • "99" "2" is recognized as a digital block, with "( ⁇ d+)” and "( ⁇ e+)” as an example of a wildcard character
  • the first feature URL prefix can be http://bbs.XXX. Com/forum-( ⁇ d+)-( ⁇ e+).html
  • the second feature URL prefix can be http://bbs.XXX.com/forum-( ⁇ d+)-( ⁇ e+).html.
  • Sub-step S41 when the first feature URL prefix is the same as the second feature URL prefix, the first feature URL prefix or the second feature URL prefix is used as an associated webpage URL pattern.
  • the webpage corresponding to the associated webpage of the specified webpage is the associated page turning webpage.
  • the first feature URL prefix or the second feature URL prefix may be used as the associated webpage URL pattern Pattern.
  • the invention adopts the page turning feature anchor to identify the associated webpage, and has high recognition accuracy, and calculates the associated webpage URL pattern pattern based on the URL of the specified webpage and the associated URL, and the calculation efficiency is high.
  • the present invention replaces a digital block with a wildcard character to obtain a first feature URL prefix and obtain a second feature URL prefix.
  • first feature URL prefix is the same as the second feature URL prefix
  • second feature URL prefix is used as the associated webpage URL pattern.
  • the invention adopts the common part of the URL to perform matching, thereby further improving the recognition accuracy of the associated webpage, so that the recall rate is greatly improved, and more than 90% of the associations can be identified in practical applications. Web page.
  • FIG. 4 a flow chart of the steps of the second embodiment of the method for calculating the URL pattern pattern of the associated webpage is shown in the following steps.
  • Step 401 it is determined whether the page element of the specified web page has a page turning feature anchor; if yes, step 402 is performed;
  • Step 402 Extract an associated URL to which the page turning feature anchor is linked
  • Step 403 Calculate an associated webpage URL pattern pattern corresponding to the specified webpage according to the URL of the specified webpage and the associated URL to which the page flipping feature anchor is associated;
  • Step 404 Perform structural analysis on the common part in the associated webpage URL pattern pattern, extract the page turning block in the associated webpage URL pattern pattern, and replace the flipping block with the first page identifier to obtain the URL of the homepage associated webpage;
  • the page turning block is a digital block having the same position but different numbers in a plurality of associated webpage URL pattern patterns.
  • the URL may include one or more of the following structures:
  • protocol specifies the transport protocol used, the most commonly used is the HTTP protocol, which is also the most widely used protocol in the current WWW.
  • the transport protocol includes a file protocol (the resource is a file on the local computer, the format is file:///), the ftp protocol (accessing the resource through FTP, the format is FTP://), and the gopher (accessing the resource through the Gopher protocol).
  • http protocol accessing resources via HTTP, format is http://
  • https protocol accessing resources through secure HTTPS, format is HTTPS://
  • HTTPS HyperText Protocol
  • hostname The domain name system (DNS) host name or IP address of the server hosting the resource. Sometimes, you can also include the username and password (in the format username:password) required to connect to the server before the host name.
  • DNS domain name system
  • Port (port number) The default port of the scheme is used when omitted. Each transport protocol has a default port number. For example, the default port of http is 80. If omitted when typing, the default port number is used. Sometimes for security or other considerations, the port can be redefined on the server, that is, a non-standard port number is used. In this case, the port number cannot be omitted from the URL.
  • path A string separated by zero or more "/" symbols, generally used to represent a directory or file address on the host.
  • parameters can be used to specify the optional parameters of the optional parameters.
  • dynamic web pages such as web pages created using CGI, ISAPI, PHP / JSP / ASP / ASP.NET technology
  • fragment (information) can be used to specify fragments in network resources. For example, if there is multiple nouns in a web page, you can use the fragment to directly locate a noun explanation.
  • the page turning block in the associated webpage URL pattern is extracted, and then the page turning block is replaced with the homepage identifier to obtain the URL of the homepage associated webpage.
  • the homepage identifier may include 0, 1, and/or a maximum value in a current associated webpage.
  • the homepage associated webpage in the associated webpage generally records important content, such as the text block shown in FIG. 3. Therefore, the importance of the homepage associated webpage is relatively high, so it is important to know that the homepage associated webpage has a relatively important meaning.
  • Different websites will adopt different page turning structures, which will result in different pages related to the home page. For example, some websites will use page 0 as the homepage associated page. Some sites will use page 1 as the homepage associated page. Some sites will use the largest page (such as 2100 shown in Figure 3) as the homepage associated page, etc. Wait.
  • the foregoing homepage associated webpage is only an example.
  • the digital fast can be replaced with the identifier of any associated webpage to obtain the corresponding associated webpage according to the actual situation, which is not specifically described in the embodiment of the present invention. Said.
  • the invention replaces the page turning block of the associated webpage URL pattern pattern with the first page identifier to obtain the URL of the related page of the first page.
  • the page turning block can be replaced with other linked webpage identifiers to obtain the URLs of other related webpages, thereby increasing the association.
  • the coverage of the webpage enables a more comprehensive associated web page to be obtained, thereby achieving fine-grained operations.
  • the method may include the following steps:
  • Step 501 Acquire an associated URL to which the page turning feature anchor corresponding to the page element of the specified webpage is linked;
  • the step 501 may specifically include the following sub-steps:
  • Sub-step S51 using a page turning feature anchor to perform matching in a DOM tree node of a specified webpage
  • Sub-step S52 when the matching is successful, the associated URL is obtained from the matching paged feature anchor.
  • Step 502 Calculate an associated webpage URL pattern pattern according to the URL of the specified webpage and the associated URL;
  • the step 502 may specifically include the following sub-steps:
  • Sub-step S61 replacing the digital block in the URL of the specified webpage with the wildcard character to obtain the first feature URL prefix; wherein the digital block is a single digit or a plurality of digits separated by the interval identifier;
  • Sub-step S71 replacing the digital block in the associated URL with a wildcard character to obtain a second feature URL prefix
  • the sub-step S61 may further include the following sub-steps:
  • Sub-step S611, replacing the digital block at different positions in the URL of the specified webpage with the same wildcard character to obtain the first feature URL prefix;
  • sub-step S71 may further comprise the following sub-steps:
  • Sub-step S311 replacing the digital block at different positions in the associated URL with the same wildcard character to obtain a second feature URL prefix.
  • the sub-step S61 may further include the following sub-steps:
  • Sub-step S612 which replaces the digital blocks at different positions in the URL of the specified webpage by using different replacement characters to obtain the first feature URL prefix;
  • sub-step S71 may further comprise the following sub-steps:
  • Sub-step S712 replacing the digital block of the associated URL at the same location with the same wildcard character as the first feature URL, respectively, to obtain a second feature URL prefix.
  • Sub-step S81 when the first feature URL prefix is the same as the second feature URL prefix, the first feature URL prefix or the second feature URL prefix is used as an associated webpage URL pattern.
  • Step 503 Determine, according to the associated webpage URL pattern pattern corresponding to the specified webpage, a page number feature part of the specified webpage URL and a page number feature part in the associated URL, respectively;
  • the page turning block in the associated webpage URL pattern is extracted, and then the page turning block is replaced with the homepage identifier to obtain the URL of the homepage associated webpage.
  • the page number feature part in the associated webpage URL pattern pattern may be determined, which may be the same position but different numbers in the multiple associated webpage URL pattern patterns. Digital block.
  • Step 504 Compare the specified webpage URL with the page number feature part of the associated page URL, and extract a page number identifier that is identified by the different digital identification part as the specified webpage URL.
  • the page number identifier may include a homepage identifier
  • the homepage identifier may include 0, 1, and/or a maximum value in a current associated webpage.
  • the page turning block may be replaced with the first page identifier to obtain the URL of the first page associated web page.
  • the homepage identifier may include 0, 1, and/or a maximum value in a current associated webpage.
  • the homepage associated webpage in the associated webpage generally records important content, such as the text block shown in FIG. 3. Therefore, the importance of the homepage associated webpage is relatively high, so it is important to know that the homepage associated webpage has a relatively important meaning.
  • Different websites will adopt different page turning structures, which will result in different pages related to the home page. For example, some websites will use page 0 as the homepage associated page. Some sites will use page 1 as the homepage associated page. Some sites will use the largest page (such as 2100 shown in Figure 3) as the homepage associated page, etc. Wait.
  • the foregoing homepage associated webpage is only an example.
  • the digital fast can be replaced with the identifier of any associated webpage to obtain the corresponding associated webpage according to the actual situation, which is not specifically described in the embodiment of the present invention. Said.
  • the invention adopts the page turning feature anchor to identify the associated webpage, has high recognition accuracy, calculates the associated webpage URL pattern pattern based on the URL of the specified webpage and the associated URL, and has high calculation efficiency, and compares the common parts of the URL to greatly improve the recall rate. More than 90% of related web pages can be identified in practical applications.
  • the present invention replaces a digital block with a wildcard character to obtain a first feature URL prefix and obtain a second feature URL prefix.
  • first feature URL prefix is the same as the second feature URL prefix
  • second feature URL prefix is used as the associated webpage URL pattern, and the present invention uses the common part of the URL to match, further improving the association. The accuracy of the recognition of the web page.
  • the invention replaces the page turning block of the associated webpage URL pattern pattern with the first page identifier to obtain the URL of the related page of the first page.
  • the page turning block can be replaced with other linked webpage identifiers to obtain the URLs of other related webpages, thereby increasing the association.
  • the coverage of the webpage enables a more comprehensive associated web page to be obtained, thereby achieving fine-grained operations.
  • FIG. 6 a flow chart of steps of an embodiment of a method for establishing an associated webpage database according to an embodiment of the present invention is shown, which may specifically include the following steps:
  • Step 601 it is determined whether the captured web page includes the associated web page URL mode; if yes, step 602 is performed;
  • Web crawlers also known as web spiders, are Web Spiders.
  • Web spiders use web pages to find web pages. Start with a page (usually the home page), read the content of the web page, and find other link addresses in the web page. And then look for the next page through these link addresses, so that it keeps looping until all the pages of the site are crawled. If the entire Internet is treated as a website, then web spiders can use this principle to capture all the web pages on the Internet.
  • the associated webpage URL pattern may be a common part of the page turning webpage, that is, a set formed by a long-term or functionally similar URL/webpage.
  • the step 601 may specifically include the following sub-steps:
  • Sub-step S91 determining whether there is a page turning feature string in the page element of the current webpage; if yes, extracting the URL of the page turning feature string link;
  • FIG. 3 there is shown an exemplary diagram showing a page turning block in accordance with one embodiment of the present invention.
  • the page turning block may be mainly composed of a page turning feature string (ie, a page turning feature ancho), and the page turning feature string may be a page element for identifying a page turning.
  • a page turning feature string ie, a page turning feature ancho
  • the page turning feature string may be a page element for identifying a page turning.
  • the page turning feature string may include one or more of the following:
  • page turning feature string is only used as an example.
  • other page turning feature strings may be set according to actual conditions, which is not limited by the embodiment of the present invention.
  • the current webpage may be the webpage that is captured.
  • the sub-step S91 may further include the following sub-steps:
  • Sub-step S911 using a page turning feature string to perform matching in the DOM tree node of the current webpage;
  • Sub-step S912 when the matching is successful, it is determined that the current webpage has a page turning feature string.
  • Sub-step S92 replacing the digital block in the URL of the current webpage with a preset replacement character to obtain a first feature URL prefix; wherein the digital block is a single digit or a plurality of digits separated by the interval identifier;
  • Sub-step S93 replacing the digital block in the URL of the page-turning feature string link with a preset replacement character to obtain a second feature URL prefix
  • the sub-step S92 may further include the following sub-steps:
  • Sub-step S921 replacing the digital block at different positions in the URL of the current webpage with the same replacement character, to obtain the first feature URL prefix
  • sub-step S93 may further comprise the following sub-steps:
  • Sub-step S931 replacing the digital blocks at different positions in the URL of the feature string link with the same replacement character to obtain a second feature URL prefix.
  • the sub-step S92 may further include the following sub-steps:
  • Sub-step S922 which uses different replacement characters to replace the digital blocks in different positions in the URL of the current webpage to obtain the first feature URL prefix;
  • sub-step S93 may further comprise the following sub-steps:
  • Sub-step S932 replacing the digital block of the URL of the feature string link in the same position with the same replacement character as the first feature URL, respectively, to obtain the second feature URL prefix.
  • Sub-step S94 when the first feature URL prefix is the same as the second feature URL prefix, it is determined whether the crawled webpage includes an associated webpage URL pattern.
  • Step 602 Acquire the associated webpage URL pattern.
  • the step 602 may specifically include the following sub-steps:
  • Sub-step S101 the first feature URL prefix or the second feature URL prefix is used as a corresponding associated webpage URL pattern of the current webpage.
  • the present invention replaces the digital block in the URL of the current webpage with a preset replacement character, obtains the first feature URL prefix, and replaces the page flip with the preset replacement character.
  • the digital block in the URL of the feature string link obtains a second feature URL prefix, and when the first feature URL prefix is the same as the second feature URL prefix, the first feature URL prefix or the second feature URL is obtained.
  • the prefix is used as the corresponding associated webpage URL pattern of the current webpage.
  • the present invention uses the page turning feature string to identify the associated webpage, and the recognition accuracy is high, and the common part of the URL is used for matching, thereby further improving the recognition accuracy of the associated webpage.
  • the recall rate is greatly improved, and more than 90% of related web pages can be identified in practical applications.
  • Step 603 Acquire a corresponding associated webpage based on the associated webpage URL pattern.
  • the associated webpage may include a homepage associated webpage and other related webpages, wherein the homepage associated webpage generally records important content, such as the text block shown in FIG. 3, so the importance of the homepage associated webpage is relatively high, so It is important to know the homepage associated with the homepage.
  • the step 603 may specifically include the following sub-steps:
  • Sub-step S111 by performing structural analysis on the common part in the associated webpage URL pattern, extracting the page turning block in the associated webpage URL pattern, and replacing the flipping block with the first page identifier to obtain the URL of the homepage associated webpage;
  • the page turning block is a digital block having the same position but different numbers in a plurality of associated web page URL patterns;
  • Sub-step S112 accessing the URL of the homepage associated webpage to obtain the homepage associated webpage.
  • the homepage identifier may include 0, 1, and/or a maximum value in a current associated webpage.
  • the invention replaces the page turning block of the associated webpage URL pattern with the homepage identifier to obtain the URL of the homepage associated webpage, and similarly, the page flipping block can be replaced with other hanging webpage logos to obtain the URLs of other related webpages, thereby increasing the associated webpage.
  • the coverage enables a more comprehensive associated web page to be achieved, resulting in fine-grained operations.
  • Step 604 Establish an associated webpage database by using an associated webpage corresponding to the associated webpage URL pattern.
  • the associated webpage corresponding to the webpage URL pattern may include a homepage associated webpage and other related webpages, which may be all of the associated webpages, or may be a part of all associated webpages, which is not limited by the embodiment of the present invention.
  • the data processing of the webpage file captured by the spider may be performed, which may specifically include:
  • Web page structure That is, the HTML code of the associated web page is deleted, and the web content is extracted.
  • Link analysis Query the back link of the page, export the number of links and the inner chain, and then give the page how much weight and so on.
  • the processed data can be stored in the associated web page database.
  • the invention extracts the associated webpage URL pattern based on the currently captured webpage, and establishes the associated webpage database by using the associated webpage corresponding to the webpage URL pattern, thereby avoiding repeated crawling of the webpage, reducing the occupation of system resources, and greatly improving the database establishment efficiency. .
  • FIG. 7 a flow chart of steps of an embodiment of an associated webpage search method according to an embodiment of the present invention is shown. Specifically, the method may include the following steps:
  • Step 701 Receive a search request, where the request includes a search keyword
  • the search request may refer to a request by the user to perform an associated information search for a certain search keyword.
  • the user can input a search keyword in the browser address bar, the search bar, the search keyword input box in the search engine, and press the enter key or click the search button, which is equivalent to receiving the user's search request.
  • Step 702 Perform a search in the preset related webpage database according to the search keyword, and obtain a webpage that matches the keyword;
  • the collected information is generally a keyword or phrase that indicates the content of the associated web page (including the web page itself, the URL address of the web page, the code that makes up the web page, and the connection to and from the web page).
  • the search word set q is segmented, the URL corresponding to each keyword in q is sorted—the index library, and the keyword is also calculated according to the user's query mode and part of speech. Important, then only a comprehensive sorting algorithm is needed to get the search results.
  • the associated web page database can be established in the following manner:
  • Sub-step S101 it is determined whether the captured web page includes the associated web page URL mode; if so, sub-step S102 is performed;
  • the sub-step S101 may specifically include the following sub-steps:
  • Sub-step S121 determining whether the page element of the current webpage has a page turning feature string; if yes, extracting the URL of the page turning feature string link;
  • the sub-step S121 may further include the following sub-steps:
  • Sub-step S1211 using a page turning feature string to perform matching in a DOM tree node of the current webpage
  • Sub-step S1212 when the matching is successful, it is determined that the current webpage has a page turning feature string.
  • Sub-step S122 replacing the digital block in the URL of the current webpage with a preset replacement character to obtain a first feature URL prefix; wherein the digital block is a single number or multiple digits separated by the interval identifier;
  • Sub-step S123 replacing the digital block in the URL of the page-turning feature string link with a preset replacement character to obtain a second feature URL prefix
  • the sub-step S122 may further include the following sub-steps:
  • sub-step S123 may further comprise the following sub-steps:
  • Sub-step S1231 replacing the digital block at different positions in the URL of the feature string link with the same replacement character to obtain a second feature URL prefix.
  • the sub-step S122 may further include the following sub-steps:
  • Sub-step S1222 which replaces the digital blocks at different positions in the URL of the current webpage by using different replacement characters to obtain the first feature URL prefix
  • sub-step S123 may further comprise the following sub-steps:
  • Sub-step S1232 replacing the digital block of the URL of the feature string link at the same position with the same replacement character as the first feature URL, respectively, to obtain a second feature URL prefix.
  • Sub-step S124 when the first feature URL prefix is the same as the second feature URL prefix, it is determined whether the crawled webpage includes an associated webpage URL mode.
  • Sub-step S102 acquiring the associated webpage URL pattern
  • the sub-step S102 may specifically include the following sub-steps:
  • Sub-step S131 the first feature URL prefix or the second feature URL prefix is used as a corresponding associated webpage URL pattern of the current webpage.
  • the present invention replaces the digital block in the URL of the current webpage with a preset replacement character, obtains the first feature URL prefix, and replaces the page flip with the preset replacement character.
  • the digital block in the URL of the feature string link obtains a second feature URL prefix, and when the first feature URL prefix is the same as the second feature URL prefix, the first feature URL prefix or the second feature URL is obtained.
  • the prefix is used as the corresponding associated webpage URL pattern of the current webpage.
  • the present invention uses the page turning feature string to identify the associated webpage, and the recognition accuracy is high, and the common part of the URL is used for matching, thereby further improving the recognition accuracy of the associated webpage.
  • the recall rate is greatly improved, and more than 90% of related web pages can be identified in practical applications.
  • Sub-step S103 acquiring the corresponding associated webpage by using the associated webpage URL pattern
  • the sub-step S103 may specifically include the following sub-steps:
  • Sub-step S141 extracting the associated webpage URL by performing structural analysis on the common part in the associated webpage URL pattern a page turning block in the mode, the page turning block is replaced with a first page identifier to obtain a URL of a homepage associated webpage; wherein the page turning block is a digital block having the same position but different numbers in a plurality of associated webpage URL patterns;
  • Sub-step S142 accessing the URL of the homepage associated webpage to obtain the homepage associated webpage.
  • the invention replaces the page turning block of the associated webpage URL pattern with the homepage identifier to obtain the URL of the homepage associated webpage, and similarly, the page flipping block can be replaced with other hanging webpage logos to obtain the URLs of other related webpages, thereby increasing the associated webpage.
  • the coverage enables a more comprehensive associated web page to be achieved, resulting in fine-grained operations.
  • Sub-step S104 the associated webpage database is established by using the associated webpage corresponding to the associated webpage URL pattern.
  • Step 703 it is determined whether the webpage is an associated webpage; if yes, step 706 is performed;
  • determining whether the webpage includes an associated webpage URL pattern can determine whether the webpage is an associated webpage. That is, when the webpage includes an associated webpage URL pattern, the webpage is determined to be an associated webpage.
  • Step 704 returning the webpage and the homepage information associated with the webpage.
  • the embodiment of the present invention may store the corresponding relationship between the URL pattern of the associated webpage and the corresponding webpage, and the homepage associated with the webpage may be obtained by querying the corresponding webpage URL pattern of the webpage and the corresponding relationship of the webpage.
  • the search engine can display the search results on the user's viewing interface for the user to use.
  • the invention When the webpage that is matched with the keyword is determined to be the associated webpage, the invention returns the webpage and the homepage information associated with the webpage, thereby avoiding the process of the user repeating the search or searching the homepage, further reducing the operation of the system and reducing the system resources. Occupied, improving the efficiency of search.
  • FIG. 8 a block diagram of a device embodiment 1 for calculating an associated web page URL pattern pattern according to an embodiment of the present invention is shown, which may specifically include the following modules:
  • the page turning feature anchor determining module 801 is adapted to determine whether the page element of the specified web page has a page turning feature anchor; if so, the associated URL extracting module 802 is invoked;
  • the URL extraction module 802 is adapted to extract an associated URL to which the page turning feature anchor is linked;
  • the associated webpage URL pattern calculation module 803 is adapted to calculate an associated webpage URL pattern pattern corresponding to the specified webpage according to the URL of the specified webpage and the associated URL to which the page turning feature anchor is linked.
  • the page turning feature anchor determining module 801 is further adapted to:
  • Matching is performed in the DOM tree node of the current webpage by using a page turning feature anchor;
  • the page flip feature anchor may be linked to one or more associated URLs.
  • the associated webpage URL pattern calculation module 803 may specifically include the following modules:
  • a first feature URL prefix obtaining module adapted to replace a digital block in a URL of a specified webpage with a wildcard character to obtain a first feature URL prefix; wherein the digital block is a single number or a plurality of numbers segmented by the interval identifier ;
  • a second feature URL prefix obtaining module configured to replace the digital block in the associated URL with a wildcard character to obtain a second feature URL prefix
  • the associated webpage URL pattern obtaining module is configured to use the first feature URL prefix or the second feature URL prefix as the associated webpage URL pattern pattern when the first feature URL prefix is the same as the second feature URL prefix.
  • the first feature URL prefix obtaining module may further be adapted to:
  • the second feature URL prefix obtaining module may further be adapted to:
  • the second feature URL prefix is obtained by replacing the digital blocks at different positions in the associated URL with the same wildcard characters.
  • the first feature URL prefix obtaining module may further be adapted to:
  • the first feature URL prefix is obtained by using different wildcard characters to replace the digital blocks in different positions in the URL of the specified webpage.
  • the second feature URL prefix obtaining module may also be adapted to:
  • FIG. 9 a structural block diagram of a device 2 for calculating an associated webpage URL pattern pattern according to an embodiment of the present invention is shown, which may specifically include the following modules:
  • the page turning feature anchor determining module 901 is adapted to determine whether the page element of the specified web page has a page turning feature anchor; if so, the associated URL extracting module 902 is invoked;
  • the URL extraction module 902 is adapted to extract an associated URL to which the page turning feature anchor is linked;
  • the associated webpage URL pattern tablet computing module 903 is adapted to calculate an associated webpage URL pattern pattern corresponding to the specified webpage according to the URL of the specified webpage and the associated URL to which the page turning feature anchor is associated;
  • the homepage related webpage URL obtaining module 904 is adapted to extract a page turning block in the associated webpage URL pattern pattern by performing structural analysis on the common part in the associated webpage URL pattern pattern, and replace the flipping block with the first page identifier to obtain a homepage association.
  • a URL of the webpage wherein the page turning block is a digital block having the same position but different numbers in the plurality of associated webpage URL pattern patterns.
  • the homepage identifier may include 0, 1, and/or a maximum value in a current associated webpage.
  • the various component embodiments of the present invention may be implemented in hardware, or in a software module running on one or more processors, or in a combination thereof.
  • a microprocessor or digital signal processor may be used in practice to implement some or all of some or all of the components of the device for calculating the associated web page URL pattern pattern in accordance with an embodiment of the present invention.
  • the invention can also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing some or all of the methods described herein.
  • a program implementing the invention may be stored on a computer readable medium or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.
  • FIG. 10 illustrates a computing device, such as a user terminal device or an application server, that can implement the calculation of an associated web page URL pattern pattern in accordance with the present invention.
  • the computing device conventionally includes a processor 1010 and a computer program product or computer readable medium in the form of a memory 1020.
  • the memory 1020 may be an electronic memory such as a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), an EPROM, a hard disk, or a ROM.
  • the memory 1020 has a memory space 1030 for executing program code 1031 of any of the above method steps.
  • storage space 1030 for program code may include various program code 1031 for implementing various steps in the above methods, respectively.
  • the program code can be read from or written to one or more computer program products.
  • These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks.
  • Such a computer program product is typically a portable or fixed storage unit as described with reference to FIG.
  • the storage unit may have storage segments, storage spaces, and the like that are similarly arranged to memory 1020 in the computing device of FIG.
  • the program code can be compressed, for example, in an appropriate form.
  • the storage unit includes computer readable code 1031', ie, code that can be read by, for example, a processor such as 1010, which when executed by a computing device causes the computing device to perform each of the methods described above step.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

A method and an apparatus for computing a URL pattern of an associated webpage. The method comprises: determining whether a page turning feature anchor exists in page elements of a specified webpage; if yes, retrieving an associated URL to which the page turning feature anchor is correspondingly linked; and computing, according to a URL of the specified webpage and the associated URL to which the page turning feature anchor is correspondingly linked, an associated webpage pattern corresponding to the specified webpage. A page turning feature anchor is used to recognize an associated webpage, so that an accuracy rate of recognition is high, and an associated webpage URL pattern is obtained through computation on the basis of a URL of a specified webpage and an associated URL, so that computing efficiency is high.

Description

一种计算关联网页URL模式pattern的方法和装置Method and device for calculating associated webpage URL pattern pattern 技术领域Technical field
本发明涉及数据处理技术领域,尤其涉及一种计算关联网页URL模式pattern的方法、一种计算关联网页URL模式pattern的装置。The present invention relates to the field of data processing technologies, and in particular, to a method for calculating an associated web page URL pattern pattern, and an apparatus for calculating an associated web page URL pattern pattern.
背景技术Background technique
随着因特网的发展,愈来愈多的信息是通过网页方式呈现在因特网上供用户查询,同样的通过搜寻引擎在因特网中查询数据也成为最常使用的数据搜寻方法。With the development of the Internet, more and more information is presented on the Internet for users to query through webpages. Similarly, querying data on the Internet through search engines has become the most commonly used data search method.
搜索引擎收录网页时需要针对不同种类的网页采取不同的调度策略,网页种类的识别是一项基础工作,其中翻页(Page turning)网页的识别是一项比较关键的工作。所谓翻页网页,即查看分页文件的上一个页面、下一个页面或任意存在的非当前页面。翻页网页可以将实体书或者移动Web窗体中的内容进行改变,以观看不同内容。在互联网上运用时该机制还呈现可用于浏览到其他页的用户界面元素。Search engines need to adopt different scheduling strategies for different types of web pages. The identification of web page types is a basic work. The identification of page turning pages is a relatively important task. The so-called page turning page is to view the previous page of the paging file, the next page or any non-current page existing. Turning pages can change the content of a physical book or mobile web form to view different content. This mechanism also presents user interface elements that can be used to browse to other pages when used on the Internet.
现有的翻页网页的识别方法是根据网页的URL(Uniform Resource Locator,统一资源定位符)所包含的关键词来识别是否是索引页。例如,当URL包含有page、pn、p等关键词以及关键词后面有数字时,判断该URL对应的网页为翻页网页。The existing method for identifying a page turning page is to identify whether it is an index page according to a keyword included in a URL (Uniform Resource Locator) of the web page. For example, when the URL includes keywords such as page, pn, and p, and a number after the keyword, the web page corresponding to the URL is determined to be a page turning page.
但是,这种识别方法召回率低,并且很多网站的翻页是不具有这些关键词的,比如“http://cq.ABC.com/lvshi/o12/”、“http://bbs.BCA.com/t661_10”、“http://china.BCD.com/product/20110617/2647”,但是这些网页依然是翻页,使得这些识别方法容易造成误操作,实用性低。However, this recognition method has a low recall rate, and many websites do not have these keywords, such as "http://cq.ABC.com/lvshi/o12/", "http://bbs.BCA" .com/t661_10", "http://china.BCD.com/product/20110617/2647", but these pages are still page turning, making these identification methods easy to cause misuse and low practicality.
发明内容Summary of the invention
鉴于上述问题,提出了本发明以便提供一种克服上述问题或者至少部分地解决上述问题的一种计算关联网页URL模式pattern的方法和相应的一种计算关联网页URL模式pattern的装置。In view of the above problems, the present invention has been made in order to provide a method of calculating an associated web page URL pattern pattern and a corresponding apparatus for calculating an associated web page URL pattern pattern that overcomes the above problems or at least partially solves the above problems.
根据本发明的一个方面,提供了一种计算关联网页URL模式pattern的方法,包括:According to an aspect of the present invention, a method for calculating an associated web page URL pattern pattern is provided, including:
判断指定网页的页面元素中是否具有翻页特征anchor;若是,则提取所述翻页特征anchor对应链接到的关联URL;Determining whether there is a page turning feature anchor in the page element of the specified webpage; if yes, extracting the associated URL to which the page turning feature anchor is linked;
根据所述指定网页的URL以及所述翻页特征anchor对应链接到的关联URL计算与所述指定网页对应的关联网页URL模式pattern。And calculating, according to the URL of the specified webpage and the associated URL to which the page turning feature anchor is linked, an associated webpage URL pattern corresponding to the specified webpage.
根据本发明的另一方面,提供了一种识别网页URL中页码标识的方法,包括:According to another aspect of the present invention, a method for identifying a page number identifier in a webpage URL is provided, including:
获取指定网页的页面元素中翻页特征anchor对应链接到的关联URL;Obtaining the associated URL to which the page turning feature anchor is linked in the page element of the specified webpage;
依据所述指定网页的URL和所述关联URL计算关联网页URL模式pattern;Calculating an associated webpage URL pattern pattern according to the URL of the specified webpage and the associated URL;
基于与指定网页对应的关联网页URL模式pattern,分别确定所述指定网页URL的页码特征部分以及所述关联URL中的页码特征部分;Determining, respectively, a page number feature portion of the specified web page URL and a page code feature portion of the associated URL based on an associated web page URL pattern pattern corresponding to the specified web page;
比较所述指定网页URL与所述关联页URL的页码特征部分,提取不同数字标识部分识别为指定网页URL的页码标识。Comparing the specified webpage URL with the page number feature part of the associated page URL, and extracting the page number identifier that the different digital identification part identifies as the specified webpage URL.
根据本发明的另一方面,提供了一种关联网页数据库的建立方法,包括:According to another aspect of the present invention, a method for establishing an associated web page database is provided, including:
判断抓取到的网页是否包括关联网页URL模式;若是,则获取所述关联网页URL模式;Determining whether the crawled webpage includes an associated webpage URL pattern; if yes, acquiring the associated webpage URL pattern;
基于所述关联网页URL模式获取对应的关联网页;Obtaining a corresponding associated webpage based on the associated webpage URL pattern;
采用所述关联网页URL模式对应的关联网页建立关联网页数据库。The associated webpage database is established by using the associated webpage corresponding to the associated webpage URL pattern.
根据本发明的另一方面,提供了一种关联网页搜索方法,包括:According to another aspect of the present invention, an associated web page search method is provided, including:
接收搜索请求;所述请求中包括搜索关键词;Receiving a search request; the request includes a search keyword;
依据所述搜索关键词在预置的关联网页数据库中进行查找,获得与所述关键词匹配的网页;Performing a search in the preset associated webpage database according to the search keyword to obtain a webpage matching the keyword;
判断所述网页是否为关联网页;若是,则返回所述网页及所述网页关联的首页信息。Determining whether the webpage is an associated webpage; if yes, returning the webpage and the homepage information associated with the webpage.
根据本发明的另一方面,提供了一种计算关联网页URL模式pattern的装置,包括:According to another aspect of the present invention, an apparatus for calculating an associated web page URL pattern pattern is provided, including:
翻页特征anchor判断模块,适于判断指定网页的页面元素中是否具有翻页特征anchor;若是,则调用关联URL提取模块;The page turning feature anchor determining module is adapted to determine whether the page element of the specified webpage has a page turning feature anchor; if yes, calling the associated URL extracting module;
URL提取模块,适于提取所述翻页特征anchor对应链接到的关联URL;a URL extraction module, configured to extract an associated URL to which the page turning feature anchor is linked;
关联网页URL模式pattern计算模块,适于根据所述指定网页的URL以及所述翻页特征anchor对应链接到的关联URL计算与所述指定网页对应的关联网页URL模式pattern。The associated webpage URL pattern calculation module is adapted to calculate an associated webpage URL pattern pattern corresponding to the specified webpage according to the URL of the specified webpage and the associated URL to which the page turning feature anchor is linked.
根据本发明的又一个方面,提供了一种计算机程序,其包括计算机可读代码,当所述计算 机可读代码在计算设备上运行时,导致所述计算设备执行根据权利要求1-8中的任一个所述的计算关联网页URL模式pattern方法。According to still another aspect of the present invention, a computer program is provided, comprising computer readable code when said calculating The machine readable code, when run on a computing device, causes the computing device to perform the method of calculating an associated web page URL pattern pattern according to any of claims 1-8.
根据本发明的再一个方面,提供了一种计算机可读介质,其中存储了如权利要求23所述的计算机程序。According to still another aspect of the present invention, a computer readable medium storing the computer program according to claim 23 is provided.
本发明的有益效果为:The beneficial effects of the invention are:
本发明采用翻页特征anchor识别关联网页,识别准确率高,基于指定网页的URL中和关联URL计算出关联网页URL模式pattem,计算效率高。The invention adopts the page turning feature anchor to identify the associated webpage, and the recognition accuracy is high. The associated webpage URL pattern patte is calculated based on the URL of the specified webpage and the associated URL, and the calculation efficiency is high.
本发明使用通配字符替换数字块获得第一特征URL前缀和获得第二特征URL前缀,当所述第一特征URL前缀与所述第二特征URL前缀相同时,将所述第一特征URL前缀或第二特征URL前缀作为关联网页URL模式,本发明采用URL的共性部分进行匹配,进一步提高了关联网页的识别准确率,使得召回率大幅提高,在实际应用中可以识别90%以上的关联网页。The present invention replaces a digital block with a wildcard character to obtain a first feature URL prefix and obtain a second feature URL prefix. When the first feature URL prefix is the same as the second feature URL prefix, the first feature URL prefix is used. Or the second feature URL prefix is used as the associated webpage URL pattern. The present invention uses the common part of the URL to perform matching, further improves the recognition accuracy of the associated webpage, and the recall rate is greatly improved, and more than 90% of the associated webpages can be identified in practical applications. .
本发明将关联网页URL模式pattern的翻页块替换为首页标识获得首页关联网页的URL,同理,也可以将翻页块替换为其他挂链网页标识获得其他关联网页的URL,从而增加了关联网页的覆盖率,使得能够获取更加全面的关联网页,进而实现了细颗粒度的操作。The invention replaces the page turning block of the associated webpage URL pattern pattern with the first page identifier to obtain the URL of the related page of the first page. Similarly, the page turning block can be replaced with other linked webpage identifiers to obtain the URLs of other related webpages, thereby increasing the association. The coverage of the webpage enables a more comprehensive associated web page to be obtained, thereby achieving fine-grained operations.
本发明基于当前抓取到的网页提取关联网页URL模式,采用关联网页URL模式对应的关联网页建立关联网页数据库,避免了重复抓取网页,减少了系统资源的占用,大大提高了数据库的建立效率。The invention extracts the associated webpage URL pattern based on the currently captured webpage, and establishes the associated webpage database by using the associated webpage corresponding to the webpage URL pattern, thereby avoiding repeated crawling of the webpage, reducing the occupation of system resources, and greatly improving the database establishment efficiency. .
本发明在判断获得与关键词匹配的网页为关联网页时,返回该网页及该网页关联的首页信息,避免了用户重复搜索或者查找首页的过程,进一步减少了系统的操作,减少了系统资源的占用,提高了搜索的效率。When the webpage that is matched with the keyword is determined to be the associated webpage, the invention returns the webpage and the homepage information associated with the webpage, thereby avoiding the process of the user repeating the search or searching the homepage, further reducing the operation of the system and reducing the system resources. Occupied, improving the efficiency of search.
上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。The above description is only an overview of the technical solutions of the present invention, and the above-described and other objects, features and advantages of the present invention can be more clearly understood. Specific embodiments of the invention are set forth below.
附图说明DRAWINGS
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:Various other advantages and benefits will become apparent to those skilled in the art from a The drawings are only for the purpose of illustrating the preferred embodiments and are not to be construed as limiting. Throughout the drawings, the same reference numerals are used to refer to the same parts. In the drawing:
图1示意性示出了根据本发明一个实施例的一种计算关联网页URL模式pattern的方法实施例1的步骤流程图;FIG. 1 is a flow chart showing the steps of Embodiment 1 of a method for calculating an associated web page URL pattern pattern according to an embodiment of the present invention;
图2示意性示出了根据本发明一个实施例的一种网页结构示例图;FIG. 2 is a view schematically showing an example of a web page structure according to an embodiment of the present invention; FIG.
图3示意性示出了示出了本发明一个实施例的一种翻页块的示例图;FIG. 3 is a view schematically showing an example of a page turning block showing an embodiment of the present invention; FIG.
图4示意性示出了根据本发明一个实施例的一种计算关联网页URL模式pattern的方法实施例2的步骤流程图;FIG. 4 is a flow chart showing the steps of Embodiment 2 of a method for calculating an associated web page URL pattern pattern according to an embodiment of the present invention;
图5示意性示出了本发明一个实施例的一种识别网页URL中页码标识的方法实施例的步骤流程图;FIG. 5 is a flow chart showing the steps of an embodiment of a method for identifying a page number identifier in a webpage URL according to an embodiment of the present invention; FIG.
图6示意性示出了本发明一个实施例的一种关联网页数据库的建立方法实施例的步骤流程图;6 is a flow chart showing the steps of an embodiment of a method for establishing an associated webpage database according to an embodiment of the present invention;
图7示意性示出了本发明一个实施例的一种关联网页搜索方法实施例的步骤流程图;FIG. 7 is a flow chart showing the steps of an embodiment of an associated webpage search method according to an embodiment of the present invention;
图8示意性示出了根据本发明一个实施例的一种计算关联网页URL模式pattern的装置实施例1的结构框图;FIG. 8 is a block diagram showing a structural diagram of Embodiment 1 of an apparatus for calculating an associated web page URL pattern pattern according to an embodiment of the present invention;
图9示意性示出了根据本发明一个实施例的一种计算关联网页URL模式pattern的装置实施例2的结构框图;FIG. 9 is a block diagram showing a structural diagram of Embodiment 2 of an apparatus for calculating an associated web page URL pattern pattern according to an embodiment of the present invention;
图10示意性地示出了用于执行根据本发明的方法的计算设备的框图;以及Figure 10 schematically shows a block diagram of a computing device for performing the method according to the invention;
图11示意性地示出了用于保持或者携带实现根据本发明的方法的程序代码的存储单元。Fig. 11 schematically shows a storage unit for holding or carrying program code implementing the method according to the invention.
具体实施方式detailed description
下面结合附图和具体的实施方式对本发明作进一步的描述。The invention is further described below in conjunction with the drawings and specific embodiments.
参照图1,示出了本发明一个实施例的一种计算关联网页URL模式pattem的方法实施例1的步骤流程图,具体可以包括如下步骤:Referring to FIG. 1 , a flow chart of the steps of the method for calculating the associated web page URL pattern patte is shown in the following steps.
步骤101,判断指定网页的页面元素中是否具有翻页特征anchor;若是,则执行步骤102; Step 101, it is determined whether the page element of the specified web page has a page turning feature anchor; if yes, step 102 is performed;
网页按照功能可以划分为多个区域,以某一个论坛(Bulletin Board System,BBS)的页面为例,如图2所示,该页面可以划分为导航块(1)、垃圾块(2、4)、翻页块(3)、标题块(5)、作者信息块(6)、发表日期块(7)、正文块(8)。其中,导航块可以位于网页页眉顶部,或者 banner(网页的横幅广告)下部,用于指向网页的信息栏目。垃圾块可以为与网页主题相关度很低的页面元素所在的区域,例如“发帖”、“回复”等功能按钮。翻页块可以为指示翻页的区域。标题块可以为网页主题的标题(例如图2所示的“安全浏览器聚集黑色星期四”)所在的区域。作者信息块为记载该网页主题作者信息的区域。正文块为记载该网页主题正文的区域。The webpage can be divided into multiple areas according to functions. Take a page of a forum (BBS) as an example. As shown in FIG. 2, the page can be divided into a navigation block (1) and a garbage block (2, 4). , page turning block (3), title block (5), author information block (6), publication date block (7), text block (8). Wherein, the navigation block can be located at the top of the web page header, or The lower part of the banner (the banner of the web page) is used to point to the information section of the web page. A garbage block can be an area where a page element having a low relevance to a web page topic is located, such as a "post", "reply", and the like. The page turning block can be an area indicating the page turning. The title block can be the area in which the title of the web page (such as "Secure Browser Gather Black Thursday" shown in Figure 2) is located. The author information block is an area that records the author information of the web page. The body block is the area in which the body of the subject of the web page is recorded.
参照图3,示出了示出了本发明一个实施例的一种翻页块的示例图。Referring to Figure 3, there is shown an exemplary diagram showing a page turning block in accordance with one embodiment of the present invention.
如图3所示,翻页块主要可以由翻页特征anchor组成,翻页特征anchor即翻页特征字符串,其可以为用于标识翻页的页面元素。As shown in FIG. 3, the page turning block may mainly be composed of a page turning feature anchor, and the page turning feature anchor is a page turning feature string, which may be a page element for identifying a page turning.
在具体实现中,翻页特征anchor可以包括以下的一种或多种:In a specific implementation, the page turning feature anchor may include one or more of the following:
[<<]、[>>]、[<  <]、[>  >]、[《]、[》]、[>]、[<]、[下一页]、[上一页]、[上一]、[下一]、[next]、[末页]、[尾页]、[前页]、[后页]、[<上一页]、[>上一]、[下一>]、[下一页>]、[1...]。[<<], [>>], [< <], [> >], ["], ["], [>], [<], [Next], [Previous], [上上One], [Next], [next], [Last Page], [Last Page], [Previous Page], [Next Page], [<Previous Page], [> Previous], [Next]] , [Next Page], [1...].
当然,上述翻页特征anchor只是作为示例,在实施本发明实施例时,可以根据实际情况设置其他翻页特征anchor,本发明实施例对此不加以限制。Of course, the above-mentioned page turning feature anchor is only used as an example. When the embodiment of the present invention is implemented, other page turning feature anchors may be set according to actual conditions, which is not limited by the embodiment of the present invention.
在本发明的一种优选实施例中,所述步骤101具体可以包括如下子步骤:In a preferred embodiment of the present invention, the step 101 may specifically include the following sub-steps:
子步骤S11,采用翻页特征anchor在当前网页的DOM树节点中进行匹配;Sub-step S11, using a page turning feature anchor to perform matching in the DOM tree node of the current webpage;
子步骤S12,当匹配成功时,则判断当前网页具有翻页特征anchor。Sub-step S12, when the matching is successful, it is determined that the current webpage has a page turning feature anchor.
DOM(文件对象模型,Document Object Model)是处理可扩展置标语言的标准编程接口。DOM可以以一种独立于平台和语言的方式访问和修改一个文档的内容和结构,是表示和处理一个HTML(Hypertext Markup Language,超文本标记语言)或XML(eXtensible Markup Language,可扩展标记语言)文档的常用方法。The DOM (Document Object Model) is a standard programming interface for handling extensible markup languages. The DOM can access and modify the content and structure of a document in a platform- and language-independent manner, representing and processing an HTML (Hypertext Markup Language) or XML (eXtensible Markup Language). A common method of documentation.
DOM实际上是以面向对象方式描述的文档模型。DOM定义了表示和修改文档所需的对象、这些对象的行为和属性以及这些对象之间的关系。可以把DOM认为是页面上数据和结构的一个树形表示,不过页面当然可能并不是以这种树的方式具体实现。The DOM is actually a document model that is described in an object-oriented manner. The DOM defines the objects needed to represent and modify documents, the behavior and properties of those objects, and the relationships between these objects. The DOM can be thought of as a tree representation of the data and structure on the page, but of course the page may not be implemented in this way.
通过JavaScript可以重构整个HTML文档,可以添加、移除、改变或重排页面上的项目。The entire HTML document can be refactored via JavaScript, and items on the page can be added, removed, changed, or rearranged.
要改变页面的某个东西,JavaScript就需要获得对HTML文档中所有元素进行访问的入口。这个入口,连同对HTML元素进行添加、移动、改变或移除的方法和属性,都是通过文档对象模型来获得的(DOM)。To change something on the page, JavaScript needs to get access to all the elements in the HTML document. This entry, along with methods and properties for adding, moving, changing, or removing HTML elements, is obtained through the Document Object Model (DOM).
可以将HTML文档视作树结构,而这种结构被称为节点树(HTML DOM)。通过HTML DOM,树中的所有节点均可通过JavaScript进行访问。所有HTML元素(节点)均可被修改,也可以创建或删除节点。HTML documents can be thought of as a tree structure, and this structure is called a node tree (HTML DOM). With the HTML DOM, all nodes in the tree are accessible via JavaScript. All HTML elements (nodes) can be modified, and nodes can be created or deleted.
节点树中的节点彼此拥有层级关系。可以采用父(parent)、子(child)和同胞(sibling)等术语用于描述这些关系。其中,父节点拥有子节点。同级的子节点被称为同胞(兄弟或姐妹)。在节点树中,顶端节点被称为根(root)。每个节点都有父节点、除了根(它没有父节点)。一个节点可拥有任意数量的子,同胞是拥有相同父节点的节点。The nodes in the node tree have a hierarchical relationship with each other. Terms such as parent, child, and sibling can be used to describe these relationships. Among them, the parent node has child nodes. The child nodes of the same level are called siblings (brothers or sisters). In the node tree, the top node is called the root. Each node has a parent node, except for the root (it has no parent). A node can have any number of children, and a sibling is a node that has the same parent.
具体可以通过若干种方法在节点树来查找希望操作的网页元素:Specifically, there are several ways to find the webpage elements you want to operate in the node tree:
例如,可以通过使用getElementById()和getElementsByTagName()方法进行查找。For example, you can do this by using the getElementById() and getElementsByTagName() methods.
又例如,可以通过使用一个元素节点的parentNode、firstChild以及lastChild属性。As another example, you can use the parentNode, firstChild, and lastChild properties of an element node.
其中,getElementById()和getElementsByTagName()这两种方法,可查找整个HTML文档中的任何HTML元素。而这两种方法会忽略文档的结构。假如查找文档中所有的<p>元素,getElementsByTagName()会把它们全部找到,不管<p>元素处于文档中的哪个层次。同时,getElementById()方法也会返回正确的元素,不论它被隐藏在文档结构中的什么位置。这两种方法会提供任何所需要的HTML元素,不论它们在文档中所处的位置。Among them, getElementById() and getElementsByTagName() can find any HTML element in the entire HTML document. Both methods ignore the structure of the document. If you look up all the <p> elements in the document, getElementsByTagName() will find them all, no matter which level in the document the <p> element is in. At the same time, the getElementById() method will also return the correct element, no matter where it is hidden in the document structure. These two methods provide whatever HTML elements are needed, regardless of where they are in the document.
此外,getElementById()可通过指定的ID来返回网页元素。In addition, getElementById() returns the page element with the specified ID.
在具体实现中,可以通过识别该网页的HTML文本DOM树中超链接<a>(anchor,锚点)标识是否包括[<<]、[>>]、[<  <]、[>  >]、[《]、[》]、[>]、[<]、[下一页]、[上一页]、[上一]、[下一]、[next]、[末页]、[尾页]、[前页]、[后页]、[<上一页]、[<上一]、[下一>]、[下一页>]、[1...]中的一种或多种,若是,则判断当前网页具有翻页特征anchor。In a specific implementation, the hyperlink <a> (anchor) in the HTML text DOM tree of the web page may be identified to include [<<], [>>], [< < ], [> >], [ "], ["], [>], [<], [Next], [Previous], [Previous], [Next], [next], [Last], [Last] One or more of [Previous Page], [Next Page], [<Previous Page], [<Previous], [Next], [Next Page], [1...] If yes, it is determined that the current webpage has a page turning feature anchor.
其中,<a>可以用于把当前位置的文本或图片连接到其他的页面、文本或图像等。Among them, <a> can be used to connect the text or picture at the current position to other pages, texts or images.
<a>标识的基本语法结构可以如下:The basic syntax structure of the <a> tag can be as follows:
<a<a
class=type Class=type
id=valueId=value
href=referenceHref=reference
name=valueName=value
rel=same|next|parent|previousRel=same|next|parent|previous
rev=valueRev=value
target=windowTarget=window
style=valueStyle=value
title=titleTitle=title
onclick=functionOnclick=function
onmouseout=functionOnmouseout=function
onMouseOver=function>显示文字或者图片的代码</a>onMouseOver=function>Show code for text or image</a>
例如以下一种HTML文本中<a>标识的内容为:For example, the content of the <a> identifier in the following HTML text is:
<div id=″pgt″class=″bm bw0 pgs cl″><div id=”pgt”class=”bm bw0 pgs cl”>
<span id=″fd_page top″><span id=”fd_page top”>
<div class=″pg″><div class=“pg′′>
<a<a
href=″forum-99-1.html″class=″prev″></a>Href="forum-99-1.html"class="prev"></a>
<a<a
href=″forum-99-1.html″>1</a><strong>2<>Href=”forum-99-1.html”>1</a><strong>2<>
<a<a
href=″forum-99-3.html″>3</a>Href=”forum-99-3.html”>3</a>
<a<a
href=″forum-99-4.html″>4</a>Href=”forum-99-4.html”>4</a>
<a<a
href=″forum-99-5.html″>5</a>Href=”forum-99-5.html”>5</a>
<a<a
href=″forum-99-6.html″>6</a>Href=”forum-99-6.html”>6</a>
<a<a
href=″forum-99-7.html″>7</a>Href=”forum-99-7.html”>7</a>
<a<a
href=″forum-99-8.html″>8</a>Href=”forum-99-8.html”>8</a>
<a<a
href=″forum-99-9.html″>9</a>Href=”forum-99-9.html”>9</a>
<a<a
href=″forum-99-10.html″>10</a>Href=”forum-99-10.html”>10</a>
<a<a
href=″forum-99-1000.html″class=″last″>...2107</a>Href=”forum-99-1000.html”class=”last”>...2107</a>
<label><label>
<inputtype=″text″name=″custompage″class=″px″size=″2″title=″输入页码,按回车快速跳转″value=″2″onkeydown=″if(event.keyCode==13){window.location=′forum.php?mod=forumdisplay&fid=99&page=′+this.value;doane(event);}″/><inputtype="text"name="custompage"class="px"size="2"title="Enter page number, press Enter to quickly jump" value="2"onkeydown="if(event.keyCode==13 ){window.location='forum.php?mod=forumdisplay&fid=99&page='+this.value;doane(event);}"/>
<span title=″共1000页″>/1000页</span><span title=“Total 1000 pages”>/1000 pages</span>
</label></label>
<a<a
href=″forum-99-3.html″class=″nxt″>下一页</a>Href=”forum-99-3.html”class=”nxt”>Next Page</a>
</div></div>
</span></span>
通过HTML文本中<a>标识的匹配,可以判断该网页具有一个或多个翻页特征anchor。By matching the <a> identifier in the HTML text, it can be judged that the web page has one or more page turning feature anchors.
步骤102,提取所述翻页特征anchor对应链接到的关联URL(Un而nn Resource Locator,统一资源定位符);Step 102: Extract an associated URL (Un and nn Resource Locator) to which the page turning feature anchor is linked;
在实现应用中,所述翻页特征anchor可以对应链接到一个或多个关联URL。In an implementation application, the page flip feature anchor may be linked to one or more associated URLs.
具体地,在识别出该一个或多个翻页特征anchor之后,提取该一个或多个翻页特征anchor链接的一个或多个关联URL,该一个或多个关联URL指向其他的与当前网页关联的翻页网页。Specifically, after identifying the one or more page flip feature anchors, extract one or more associated URLs of the one or more page flip feature anchor links, the one or more associated URLs pointing to other associated with the current web page Page turning page.
步骤103,根据所述指定网页的URL以及所述翻页特征anchor对应链接到的关联URL计 算与所述指定网页对应的关联网页URL模式pattern。Step 103: Calculate according to the URL of the specified webpage and the associated URL to which the page flipping feature anchor is linked Calculating an associated webpage URL pattern pattern corresponding to the specified webpage.
关联网页URL模式Pattern,可以为长相或者功能类似的URL/网页聚在一起形成的集合。The associated web page URL pattern Pattern, which can be a collection of long-formed or functionally similar URLs/web pages.
在本发明的一种优选实施例中,所述步骤103具体可以包括如下子步骤:In a preferred embodiment of the present invention, the step 103 may specifically include the following sub-steps:
子步骤S21,使用通配字符替换指定网页的URL中的数字块,获得第一特征URL前缀;其中,所述数字块为被间隔标识分割出的单个数字或多个数字;Sub-step S21, replacing a digital block in a URL of a specified webpage with a wildcard character to obtain a first feature URL prefix; wherein the digital block is a single number or a plurality of numbers segmented by the interval identifier;
子步骤S31,使用通配字符替换所述关联URL中的数字块,获得第二特征URL前缀;Sub-step S31, replacing the digital block in the associated URL with a wildcard character to obtain a second feature URL prefix;
需要说明的是,通配字符可以为任意字符,本发明实施例对此不加以限制。间隔标识可以为URL中用于间隔的符号,例如“/”、“.”、“-”、“?”、“:”等等。数字块需要为间隔标识中连续的数字,例如“123ABC”不为数字块。It should be noted that the wildcard character may be any character, which is not limited in this embodiment of the present invention. The interval identifier may be a symbol for the interval in the URL, such as "/", ".", "-", "?", ":", and the like. The digital block needs to be a consecutive number in the interval identifier, for example "123ABC" is not a digital block.
在本发明实施例的一种优选示例中,所述子步骤S21进一步可以包括如下子步骤:In a preferred example of the embodiment of the present invention, the sub-step S21 may further include the following sub-steps:
子步骤S211,采用相同的通配字符替换指定网页的URL中不同位置的数字块,获得第一特征URL前缀;Sub-step S211, replacing the digital block at different positions in the URL of the specified webpage with the same wildcard character to obtain the first feature URL prefix;
与子步骤S211相对应地,所述子步骤S31进一步可以包括如下子步骤:Corresponding to the sub-step S211, the sub-step S31 may further comprise the following sub-steps:
子步骤S311,采用相同的通配字符替换所述关联URL中不同位置的数字块,获得第二特征URL前缀。Sub-step S311, replacing the digital block at different positions in the associated URL with the same wildcard character to obtain a second feature URL prefix.
在具体实现中,指定网页的URL和关联URL可以具有一个或多个数字块,为减少替换的操作步骤和系统的资源占用,可以用相同的通配字符替换数字块。In a specific implementation, the URL of the specified webpage and the associated URL may have one or more digital blocks. To reduce the operational steps of the replacement and the resource usage of the system, the digital block may be replaced with the same wildcard character.
例如,指定网页的URL为http://bbs.XXX.com/forum-99-2.html,关联URL为http://bbs.XXX.com/forum-99-3.html,其中“99”、“2”被识别出为数字块,以“(\d+)”作为通配字符的一种示例,则第一特征URL前缀可以为http://bbs.XXX.com/forum-(\d+)-(\d+).html,第二特征URL前缀可以为http://bbs.XXX.com/forum-(\d+)-(\d+).html。For example, the URL of the specified web page is http://bbs.XXX.com/forum-99-2.html, and the associated URL is http://bbs.XXX.com/forum-99-3.html, where "99" "2" is recognized as a digital block, and "(\d+)" is an example of a wildcard character. The first feature URL prefix can be http://bbs.XXX.com/forum-(\d+ )-(\d+).html, the second feature URL prefix can be http://bbs.XXX.com/forum-(\d+)-(\d+).html.
在本发明的一种实施例中,所述子步骤S21进一步可以包括如下子步骤:In an embodiment of the present invention, the sub-step S21 may further include the following sub-steps:
子步骤S212,分别采用不同的替换字符,替换指定网页的URL中不同位置的数字块,获得第一特征URL前缀;Sub-step S212, using different replacement characters to replace the digital blocks in different positions in the URL of the specified webpage, to obtain the first feature URL prefix;
与子步骤S212相对应地,所述步骤103具体可以包括如下子步骤:Corresponding to the sub-step S212, the step 103 may specifically include the following sub-steps:
子步骤S312,分别采用与第一特征URL相同的通配字符替换所述关联URL在相同位置的数字块,获得第二特征URL前缀。Sub-step S312, replacing the digital block of the associated URL at the same location with the same wildcard character as the first feature URL, respectively, to obtain a second feature URL prefix.
在具体实现中,指定网页的URL和关联URL可以具有一个或多个数字块,为提高后续第一特征URL前缀与第二特征URL是否相同的判断以及对数字块的标识的效率,可以采用不同的通配字符替换数字块。In a specific implementation, the URL of the specified webpage and the associated URL may have one or more digital blocks, and may be different to determine whether the subsequent first feature URL prefix is the same as the second feature URL and the efficiency of the identification of the digital block. The wildcard character replaces the numeric block.
例如,指定网页的URL为http://bbs.XXX.com/forum-99-2.html,关联URL为http://bbs.XXX.com/forum-99-3.html,其中“99”、“2”被识别出为数字块,以“(\d+)”、“(\e+)”作为通配字符的一种示例,则第一特征URL前缀可以为http://bbs.XXX.com/forum-(\d+)-(\e+).html,第二特征URL前缀可以为http://bbs.XXX.com/forum-(\d+)-(\e+).html。For example, the URL of the specified web page is http://bbs.XXX.com/forum-99-2.html, and the associated URL is http://bbs.XXX.com/forum-99-3.html, where "99" "2" is recognized as a digital block, with "(\d+)" and "(\e+)" as an example of a wildcard character, the first feature URL prefix can be http://bbs.XXX. Com/forum-(\d+)-(\e+).html, the second feature URL prefix can be http://bbs.XXX.com/forum-(\d+)-(\e+).html.
子步骤S41,当所述第一特征URL前缀与所述第二特征URL前缀相同时,将所述第一特征URL前缀或第二特征URL前缀作为关联网页URL模式pattern。Sub-step S41, when the first feature URL prefix is the same as the second feature URL prefix, the first feature URL prefix or the second feature URL prefix is used as an associated webpage URL pattern.
在实际应用中,当第一特征URL前缀与第二特征URL前缀相同时,可以判定指定网页的和关联URL对应的网页为关联的翻页网页。In an actual application, when the first feature URL prefix is the same as the second feature URL prefix, it may be determined that the webpage corresponding to the associated webpage of the specified webpage is the associated page turning webpage.
因为第一特征URL前缀和第二特征URL相同,则以第一特征URL前缀或第二特征URL前缀作为关联网页URL模式Pattern均可。Because the first feature URL prefix and the second feature URL are the same, the first feature URL prefix or the second feature URL prefix may be used as the associated webpage URL pattern Pattern.
本发明采用翻页特征anchor识别关联网页,识别准确率高,基于指定网页的URL中和关联URL计算出关联网页URL模式pattern,计算效率高。The invention adopts the page turning feature anchor to identify the associated webpage, and has high recognition accuracy, and calculates the associated webpage URL pattern pattern based on the URL of the specified webpage and the associated URL, and the calculation efficiency is high.
本发明使用通配字符替换数字块获得第一特征URL前缀和获得第二特征URL前缀,当所述第一特征URL前缀与所述第二特征URL前缀相同时,将所述第一特征URL前缀或第二特征URL前缀作为关联网页URL模式,本发明采用采用URL的共性部分进行匹配,进一步提高了关联网页的识别准确率,使得召回率大幅提高,在实际应用中可以识别90%以上的关联网页。The present invention replaces a digital block with a wildcard character to obtain a first feature URL prefix and obtain a second feature URL prefix. When the first feature URL prefix is the same as the second feature URL prefix, the first feature URL prefix is used. Or the second feature URL prefix is used as the associated webpage URL pattern. The invention adopts the common part of the URL to perform matching, thereby further improving the recognition accuracy of the associated webpage, so that the recall rate is greatly improved, and more than 90% of the associations can be identified in practical applications. Web page.
参照图4,示出了本发明一个实施例的一种计算关联网页URL模式pattern的方法实施例2的步骤流程图,具体可以包括如下步骤:Referring to FIG. 4, a flow chart of the steps of the second embodiment of the method for calculating the URL pattern pattern of the associated webpage is shown in the following steps.
步骤401,判断指定网页的页面元素中是否具有翻页特征anchor;若是,则执行步骤402; Step 401, it is determined whether the page element of the specified web page has a page turning feature anchor; if yes, step 402 is performed;
步骤402,提取所述翻页特征anchor对应链接到的关联URL;Step 402: Extract an associated URL to which the page turning feature anchor is linked;
步骤403,根据所述指定网页的URL以及所述翻页特征anchor对应链接到的关联URL计算与所述指定网页对应的关联网页URL模式pattern;Step 403: Calculate an associated webpage URL pattern pattern corresponding to the specified webpage according to the URL of the specified webpage and the associated URL to which the page flipping feature anchor is associated;
步骤404,通过对关联网页URL模式pattern中的共性部分进行结构分析,提取关联网页URL模式pattern中的翻页块,将所述翻页块替换为首页标识获得首页关联网页的URL;Step 404: Perform structural analysis on the common part in the associated webpage URL pattern pattern, extract the page turning block in the associated webpage URL pattern pattern, and replace the flipping block with the first page identifier to obtain the URL of the homepage associated webpage;
其中,所述翻页块为多个关联网页URL模式pattern中位置相同但数字不同的数字块。The page turning block is a digital block having the same position but different numbers in a plurality of associated webpage URL pattern patterns.
在实际应用中,URL可以包括以下的一种或多种结构:In practical applications, the URL may include one or more of the following structures:
1、protocol(协议):指定使用的传输协议,最常用的是HTTP协议,它也是目前WWW中应用最广的协议。具体地,传输协议包括file协议(资源是本地计算机上的文件,格式为file:///)、ftp协议(通过FTP访问资源,格式为FTP://)、gopher(通过Gopher协议访问资源)、http协议(通过HTTP访问资源,格式为HTTP://)、https协议(通过安全的HTTPS访问资源,格式为HTTPS://)等等。1, protocol (protocol): specifies the transport protocol used, the most commonly used is the HTTP protocol, which is also the most widely used protocol in the current WWW. Specifically, the transport protocol includes a file protocol (the resource is a file on the local computer, the format is file:///), the ftp protocol (accessing the resource through FTP, the format is FTP://), and the gopher (accessing the resource through the Gopher protocol). , http protocol (accessing resources via HTTP, format is http://), https protocol (accessing resources through secure HTTPS, format is HTTPS://), and so on.
2、hostname(主机名):指存放资源的服务器的域名系统(DNS)主机名或IP地址。有时,在主机名前也可以包含连接到服务器所需的用户名和密码(格式为username:password)。2. hostname: The domain name system (DNS) host name or IP address of the server hosting the resource. Sometimes, you can also include the username and password (in the format username:password) required to connect to the server before the host name.
3、port(端口号):省略时使用方案的默认端口,各种传输协议都有默认的端口号,如http的默认端口为80。如果输入时省略,则使用默认端口号。有时候出于安全或其他考虑,可以在服务器上对端口进行重定义,即采用非标准端口号,此时,URL中就不能省略端口号这一项。3. Port (port number): The default port of the scheme is used when omitted. Each transport protocol has a default port number. For example, the default port of http is 80. If omitted when typing, the default port number is used. Sometimes for security or other considerations, the port can be redefined on the server, that is, a non-standard port number is used. In this case, the port number cannot be omitted from the URL.
4、path(路径):由零或多个“/”符号隔开的字符串,一般用来表示主机上的一个目录或文件地址。4. path: A string separated by zero or more "/" symbols, generally used to represent a directory or file address on the host.
5、parameters(参数):可以用于指定特殊参数的可选项。5, parameters: can be used to specify the optional parameters of the optional parameters.
6、query(查询):可以用于给动态网页(如使用CGI、ISAPI、PHP/JSP/ASP/ASP.NET等技术制作的网页)传递参数,可有多个参数,用“&”符号隔开,每个参数的名和值用“=”符号隔开。6, query (query): can be used to send parameters to dynamic web pages (such as web pages created using CGI, ISAPI, PHP / JSP / ASP / ASP.NET technology), can have multiple parameters, separated by "&" symbol On, the name and value of each parameter are separated by the "=" sign.
7、fragment(信息片断):可以用于指定网络资源中的片断。例如一个网页中有多个名词解释,可使用fragment直接定位到某一名词解释。7, fragment (information): can be used to specify fragments in network resources. For example, if there is multiple nouns in a web page, you can use the fragment to directly locate a noun explanation.
在具体实现中,通过对多个关联网页URL模式中的共性部分进行结构分析,提取关联网页URL模式中的翻页块,然后将所述翻页块替换为首页标识获得首页关联网页的URL。In a specific implementation, by performing structural analysis on the common parts in the plurality of associated webpage URL patterns, the page turning block in the associated webpage URL pattern is extracted, and then the page turning block is replaced with the homepage identifier to obtain the URL of the homepage associated webpage.
例如,对于上述示例的关联网页URL模式-http://bbs.XXX.com/forum-(\d+)-(\e+).html,在识别出(\e+)为翻页块,然后将翻页块替换为首页标识后,获得首页关联网页的URL-http://bbs.XXX.com/forum-99-1.html。For example, for the associated web page URL pattern of the above example - http://bbs.XXX.com/forum-(\d+)-(\e+).html, after identifying (\e+) as a page turning block, then turning After replacing the page block with the home page identifier, obtain the URL of the home page associated with the home page - http://bbs.XXX.com/forum-99-1.html.
在本发明实施例的一种优选示例中,所述首页标识可以包括0、1和/或当前关联网页中的最大数值。In a preferred example of an embodiment of the present invention, the homepage identifier may include 0, 1, and/or a maximum value in a current associated webpage.
在具体实现中,关联网页中的首页关联网页一般会记载有重要的内容,例如图3所示的正文块,因此首页关联网页的重要性比较高,因此获知首页关联网页具有比较重要的意义。而不同的网站会采用不同的翻页结构,造成了首页关联网页的不同。例如,某些网站会采用第0页作为首页关联网页,某些网站会采用第1页作为首页关联网页,某些网站会采用最大页(例如图3所示的2100)作为首页关联网页,等等。In a specific implementation, the homepage associated webpage in the associated webpage generally records important content, such as the text block shown in FIG. 3. Therefore, the importance of the homepage associated webpage is relatively high, so it is important to know that the homepage associated webpage has a relatively important meaning. Different websites will adopt different page turning structures, which will result in different pages related to the home page. For example, some websites will use page 0 as the homepage associated page. Some sites will use page 1 as the homepage associated page. Some sites will use the largest page (such as 2100 shown in Figure 3) as the homepage associated page, etc. Wait.
当然,上述首页关联网页只是作为示例,在实施本发明实施例时,可以根据实际情况将数字快替换为任一关联网页的标识获取对应的关联网页,本发明实施例对此不一一加以详述。Of course, the foregoing homepage associated webpage is only an example. When the embodiment of the present invention is implemented, the digital fast can be replaced with the identifier of any associated webpage to obtain the corresponding associated webpage according to the actual situation, which is not specifically described in the embodiment of the present invention. Said.
本发明将关联网页URL模式pattern的翻页块替换为首页标识获得首页关联网页的URL,同理,也可以将翻页块替换为其他挂链网页标识获得其他关联网页的URL,从而增加了关联网页的覆盖率,使得能够获取更加全面的关联网页,进而实现了细颗粒度的操作。The invention replaces the page turning block of the associated webpage URL pattern pattern with the first page identifier to obtain the URL of the related page of the first page. Similarly, the page turning block can be replaced with other linked webpage identifiers to obtain the URLs of other related webpages, thereby increasing the association. The coverage of the webpage enables a more comprehensive associated web page to be obtained, thereby achieving fine-grained operations.
参照图5,示出了本发明一个实施例的一种识别网页URL中页码标识的方法实施例的步骤流程图,具体可以包括如下步骤:Referring to FIG. 5, a flow chart of a method for identifying a page number identifier in a webpage URL according to an embodiment of the present invention is shown. The method may include the following steps:
步骤501,获取指定网页的页面元素中翻页特征anchor对应链接到的关联URL;Step 501: Acquire an associated URL to which the page turning feature anchor corresponding to the page element of the specified webpage is linked;
在本发明的一种优选实施例中,所述步骤501具体可以包括如下子步骤:In a preferred embodiment of the present invention, the step 501 may specifically include the following sub-steps:
子步骤S51,使用翻页特征anchor在指定网页的DOM树节点中进行匹配;Sub-step S51, using a page turning feature anchor to perform matching in a DOM tree node of a specified webpage;
子步骤S52,当匹配成功时,则从匹配成功的翻页特征anchor中获取关联URL。 Sub-step S52, when the matching is successful, the associated URL is obtained from the matching paged feature anchor.
步骤502,依据所述指定网页的URL和所述关联URL计算关联网页URL模式pattern;Step 502: Calculate an associated webpage URL pattern pattern according to the URL of the specified webpage and the associated URL;
在本发明的一种优选实施例中,所述步骤502具体可以包括如下子步骤:In a preferred embodiment of the present invention, the step 502 may specifically include the following sub-steps:
子步骤S61,使用通配字符替换指定网页的URL中的数字块,获得第一特征URL前缀;其中,所述数字块为被间隔标识分割出的单个数字或多个数字;Sub-step S61, replacing the digital block in the URL of the specified webpage with the wildcard character to obtain the first feature URL prefix; wherein the digital block is a single digit or a plurality of digits separated by the interval identifier;
子步骤S71,使用通配字符替换所述关联URL中的数字块,获得第二特征URL前缀;Sub-step S71, replacing the digital block in the associated URL with a wildcard character to obtain a second feature URL prefix;
在本发明实施例的一种优选示例中,所述子步骤S61进一步可以包括如下子步骤:In a preferred example of the embodiment of the present invention, the sub-step S61 may further include the following sub-steps:
子步骤S611,采用相同的通配字符替换指定网页的URL中不同位置的数字块,获得第一特征URL前缀;Sub-step S611, replacing the digital block at different positions in the URL of the specified webpage with the same wildcard character to obtain the first feature URL prefix;
与子步骤S611相对应地,所述子步骤S71进一步可以包括如下子步骤:Corresponding to the sub-step S611, the sub-step S71 may further comprise the following sub-steps:
子步骤S311,采用相同的通配字符替换所述关联URL中不同位置的数字块,获得第二特征URL前缀。Sub-step S311, replacing the digital block at different positions in the associated URL with the same wildcard character to obtain a second feature URL prefix.
在本发明的一种实施例中,所述子步骤S61进一步可以包括如下子步骤:In an embodiment of the present invention, the sub-step S61 may further include the following sub-steps:
子步骤S612,分别采用不同的替换字符,替换指定网页的URL中不同位置的数字块,获得第一特征URL前缀;Sub-step S612, which replaces the digital blocks at different positions in the URL of the specified webpage by using different replacement characters to obtain the first feature URL prefix;
与子步骤S612相对应地,所述子步骤S71进一步可以包括如下子步骤:Corresponding to sub-step S612, the sub-step S71 may further comprise the following sub-steps:
子步骤S712,分别采用与第一特征URL相同的通配字符替换所述关联URL在相同位置的数字块,获得第二特征URL前缀。Sub-step S712, replacing the digital block of the associated URL at the same location with the same wildcard character as the first feature URL, respectively, to obtain a second feature URL prefix.
子步骤S81,当所述第一特征URL前缀与所述第二特征URL前缀相同时,将所述第一特征URL前缀或第二特征URL前缀作为关联网页URL模式pattern。Sub-step S81, when the first feature URL prefix is the same as the second feature URL prefix, the first feature URL prefix or the second feature URL prefix is used as an associated webpage URL pattern.
步骤503,基于与指定网页对应的关联网页URL模式pattern,分别确定所述指定网页URL的页码特征部分以及所述关联URL中的页码特征部分;Step 503: Determine, according to the associated webpage URL pattern pattern corresponding to the specified webpage, a page number feature part of the specified webpage URL and a page number feature part in the associated URL, respectively;
在具体实现中,通过对多个关联网页URL模式中的共性部分进行结构分析,提取关联网页URL模式中的翻页块,然后将所述翻页块替换为首页标识获得首页关联网页的URL。In a specific implementation, by performing structural analysis on the common parts in the plurality of associated webpage URL patterns, the page turning block in the associated webpage URL pattern is extracted, and then the page turning block is replaced with the homepage identifier to obtain the URL of the homepage associated webpage.
通过对关联网页URL模式pattern中的共性部分进行结构分析,可以确定关联网页URL模式pattern中的页码特征部分,即翻页块,具体可以为多个关联网页URL模式pattern中位置相同但数字不同的数字块。By performing structural analysis on the common part of the associated webpage URL pattern pattern, the page number feature part in the associated webpage URL pattern pattern, that is, the page turning block, may be determined, which may be the same position but different numbers in the multiple associated webpage URL pattern patterns. Digital block.
步骤504,比较所述指定网页URL与所述关联页URL的页码特征部分,提取不同数字标识部分识别为指定网页URL的页码标识。Step 504: Compare the specified webpage URL with the page number feature part of the associated page URL, and extract a page number identifier that is identified by the different digital identification part as the specified webpage URL.
在具体实现中,所述页码标识可以包括首页标识,所述首页标识可以包括0、1和/或当前关联网页中的最大数值。In a specific implementation, the page number identifier may include a homepage identifier, and the homepage identifier may include 0, 1, and/or a maximum value in a current associated webpage.
在提取关联网页URL模式中的翻页块后可以将所述翻页块替换为首页标识获得首页关联网页的URL。After extracting the page turning block in the associated web page URL pattern, the page turning block may be replaced with the first page identifier to obtain the URL of the first page associated web page.
例如,对于上述示例的关联网页URL模式-http://bbs.XXX.com/forum-(\d+)-(\e+).html,在识别出(\e+)为翻页块,然后将翻页块替换为首页标识后,获得首页关联网页的URL-http://bbs.XXX.com/fomm-99-1.html。For example, for the associated web page URL pattern of the above example - http://bbs.XXX.com/forum-(\d+)-(\e+).html, after identifying (\e+) as a page turning block, then turning After replacing the page block with the home page identifier, obtain the URL of the home page associated with the home page - http://bbs.XXX.com/fomm-99-1.html.
在本发明实施例的一种优选示例中,所述首页标识可以包括0、1和/或当前关联网页中的最大数值。In a preferred example of an embodiment of the present invention, the homepage identifier may include 0, 1, and/or a maximum value in a current associated webpage.
在具体实现中,关联网页中的首页关联网页一般会记载有重要的内容,例如图3所示的正文块,因此首页关联网页的重要性比较高,因此获知首页关联网页具有比较重要的意义。而不同的网站会采用不同的翻页结构,造成了首页关联网页的不同。例如,某些网站会采用第0页作为首页关联网页,某些网站会采用第1页作为首页关联网页,某些网站会采用最大页(例如图3所示的2100)作为首页关联网页,等等。In a specific implementation, the homepage associated webpage in the associated webpage generally records important content, such as the text block shown in FIG. 3. Therefore, the importance of the homepage associated webpage is relatively high, so it is important to know that the homepage associated webpage has a relatively important meaning. Different websites will adopt different page turning structures, which will result in different pages related to the home page. For example, some websites will use page 0 as the homepage associated page. Some sites will use page 1 as the homepage associated page. Some sites will use the largest page (such as 2100 shown in Figure 3) as the homepage associated page, etc. Wait.
当然,上述首页关联网页只是作为示例,在实施本发明实施例时,可以根据实际情况将数字快替换为任一关联网页的标识获取对应的关联网页,本发明实施例对此不一一加以详述。Of course, the foregoing homepage associated webpage is only an example. When the embodiment of the present invention is implemented, the digital fast can be replaced with the identifier of any associated webpage to obtain the corresponding associated webpage according to the actual situation, which is not specifically described in the embodiment of the present invention. Said.
本发明采用翻页特征anchor识别关联网页,识别准确率高,基于指定网页的URL中和关联URL计算出关联网页URL模式pattern,计算效率高,采用URL的共性部分进行比较,大幅提高召回率,在实际应用中可以识别90%以上的关联网页。The invention adopts the page turning feature anchor to identify the associated webpage, has high recognition accuracy, calculates the associated webpage URL pattern pattern based on the URL of the specified webpage and the associated URL, and has high calculation efficiency, and compares the common parts of the URL to greatly improve the recall rate. More than 90% of related web pages can be identified in practical applications.
本发明使用通配字符替换数字块获得第一特征URL前缀和获得第二特征URL前缀,当所述第一特征URL前缀与所述第二特征URL前缀相同时,将所述第一特征URL前缀或第二特征URL前缀作为关联网页URL模式,本发明采用URL的共性部分进行匹配,进一步提高了关联 网页的识别准确率。The present invention replaces a digital block with a wildcard character to obtain a first feature URL prefix and obtain a second feature URL prefix. When the first feature URL prefix is the same as the second feature URL prefix, the first feature URL prefix is used. Or the second feature URL prefix is used as the associated webpage URL pattern, and the present invention uses the common part of the URL to match, further improving the association. The accuracy of the recognition of the web page.
本发明将关联网页URL模式pattern的翻页块替换为首页标识获得首页关联网页的URL,同理,也可以将翻页块替换为其他挂链网页标识获得其他关联网页的URL,从而增加了关联网页的覆盖率,使得能够获取更加全面的关联网页,进而实现了细颗粒度的操作。The invention replaces the page turning block of the associated webpage URL pattern pattern with the first page identifier to obtain the URL of the related page of the first page. Similarly, the page turning block can be replaced with other linked webpage identifiers to obtain the URLs of other related webpages, thereby increasing the association. The coverage of the webpage enables a more comprehensive associated web page to be obtained, thereby achieving fine-grained operations.
参照图6,示出了本发明一个实施例的一种关联网页数据库的建立方法实施例的步骤流程图,具体可以包括如下步骤:Referring to FIG. 6 , a flow chart of steps of an embodiment of a method for establishing an associated webpage database according to an embodiment of the present invention is shown, which may specifically include the following steps:
步骤601,判断抓取到的网页是否包括关联网页URL模式;若是,则执行步骤602; Step 601, it is determined whether the captured web page includes the associated web page URL mode; if yes, step 602 is performed;
需要说明的是,搜索引擎从万维网上自动提取网页的功能可以是通过网络爬虫实现的。网络爬虫又称为网络蜘蛛,即Web Spider,网络蜘蛛是通过网页的链接地址来寻找网页,从网站某一个页面(通常是首页)开始,读取网页的内容,找到在网页中的其它链接地址,然后通过这些链接地址寻找下一个网页,这样一直循环下去,直到把这个网站所有的网页都抓取完为止。如果把整个互联网当成一个网站,那么网络蜘蛛就可以用这个原理把互联网上所有的网页都抓取下来。It should be noted that the function of the search engine to automatically extract webpages from the World Wide Web can be realized by a web crawler. Web crawlers, also known as web spiders, are Web Spiders. Web spiders use web pages to find web pages. Start with a page (usually the home page), read the content of the web page, and find other link addresses in the web page. And then look for the next page through these link addresses, so that it keeps looping until all the pages of the site are crawled. If the entire Internet is treated as a website, then web spiders can use this principle to capture all the web pages on the Internet.
关联网页URL模式可以为翻页网页的共性部分Pattern,即长相或者功能类似的URL/网页聚在一起形成的集合。The associated webpage URL pattern may be a common part of the page turning webpage, that is, a set formed by a long-term or functionally similar URL/webpage.
在本发明的一种优选实施例中,所述步骤601具体可以包括如下子步骤:In a preferred embodiment of the present invention, the step 601 may specifically include the following sub-steps:
子步骤S91,判断当前网页的页面元素中是否具有翻页特征字符串;若是,则提取所述翻页特征字符串链接的URL;Sub-step S91, determining whether there is a page turning feature string in the page element of the current webpage; if yes, extracting the URL of the page turning feature string link;
参照图3,示出了示出了本发明一个实施例的一种翻页块的示例图。Referring to Figure 3, there is shown an exemplary diagram showing a page turning block in accordance with one embodiment of the present invention.
如图3所示,翻页块主要可以由翻页特征字符串(即翻页特征ancho)组成,而翻页特征字符串可以为用于标识翻页的页面元素。As shown in FIG. 3, the page turning block may be mainly composed of a page turning feature string (ie, a page turning feature ancho), and the page turning feature string may be a page element for identifying a page turning.
在具体实现中,翻页特征字符串可以包括以下的一种或多种:In a specific implementation, the page turning feature string may include one or more of the following:
[<<]、[>>]、[<  <]、[>  >]、[《]、[》]、[>]、[<]、[下一页]、[上一页]、[上一]、[下一]、[next]、[末页]、[尾页]、[前页]、[后页]、[<上一页]、[<上一]、[下一>]、[下一页>]、[1...]。[<<], [>>], [< <], [> >], ["], ["], [>], [<], [Next], [Previous], [上上One], [next], [next], [last page], [last page], [previous page], [next page], [<previous page], [<previous one], [next>] , [Next Page], [1...].
当然,上述翻页特征字符串只是作为示例,在实施本发明实施例时,可以根据实际情况设置其他翻页特征字符串,本发明实施例对此不加以限制。Of course, the above-mentioned page turning feature string is only used as an example. When the embodiment of the present invention is implemented, other page turning feature strings may be set according to actual conditions, which is not limited by the embodiment of the present invention.
需要说明的是,当前网页可以为被抓取到的网页。It should be noted that the current webpage may be the webpage that is captured.
在本发明的一种优选实施例中,所述子步骤S91进一步可以包括如下子步骤:In a preferred embodiment of the present invention, the sub-step S91 may further include the following sub-steps:
子步骤S911,采用翻页特征字符串在当前网页的DOM树节点中进行匹配;Sub-step S911, using a page turning feature string to perform matching in the DOM tree node of the current webpage;
子步骤S912,当匹配成功时,则判断当前网页具有翻页特征字符串。Sub-step S912, when the matching is successful, it is determined that the current webpage has a page turning feature string.
子步骤S92,采用预置的替换字符替换当前网页的URL中的数字块,获得第一特征URL前缀;其中,所述数字块为被间隔标识分割出的单个数字或多个数字;Sub-step S92, replacing the digital block in the URL of the current webpage with a preset replacement character to obtain a first feature URL prefix; wherein the digital block is a single digit or a plurality of digits separated by the interval identifier;
子步骤S93,采用预置的替换字符替换所述翻页特征字符串链接的URL中的数字块,获得第二特征URL前缀;Sub-step S93, replacing the digital block in the URL of the page-turning feature string link with a preset replacement character to obtain a second feature URL prefix;
在本发明的一种实施例中,所述子步骤S92进一步可以包括如下子步骤:In an embodiment of the present invention, the sub-step S92 may further include the following sub-steps:
子步骤S921,采用相同的替换字符替换当前网页的URL中不同位置的数字块,获得第一特征URL前缀;Sub-step S921, replacing the digital block at different positions in the URL of the current webpage with the same replacement character, to obtain the first feature URL prefix;
与子步骤S921相对应地,所述子步骤S93进一步可以包括如下子步骤:Corresponding to sub-step S921, the sub-step S93 may further comprise the following sub-steps:
子步骤S931,采用相同的替换字符替换所述特征字符串链接的URL中不同位置的数字块,获得第二特征URL前缀。Sub-step S931, replacing the digital blocks at different positions in the URL of the feature string link with the same replacement character to obtain a second feature URL prefix.
在本发明的一种实施例中,所述子步骤S92进一步可以包括如下子步骤:In an embodiment of the present invention, the sub-step S92 may further include the following sub-steps:
子步骤S922,分别采用不同的替换字符,替换当前网页的URL中不同位置的数字块,获得第一特征URL前缀;Sub-step S922, which uses different replacement characters to replace the digital blocks in different positions in the URL of the current webpage to obtain the first feature URL prefix;
与子步骤S922相对应地,所述子步骤S93进一步可以包括如下子步骤:Corresponding to sub-step S922, the sub-step S93 may further comprise the following sub-steps:
子步骤S932,分别采用与第一特征URL相同的替换字符替换所述特征字符串链接的URL在相同位置的数字块,获得第二特征URL前缀。Sub-step S932, replacing the digital block of the URL of the feature string link in the same position with the same replacement character as the first feature URL, respectively, to obtain the second feature URL prefix.
子步骤S94,当所述第一特征URL前缀与所述第二特征URL前缀相同时,则判定抓取到的网页是否包括关联网页URL模式。 Sub-step S94, when the first feature URL prefix is the same as the second feature URL prefix, it is determined whether the crawled webpage includes an associated webpage URL pattern.
步骤602,获取所述关联网页URL模式;Step 602: Acquire the associated webpage URL pattern.
在本发明的一种实施例中,所述步骤602具体可以包括如下子步骤:In an embodiment of the present invention, the step 602 may specifically include the following sub-steps:
子步骤S101,将所述第一特征URL前缀或第二特征URL前缀作为所述当前网页的对应的关联网页URL模式。Sub-step S101, the first feature URL prefix or the second feature URL prefix is used as a corresponding associated webpage URL pattern of the current webpage.
本发明在当前网页的页面元素中具有翻页特征字符串时,采用预置的替换字符替换当前网页的URL中的数字块,获得第一特征URL前缀,并采用预置的替换字符替换翻页特征字符串链接的URL中的数字块,获得第二特征URL前缀,当所述第一特征URL前缀与所述第二特征URL前缀相同时,将所述第一特征URL前缀或第二特征URL前缀作为所述当前网页的对应的关联网页URL模式,本发明采用翻页特征字符串进行识别关联网页,识别准确率高,采用URL的共性部分进行匹配,进一步提高了关联网页的识别准确率,使得召回率大幅提高,在实际应用中可以识别90%以上的关联网页。When the page element of the current webpage has a page turning feature string, the present invention replaces the digital block in the URL of the current webpage with a preset replacement character, obtains the first feature URL prefix, and replaces the page flip with the preset replacement character. The digital block in the URL of the feature string link obtains a second feature URL prefix, and when the first feature URL prefix is the same as the second feature URL prefix, the first feature URL prefix or the second feature URL is obtained The prefix is used as the corresponding associated webpage URL pattern of the current webpage. The present invention uses the page turning feature string to identify the associated webpage, and the recognition accuracy is high, and the common part of the URL is used for matching, thereby further improving the recognition accuracy of the associated webpage. The recall rate is greatly improved, and more than 90% of related web pages can be identified in practical applications.
步骤603,基于所述关联网页URL模式获取对应的关联网页;Step 603: Acquire a corresponding associated webpage based on the associated webpage URL pattern.
在具体实现中,关联网页可以包括首页关联网页和其他关联网页,其中,首页关联网页一般会记载有重要的内容,例如图3所示的正文块,因此首页关联网页的重要性比较高,因此获知首页关联网页具有比较重要的意义。In a specific implementation, the associated webpage may include a homepage associated webpage and other related webpages, wherein the homepage associated webpage generally records important content, such as the text block shown in FIG. 3, so the importance of the homepage associated webpage is relatively high, so It is important to know the homepage associated with the homepage.
在本发明的一种优选实施例中,所述步骤603具体可以包括如下子步骤:In a preferred embodiment of the present invention, the step 603 may specifically include the following sub-steps:
子步骤S111,通过对关联网页URL模式中的共性部分进行结构分析,提取关联网页URL模式中的翻页块,将所述翻页块替换为首页标识获得首页关联网页的URL;其中,所述翻页块为多个关联网页URL模式中位置相同但数字不同的数字块;Sub-step S111, by performing structural analysis on the common part in the associated webpage URL pattern, extracting the page turning block in the associated webpage URL pattern, and replacing the flipping block with the first page identifier to obtain the URL of the homepage associated webpage; wherein, The page turning block is a digital block having the same position but different numbers in a plurality of associated web page URL patterns;
子步骤S112,访问所述首页关联网页的URL获取所述首页关联网页。Sub-step S112, accessing the URL of the homepage associated webpage to obtain the homepage associated webpage.
在本发明实施例的一种优选示例中,所述首页标识可以包括0、1和/或当前关联网页中的最大数值。In a preferred example of an embodiment of the present invention, the homepage identifier may include 0, 1, and/or a maximum value in a current associated webpage.
本发明将关联网页URL模式的翻页块替换为首页标识获得首页关联网页的URL,同理,也可以将翻页块替换为其他挂链网页标识获得其他关联网页的URL,从而增加了关联网页的覆盖率,使得能够获取更加全面的关联网页,进而实现了细颗粒度的操作。The invention replaces the page turning block of the associated webpage URL pattern with the homepage identifier to obtain the URL of the homepage associated webpage, and similarly, the page flipping block can be replaced with other hanging webpage logos to obtain the URLs of other related webpages, thereby increasing the associated webpage. The coverage enables a more comprehensive associated web page to be achieved, resulting in fine-grained operations.
步骤604,采用所述关联网页URL模式对应的关联网页建立关联网页数据库。Step 604: Establish an associated webpage database by using an associated webpage corresponding to the associated webpage URL pattern.
在具体实现中,关联网页URL模式对应的关联网页可以包括首页关联网页和其他关联网页,可以是所有关联网页的全部,也可以是所有关联网页的部分,本发明实施例对此不加以限制。In a specific implementation, the associated webpage corresponding to the webpage URL pattern may include a homepage associated webpage and other related webpages, which may be all of the associated webpages, or may be a part of all associated webpages, which is not limited by the embodiment of the present invention.
作为一种优选示例,可以对蜘蛛抓取的网页文件进行数据处理,具体可以包括:As a preferred example, the data processing of the webpage file captured by the spider may be performed, which may specifically include:
1、网页结构化。即关联网页的HTML代码删掉,提取出网页内容。1. Web page structure. That is, the HTML code of the associated web page is deleted, and the web content is extracted.
2、消噪。在网页结构化中,已经删掉了HTML代码,剩下了网页内容,那么消噪指的就是留下网页的主题内容,删掉没用的内容,比如版权。2. Denoising. In the structuring of the webpage, the HTML code has been deleted, and the content of the webpage is left. Then the noise elimination refers to leaving the subject content of the webpage and deleting the useless content, such as copyright.
3、查重。查找重复的网页与内容,如果找到重复的页面,就删除。3, check the weight. Find duplicate pages and content, and delete them if you find duplicate pages.
4、分词。提取出网页内容,然后分成N个词语,排列出来,存入索引库,同时也会计算这一个词在这个页面出现了多少次。4. Word segmentation. Extract the content of the webpage, then divide it into N words, arrange it, store it in the index library, and calculate how many times this word appears on this page.
5、链接分析。查询页面的反向链接,导出链接有多少以及内链,然后给这个页面多少的权重等。5. Link analysis. Query the back link of the page, export the number of links and the inner chain, and then give the page how much weight and so on.
在进行了上边的数据处理之后,就可以把这些处理好的数据存储在关联网页数据库中。After the above data processing, the processed data can be stored in the associated web page database.
本发明基于当前抓取到的网页提取关联网页URL模式,采用关联网页URL模式对应的关联网页建立关联网页数据库,避免了重复抓取网页,减少了系统资源的占用,大大提高了数据库的建立效率。The invention extracts the associated webpage URL pattern based on the currently captured webpage, and establishes the associated webpage database by using the associated webpage corresponding to the webpage URL pattern, thereby avoiding repeated crawling of the webpage, reducing the occupation of system resources, and greatly improving the database establishment efficiency. .
参照图7,示出了本发明一个实施例的一种关联网页搜索方法实施例的步骤流程图,具体可以包括如下步骤:Referring to FIG. 7, a flow chart of steps of an embodiment of an associated webpage search method according to an embodiment of the present invention is shown. Specifically, the method may include the following steps:
步骤701,接收搜索请求;所述请求中包括搜索关键词;Step 701: Receive a search request, where the request includes a search keyword;
搜索请求可以是指用户发出的对某搜索关键词进行相关联信息搜索的请求。例如,用户可以在浏览器地址栏、搜索栏、搜索引擎中的搜索关键字输入框中输入搜索关键词并按下回车键或点击搜索按钮,相当于接收到了用户的搜索请求。The search request may refer to a request by the user to perform an associated information search for a certain search keyword. For example, the user can input a search keyword in the browser address bar, the search bar, the search keyword input box in the search engine, and press the enter key or click the search button, which is equivalent to receiving the user's search request.
步骤702,依据所述搜索关键词在预置的关联网页数据库中进行查找,获得与所述关键词匹配的网页; Step 702: Perform a search in the preset related webpage database according to the search keyword, and obtain a webpage that matches the keyword;
在搜索引擎的后台预置有关联网页数据库,用于存放搜集到的关联网页的信息。所收集的信息一般是能表明关联网页内容(包括网页本身、网页的URL地址、构成网页的代码以及进出网页的连接)的关键词或者短语。In the background of the search engine, there is an associated webpage database for storing information of the collected related webpages. The collected information is generally a keyword or phrase that indicates the content of the associated web page (including the web page itself, the URL address of the web page, the code that makes up the web page, and the connection to and from the web page).
作为一种优选示例,首先可以把用户输入的搜索关键词切分为一个关键词序列,用q来进行表示,则用户搜索的关键词q被切分为q={q1,q2,q3,......,qn}。然后再根据用户查询方式,例如是所有词连在一起,还是中间有空格等,以及根据q中不同关键词的词性,来确定所需查询词中每一个词在查询结果的展示上所占有的重要性。当切分出搜索词集合q后,q中每个关键词所对应的URL排序——索引库,同时也根据用户的查询方式与词性计算出每个关键词在查询结果的展示上所占有的重要,那么只需要进行一点综合性的排序算法,即可以获得搜索结果。As a preferred example, the search keyword input by the user may be first divided into a keyword sequence and represented by q, and the keyword q searched by the user is divided into q={q1, q2, q3,. .....,qn}. Then according to the user query method, for example, all the words are connected together, or there are spaces in the middle, and according to the part of speech of different keywords in q, to determine the possession of each word in the desired query word on the display of the query result. importance. When the search word set q is segmented, the URL corresponding to each keyword in q is sorted—the index library, and the keyword is also calculated according to the user's query mode and part of speech. Important, then only a comprehensive sorting algorithm is needed to get the search results.
在本发明的一种优选实施例中,所述关联网页数据库可以通过以下方式建立:In a preferred embodiment of the invention, the associated web page database can be established in the following manner:
子步骤S101,判断抓取到的网页是否包括关联网页URL模式;若是,则执行子步骤S102;Sub-step S101, it is determined whether the captured web page includes the associated web page URL mode; if so, sub-step S102 is performed;
在本发明的一种优选实施例中,所述子步骤S101具体可以包括如下子步骤:In a preferred embodiment of the present invention, the sub-step S101 may specifically include the following sub-steps:
子步骤S121,判断当前网页的页面元素中是否具有翻页特征字符串;若是,则提取所述翻页特征字符串链接的URL;Sub-step S121, determining whether the page element of the current webpage has a page turning feature string; if yes, extracting the URL of the page turning feature string link;
在本发明的一种优选实施例中,所述子步骤S121进一步可以包括如下子步骤:In a preferred embodiment of the present invention, the sub-step S121 may further include the following sub-steps:
子步骤S1211,采用翻页特征字符串在当前网页的DOM树节点中进行匹配;Sub-step S1211, using a page turning feature string to perform matching in a DOM tree node of the current webpage;
子步骤S1212,当匹配成功时,则判断当前网页具有翻页特征字符串。Sub-step S1212, when the matching is successful, it is determined that the current webpage has a page turning feature string.
子步骤S122,采用预置的替换字符替换当前网页的URL中的数字块,获得第一特征URL前缀;其中,所述数字块为被间隔标识分割出的单个数字或多个数字;Sub-step S122, replacing the digital block in the URL of the current webpage with a preset replacement character to obtain a first feature URL prefix; wherein the digital block is a single number or multiple digits separated by the interval identifier;
子步骤S123,采用预置的替换字符替换所述翻页特征字符串链接的URL中的数字块,获得第二特征URL前缀;Sub-step S123, replacing the digital block in the URL of the page-turning feature string link with a preset replacement character to obtain a second feature URL prefix;
在本发明的一种实施例中,所述子步骤S122进一步可以包括如下子步骤:In an embodiment of the present invention, the sub-step S122 may further include the following sub-steps:
子步骤S1221,采用相同的替换字符替换当前网页的URL中不同位置的数字块,获得第一特征URL前缀;Sub-step S1221, replacing the digital block at different positions in the URL of the current webpage with the same replacement character, to obtain the first feature URL prefix;
与子步骤S1221相对应地,所述子步骤S123进一步可以包括如下子步骤:Corresponding to sub-step S1221, the sub-step S123 may further comprise the following sub-steps:
子步骤S1231,采用相同的替换字符替换所述特征字符串链接的URL中不同位置的数字块,获得第二特征URL前缀。Sub-step S1231, replacing the digital block at different positions in the URL of the feature string link with the same replacement character to obtain a second feature URL prefix.
在本发明的一种实施例中,所述子步骤S122进一步可以包括如下子步骤:In an embodiment of the present invention, the sub-step S122 may further include the following sub-steps:
子步骤S1222,分别采用不同的替换字符,替换当前网页的URL中不同位置的数字块,获得第一特征URL前缀;Sub-step S1222, which replaces the digital blocks at different positions in the URL of the current webpage by using different replacement characters to obtain the first feature URL prefix;
与子步骤S1222相对应地,所述子步骤S123进一步可以包括如下子步骤:Corresponding to sub-step S1222, the sub-step S123 may further comprise the following sub-steps:
子步骤S1232,分别采用与第一特征URL相同的替换字符替换所述特征字符串链接的URL在相同位置的数字块,获得第二特征URL前缀。Sub-step S1232, replacing the digital block of the URL of the feature string link at the same position with the same replacement character as the first feature URL, respectively, to obtain a second feature URL prefix.
子步骤S124,当所述第一特征URL前缀与所述第二特征URL前缀相同时,则判定抓取到的网页是否包括关联网页URL模式。Sub-step S124, when the first feature URL prefix is the same as the second feature URL prefix, it is determined whether the crawled webpage includes an associated webpage URL mode.
子步骤S102,获取所述关联网页URL模式;Sub-step S102, acquiring the associated webpage URL pattern;
在本发明的一种实施例中,所述子步骤S102具体可以包括如下子步骤:In an embodiment of the present invention, the sub-step S102 may specifically include the following sub-steps:
子步骤S131,将所述第一特征URL前缀或第二特征URL前缀作为所述当前网页的对应的关联网页URL模式。Sub-step S131, the first feature URL prefix or the second feature URL prefix is used as a corresponding associated webpage URL pattern of the current webpage.
本发明在当前网页的页面元素中具有翻页特征字符串时,采用预置的替换字符替换当前网页的URL中的数字块,获得第一特征URL前缀,并采用预置的替换字符替换翻页特征字符串链接的URL中的数字块,获得第二特征URL前缀,当所述第一特征URL前缀与所述第二特征URL前缀相同时,将所述第一特征URL前缀或第二特征URL前缀作为所述当前网页的对应的关联网页URL模式,本发明采用翻页特征字符串进行识别关联网页,识别准确率高,采用URL的共性部分进行匹配,进一步提高了关联网页的识别准确率,使得召回率大幅提高,在实际应用中可以识别90%以上的关联网页。When the page element of the current webpage has a page turning feature string, the present invention replaces the digital block in the URL of the current webpage with a preset replacement character, obtains the first feature URL prefix, and replaces the page flip with the preset replacement character. The digital block in the URL of the feature string link obtains a second feature URL prefix, and when the first feature URL prefix is the same as the second feature URL prefix, the first feature URL prefix or the second feature URL is obtained The prefix is used as the corresponding associated webpage URL pattern of the current webpage. The present invention uses the page turning feature string to identify the associated webpage, and the recognition accuracy is high, and the common part of the URL is used for matching, thereby further improving the recognition accuracy of the associated webpage. The recall rate is greatly improved, and more than 90% of related web pages can be identified in practical applications.
子步骤S103,采用所述关联网页URL模式获取对应的关联网页;Sub-step S103, acquiring the corresponding associated webpage by using the associated webpage URL pattern;
在本发明的一种优选实施例中,所述子步骤S103具体可以包括如下子步骤:In a preferred embodiment of the present invention, the sub-step S103 may specifically include the following sub-steps:
子步骤S141,通过对关联网页URL模式中的共性部分进行结构分析,提取关联网页URL 模式中的翻页块,将所述翻页块替换为首页标识获得首页关联网页的URL;其中,所述翻页块为多个关联网页URL模式中位置相同但数字不同的数字块;Sub-step S141, extracting the associated webpage URL by performing structural analysis on the common part in the associated webpage URL pattern a page turning block in the mode, the page turning block is replaced with a first page identifier to obtain a URL of a homepage associated webpage; wherein the page turning block is a digital block having the same position but different numbers in a plurality of associated webpage URL patterns;
子步骤S142,访问所述首页关联网页的URL获取所述首页关联网页。Sub-step S142, accessing the URL of the homepage associated webpage to obtain the homepage associated webpage.
本发明将关联网页URL模式的翻页块替换为首页标识获得首页关联网页的URL,同理,也可以将翻页块替换为其他挂链网页标识获得其他关联网页的URL,从而增加了关联网页的覆盖率,使得能够获取更加全面的关联网页,进而实现了细颗粒度的操作。The invention replaces the page turning block of the associated webpage URL pattern with the homepage identifier to obtain the URL of the homepage associated webpage, and similarly, the page flipping block can be replaced with other hanging webpage logos to obtain the URLs of other related webpages, thereby increasing the associated webpage. The coverage enables a more comprehensive associated web page to be achieved, resulting in fine-grained operations.
子步骤S104,采用所述关联网页URL模式对应的关联网页建立关联网页数据库。Sub-step S104, the associated webpage database is established by using the associated webpage corresponding to the associated webpage URL pattern.
步骤703,判断所述网页是否为关联网页;若是,则执行步骤706; Step 703, it is determined whether the webpage is an associated webpage; if yes, step 706 is performed;
在具体实现中,判断所述网页是否包括关联网页URL模式即可判断所述网页是否为关联网页。即当所述网页包括关联网页URL模式时,判断所述网页为关联网页。In a specific implementation, determining whether the webpage includes an associated webpage URL pattern can determine whether the webpage is an associated webpage. That is, when the webpage includes an associated webpage URL pattern, the webpage is determined to be an associated webpage.
步骤704,返回所述网页及所述网页关联的首页信息。 Step 704, returning the webpage and the homepage information associated with the webpage.
本发明实施例可以存储有关联网页URL模式及其对应的网页的对应关系,只要查询所述网页的关联网页URL模式及其对应的网页的对应关系即可获得所述网页关联的首页。The embodiment of the present invention may store the corresponding relationship between the URL pattern of the associated webpage and the corresponding webpage, and the homepage associated with the webpage may be obtained by querying the corresponding webpage URL pattern of the webpage and the corresponding relationship of the webpage.
当获得搜索结果后,搜索引擎即可以将搜索结果展示在用户阅览的界面上以供用户使用。When the search results are obtained, the search engine can display the search results on the user's viewing interface for the user to use.
本发明在判断获得与关键词匹配的网页为关联网页时,返回该网页及该网页关联的首页信息,避免了用户重复搜索或者查找首页的过程,进一步减少了系统的操作,减少了系统资源的占用,提高了搜索的效率。When the webpage that is matched with the keyword is determined to be the associated webpage, the invention returns the webpage and the homepage information associated with the webpage, thereby avoiding the process of the user repeating the search or searching the homepage, further reducing the operation of the system and reducing the system resources. Occupied, improving the efficiency of search.
对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明并不受所描述的动作顺序的限制,因为依据本发明,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本发明所必须的。For the method embodiments, for the sake of brevity, they are all described as a series of combinations of actions, but those skilled in the art will appreciate that the present invention is not limited by the described order of actions, as some steps are in accordance with the present invention. It can be done in other orders or at the same time. In addition, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by the present invention.
参照图8,示出了本发明一个实施例的一种计算关联网页URL模式pattern的装置实施例1的结构框图,具体可以包括如下模块:Referring to FIG. 8 , a block diagram of a device embodiment 1 for calculating an associated web page URL pattern pattern according to an embodiment of the present invention is shown, which may specifically include the following modules:
翻页特征anchor判断模块801,适于判断指定网页的页面元素中是否具有翻页特征anchor;若是,则调用关联URL提取模块802;The page turning feature anchor determining module 801 is adapted to determine whether the page element of the specified web page has a page turning feature anchor; if so, the associated URL extracting module 802 is invoked;
URL提取模块802,适于提取所述翻页特征anchor对应链接到的关联URL;The URL extraction module 802 is adapted to extract an associated URL to which the page turning feature anchor is linked;
关联网页URL模式pattern计算模块803,适于根据所述指定网页的URL以及所述翻页特征anchor对应链接到的关联URL计算与所述指定网页对应的关联网页URL模式pattern。The associated webpage URL pattern calculation module 803 is adapted to calculate an associated webpage URL pattern pattern corresponding to the specified webpage according to the URL of the specified webpage and the associated URL to which the page turning feature anchor is linked.
在本发明的一种优选实施例中,所述翻页特征anchor判断模块801还可以适于:In a preferred embodiment of the present invention, the page turning feature anchor determining module 801 is further adapted to:
采用翻页特征anchor在当前网页的DOM树节点中进行匹配;Matching is performed in the DOM tree node of the current webpage by using a page turning feature anchor;
当匹配成功时,则判断当前网页具有翻页特征anchor。When the matching is successful, it is determined that the current webpage has a page turning feature anchor.
在本发明的一种优选实施例中,所述翻页特征anchor可以对应链接到一个或多个关联URL。In a preferred embodiment of the invention, the page flip feature anchor may be linked to one or more associated URLs.
在本发明的一种优选实施例中,所述关联网页URL模式pattern计算模块803具体可以包括如下模块:In a preferred embodiment of the present invention, the associated webpage URL pattern calculation module 803 may specifically include the following modules:
第一特征URL前缀获得模块,适于使用通配字符替换指定网页的URL中的数字块,获得第一特征URL前缀;其中,所述数字块为被间隔标识分割出的单个数字或多个数字;a first feature URL prefix obtaining module adapted to replace a digital block in a URL of a specified webpage with a wildcard character to obtain a first feature URL prefix; wherein the digital block is a single number or a plurality of numbers segmented by the interval identifier ;
第二特征URL前缀获得模块,适于使用通配字符替换所述关联URL中的数字块,获得第二特征URL前缀;a second feature URL prefix obtaining module, configured to replace the digital block in the associated URL with a wildcard character to obtain a second feature URL prefix;
关联网页URL模式pattern获得模块,适于在所述第一特征URL前缀与所述第二特征URL前缀相同时,将所述第一特征URL前缀或第二特征URL前缀作为关联网页URL模式pattern。The associated webpage URL pattern obtaining module is configured to use the first feature URL prefix or the second feature URL prefix as the associated webpage URL pattern pattern when the first feature URL prefix is the same as the second feature URL prefix.
在本发明的一种优选实施例中,所述第一特征URL前缀获得模块还可以适于:In a preferred embodiment of the present invention, the first feature URL prefix obtaining module may further be adapted to:
采用相同的通配字符替换指定网页的URL中不同位置的数字块,获得第一特征URL前缀;Replacing the digital block at different positions in the URL of the specified webpage with the same wildcard character to obtain the first feature URL prefix;
所述第二特征URL前缀获得模块还可以适于:The second feature URL prefix obtaining module may further be adapted to:
采用相同的通配字符替换所述关联URL中不同位置的数字块,获得第二特征URL前缀。The second feature URL prefix is obtained by replacing the digital blocks at different positions in the associated URL with the same wildcard characters.
在本发明的一种优选实施例中,所述第一特征URL前缀获得模块还可以适于:In a preferred embodiment of the present invention, the first feature URL prefix obtaining module may further be adapted to:
分别采用不同的通配字符,替换指定网页的URL中不同位置的数字块,获得第一特征URL前缀;The first feature URL prefix is obtained by using different wildcard characters to replace the digital blocks in different positions in the URL of the specified webpage.
第二特征URL前缀获得模块还可以适于:The second feature URL prefix obtaining module may also be adapted to:
分别采用与第一特征URL相同的通配字符替换所述关联URL在相同位置的数字块,获得 第二特征URL前缀。Replacing the digital block of the associated URL at the same position with the same wildcard character as the first feature URL, respectively The second feature URL prefix.
对于图8的装置实施例而言,由于其与图1的方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。For the device embodiment of FIG. 8, since it is basically similar to the method embodiment of FIG. 1, the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.
参照图9,示出了本发明一个实施例的计算一种关联网页URL模式pattern的装置施例2的结构框图,具体可以包括如下模块:Referring to FIG. 9, a structural block diagram of a device 2 for calculating an associated webpage URL pattern pattern according to an embodiment of the present invention is shown, which may specifically include the following modules:
翻页特征anchor判断模块901,适于判断指定网页的页面元素中是否具有翻页特征anchor;若是,则调用关联URL提取模块902;The page turning feature anchor determining module 901 is adapted to determine whether the page element of the specified web page has a page turning feature anchor; if so, the associated URL extracting module 902 is invoked;
URL提取模块902,适于提取所述翻页特征anchor对应链接到的关联URL;The URL extraction module 902 is adapted to extract an associated URL to which the page turning feature anchor is linked;
关联网页URL模式pattem计算模块903,适于根据所述指定网页的URL以及所述翻页特征anchor对应链接到的关联URL计算与所述指定网页对应的关联网页URL模式pattern;The associated webpage URL pattern tablet computing module 903 is adapted to calculate an associated webpage URL pattern pattern corresponding to the specified webpage according to the URL of the specified webpage and the associated URL to which the page turning feature anchor is associated;
首页关联网页URL获得模块904,适于通过对关联网页URL模式pattern中的共性部分进行结构分析,提取关联网页URL模式pattern中的翻页块,将所述翻页块替换为首页标识获得首页关联网页的URL;其中,所述翻页块为多个关联网页URL模式pattern中位置相同但数字不同的数字块。The homepage related webpage URL obtaining module 904 is adapted to extract a page turning block in the associated webpage URL pattern pattern by performing structural analysis on the common part in the associated webpage URL pattern pattern, and replace the flipping block with the first page identifier to obtain a homepage association. a URL of the webpage; wherein the page turning block is a digital block having the same position but different numbers in the plurality of associated webpage URL pattern patterns.
在本发明实施例的一种优选示例中,所述首页标识可以包括0、1和/或当前关联网页中的最大数值。In a preferred example of an embodiment of the present invention, the homepage identifier may include 0, 1, and/or a maximum value in a current associated webpage.
对于图9的装置实施例而言,由于其与图4的方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。For the device embodiment of FIG. 9, since it is basically similar to the method embodiment of FIG. 4, the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.
本发明的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的计算关联网页URL模式pattern的设备中的一些或者全部部件的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如,计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。The various component embodiments of the present invention may be implemented in hardware, or in a software module running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or digital signal processor (DSP) may be used in practice to implement some or all of some or all of the components of the device for calculating the associated web page URL pattern pattern in accordance with an embodiment of the present invention. Features. The invention can also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing some or all of the methods described herein. Such a program implementing the invention may be stored on a computer readable medium or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.
例如,图10示出了可以实现根据本发明的计算关联网页URL模式pattern的的计算设备,例如用户终端设备或应用服务器。该计算设备传统上包括处理器1010和以存储器1020形式的计算机程序产品或者计算机可读介质。存储器1020可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。存储器1020具有用于执行上述方法中的任何方法步骤的程序代码1031的存储空间1030。例如,用于程序代码的存储空间1030可以包括分别用于实现上面的方法中的各种步骤的各个程序代码1031。这些程序代码可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。这些计算机程序产品包括诸如硬盘,紧致盘(CD)、存储卡或者软盘之类的程序代码载体。这样的计算机程序产品通常为如参考图11所述的便携式或者固定存储单元。该存储单元可以具有与图10的计算设备中的存储器1020类似布置的存储段、存储空间等。程序代码可以例如以适当形式进行压缩。通常,存储单元包括计算机可读代码1031’,即可以由例如诸如1010之类的处理器读取的代码,这些代码当由计算设备运行时,导致该计算设备执行上面所描述的方法中的各个步骤。For example, FIG. 10 illustrates a computing device, such as a user terminal device or an application server, that can implement the calculation of an associated web page URL pattern pattern in accordance with the present invention. The computing device conventionally includes a processor 1010 and a computer program product or computer readable medium in the form of a memory 1020. The memory 1020 may be an electronic memory such as a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), an EPROM, a hard disk, or a ROM. The memory 1020 has a memory space 1030 for executing program code 1031 of any of the above method steps. For example, storage space 1030 for program code may include various program code 1031 for implementing various steps in the above methods, respectively. The program code can be read from or written to one or more computer program products. These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks. Such a computer program product is typically a portable or fixed storage unit as described with reference to FIG. The storage unit may have storage segments, storage spaces, and the like that are similarly arranged to memory 1020 in the computing device of FIG. The program code can be compressed, for example, in an appropriate form. Typically, the storage unit includes computer readable code 1031', ie, code that can be read by, for example, a processor such as 1010, which when executed by a computing device causes the computing device to perform each of the methods described above step.
本文中所称的“一个实施例”、“实施例”或者“一个或者多个实施例”意味着,结合实施例描述的特定特征、结构或者特性包括在本发明的至少一个实施例中。此外,请注意,这里“在一个实施例中”的词语例子不一定全指同一个实施例。&quot;an embodiment,&quot; or &quot;an embodiment,&quot; or &quot;an embodiment,&quot; In addition, it is noted that the phrase "in one embodiment" is not necessarily referring to the same embodiment.
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下被实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。In the description provided herein, numerous specific details are set forth. However, it is understood that the embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures, and techniques are not shown in detail so as not to obscure the understanding of the description.
应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第 二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。It is to be noted that the above-described embodiments are illustrative of the invention and are not intended to be limiting, and that the invention may be devised without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as a limitation. The word "comprising" does not exclude the presence of the elements or steps that are not recited in the claims. The word "a" or "an" The invention can be implemented by means of hardware comprising several distinct elements and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means can be embodied by the same hardware item. Word first, first Second, and the use of the third class does not indicate any order. These words can be interpreted as names.
此外,还应当注意,本说明书中使用的语言主要是为了可读性和教导的目的而选择的,而不是为了解释或者限定本发明的主题而选择的。因此,在不偏离所附权利要求书的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。对于本发明的范围,对本发明所做的公开是说明性的,而非限制性的,本发明的范围由所附权利要求书限定。 In addition, it should be noted that the language used in the specification has been selected for the purpose of readability and teaching, and is not intended to be construed or limited. Therefore, many modifications and changes will be apparent to those skilled in the art without departing from the scope of the invention. The disclosure of the present invention is intended to be illustrative, and not restrictive, and the scope of the invention is defined by the appended claims.

Claims (24)

  1. 一种计算关联网页URL模式pattern的方法,包括:A method for calculating a URL pattern pattern of an associated web page, comprising:
    判断指定网页的页面元素中是否具有翻页特征anchor;若是,则提取所述翻页特征anchor对应链接到的关联URL;Determining whether there is a page turning feature anchor in the page element of the specified webpage; if yes, extracting the associated URL to which the page turning feature anchor is linked;
    根据所述指定网页的URL以及所述翻页特征anchor对应链接到的关联URL计算与所述指定网页对应的关联网页URL模式pattern。And calculating, according to the URL of the specified webpage and the associated URL to which the page turning feature anchor is linked, an associated webpage URL pattern corresponding to the specified webpage.
  2. 如权利要求1所述的方法,其特征在于,所述判断指定网页的页面元素中是否具有翻页特征anchor的步骤包括:The method of claim 1, wherein the step of determining whether the page element of the specified web page has a page turning feature anchor comprises:
    采用翻页特征anchor在当前网页的DOM树节点中进行匹配;Matching is performed in the DOM tree node of the current webpage by using a page turning feature anchor;
    当匹配成功时,则判断当前网页具有翻页特征anchor。When the matching is successful, it is determined that the current webpage has a page turning feature anchor.
  3. 如权利要求1所述的方法,其特征在于,所述翻页特征anchor对应链接到一个或多个关联URL。The method of claim 1 wherein said page flip feature anchor is linked to one or more associated URLs.
  4. 如权利要求1或2或3所述的方法,其特征在于,所述根据所述指定网页的URL以及所述关联页URL计算所述关联网页URL模式pattern的步骤进一步包括:The method of claim 1 or 2 or 3, wherein the step of calculating the associated webpage URL pattern pattern according to the URL of the specified webpage and the associated page URL further comprises:
    使用通配字符替换指定网页的URL中的数字块,获得第一特征URL前缀;其中,所述数字块为被间隔标识分割出的单个数字或多个数字;Replacing a digital block in a URL of a specified webpage with a wildcard character to obtain a first feature URL prefix; wherein the digital block is a single number or a plurality of numbers segmented by the interval identifier;
    使用通配字符替换所述关联URL中的数字块,获得第二特征URL前缀;Replacing the digital block in the associated URL with a wildcard character to obtain a second feature URL prefix;
    当所述第一特征URL前缀与所述第二特征URL前缀相同时,将所述第一特征URL前缀或第二特征URL前缀作为关联网页URL模式pattern。When the first feature URL prefix is the same as the second feature URL prefix, the first feature URL prefix or the second feature URL prefix is used as an associated web page URL pattern.
  5. 如权利要求4所述的方法,其特征在于,所述使用通配字符替换指定网页的URL中的数字块,获得第一特征URL前缀的步骤为:The method of claim 4, wherein the step of replacing the digital block in the URL of the specified web page with the wildcard character to obtain the first feature URL prefix is:
    采用相同的通配字符替换指定网页的URL中不同位置的数字块,获得第一特征URL前缀;Replacing the digital block at different positions in the URL of the specified webpage with the same wildcard character to obtain the first feature URL prefix;
    所述使用通配字符替换所述关联URL中的数字块,获得第二特征URL前缀的步骤为:The step of replacing the digital block in the associated URL with a wildcard character to obtain the second feature URL prefix is:
    采用相同的通配字符替换所述关联URL中不同位置的数字块,获得第二特征URL前缀。The second feature URL prefix is obtained by replacing the digital blocks at different positions in the associated URL with the same wildcard characters.
  6. 如权利要求5所述的方法,其特征在于,所述使用通配字符替换指定网页的URL中的数字块,获得第一特征URL前缀的步骤为:The method of claim 5, wherein the step of replacing the digital block in the URL of the specified web page with the wildcard character to obtain the first feature URL prefix is:
    分别采用不同的通配字符,替换指定网页的URL中不同位置的数字块,获得第一特征URL前缀;The first feature URL prefix is obtained by using different wildcard characters to replace the digital blocks in different positions in the URL of the specified webpage.
    所述使用通配字符替换所述关联URL中的数字块,获得第二特征URL前缀的步骤为:The step of replacing the digital block in the associated URL with a wildcard character to obtain the second feature URL prefix is:
    分别采用与第一特征URL相同的通配字符替换所述关联URL在相同位置的数字块,获得第二特征URL前缀。The second feature URL prefix is obtained by replacing the digital block of the associated URL at the same location with the same wildcard character as the first feature URL.
  7. 如权利要求1或2或3或5或6所述的方法,其特征在于,还包括:The method of claim 1 or 2 or 3 or 5 or 6, further comprising:
    通过对关联网页URL模式pattern中的共性部分进行结构分析,提取关联网页URL模式pattern中的翻页块,将所述翻页块替换为首页标识获得首页关联网页的URL;其中,所述翻页块为多个关联网页URL模式pattern中位置相同但数字不同的数字块。By performing structural analysis on the common part in the URL pattern pattern of the associated webpage, extracting the page turning block in the associated webpage URL pattern pattern, and replacing the flipping block with the first page identifier to obtain the URL of the homepage associated webpage; wherein the page turning The block is a digital block with the same position but different numbers in multiple associated web page URL pattern patterns.
  8. 如权利要求7所述的方法,其特征在于,所述首页标识包括0、1和/或当前关联网页中的最大数值。The method of claim 7 wherein said first page identification comprises 0, 1, and/or a maximum value in a current associated web page.
  9. 一种识别网页URL中页码标识的方法,包括:A method for identifying a page number identifier in a webpage URL, comprising:
    获取指定网页的页面元素中翻页特征anchor对应链接到的关联URL;Obtaining the associated URL to which the page turning feature anchor is linked in the page element of the specified webpage;
    依据所述指定网页的URL和所述关联URL计算关联网页URL模式pattern;Calculating an associated webpage URL pattern pattern according to the URL of the specified webpage and the associated URL;
    基于与指定网页对应的关联网页URL模式pattern,分别确定所述指定网页URL的页码特征部分以及所述关联URL中的页码特征部分;Determining, respectively, a page number feature portion of the specified web page URL and a page code feature portion of the associated URL based on an associated web page URL pattern pattern corresponding to the specified web page;
    比较所述指定网页URL与所述关联页URL的页码特征部分,提取不同数字标识部分识别为指定网页URL的页码标识。Comparing the specified webpage URL with the page number feature part of the associated page URL, and extracting the page number identifier that the different digital identification part identifies as the specified webpage URL.
  10. 一种关联网页数据库的建立方法,包括:A method for establishing an associated web page database, comprising:
    判断抓取到的网页是否包括关联网页URL模式;若是,则获取所述关联网页URL模式;Determining whether the crawled webpage includes an associated webpage URL pattern; if yes, acquiring the associated webpage URL pattern;
    基于所述关联网页URL模式获取对应的关联网页;Obtaining a corresponding associated webpage based on the associated webpage URL pattern;
    采用所述关联网页URL模式对应的关联网页建立关联网页数据库。The associated webpage database is established by using the associated webpage corresponding to the associated webpage URL pattern.
  11. 一种关联网页搜索方法,包括:An associated web page search method includes:
    接收搜索请求;所述请求中包括搜索关键词; Receiving a search request; the request includes a search keyword;
    依据所述搜索关键词在预置的关联网页数据库中进行查找,获得与所述关键词匹配的网页;Performing a search in the preset associated webpage database according to the search keyword to obtain a webpage matching the keyword;
    判断所述网页是否为关联网页;若是,则返回所述网页及所述网页关联的首页信息。Determining whether the webpage is an associated webpage; if yes, returning the webpage and the homepage information associated with the webpage.
  12. 一种计算关联网页URL模式pattern的装置,包括:An apparatus for calculating a URL pattern pattern of an associated web page, comprising:
    翻页特征anchor判断模块,适于判断指定网页的页面元素中是否具有翻页特征anchor;若是,则调用关联URL提取模块;The page turning feature anchor determining module is adapted to determine whether the page element of the specified webpage has a page turning feature anchor; if yes, calling the associated URL extracting module;
    URL提取模块,适于提取所述翻页特征anchor对应链接到的关联URL;a URL extraction module, configured to extract an associated URL to which the page turning feature anchor is linked;
    关联网页URL模式pattern计算模块,适于根据所述指定网页的URL以及所述翻页特征anchor对应链接到的关联URL计算与所述指定网页对应的关联网页URL模式pattern。The associated webpage URL pattern calculation module is adapted to calculate an associated webpage URL pattern pattern corresponding to the specified webpage according to the URL of the specified webpage and the associated URL to which the page turning feature anchor is linked.
  13. 如权利要求12所述的装置,其特征在于,所述翻页特征anchor判断模块还适于:The apparatus according to claim 12, wherein said page turning feature anchor determining module is further adapted to:
    采用翻页特征anchor在当前网页的DOM树节点中进行匹配;Matching is performed in the DOM tree node of the current webpage by using a page turning feature anchor;
    当匹配成功时,则判断当前网页具有翻页特征anchor。When the matching is successful, it is determined that the current webpage has a page turning feature anchor.
  14. 如权利要求12所述的装置,其特征在于,所述翻页特征anchor对应链接到一个或多个关联URL。The apparatus of claim 12 wherein said page flip feature anchor is linked to one or more associated URLs.
  15. 如权利要求12或13或14所述的装置,其特征在于,所述关联网页URL模式pattern计算模块包括:The device according to claim 12 or 13 or 14, wherein the associated web page URL pattern calculation module comprises:
    第一特征URL前缀获得模块,适于使用通配字符替换指定网页的URL中的数字块,获得第一特征URL前缀;其中,所述数字块为被间隔标识分割出的单个数字或多个数字;a first feature URL prefix obtaining module adapted to replace a digital block in a URL of a specified webpage with a wildcard character to obtain a first feature URL prefix; wherein the digital block is a single number or a plurality of numbers segmented by the interval identifier ;
    第二特征URL前缀获得模块,适于使用通配字符替换所述关联URL中的数字块,获得第二特征URL前缀;a second feature URL prefix obtaining module, configured to replace the digital block in the associated URL with a wildcard character to obtain a second feature URL prefix;
    关联网页URL模式pattem获得模块,适于在所述第一特征URL前缀与所述第二特征URL前缀相同时,将所述第一特征URL前缀或第二特征URL前缀作为关联网页URL模式pattern。The associated webpage URL pattern patten obtaining module is configured to use the first feature URL prefix or the second feature URL prefix as the associated webpage URL pattern pattern when the first feature URL prefix is the same as the second feature URL prefix.
  16. 如权利要求15所述的装置,其特征在于,所述第一特征URL前缀获得模块还适于:The apparatus according to claim 15, wherein the first feature URL prefix obtaining module is further adapted to:
    采用相同的通配字符替换指定网页的URL中不同位置的数字块,获得第一特征URL前缀;Replacing the digital block at different positions in the URL of the specified webpage with the same wildcard character to obtain the first feature URL prefix;
    所述第二特征URL前缀获得模块还适于:The second feature URL prefix obtaining module is further adapted to:
    采用相同的通配字符替换所述关联URL中不同位置的数字块,获得第二特征URL前缀。The second feature URL prefix is obtained by replacing the digital blocks at different positions in the associated URL with the same wildcard characters.
  17. 如权利要求16所述的装置,其特征在于,所述第一特征URL前缀获得模块还适于:The apparatus according to claim 16, wherein the first feature URL prefix obtaining module is further adapted to:
    分别采用不同的通配字符,替换指定网页的URL中不同位置的数字块,获得第一特征URL前缀;The first feature URL prefix is obtained by using different wildcard characters to replace the digital blocks in different positions in the URL of the specified webpage.
    第二特征URL前缀获得模块还适于:The second feature URL prefix obtaining module is further adapted to:
    分别采用与第一特征URL相同的通配字符替换所述关联URL在相同位置的数字块,获得第二特征URL前缀。The second feature URL prefix is obtained by replacing the digital block of the associated URL at the same location with the same wildcard character as the first feature URL.
  18. 如权利要求12或13或14或16或17所述的装置,其特征在于,还包括:The device of claim 12 or 13 or 14 or 16 or 17, further comprising:
    首页关联网页URL获得模块,适于通过对关联网页URL模式pattern中的共性部分进行结构分析,提取关联网页URL模式pattern中的翻页块,将所述翻页块替换为首页标识获得首页关联网页的URL;其中,所述翻页块为多个关联网页URL模式pattern中位置相同但数字不同的数字块。The homepage related webpage URL obtaining module is configured to extract a page turning block in the associated webpage URL pattern pattern by performing structural analysis on the common part in the associated webpage URL pattern pattern, and replace the flipping block with the first page identifier to obtain a homepage related webpage. a URL; wherein the page turning block is a digital block having the same position but different numbers in a plurality of associated web page URL pattern patterns.
  19. 如权利要求18所述的装置,其特征在于,所述首页标识包括0、1和/或当前关联网页中的最大数值。The apparatus of claim 18, wherein the first page identification comprises 0, 1, and/or a maximum value in a current associated web page.
  20. 如权利要求12所述的装置,其特征在于,还包括:The device of claim 12, further comprising:
    页码特征部分确定模块,适于基于与指定网页对应的关联网页URL模式pattern,分别确定所述指定网页URL的页码特征部分以及所述关联URL中的页码特征部分;a page feature portion determining module, configured to respectively determine a page code feature portion of the specified web page URL and a page code feature portion of the associated URL based on an associated web page URL pattern pattern corresponding to the specified web page;
    页码标识确定模块,适于比较所述指定网页URL与所述关联页URL的页码特征部分,提取不同数字标识部分识别为指定网页URL的页码标识。The page number identification determining module is adapted to compare the specified webpage URL with the page number feature part of the associated page URL, and extract a page number identifier that is identified by the different digital identification part as the specified webpage URL.
  21. 如权利要求12所述的装置,其特征在于,还包括:The device of claim 12, further comprising:
    关联网页数据库建立模块,适于采用所述关联网页URL模式对应的关联网页建立关联网页数据库。The associated webpage database establishing module is adapted to establish an associated webpage database by using the associated webpage corresponding to the associated webpage URL pattern.
  22. 如权利要求21所述的装置,其特征在于,还包括:The device of claim 21, further comprising:
    搜索请求接收模块,适于接收搜索请求;所述请求中包括搜索关键词;a search request receiving module, adapted to receive a search request; the request includes a search keyword;
    匹配网页获得模块,适于依据所述搜索关键词在预置的关联网页数据库中进行查找,获得与所述关键词匹配的网页; The matching webpage obtaining module is adapted to perform searching in the preset related webpage database according to the search keyword to obtain a webpage matching the keyword;
    多页关联网页判断模块,适于判断所述网页是否为关联网页;若是,则调用信息返回模块;The multi-page associated webpage judging module is adapted to determine whether the webpage is an associated webpage; if yes, the information returning module is invoked;
    信息返回模块,适于返回所述网页及所述网页关联的首页信息。The information returning module is adapted to return the webpage and the homepage information associated with the webpage.
  23. 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在计算设备上运行时,导致所述计算设备执行根据权利要求1-8中的任一个所述的计算关联网页URL模式pattem的方法。A computer program comprising computer readable code, when the computer readable code is run on a computing device, causing the computing device to perform the computing associated web page URL pattern according to any one of claims 1-8 Methods.
  24. 一种计算机可读介质,其中存储了如权利要求23所述的计算机程序。 A computer readable medium storing the computer program of claim 23.
PCT/CN2014/086522 2013-11-25 2014-09-15 Method and apparatus for computing url pattern of associated webpage WO2015074455A1 (en)

Applications Claiming Priority (8)

Application Number Priority Date Filing Date Title
CN201310607851.8A CN103617228A (en) 2013-11-25 2013-11-25 Method and device for calculating relevant webpage URL pattern
CN201310603918.0A CN103617225B (en) 2013-11-25 2013-11-25 A kind of associating web pages searching method and system
CN201310607854.1A CN103617229A (en) 2013-11-25 2013-11-25 Method and device for establishing relevant-webpage data base
CN201310603918.0 2013-11-25
CN201310606990.9A CN103631906A (en) 2013-11-25 2013-11-25 Method and device for recognizing page number identification in webpage URL
CN201310606990.9 2013-11-25
CN201310607851.8 2013-11-25
CN201310607854.1 2013-11-25

Publications (1)

Publication Number Publication Date
WO2015074455A1 true WO2015074455A1 (en) 2015-05-28

Family

ID=53178902

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/086522 WO2015074455A1 (en) 2013-11-25 2014-09-15 Method and apparatus for computing url pattern of associated webpage

Country Status (1)

Country Link
WO (1) WO2015074455A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110874443A (en) * 2018-08-31 2020-03-10 北京搜狗科技发展有限公司 URL mode obtaining method and device, electronic equipment and readable storage medium
CN111177522A (en) * 2018-11-09 2020-05-19 百度在线网络技术(北京)有限公司 Page aggregation method and device, computer equipment and storage medium
CN111723378A (en) * 2020-06-17 2020-09-29 浙江网新恒天软件有限公司 Website directory blasting method based on website map
CN114117181A (en) * 2022-01-25 2022-03-01 北京金堤科技有限公司 Website page turning logic acquisition method and device and website page turning control method and device

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101133415A (en) * 2005-03-04 2008-02-27 Chutnoon公司 Server, method and system for providing information search service by using sheaf of pages
CN102053979A (en) * 2009-10-27 2011-05-11 华为技术有限公司 Information acquisition method and system
CN103049557A (en) * 2012-12-31 2013-04-17 百度在线网络技术(北京)有限公司 Website resource management method and website resource management device
CN103258032A (en) * 2013-05-10 2013-08-21 清华大学 Parallel webpage obtaining method and parallel webpage obtaining device
CN103617228A (en) * 2013-11-25 2014-03-05 北京奇虎科技有限公司 Method and device for calculating relevant webpage URL pattern
CN103617225A (en) * 2013-11-25 2014-03-05 北京奇虎科技有限公司 Associated webpage searching method and system
CN103617229A (en) * 2013-11-25 2014-03-05 北京奇虎科技有限公司 Method and device for establishing relevant-webpage data base
CN103631906A (en) * 2013-11-25 2014-03-12 北京奇虎科技有限公司 Method and device for recognizing page number identification in webpage URL

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101133415A (en) * 2005-03-04 2008-02-27 Chutnoon公司 Server, method and system for providing information search service by using sheaf of pages
CN102053979A (en) * 2009-10-27 2011-05-11 华为技术有限公司 Information acquisition method and system
CN103049557A (en) * 2012-12-31 2013-04-17 百度在线网络技术(北京)有限公司 Website resource management method and website resource management device
CN103258032A (en) * 2013-05-10 2013-08-21 清华大学 Parallel webpage obtaining method and parallel webpage obtaining device
CN103617228A (en) * 2013-11-25 2014-03-05 北京奇虎科技有限公司 Method and device for calculating relevant webpage URL pattern
CN103617225A (en) * 2013-11-25 2014-03-05 北京奇虎科技有限公司 Associated webpage searching method and system
CN103617229A (en) * 2013-11-25 2014-03-05 北京奇虎科技有限公司 Method and device for establishing relevant-webpage data base
CN103631906A (en) * 2013-11-25 2014-03-12 北京奇虎科技有限公司 Method and device for recognizing page number identification in webpage URL

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110874443A (en) * 2018-08-31 2020-03-10 北京搜狗科技发展有限公司 URL mode obtaining method and device, electronic equipment and readable storage medium
CN111177522A (en) * 2018-11-09 2020-05-19 百度在线网络技术(北京)有限公司 Page aggregation method and device, computer equipment and storage medium
CN111177522B (en) * 2018-11-09 2023-08-18 百度在线网络技术(北京)有限公司 Page aggregation method, device, computer equipment and storage medium
CN111723378A (en) * 2020-06-17 2020-09-29 浙江网新恒天软件有限公司 Website directory blasting method based on website map
CN114117181A (en) * 2022-01-25 2022-03-01 北京金堤科技有限公司 Website page turning logic acquisition method and device and website page turning control method and device

Similar Documents

Publication Publication Date Title
JP4857075B2 (en) Method and computer program for efficiently retrieving dates in a collection of web documents
CN106095979B (en) URL merging processing method and device
US20090089278A1 (en) Techniques for keyword extraction from urls using statistical analysis
US20130074148A1 (en) Method and system for compiling a unique sample code for specific web content
CN102446255B (en) Method and device for detecting page tamper
WO2015196906A1 (en) Search-based method and device for obtaining disease advisory information
US7962523B2 (en) System and method for detecting templates of a website using hyperlink analysis
CN102436563A (en) Method and device for detecting page tampering
WO2021068681A1 (en) Tag analysis method and device, and computer readable storage medium
CN102591965A (en) Method and device for detecting black chain
WO2015074455A1 (en) Method and apparatus for computing url pattern of associated webpage
CN104239582A (en) Method and device for identifying phishing webpage based on feature vector model
CN104036190A (en) Method and device for detecting page tampering
CN102314494A (en) Method and equipment for processing webpage contents
CN103617225B (en) A kind of associating web pages searching method and system
CN102567521A (en) Webpage data capturing and filtering method
CN106446123A (en) Webpage verification code element identification method
WO2017000659A1 (en) Enriched uniform resource locator (url) identification method and apparatus
CN110532784A (en) A kind of dark chain detection method, device, equipment and computer readable storage medium
CN103631906A (en) Method and device for recognizing page number identification in webpage URL
CN104036189A (en) Page distortion detecting method and black link database generating method
CN104778232B (en) Searching result optimizing method and device based on long query
CN108363711B (en) Method and device for detecting dark chain in webpage
CN104881453A (en) Method and device for indentifying type of webpage
CN103617229A (en) Method and device for establishing relevant-webpage data base

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14864611

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14864611

Country of ref document: EP

Kind code of ref document: A1