WO2015074455A1

WO2015074455A1 - Method and apparatus for computing url pattern of associated webpage

Info

Publication number: WO2015074455A1
Application number: PCT/CN2014/086522
Authority: WO
Inventors: 王智广
Original assignee: 北京奇虎科技有限公司; 奇智软件（北京）有限公司
Priority date: 2013-11-25
Filing date: 2014-09-15
Publication date: 2015-05-28

Abstract

A method and an apparatus for computing a URL pattern of an associated webpage. The method comprises: determining whether a page turning feature anchor exists in page elements of a specified webpage; if yes, retrieving an associated URL to which the page turning feature anchor is correspondingly linked; and computing, according to a URL of the specified webpage and the associated URL to which the page turning feature anchor is correspondingly linked, an associated webpage pattern corresponding to the specified webpage. A page turning feature anchor is used to recognize an associated webpage, so that an accuracy rate of recognition is high, and an associated webpage URL pattern is obtained through computation on the basis of a URL of a specified webpage and an associated URL, so that computing efficiency is high.

Description

Method and device for calculating associated webpage URL pattern pattern

Technical field

The present invention relates to the field of data processing technologies, and in particular, to a method for calculating an associated web page URL pattern pattern, and an apparatus for calculating an associated web page URL pattern pattern.

Background technique

With the development of the Internet, more and more information is presented on the Internet for users to query through webpages. Similarly, querying data on the Internet through search engines has become the most commonly used data search method.

Search engines need to adopt different scheduling strategies for different types of web pages. The identification of web page types is a basic work. The identification of page turning pages is a relatively important task. The so-called page turning page is to view the previous page of the paging file, the next page or any non-current page existing. Turning pages can change the content of a physical book or mobile web form to view different content. This mechanism also presents user interface elements that can be used to browse to other pages when used on the Internet.

The existing method for identifying a page turning page is to identify whether it is an index page according to a keyword included in a URL (Uniform Resource Locator) of the web page. For example, when the URL includes keywords such as page, pn, and p, and a number after the keyword, the web page corresponding to the URL is determined to be a page turning page.

However, this recognition method has a low recall rate, and many websites do not have these keywords, such as "http://cq.ABC.com/lvshi/o12/", "http://bbs.BCA" .com/t661_10", "http://china.BCD.com/product/20110617/2647", but these pages are still page turning, making these identification methods easy to cause misuse and low practicality.

Summary of the invention

In view of the above problems, the present invention has been made in order to provide a method of calculating an associated web page URL pattern pattern and a corresponding apparatus for calculating an associated web page URL pattern pattern that overcomes the above problems or at least partially solves the above problems.

According to an aspect of the present invention, a method for calculating an associated web page URL pattern pattern is provided, including:

Determining whether there is a page turning feature anchor in the page element of the specified webpage; if yes, extracting the associated URL to which the page turning feature anchor is linked;

And calculating, according to the URL of the specified webpage and the associated URL to which the page turning feature anchor is linked, an associated webpage URL pattern corresponding to the specified webpage.

According to another aspect of the present invention, a method for identifying a page number identifier in a webpage URL is provided, including:

Obtaining the associated URL to which the page turning feature anchor is linked in the page element of the specified webpage;

Calculating an associated webpage URL pattern pattern according to the URL of the specified webpage and the associated URL;

Determining, respectively, a page number feature portion of the specified web page URL and a page code feature portion of the associated URL based on an associated web page URL pattern pattern corresponding to the specified web page;

Comparing the specified webpage URL with the page number feature part of the associated page URL, and extracting the page number identifier that the different digital identification part identifies as the specified webpage URL.

According to another aspect of the present invention, a method for establishing an associated web page database is provided, including:

Determining whether the crawled webpage includes an associated webpage URL pattern; if yes, acquiring the associated webpage URL pattern;

Obtaining a corresponding associated webpage based on the associated webpage URL pattern;

The associated webpage database is established by using the associated webpage corresponding to the associated webpage URL pattern.

According to another aspect of the present invention, an associated web page search method is provided, including:

Receiving a search request; the request includes a search keyword;

Performing a search in the preset associated webpage database according to the search keyword to obtain a webpage matching the keyword;

Determining whether the webpage is an associated webpage; if yes, returning the webpage and the homepage information associated with the webpage.

According to another aspect of the present invention, an apparatus for calculating an associated web page URL pattern pattern is provided, including:

The page turning feature anchor determining module is adapted to determine whether the page element of the specified webpage has a page turning feature anchor; if yes, calling the associated URL extracting module;

a URL extraction module, configured to extract an associated URL to which the page turning feature anchor is linked;

The associated webpage URL pattern calculation module is adapted to calculate an associated webpage URL pattern pattern corresponding to the specified webpage according to the URL of the specified webpage and the associated URL to which the page turning feature anchor is linked.

According to still another aspect of the present invention, a computer program is provided, comprising computer readable code when said calculating The machine readable code, when run on a computing device, causes the computing device to perform the method of calculating an associated web page URL pattern pattern according to any of claims 1-8.

According to still another aspect of the present invention, a computer readable medium storing the computer program according to claim 23 is provided.

The beneficial effects of the invention are:

The invention adopts the page turning feature anchor to identify the associated webpage, and the recognition accuracy is high. The associated webpage URL pattern patte is calculated based on the URL of the specified webpage and the associated URL, and the calculation efficiency is high.

The present invention replaces a digital block with a wildcard character to obtain a first feature URL prefix and obtain a second feature URL prefix. When the first feature URL prefix is the same as the second feature URL prefix, the first feature URL prefix is used. Or the second feature URL prefix is used as the associated webpage URL pattern. The present invention uses the common part of the URL to perform matching, further improves the recognition accuracy of the associated webpage, and the recall rate is greatly improved, and more than 90% of the associated webpages can be identified in practical applications. .

The invention replaces the page turning block of the associated webpage URL pattern pattern with the first page identifier to obtain the URL of the related page of the first page. Similarly, the page turning block can be replaced with other linked webpage identifiers to obtain the URLs of other related webpages, thereby increasing the association. The coverage of the webpage enables a more comprehensive associated web page to be obtained, thereby achieving fine-grained operations.

The invention extracts the associated webpage URL pattern based on the currently captured webpage, and establishes the associated webpage database by using the associated webpage corresponding to the webpage URL pattern, thereby avoiding repeated crawling of the webpage, reducing the occupation of system resources, and greatly improving the database establishment efficiency. .

When the webpage that is matched with the keyword is determined to be the associated webpage, the invention returns the webpage and the homepage information associated with the webpage, thereby avoiding the process of the user repeating the search or searching the homepage, further reducing the operation of the system and reducing the system resources. Occupied, improving the efficiency of search.

The above description is only an overview of the technical solutions of the present invention, and the above-described and other objects, features and advantages of the present invention can be more clearly understood. Specific embodiments of the invention are set forth below.

DRAWINGS

Various other advantages and benefits will become apparent to those skilled in the art from a The drawings are only for the purpose of illustrating the preferred embodiments and are not to be construed as limiting. Throughout the drawings, the same reference numerals are used to refer to the same parts. In the drawing:

FIG. 1 is a flow chart showing the steps of Embodiment 1 of a method for calculating an associated web page URL pattern pattern according to an embodiment of the present invention;

FIG. 2 is a view schematically showing an example of a web page structure according to an embodiment of the present invention; FIG.

FIG. 3 is a view schematically showing an example of a page turning block showing an embodiment of the present invention; FIG.

FIG. 4 is a flow chart showing the steps of Embodiment 2 of a method for calculating an associated web page URL pattern pattern according to an embodiment of the present invention;

FIG. 5 is a flow chart showing the steps of an embodiment of a method for identifying a page number identifier in a webpage URL according to an embodiment of the present invention; FIG.

6 is a flow chart showing the steps of an embodiment of a method for establishing an associated webpage database according to an embodiment of the present invention;

FIG. 7 is a flow chart showing the steps of an embodiment of an associated webpage search method according to an embodiment of the present invention;

FIG. 8 is a block diagram showing a structural diagram of Embodiment 1 of an apparatus for calculating an associated web page URL pattern pattern according to an embodiment of the present invention;

FIG. 9 is a block diagram showing a structural diagram of Embodiment 2 of an apparatus for calculating an associated web page URL pattern pattern according to an embodiment of the present invention;

Figure 10 schematically shows a block diagram of a computing device for performing the method according to the invention;

Fig. 11 schematically shows a storage unit for holding or carrying program code implementing the method according to the invention.

detailed description

The invention is further described below in conjunction with the drawings and specific embodiments.

Referring to FIG. 1 , a flow chart of the steps of the method for calculating the associated web page URL pattern patte is shown in the following steps.

Step 101, it is determined whether the page element of the specified web page has a page turning feature anchor; if yes, step 102 is performed;

The webpage can be divided into multiple areas according to functions. Take a page of a forum (BBS) as an example. As shown in FIG. 2, the page can be divided into a navigation block (1) and a garbage block (2, 4). , page turning block (3), title block (5), author information block (6), publication date block (7), text block (8). Wherein, the navigation block can be located at the top of the web page header, or The lower part of the banner (the banner of the web page) is used to point to the information section of the web page. A garbage block can be an area where a page element having a low relevance to a web page topic is located, such as a "post", "reply", and the like. The page turning block can be an area indicating the page turning. The title block can be the area in which the title of the web page (such as "Secure Browser Gather Black Thursday" shown in Figure 2) is located. The author information block is an area that records the author information of the web page. The body block is the area in which the body of the subject of the web page is recorded.

Referring to Figure 3, there is shown an exemplary diagram showing a page turning block in accordance with one embodiment of the present invention.

As shown in FIG. 3, the page turning block may mainly be composed of a page turning feature anchor, and the page turning feature anchor is a page turning feature string, which may be a page element for identifying a page turning.

In a specific implementation, the page turning feature anchor may include one or more of the following:

[<<], [>>], [< <], [> >], ["], ["], [>], [<], [Next], [Previous], [上上One], [Next], [next], [Last Page], [Last Page], [Previous Page], [Next Page], [<Previous Page], [> Previous], [Next]] , [Next Page], [1...].

Of course, the above-mentioned page turning feature anchor is only used as an example. When the embodiment of the present invention is implemented, other page turning feature anchors may be set according to actual conditions, which is not limited by the embodiment of the present invention.

In a preferred embodiment of the present invention, the step 101 may specifically include the following sub-steps:

Sub-step S11, using a page turning feature anchor to perform matching in the DOM tree node of the current webpage;

Sub-step S12, when the matching is successful, it is determined that the current webpage has a page turning feature anchor.

The DOM (Document Object Model) is a standard programming interface for handling extensible markup languages. The DOM can access and modify the content and structure of a document in a platform- and language-independent manner, representing and processing an HTML (Hypertext Markup Language) or XML (eXtensible Markup Language). A common method of documentation.

The DOM is actually a document model that is described in an object-oriented manner. The DOM defines the objects needed to represent and modify documents, the behavior and properties of those objects, and the relationships between these objects. The DOM can be thought of as a tree representation of the data and structure on the page, but of course the page may not be implemented in this way.

The entire HTML document can be refactored via JavaScript, and items on the page can be added, removed, changed, or rearranged.

To change something on the page, JavaScript needs to get access to all the elements in the HTML document. This entry, along with methods and properties for adding, moving, changing, or removing HTML elements, is obtained through the Document Object Model (DOM).

HTML documents can be thought of as a tree structure, and this structure is called a node tree (HTML DOM). With the HTML DOM, all nodes in the tree are accessible via JavaScript. All HTML elements (nodes) can be modified, and nodes can be created or deleted.

The nodes in the node tree have a hierarchical relationship with each other. Terms such as parent, child, and sibling can be used to describe these relationships. Among them, the parent node has child nodes. The child nodes of the same level are called siblings (brothers or sisters). In the node tree, the top node is called the root. Each node has a parent node, except for the root (it has no parent). A node can have any number of children, and a sibling is a node that has the same parent.

Specifically, there are several ways to find the webpage elements you want to operate in the node tree:

For example, you can do this by using the getElementById() and getElementsByTagName() methods.

As another example, you can use the parentNode, firstChild, and lastChild properties of an element node.

Among them, getElementById() and getElementsByTagName() can find any HTML element in the entire HTML document. Both methods ignore the structure of the document. If you look up all the <p> elements in the document, getElementsByTagName() will find them all, no matter which level in the document the <p> element is in. At the same time, the getElementById() method will also return the correct element, no matter where it is hidden in the document structure. These two methods provide whatever HTML elements are needed, regardless of where they are in the document.

In addition, getElementById() returns the page element with the specified ID.

In a specific implementation, the hyperlink <a> (anchor) in the HTML text DOM tree of the web page may be identified to include [<<], [>>], [< < ], [> >], [ "], ["], [>], [<], [Next], [Previous], [Previous], [Next], [next], [Last], [Last] One or more of [Previous Page], [Next Page], [<Previous Page], [<Previous], [Next], [Next Page], [1...] If yes, it is determined that the current webpage has a page turning feature anchor.

Among them, <a> can be used to connect the text or picture at the current position to other pages, texts or images.

The basic syntax structure of the <a> tag can be as follows:

<a

Class=type

Id=value

Href=reference

Name=value

Rel=same|next|parent|previous

Rev=value

Target=window

Style=value

Title=title

Onclick=function

Onmouseout=function

onMouseOver=function>Show code for text or image</a>

For example, the content of the <a> identifier in the following HTML text is:

<a

Href="forum-99-1.html"class="prev"></a>

<a

Href=”forum-99-1.html”>1</a><strong>2<>

<a

Href=”forum-99-3.html”>3</a>

<a

Href=”forum-99-4.html”>4</a>

<a

Href=”forum-99-5.html”>5</a>

<a

Href=”forum-99-6.html”>6</a>

<a

Href=”forum-99-7.html”>7</a>

<a

Href=”forum-99-8.html”>8</a>

<a

Href=”forum-99-9.html”>9</a>

<a

Href=”forum-99-10.html”>10</a>

<a

Href=”forum-99-1000.html”class=”last”>...2107</a>

<label>

<inputtype="text"name="custompage"class="px"size="2"title="Enter page number, press Enter to quickly jump" value="2"onkeydown="if(event.keyCode==13 ){window.location='forum.php?mod=forumdisplay&fid=99&page='+this.value;doane(event);}"/>

<span title=“Total 1000 pages”>/1000 pages</span>

</label>

<a

Href=”forum-99-3.html”class=”nxt”>Next Page</a>

</div>

</span>

By matching the <a> identifier in the HTML text, it can be judged that the web page has one or more page turning feature anchors.

Step 102: Extract an associated URL (Un and nn Resource Locator) to which the page turning feature anchor is linked;

In an implementation application, the page flip feature anchor may be linked to one or more associated URLs.

Specifically, after identifying the one or more page flip feature anchors, extract one or more associated URLs of the one or more page flip feature anchor links, the one or more associated URLs pointing to other associated with the current web page Page turning page.

Step 103: Calculate according to the URL of the specified webpage and the associated URL to which the page flipping feature anchor is linked Calculating an associated webpage URL pattern pattern corresponding to the specified webpage.

The associated web page URL pattern Pattern, which can be a collection of long-formed or functionally similar URLs/web pages.

In a preferred embodiment of the present invention, the step 103 may specifically include the following sub-steps:

Sub-step S21, replacing a digital block in a URL of a specified webpage with a wildcard character to obtain a first feature URL prefix; wherein the digital block is a single number or a plurality of numbers segmented by the interval identifier;

Sub-step S31, replacing the digital block in the associated URL with a wildcard character to obtain a second feature URL prefix;

It should be noted that the wildcard character may be any character, which is not limited in this embodiment of the present invention. The interval identifier may be a symbol for the interval in the URL, such as "/", ".", "-", "?", ":", and the like. The digital block needs to be a consecutive number in the interval identifier, for example "123ABC" is not a digital block.

In a preferred example of the embodiment of the present invention, the sub-step S21 may further include the following sub-steps:

Sub-step S211, replacing the digital block at different positions in the URL of the specified webpage with the same wildcard character to obtain the first feature URL prefix;

Corresponding to the sub-step S211, the sub-step S31 may further comprise the following sub-steps:

Sub-step S311, replacing the digital block at different positions in the associated URL with the same wildcard character to obtain a second feature URL prefix.

In a specific implementation, the URL of the specified webpage and the associated URL may have one or more digital blocks. To reduce the operational steps of the replacement and the resource usage of the system, the digital block may be replaced with the same wildcard character.

For example, the URL of the specified web page is http://bbs.XXX.com/forum-99-2.html, and the associated URL is http://bbs.XXX.com/forum-99-3.html, where "99" "2" is recognized as a digital block, and "(\d+)" is an example of a wildcard character. The first feature URL prefix can be http://bbs.XXX.com/forum-(\d+ )-(\d+).html, the second feature URL prefix can be http://bbs.XXX.com/forum-(\d+)-(\d+).html.

In an embodiment of the present invention, the sub-step S21 may further include the following sub-steps:

Sub-step S212, using different replacement characters to replace the digital blocks in different positions in the URL of the specified webpage, to obtain the first feature URL prefix;

Corresponding to the sub-step S212, the step 103 may specifically include the following sub-steps:

Sub-step S312, replacing the digital block of the associated URL at the same location with the same wildcard character as the first feature URL, respectively, to obtain a second feature URL prefix.

In a specific implementation, the URL of the specified webpage and the associated URL may have one or more digital blocks, and may be different to determine whether the subsequent first feature URL prefix is the same as the second feature URL and the efficiency of the identification of the digital block. The wildcard character replaces the numeric block.

For example, the URL of the specified web page is http://bbs.XXX.com/forum-99-2.html, and the associated URL is http://bbs.XXX.com/forum-99-3.html, where "99" "2" is recognized as a digital block, with "(\d+)" and "(\e+)" as an example of a wildcard character, the first feature URL prefix can be http://bbs.XXX. Com/forum-(\d+)-(\e+).html, the second feature URL prefix can be http://bbs.XXX.com/forum-(\d+)-(\e+).html.

Sub-step S41, when the first feature URL prefix is the same as the second feature URL prefix, the first feature URL prefix or the second feature URL prefix is used as an associated webpage URL pattern.

In an actual application, when the first feature URL prefix is the same as the second feature URL prefix, it may be determined that the webpage corresponding to the associated webpage of the specified webpage is the associated page turning webpage.

Because the first feature URL prefix and the second feature URL are the same, the first feature URL prefix or the second feature URL prefix may be used as the associated webpage URL pattern Pattern.

The invention adopts the page turning feature anchor to identify the associated webpage, and has high recognition accuracy, and calculates the associated webpage URL pattern pattern based on the URL of the specified webpage and the associated URL, and the calculation efficiency is high.

The present invention replaces a digital block with a wildcard character to obtain a first feature URL prefix and obtain a second feature URL prefix. When the first feature URL prefix is the same as the second feature URL prefix, the first feature URL prefix is used. Or the second feature URL prefix is used as the associated webpage URL pattern. The invention adopts the common part of the URL to perform matching, thereby further improving the recognition accuracy of the associated webpage, so that the recall rate is greatly improved, and more than 90% of the associations can be identified in practical applications. Web page.

Referring to FIG. 4, a flow chart of the steps of the second embodiment of the method for calculating the URL pattern pattern of the associated webpage is shown in the following steps.

Step 401, it is determined whether the page element of the specified web page has a page turning feature anchor; if yes, step 402 is performed;

Step 402: Extract an associated URL to which the page turning feature anchor is linked;

Step 403: Calculate an associated webpage URL pattern pattern corresponding to the specified webpage according to the URL of the specified webpage and the associated URL to which the page flipping feature anchor is associated;

Step 404: Perform structural analysis on the common part in the associated webpage URL pattern pattern, extract the page turning block in the associated webpage URL pattern pattern, and replace the flipping block with the first page identifier to obtain the URL of the homepage associated webpage;

The page turning block is a digital block having the same position but different numbers in a plurality of associated webpage URL pattern patterns.

In practical applications, the URL may include one or more of the following structures:

1, protocol (protocol): specifies the transport protocol used, the most commonly used is the HTTP protocol, which is also the most widely used protocol in the current WWW. Specifically, the transport protocol includes a file protocol (the resource is a file on the local computer, the format is file:///), the ftp protocol (accessing the resource through FTP, the format is FTP://), and the gopher (accessing the resource through the Gopher protocol). , http protocol (accessing resources via HTTP, format is http://), https protocol (accessing resources through secure HTTPS, format is HTTPS://), and so on.

2. hostname: The domain name system (DNS) host name or IP address of the server hosting the resource. Sometimes, you can also include the username and password (in the format username:password) required to connect to the server before the host name.

3. Port (port number): The default port of the scheme is used when omitted. Each transport protocol has a default port number. For example, the default port of http is 80. If omitted when typing, the default port number is used. Sometimes for security or other considerations, the port can be redefined on the server, that is, a non-standard port number is used. In this case, the port number cannot be omitted from the URL.

4. path: A string separated by zero or more "/" symbols, generally used to represent a directory or file address on the host.

5, parameters: can be used to specify the optional parameters of the optional parameters.

6, query (query): can be used to send parameters to dynamic web pages (such as web pages created using CGI, ISAPI, PHP / JSP / ASP / ASP.NET technology), can have multiple parameters, separated by "&" symbol On, the name and value of each parameter are separated by the "=" sign.

7, fragment (information): can be used to specify fragments in network resources. For example, if there is multiple nouns in a web page, you can use the fragment to directly locate a noun explanation.

In a specific implementation, by performing structural analysis on the common parts in the plurality of associated webpage URL patterns, the page turning block in the associated webpage URL pattern is extracted, and then the page turning block is replaced with the homepage identifier to obtain the URL of the homepage associated webpage.

For example, for the associated web page URL pattern of the above example - http://bbs.XXX.com/forum-(\d+)-(\e+).html, after identifying (\e+) as a page turning block, then turning After replacing the page block with the home page identifier, obtain the URL of the home page associated with the home page - http://bbs.XXX.com/forum-99-1.html.

In a preferred example of an embodiment of the present invention, the homepage identifier may include 0, 1, and/or a maximum value in a current associated webpage.

In a specific implementation, the homepage associated webpage in the associated webpage generally records important content, such as the text block shown in FIG. 3. Therefore, the importance of the homepage associated webpage is relatively high, so it is important to know that the homepage associated webpage has a relatively important meaning. Different websites will adopt different page turning structures, which will result in different pages related to the home page. For example, some websites will use page 0 as the homepage associated page. Some sites will use page 1 as the homepage associated page. Some sites will use the largest page (such as 2100 shown in Figure 3) as the homepage associated page, etc. Wait.

Of course, the foregoing homepage associated webpage is only an example. When the embodiment of the present invention is implemented, the digital fast can be replaced with the identifier of any associated webpage to obtain the corresponding associated webpage according to the actual situation, which is not specifically described in the embodiment of the present invention. Said.

Referring to FIG. 5, a flow chart of a method for identifying a page number identifier in a webpage URL according to an embodiment of the present invention is shown. The method may include the following steps:

Step 501: Acquire an associated URL to which the page turning feature anchor corresponding to the page element of the specified webpage is linked;

In a preferred embodiment of the present invention, the step 501 may specifically include the following sub-steps:

Sub-step S51, using a page turning feature anchor to perform matching in a DOM tree node of a specified webpage;

Sub-step S52, when the matching is successful, the associated URL is obtained from the matching paged feature anchor.

Step 502: Calculate an associated webpage URL pattern pattern according to the URL of the specified webpage and the associated URL;

In a preferred embodiment of the present invention, the step 502 may specifically include the following sub-steps:

Sub-step S61, replacing the digital block in the URL of the specified webpage with the wildcard character to obtain the first feature URL prefix; wherein the digital block is a single digit or a plurality of digits separated by the interval identifier;

Sub-step S71, replacing the digital block in the associated URL with a wildcard character to obtain a second feature URL prefix;

In a preferred example of the embodiment of the present invention, the sub-step S61 may further include the following sub-steps:

Sub-step S611, replacing the digital block at different positions in the URL of the specified webpage with the same wildcard character to obtain the first feature URL prefix;

Corresponding to the sub-step S611, the sub-step S71 may further comprise the following sub-steps:

In an embodiment of the present invention, the sub-step S61 may further include the following sub-steps:

Sub-step S612, which replaces the digital blocks at different positions in the URL of the specified webpage by using different replacement characters to obtain the first feature URL prefix;

Corresponding to sub-step S612, the sub-step S71 may further comprise the following sub-steps:

Sub-step S712, replacing the digital block of the associated URL at the same location with the same wildcard character as the first feature URL, respectively, to obtain a second feature URL prefix.

Sub-step S81, when the first feature URL prefix is the same as the second feature URL prefix, the first feature URL prefix or the second feature URL prefix is used as an associated webpage URL pattern.

Step 503: Determine, according to the associated webpage URL pattern pattern corresponding to the specified webpage, a page number feature part of the specified webpage URL and a page number feature part in the associated URL, respectively;

By performing structural analysis on the common part of the associated webpage URL pattern pattern, the page number feature part in the associated webpage URL pattern pattern, that is, the page turning block, may be determined, which may be the same position but different numbers in the multiple associated webpage URL pattern patterns. Digital block.

Step 504: Compare the specified webpage URL with the page number feature part of the associated page URL, and extract a page number identifier that is identified by the different digital identification part as the specified webpage URL.

In a specific implementation, the page number identifier may include a homepage identifier, and the homepage identifier may include 0, 1, and/or a maximum value in a current associated webpage.

After extracting the page turning block in the associated web page URL pattern, the page turning block may be replaced with the first page identifier to obtain the URL of the first page associated web page.

For example, for the associated web page URL pattern of the above example - http://bbs.XXX.com/forum-(\d+)-(\e+).html, after identifying (\e+) as a page turning block, then turning After replacing the page block with the home page identifier, obtain the URL of the home page associated with the home page - http://bbs.XXX.com/fomm-99-1.html.

The invention adopts the page turning feature anchor to identify the associated webpage, has high recognition accuracy, calculates the associated webpage URL pattern pattern based on the URL of the specified webpage and the associated URL, and has high calculation efficiency, and compares the common parts of the URL to greatly improve the recall rate. More than 90% of related web pages can be identified in practical applications.

The present invention replaces a digital block with a wildcard character to obtain a first feature URL prefix and obtain a second feature URL prefix. When the first feature URL prefix is the same as the second feature URL prefix, the first feature URL prefix is used. Or the second feature URL prefix is used as the associated webpage URL pattern, and the present invention uses the common part of the URL to match, further improving the association. The accuracy of the recognition of the web page.

Referring to FIG. 6 , a flow chart of steps of an embodiment of a method for establishing an associated webpage database according to an embodiment of the present invention is shown, which may specifically include the following steps:

Step 601, it is determined whether the captured web page includes the associated web page URL mode; if yes, step 602 is performed;

It should be noted that the function of the search engine to automatically extract webpages from the World Wide Web can be realized by a web crawler. Web crawlers, also known as web spiders, are Web Spiders. Web spiders use web pages to find web pages. Start with a page (usually the home page), read the content of the web page, and find other link addresses in the web page. And then look for the next page through these link addresses, so that it keeps looping until all the pages of the site are crawled. If the entire Internet is treated as a website, then web spiders can use this principle to capture all the web pages on the Internet.

The associated webpage URL pattern may be a common part of the page turning webpage, that is, a set formed by a long-term or functionally similar URL/webpage.

In a preferred embodiment of the present invention, the step 601 may specifically include the following sub-steps:

Sub-step S91, determining whether there is a page turning feature string in the page element of the current webpage; if yes, extracting the URL of the page turning feature string link;

As shown in FIG. 3, the page turning block may be mainly composed of a page turning feature string (ie, a page turning feature ancho), and the page turning feature string may be a page element for identifying a page turning.

In a specific implementation, the page turning feature string may include one or more of the following:

[<<], [>>], [< <], [> >], ["], ["], [>], [<], [Next], [Previous], [上上One], [next], [next], [last page], [last page], [previous page], [next page], [<previous page], [<previous one], [next>] , [Next Page], [1...].

Of course, the above-mentioned page turning feature string is only used as an example. When the embodiment of the present invention is implemented, other page turning feature strings may be set according to actual conditions, which is not limited by the embodiment of the present invention.

It should be noted that the current webpage may be the webpage that is captured.

In a preferred embodiment of the present invention, the sub-step S91 may further include the following sub-steps:

Sub-step S911, using a page turning feature string to perform matching in the DOM tree node of the current webpage;

Sub-step S912, when the matching is successful, it is determined that the current webpage has a page turning feature string.

Sub-step S92, replacing the digital block in the URL of the current webpage with a preset replacement character to obtain a first feature URL prefix; wherein the digital block is a single digit or a plurality of digits separated by the interval identifier;

Sub-step S93, replacing the digital block in the URL of the page-turning feature string link with a preset replacement character to obtain a second feature URL prefix;

In an embodiment of the present invention, the sub-step S92 may further include the following sub-steps:

Sub-step S921, replacing the digital block at different positions in the URL of the current webpage with the same replacement character, to obtain the first feature URL prefix;

Corresponding to sub-step S921, the sub-step S93 may further comprise the following sub-steps:

Sub-step S931, replacing the digital blocks at different positions in the URL of the feature string link with the same replacement character to obtain a second feature URL prefix.

Sub-step S922, which uses different replacement characters to replace the digital blocks in different positions in the URL of the current webpage to obtain the first feature URL prefix;

Corresponding to sub-step S922, the sub-step S93 may further comprise the following sub-steps:

Sub-step S932, replacing the digital block of the URL of the feature string link in the same position with the same replacement character as the first feature URL, respectively, to obtain the second feature URL prefix.

Sub-step S94, when the first feature URL prefix is the same as the second feature URL prefix, it is determined whether the crawled webpage includes an associated webpage URL pattern.

Step 602: Acquire the associated webpage URL pattern.

In an embodiment of the present invention, the step 602 may specifically include the following sub-steps:

Sub-step S101, the first feature URL prefix or the second feature URL prefix is used as a corresponding associated webpage URL pattern of the current webpage.

When the page element of the current webpage has a page turning feature string, the present invention replaces the digital block in the URL of the current webpage with a preset replacement character, obtains the first feature URL prefix, and replaces the page flip with the preset replacement character. The digital block in the URL of the feature string link obtains a second feature URL prefix, and when the first feature URL prefix is the same as the second feature URL prefix, the first feature URL prefix or the second feature URL is obtained The prefix is used as the corresponding associated webpage URL pattern of the current webpage. The present invention uses the page turning feature string to identify the associated webpage, and the recognition accuracy is high, and the common part of the URL is used for matching, thereby further improving the recognition accuracy of the associated webpage. The recall rate is greatly improved, and more than 90% of related web pages can be identified in practical applications.

Step 603: Acquire a corresponding associated webpage based on the associated webpage URL pattern.

In a specific implementation, the associated webpage may include a homepage associated webpage and other related webpages, wherein the homepage associated webpage generally records important content, such as the text block shown in FIG. 3, so the importance of the homepage associated webpage is relatively high, so It is important to know the homepage associated with the homepage.

In a preferred embodiment of the present invention, the step 603 may specifically include the following sub-steps:

Sub-step S111, by performing structural analysis on the common part in the associated webpage URL pattern, extracting the page turning block in the associated webpage URL pattern, and replacing the flipping block with the first page identifier to obtain the URL of the homepage associated webpage; wherein, The page turning block is a digital block having the same position but different numbers in a plurality of associated web page URL patterns;

Sub-step S112, accessing the URL of the homepage associated webpage to obtain the homepage associated webpage.

The invention replaces the page turning block of the associated webpage URL pattern with the homepage identifier to obtain the URL of the homepage associated webpage, and similarly, the page flipping block can be replaced with other hanging webpage logos to obtain the URLs of other related webpages, thereby increasing the associated webpage. The coverage enables a more comprehensive associated web page to be achieved, resulting in fine-grained operations.

Step 604: Establish an associated webpage database by using an associated webpage corresponding to the associated webpage URL pattern.

In a specific implementation, the associated webpage corresponding to the webpage URL pattern may include a homepage associated webpage and other related webpages, which may be all of the associated webpages, or may be a part of all associated webpages, which is not limited by the embodiment of the present invention.

As a preferred example, the data processing of the webpage file captured by the spider may be performed, which may specifically include:

1. Web page structure. That is, the HTML code of the associated web page is deleted, and the web content is extracted.

2. Denoising. In the structuring of the webpage, the HTML code has been deleted, and the content of the webpage is left. Then the noise elimination refers to leaving the subject content of the webpage and deleting the useless content, such as copyright.

3, check the weight. Find duplicate pages and content, and delete them if you find duplicate pages.

4. Word segmentation. Extract the content of the webpage, then divide it into N words, arrange it, store it in the index library, and calculate how many times this word appears on this page.

5. Link analysis. Query the back link of the page, export the number of links and the inner chain, and then give the page how much weight and so on.

After the above data processing, the processed data can be stored in the associated web page database.

Referring to FIG. 7, a flow chart of steps of an embodiment of an associated webpage search method according to an embodiment of the present invention is shown. Specifically, the method may include the following steps:

Step 701: Receive a search request, where the request includes a search keyword;

The search request may refer to a request by the user to perform an associated information search for a certain search keyword. For example, the user can input a search keyword in the browser address bar, the search bar, the search keyword input box in the search engine, and press the enter key or click the search button, which is equivalent to receiving the user's search request.

Step 702: Perform a search in the preset related webpage database according to the search keyword, and obtain a webpage that matches the keyword;

In the background of the search engine, there is an associated webpage database for storing information of the collected related webpages. The collected information is generally a keyword or phrase that indicates the content of the associated web page (including the web page itself, the URL address of the web page, the code that makes up the web page, and the connection to and from the web page).

As a preferred example, the search keyword input by the user may be first divided into a keyword sequence and represented by q, and the keyword q searched by the user is divided into q={q1, q2, q3,. .....,qn}. Then according to the user query method, for example, all the words are connected together, or there are spaces in the middle, and according to the part of speech of different keywords in q, to determine the possession of each word in the desired query word on the display of the query result. importance. When the search word set q is segmented, the URL corresponding to each keyword in q is sorted—the index library, and the keyword is also calculated according to the user's query mode and part of speech. Important, then only a comprehensive sorting algorithm is needed to get the search results.

In a preferred embodiment of the invention, the associated web page database can be established in the following manner:

Sub-step S101, it is determined whether the captured web page includes the associated web page URL mode; if so, sub-step S102 is performed;

In a preferred embodiment of the present invention, the sub-step S101 may specifically include the following sub-steps:

Sub-step S121, determining whether the page element of the current webpage has a page turning feature string; if yes, extracting the URL of the page turning feature string link;

In a preferred embodiment of the present invention, the sub-step S121 may further include the following sub-steps:

Sub-step S1211, using a page turning feature string to perform matching in a DOM tree node of the current webpage;

Sub-step S1212, when the matching is successful, it is determined that the current webpage has a page turning feature string.

Sub-step S122, replacing the digital block in the URL of the current webpage with a preset replacement character to obtain a first feature URL prefix; wherein the digital block is a single number or multiple digits separated by the interval identifier;

Sub-step S123, replacing the digital block in the URL of the page-turning feature string link with a preset replacement character to obtain a second feature URL prefix;

In an embodiment of the present invention, the sub-step S122 may further include the following sub-steps:

Sub-step S1221, replacing the digital block at different positions in the URL of the current webpage with the same replacement character, to obtain the first feature URL prefix;

Corresponding to sub-step S1221, the sub-step S123 may further comprise the following sub-steps:

Sub-step S1231, replacing the digital block at different positions in the URL of the feature string link with the same replacement character to obtain a second feature URL prefix.

Sub-step S1222, which replaces the digital blocks at different positions in the URL of the current webpage by using different replacement characters to obtain the first feature URL prefix;

Corresponding to sub-step S1222, the sub-step S123 may further comprise the following sub-steps:

Sub-step S1232, replacing the digital block of the URL of the feature string link at the same position with the same replacement character as the first feature URL, respectively, to obtain a second feature URL prefix.

Sub-step S124, when the first feature URL prefix is the same as the second feature URL prefix, it is determined whether the crawled webpage includes an associated webpage URL mode.

Sub-step S102, acquiring the associated webpage URL pattern;

In an embodiment of the present invention, the sub-step S102 may specifically include the following sub-steps:

Sub-step S131, the first feature URL prefix or the second feature URL prefix is used as a corresponding associated webpage URL pattern of the current webpage.

Sub-step S103, acquiring the corresponding associated webpage by using the associated webpage URL pattern;

In a preferred embodiment of the present invention, the sub-step S103 may specifically include the following sub-steps:

Sub-step S141, extracting the associated webpage URL by performing structural analysis on the common part in the associated webpage URL pattern a page turning block in the mode, the page turning block is replaced with a first page identifier to obtain a URL of a homepage associated webpage; wherein the page turning block is a digital block having the same position but different numbers in a plurality of associated webpage URL patterns;

Sub-step S142, accessing the URL of the homepage associated webpage to obtain the homepage associated webpage.

Sub-step S104, the associated webpage database is established by using the associated webpage corresponding to the associated webpage URL pattern.

Step 703, it is determined whether the webpage is an associated webpage; if yes, step 706 is performed;

In a specific implementation, determining whether the webpage includes an associated webpage URL pattern can determine whether the webpage is an associated webpage. That is, when the webpage includes an associated webpage URL pattern, the webpage is determined to be an associated webpage.

Step 704, returning the webpage and the homepage information associated with the webpage.

The embodiment of the present invention may store the corresponding relationship between the URL pattern of the associated webpage and the corresponding webpage, and the homepage associated with the webpage may be obtained by querying the corresponding webpage URL pattern of the webpage and the corresponding relationship of the webpage.

When the search results are obtained, the search engine can display the search results on the user's viewing interface for the user to use.

For the method embodiments, for the sake of brevity, they are all described as a series of combinations of actions, but those skilled in the art will appreciate that the present invention is not limited by the described order of actions, as some steps are in accordance with the present invention. It can be done in other orders or at the same time. In addition, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by the present invention.

Referring to FIG. 8 , a block diagram of a device embodiment 1 for calculating an associated web page URL pattern pattern according to an embodiment of the present invention is shown, which may specifically include the following modules:

The page turning feature anchor determining module 801 is adapted to determine whether the page element of the specified web page has a page turning feature anchor; if so, the associated URL extracting module 802 is invoked;

The URL extraction module 802 is adapted to extract an associated URL to which the page turning feature anchor is linked;

The associated webpage URL pattern calculation module 803 is adapted to calculate an associated webpage URL pattern pattern corresponding to the specified webpage according to the URL of the specified webpage and the associated URL to which the page turning feature anchor is linked.

In a preferred embodiment of the present invention, the page turning feature anchor determining module 801 is further adapted to:

Matching is performed in the DOM tree node of the current webpage by using a page turning feature anchor;

When the matching is successful, it is determined that the current webpage has a page turning feature anchor.

In a preferred embodiment of the invention, the page flip feature anchor may be linked to one or more associated URLs.

In a preferred embodiment of the present invention, the associated webpage URL pattern calculation module 803 may specifically include the following modules:

a first feature URL prefix obtaining module adapted to replace a digital block in a URL of a specified webpage with a wildcard character to obtain a first feature URL prefix; wherein the digital block is a single number or a plurality of numbers segmented by the interval identifier ;

a second feature URL prefix obtaining module, configured to replace the digital block in the associated URL with a wildcard character to obtain a second feature URL prefix;

The associated webpage URL pattern obtaining module is configured to use the first feature URL prefix or the second feature URL prefix as the associated webpage URL pattern pattern when the first feature URL prefix is the same as the second feature URL prefix.

In a preferred embodiment of the present invention, the first feature URL prefix obtaining module may further be adapted to:

Replacing the digital block at different positions in the URL of the specified webpage with the same wildcard character to obtain the first feature URL prefix;

The second feature URL prefix obtaining module may further be adapted to:

The second feature URL prefix is obtained by replacing the digital blocks at different positions in the associated URL with the same wildcard characters.

The first feature URL prefix is obtained by using different wildcard characters to replace the digital blocks in different positions in the URL of the specified webpage.

The second feature URL prefix obtaining module may also be adapted to:

Replacing the digital block of the associated URL at the same position with the same wildcard character as the first feature URL, respectively The second feature URL prefix.

For the device embodiment of FIG. 8, since it is basically similar to the method embodiment of FIG. 1, the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.

Referring to FIG. 9, a structural block diagram of a device 2 for calculating an associated webpage URL pattern pattern according to an embodiment of the present invention is shown, which may specifically include the following modules:

The page turning feature anchor determining module 901 is adapted to determine whether the page element of the specified web page has a page turning feature anchor; if so, the associated URL extracting module 902 is invoked;

The URL extraction module 902 is adapted to extract an associated URL to which the page turning feature anchor is linked;

The associated webpage URL pattern tablet computing module 903 is adapted to calculate an associated webpage URL pattern pattern corresponding to the specified webpage according to the URL of the specified webpage and the associated URL to which the page turning feature anchor is associated;

The homepage related webpage URL obtaining module 904 is adapted to extract a page turning block in the associated webpage URL pattern pattern by performing structural analysis on the common part in the associated webpage URL pattern pattern, and replace the flipping block with the first page identifier to obtain a homepage association. a URL of the webpage; wherein the page turning block is a digital block having the same position but different numbers in the plurality of associated webpage URL pattern patterns.

For the device embodiment of FIG. 9, since it is basically similar to the method embodiment of FIG. 4, the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.

The various component embodiments of the present invention may be implemented in hardware, or in a software module running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or digital signal processor (DSP) may be used in practice to implement some or all of some or all of the components of the device for calculating the associated web page URL pattern pattern in accordance with an embodiment of the present invention. Features. The invention can also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing some or all of the methods described herein. Such a program implementing the invention may be stored on a computer readable medium or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.

For example, FIG. 10 illustrates a computing device, such as a user terminal device or an application server, that can implement the calculation of an associated web page URL pattern pattern in accordance with the present invention. The computing device conventionally includes a processor 1010 and a computer program product or computer readable medium in the form of a memory 1020. The memory 1020 may be an electronic memory such as a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), an EPROM, a hard disk, or a ROM. The memory 1020 has a memory space 1030 for executing program code 1031 of any of the above method steps. For example, storage space 1030 for program code may include various program code 1031 for implementing various steps in the above methods, respectively. The program code can be read from or written to one or more computer program products. These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks. Such a computer program product is typically a portable or fixed storage unit as described with reference to FIG. The storage unit may have storage segments, storage spaces, and the like that are similarly arranged to memory 1020 in the computing device of FIG. The program code can be compressed, for example, in an appropriate form. Typically, the storage unit includes computer readable code 1031', ie, code that can be read by, for example, a processor such as 1010, which when executed by a computing device causes the computing device to perform each of the methods described above step.

"an embodiment," or "an embodiment," or "an embodiment," In addition, it is noted that the phrase "in one embodiment" is not necessarily referring to the same embodiment.

In the description provided herein, numerous specific details are set forth. However, it is understood that the embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures, and techniques are not shown in detail so as not to obscure the understanding of the description.

It is to be noted that the above-described embodiments are illustrative of the invention and are not intended to be limiting, and that the invention may be devised without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as a limitation. The word "comprising" does not exclude the presence of the elements or steps that are not recited in the claims. The word "a" or "an" The invention can be implemented by means of hardware comprising several distinct elements and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means can be embodied by the same hardware item. Word first, first Second, and the use of the third class does not indicate any order. These words can be interpreted as names.

In addition, it should be noted that the language used in the specification has been selected for the purpose of readability and teaching, and is not intended to be construed or limited. Therefore, many modifications and changes will be apparent to those skilled in the art without departing from the scope of the invention. The disclosure of the present invention is intended to be illustrative, and not restrictive, and the scope of the invention is defined by the appended claims.

Claims

A method for calculating a URL pattern pattern of an associated web page, comprising:

Determining whether there is a page turning feature anchor in the page element of the specified webpage; if yes, extracting the associated URL to which the page turning feature anchor is linked;

And calculating, according to the URL of the specified webpage and the associated URL to which the page turning feature anchor is linked, an associated webpage URL pattern corresponding to the specified webpage.
The method of claim 1, wherein the step of determining whether the page element of the specified web page has a page turning feature anchor comprises:

Matching is performed in the DOM tree node of the current webpage by using a page turning feature anchor;

When the matching is successful, it is determined that the current webpage has a page turning feature anchor.
The method of claim 1 wherein said page flip feature anchor is linked to one or more associated URLs.
The method of claim 1 or 2 or 3, wherein the step of calculating the associated webpage URL pattern pattern according to the URL of the specified webpage and the associated page URL further comprises:

Replacing a digital block in a URL of a specified webpage with a wildcard character to obtain a first feature URL prefix; wherein the digital block is a single number or a plurality of numbers segmented by the interval identifier;

Replacing the digital block in the associated URL with a wildcard character to obtain a second feature URL prefix;

When the first feature URL prefix is the same as the second feature URL prefix, the first feature URL prefix or the second feature URL prefix is used as an associated web page URL pattern.
The method of claim 4, wherein the step of replacing the digital block in the URL of the specified web page with the wildcard character to obtain the first feature URL prefix is:

Replacing the digital block at different positions in the URL of the specified webpage with the same wildcard character to obtain the first feature URL prefix;

The step of replacing the digital block in the associated URL with a wildcard character to obtain the second feature URL prefix is:

The second feature URL prefix is obtained by replacing the digital blocks at different positions in the associated URL with the same wildcard characters.
The method of claim 5, wherein the step of replacing the digital block in the URL of the specified web page with the wildcard character to obtain the first feature URL prefix is:

The first feature URL prefix is obtained by using different wildcard characters to replace the digital blocks in different positions in the URL of the specified webpage.

The step of replacing the digital block in the associated URL with a wildcard character to obtain the second feature URL prefix is:

The second feature URL prefix is obtained by replacing the digital block of the associated URL at the same location with the same wildcard character as the first feature URL.
The method of claim 1 or 2 or 3 or 5 or 6, further comprising:

By performing structural analysis on the common part in the URL pattern pattern of the associated webpage, extracting the page turning block in the associated webpage URL pattern pattern, and replacing the flipping block with the first page identifier to obtain the URL of the homepage associated webpage; wherein the page turning The block is a digital block with the same position but different numbers in multiple associated web page URL pattern patterns.
The method of claim 7 wherein said first page identification comprises 0, 1, and/or a maximum value in a current associated web page.
A method for identifying a page number identifier in a webpage URL, comprising:

Obtaining the associated URL to which the page turning feature anchor is linked in the page element of the specified webpage;

Calculating an associated webpage URL pattern pattern according to the URL of the specified webpage and the associated URL;

Determining, respectively, a page number feature portion of the specified web page URL and a page code feature portion of the associated URL based on an associated web page URL pattern pattern corresponding to the specified web page;

Comparing the specified webpage URL with the page number feature part of the associated page URL, and extracting the page number identifier that the different digital identification part identifies as the specified webpage URL.
A method for establishing an associated web page database, comprising:

Determining whether the crawled webpage includes an associated webpage URL pattern; if yes, acquiring the associated webpage URL pattern;

Obtaining a corresponding associated webpage based on the associated webpage URL pattern;

The associated webpage database is established by using the associated webpage corresponding to the associated webpage URL pattern.
An associated web page search method includes:

Receiving a search request; the request includes a search keyword;

Performing a search in the preset associated webpage database according to the search keyword to obtain a webpage matching the keyword;

Determining whether the webpage is an associated webpage; if yes, returning the webpage and the homepage information associated with the webpage.
An apparatus for calculating a URL pattern pattern of an associated web page, comprising:

The page turning feature anchor determining module is adapted to determine whether the page element of the specified webpage has a page turning feature anchor; if yes, calling the associated URL extracting module;

a URL extraction module, configured to extract an associated URL to which the page turning feature anchor is linked;

The associated webpage URL pattern calculation module is adapted to calculate an associated webpage URL pattern pattern corresponding to the specified webpage according to the URL of the specified webpage and the associated URL to which the page turning feature anchor is linked.
The apparatus according to claim 12, wherein said page turning feature anchor determining module is further adapted to:

Matching is performed in the DOM tree node of the current webpage by using a page turning feature anchor;

When the matching is successful, it is determined that the current webpage has a page turning feature anchor.
The apparatus of claim 12 wherein said page flip feature anchor is linked to one or more associated URLs.
The device according to claim 12 or 13 or 14, wherein the associated web page URL pattern calculation module comprises:

a first feature URL prefix obtaining module adapted to replace a digital block in a URL of a specified webpage with a wildcard character to obtain a first feature URL prefix; wherein the digital block is a single number or a plurality of numbers segmented by the interval identifier ;

a second feature URL prefix obtaining module, configured to replace the digital block in the associated URL with a wildcard character to obtain a second feature URL prefix;

The associated webpage URL pattern patten obtaining module is configured to use the first feature URL prefix or the second feature URL prefix as the associated webpage URL pattern pattern when the first feature URL prefix is the same as the second feature URL prefix.
The apparatus according to claim 15, wherein the first feature URL prefix obtaining module is further adapted to:

Replacing the digital block at different positions in the URL of the specified webpage with the same wildcard character to obtain the first feature URL prefix;

The second feature URL prefix obtaining module is further adapted to:

The second feature URL prefix is obtained by replacing the digital blocks at different positions in the associated URL with the same wildcard characters.
The apparatus according to claim 16, wherein the first feature URL prefix obtaining module is further adapted to:

The first feature URL prefix is obtained by using different wildcard characters to replace the digital blocks in different positions in the URL of the specified webpage.

The second feature URL prefix obtaining module is further adapted to:

The second feature URL prefix is obtained by replacing the digital block of the associated URL at the same location with the same wildcard character as the first feature URL.
The device of claim 12 or 13 or 14 or 16 or 17, further comprising:

The homepage related webpage URL obtaining module is configured to extract a page turning block in the associated webpage URL pattern pattern by performing structural analysis on the common part in the associated webpage URL pattern pattern, and replace the flipping block with the first page identifier to obtain a homepage related webpage. a URL; wherein the page turning block is a digital block having the same position but different numbers in a plurality of associated web page URL pattern patterns.
The apparatus of claim 18, wherein the first page identification comprises 0, 1, and/or a maximum value in a current associated web page.
The device of claim 12, further comprising:

a page feature portion determining module, configured to respectively determine a page code feature portion of the specified web page URL and a page code feature portion of the associated URL based on an associated web page URL pattern pattern corresponding to the specified web page;

The page number identification determining module is adapted to compare the specified webpage URL with the page number feature part of the associated page URL, and extract a page number identifier that is identified by the different digital identification part as the specified webpage URL.
The device of claim 12, further comprising:

The associated webpage database establishing module is adapted to establish an associated webpage database by using the associated webpage corresponding to the associated webpage URL pattern.
The device of claim 21, further comprising:

a search request receiving module, adapted to receive a search request; the request includes a search keyword;

The matching webpage obtaining module is adapted to perform searching in the preset related webpage database according to the search keyword to obtain a webpage matching the keyword;

The multi-page associated webpage judging module is adapted to determine whether the webpage is an associated webpage; if yes, the information returning module is invoked;

The information returning module is adapted to return the webpage and the homepage information associated with the webpage.
A computer program comprising computer readable code, when the computer readable code is run on a computing device, causing the computing device to perform the computing associated web page URL pattern according to any one of claims 1-8 Methods.
A computer readable medium storing the computer program of claim 23.