CN109948013B

CN109948013B - Webpage processing method and device

Info

Publication number: CN109948013B
Application number: CN201710705406.3A
Authority: CN
Inventors: 曹志明
Original assignee: Beijing Gridsum Technology Co Ltd
Current assignee: Beijing Gridsum Technology Co Ltd
Priority date: 2017-08-16
Filing date: 2017-08-16
Publication date: 2021-11-05
Anticipated expiration: 2037-08-16
Also published as: CN109948013A

Abstract

The invention discloses a webpage processing method and device. Wherein, the method comprises the following steps: crawling hyperlink information on all pages of a target website, wherein the pages of the target website comprise: a home page, a column list page, and a content page; establishing a navigation relation between pages according to the website of the crawled pages and hyperlink information to obtain a first navigation relation; a content page associated with the target hurdle list page is determined based on the first navigation relationship. The invention solves the technical problem that the content page associated with the column list page cannot be quickly and accurately determined in the prior art.

Description

Webpage processing method and device

Technical Field

The invention relates to the field of internet, in particular to a webpage processing method and device.

Background

Websites generally contain two types of pages: a content page and a list page. The content page is a page containing specific article information; the list page plays a page navigation role, and the arranged hyperlink list is used for navigating to the content page. The website column is a category of website contents, for example, a general portal may have columns of "news", "sports", "entertainment", and the like. The web site column page is typically a list page used to navigate through its various content pages.

In the business of crawling website data, the method has practical significance for accurately acquiring the data of all content pages of a website column list page. For example, whether or not a website is updated can be known by observing the update status of the content page and the update status of the column list page.

Web pages are files in HTML format, and the list part of a list page is generally composed of a plurality of li tags.

XPath is a language that can look for information in HTML documents, for example/li/h 4 represents finding all h4 markup elements nested under li markup.

The existing method for acquiring all content pages of a website column page is to crawl a list item of a website column page, and an HTML source file of the website column page needs to be checked to find an xpath path of the list item. After the crawler crawls down the website column page, the xpath is used for performing text analysis to obtain a list item.

The prior art method for acquiring the content page associated with the website column list page mainly has the following defects: acquiring the xpath of a website column page list item requires much manual work; moreover, some network stations will perform the revision work, and at this time, xpath may change, which may cause the situation of inaccurate analysis.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides a webpage processing method and a webpage processing device, which are used for at least solving the technical problem that a content page related to a column list page cannot be quickly and accurately determined in the prior art.

According to an aspect of an embodiment of the present invention, there is provided a web page processing method, including: crawling hyperlink information on all pages of a target website, wherein the pages of the target website comprise: a home page, a column list page, and a content page; establishing a navigation relationship between pages according to the crawled website of the page and the hyperlink information to obtain a first navigation relationship; and determining a content page associated with the target column list page according to the first navigation relation.

Further, a navigation relationship between the pages is established according to the crawled website of the page and the hyperlink information, so as to obtain a first navigation relationship, which comprises: crawling hyperlink information on a first page to obtain first hyperlink information, wherein the first page is any page of the target website; establishing a navigation relationship between the first page and other pages according to the website of the first page and the first hyperlink information to obtain a second navigation relationship; and summarizing all the second navigation relations to obtain the first navigation relation.

Further, establishing a navigation relationship between the first page and other pages according to the website of the first page and the first hyperlink information to obtain a second navigation relationship, including: taking a page linked by the first hyperlink information in the first page as a second page; and drawing the navigation relation between the first page and all the second pages to obtain the second navigation relation.

Further, determining a content page associated with the target listing page according to the first navigation relationship includes: screening out a third page according to the first navigation relationship, wherein all the screened out third pages form a third page set, and the third page is a page having a bidirectional navigation relationship with the target column list page; screening out a fourth page from the third page set, wherein all the screened out fourth pages form a fourth page set, and the fourth page is a page in the third page set except the column list page and the first page; and taking the page in the fourth page set as a content page associated with the target column list page.

Further, screening out a fourth page from the third set of pages includes: acquiring the websites of the home page and all column list pages of the target website; sequentially matching the websites of all the third pages in the third page set with the websites of the head page of the target website and the websites of all the column list pages respectively; and if the matching of the website of the third page with the websites of the first page of the target website and the websites of all the column list pages fails, determining that the third page is the fourth page.

According to another aspect of the embodiments of the present invention, there is also provided a web page processing apparatus, including: the crawling unit is used for crawling hyperlink information on all pages of the target website, wherein the pages of the target website comprise: a home page, a column list page, and a content page; the establishing unit is used for establishing a navigation relation between pages according to the crawled website of the page and the hyperlink information to obtain a first navigation relation; and the determining unit is used for determining the content page associated with the target column list page according to the first navigation relation.

Further, the crawled pages are multiple, and the establishing unit comprises: the crawling sub-unit is used for crawling hyperlink information on a first page to obtain first hyperlink information, wherein the first page is any page of the target website; the establishing subunit is used for establishing a navigation relationship between the first page and other pages according to the website of the first page and the first hyperlink information to obtain a second navigation relationship; and the summarizing subunit is used for summarizing all the second navigation relationships to obtain the first navigation relationship.

Further, the establishing subunit includes: a first determining module, configured to use a page linked with the first hyperlink information in the first page as a second page; and the drawing module is used for drawing the navigation relationship between the first page and all the second pages to obtain the second navigation relationship.

Further, the determining unit includes: the searching subunit is configured to screen out a third page according to the first navigation relationship, where all the screened out third pages form a third page set, and the third page is a page having a bidirectional navigation relationship with the target column list page; a screening subunit, configured to screen a fourth page from the third page set, where all the screened fourth pages form a fourth page set, where the fourth page is a page in the third page set, except the column list page and the first page; and the determining subunit is configured to use a page in the fourth page set as a content page associated with the target column list page.

Further, the screening subunit includes: the acquisition module is used for acquiring the websites of the home page and all the column list pages of the target website; the matching module is used for sequentially matching the websites of all the third pages in the third page set with the websites of the head page of the target website and the websites of all the column list pages respectively; and the second determining module is used for determining that the third page is the fourth page under the condition that the matching of the website of the third page, the website of the first page of the target website and the websites of all the column list pages fails.

According to still another aspect of the embodiments of the present invention, there is also provided a storage medium having a program stored thereon, the program implementing the web page processing method when executed by a processor.

According to still another aspect of the embodiments of the present invention, there is also provided a processor, configured to execute a program, where the program executes the web page processing method when running.

In the embodiment of the invention, the page with the bidirectional navigation relation with the target column list page is the content page related to the target column list page except the first page and other column list pages, and the navigation relation between the pages is established according to the website of the crawled page and hyperlink information in the crawled page, namely, the first navigation relation, and the first page and other column list pages are removed according to the first navigation relation, so that the content page related to the target column list page can be obtained.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a flow diagram of a method of web page processing according to an embodiment of the invention;

FIG. 2 is a schematic diagram of a content page according to an embodiment of the invention;

FIG. 3 is a schematic diagram of a column list page according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a second navigational relationship, in accordance with embodiments of the present invention;

FIG. 5 is a schematic diagram of yet another second navigational relationship, in accordance with embodiments of the present invention;

FIG. 6 is a schematic diagram of yet another second navigational relationship, in accordance with embodiments of the present invention;

FIG. 7 is a schematic diagram of a first navigational relationship, in accordance with embodiments of the present invention;

fig. 8 is a schematic diagram of a web page processing apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to an embodiment of the present invention, there is provided an embodiment of an image processing method, it should be noted that the steps shown in the flowchart of the drawings may be executed in a computer system such as a set of computer executable instructions, and that although a logical order is shown in the flowchart, in some cases, the steps shown or described may be executed in an order different from that here.

Fig. 1 is a flowchart of a web page processing method according to an embodiment of the present invention. As shown in fig. 1, the method comprises the steps of:

step S102, crawling hyperlink information on all pages of a target website, wherein the pages of the target website comprise: home page, column list page, content page.

And step S104, establishing a navigation relationship between the pages according to the crawled website of the page and the hyperlink information to obtain a first navigation relationship.

And step S106, determining a content page associated with the target column list page according to the first navigation relation.

The content page is a page containing information of a specific article, and fig. 2 shows one content page.

The column list page serves as a page navigation function using the arranged list of hyperlinks to the content page, and figure 3 shows a column list page.

The column is a category of contents of a website, for example, a general website has a plurality of columns, for example, columns of "news", "sports", "entertainment", and the like. A column list page is typically presented in a list format for navigating through its various content pages.

An A page is considered to be able to navigate to a B page if the A page contains a hyperlink that can link to the B page.

If the B page contains a hyperlink that can link to the A page, then the B page is considered to be able to navigate to the A page.

An A page is considered to have a bi-directional navigational relationship with a B page if the A page contains a hyperlink that can link to the B page and the B page contains a hyperlink that can link to the A page.

The pages of the target website include: home page, column list page, content page. The inventor finds out through a great deal of research that: the home page and the plurality of column list pages have a bidirectional navigation relation; one column list page and other column list pages have a bidirectional navigation relationship; one column list page has a two-way navigation relationship with its own content page and no two-way navigation relationship with the content pages of the other column list pages.

TABLE 1

For example, as shown in table 1, there are 3 column list pages in total, wherein there are 6 content pages of the column list page L1, which are respectively content page P (1, 1), content page P (1, 2), content page P (1, 3), content page P (1, 4), content page P (1, 5), and content page P (1, 6), and the column list page L1 has a two-way navigation relationship with these 6 content pages.

There are 11 pages of the column list page L2, which are pages P (2, 1), P (2, 2), P (2, 3), P (2, 4), P (2, 5), P (2, 6), P (2, 7), P (2, 8), P (2, 9), P (2, 10), P (2, 11), and the column list page L2 has a two-way navigation relationship with these 11 pages.

There are 8 pages in the column list page L3, which are page P (3, 1), page P (3, 2), page P (3, 3), page P (3, 4), page P (3, 5), page P (3, 6), page P (3, 7) and page P (3, 8), and the column list page L3 has a two-way navigation relationship with these 8 pages.

The column list page has a two-way navigation relationship with the content page of the column list page and has no two-way navigation relationship with the content pages of other column list pages.

In the embodiment of the invention, the page having the bidirectional navigation relationship with the target column list page is the content page associated with the target column list page except the first page and other column list pages, and the navigation relationship between the target website pages, namely the first navigation relationship, is established according to the website of the crawled target website page and the hyperlink information in the crawled page. The home page and other column list pages are removed according to the first navigation relation, and the content page associated with the target column list page can be obtained, manual participation is not needed in the process, the result is not influenced even if the website is changed, the technical problem that the content page associated with the column list page cannot be determined quickly and accurately in the prior art is solved, and the technical effect of quickly and accurately determining the content page associated with the column list page is achieved.

Optionally, the establishing a navigation relationship between the first page and other pages according to the website of the first page and the first hyperlink information to obtain a second navigation relationship includes: taking a page linked by the first hyperlink information in the first page as a second page; and drawing the navigation relation between the first page and all the second pages to obtain a second navigation relation.

Optionally, the step of establishing a navigation relationship between the pages according to the crawled website and hyperlink information of the pages to obtain a first navigation relationship includes: crawling hyperlink information on a first page to obtain first hyperlink information, wherein the first page is any page of a target website; establishing a navigation relation between the first page and other pages according to the website of the first page and the first hyperlink information to obtain a second navigation relation; and summarizing all the second navigation relations to obtain the first navigation relation.

The first page is any page of the target website, the hyperlink information in the first page is crawled to obtain first hyperlink information, and the navigation relation between the first page and other pages is established according to the website of the first page and the first hyperlink information to obtain a second navigation relation. The specific process of establishing the second navigation relationship may be as follows:

assuming that the first hyperlink information points (links) to a page, the first page can navigate to the page, and assuming that there is a hyperlink in the page that links to the first page, the first page has a bi-directional navigation relationship with the page.

Assuming that a target website has M pages, namely M first pages, a navigation relationship between the first page and other pages is established according to the website of each first page and the first hyperlink information, so that a second navigation relationship is obtained, and then M second navigation relationships are obtained in total, and the M second navigation relationships are summarized to obtain a first navigation relationship, wherein the first navigation relationship is a comprehensive navigation relationship.

A second navigation relation of the website home page obtained by crawling the website home page is shown in fig. 4; a second navigation relationship of the column list page 1 is obtained by crawling the column list page 1 and is shown in fig. 5; the second navigation relationship between the column list page 1 and the content page 1 obtained by crawling the column list page 1 and the content page 1 is shown in fig. 6. The 3 second navigation relationships are integrated into a graph to obtain the first navigation relationship shown in fig. 7.

As can be seen from fig. 7, the column list page 1 and the home page, the column list page 2, the column list page 1-content page 1, the column list page 1-content page 2, and the column list page 1-content page 3 all have a two-way navigation relationship, but the column list page 2 and the column list page 1-content page 1, the column list page 1-content page 2, and the column list page 1-content page 3 do not have a two-way navigation relationship, that is, the content pages other than the home page and other column list pages having a two-way navigation relationship with the specific column list page (target column list page) are the content pages associated with the specific column list page.

Optionally, determining the content page associated with the target column list page according to the first navigation relationship comprises: screening out third pages according to the first navigation relationship, wherein all the screened out third pages form a third page set, and the third pages are pages having a bidirectional navigation relationship with the target column list page; screening out a fourth page from the third page set, wherein all the screened out fourth pages form a fourth page set, and the fourth page is a page in the third page set except for a column list page and a first page; and taking the page in the fourth page set as a content page associated with the target column list page.

Optionally, the step of screening out a fourth page from the third page set includes: acquiring the websites of a home page and all column list pages of a target website; sequentially matching the websites of all the third pages in the third page set with the websites of the first page of the target website and the websites of all the column list pages respectively; and if the matching of the website of the third page with the website of the first page of the target website and the websites of all the column list pages fails, determining that the third page is a fourth page.

The other column list pages except the first page and the other column list pages which have a two-way navigation relationship with the target column list page are the content pages related to the target column list page.

According to the first navigation relationship, all pages having a bidirectional navigation relationship with the target column list page can be determined, namely a plurality of third pages are determined. And eliminating the first page and other column list pages in the third page to obtain the content page associated with the target column list page.

And if the website of the third page is successfully matched with the website of the home page of the target website, determining that the third page is the home page of the website.

And if the website of the third page is successfully matched with the website of one column list page, determining that the third page is the column list page.

And if the matching of the website of the third page with the websites of the first page of the target website and the websites of all the column list pages fails, determining that the third page is not the first page of the website and not the column list page but a content page related to the target column list page, namely, a fourth page. There may be a plurality of fourth pages.

In the embodiment of the invention, the network addresses of all column list pages and home pages are counted; crawling hyperlink information in a website page; and establishing a navigation relation graph according to the current page website and hyperlink information in the page. Establishing a comprehensive navigation relation graph according to all the pages, and searching the pages with a bidirectional navigation relation with the specified column list page; other column list pages and the first page are excluded from the pages, and the rest is all content pages related to the column list page.

In the existing method, generally, a list item under a website column is crawled, and an HTML source file of the website column needs to be checked to find an xpath path of the list item. And after the crawler crawls down the website column page, analyzing the text by using xpath to obtain the target website. Xpath is a query language that requires learning costs and operating costs, which increases labor costs. The webpage processing method provided by the embodiment of the invention avoids the use of xpath to acquire the content page data of the column list page, and reduces the labor cost.

The webpage processing method provided by the embodiment of the invention utilizes the internal law of hyperlink navigation in the webpage list page and the content page to accurately acquire the related page of the column list page, namely the content page of the column list page. Because the internal rule does not depend on the version of the website, the method is also applicable to the condition of website version change, and has no subsequent maintenance cost.

The embodiment of the invention also provides a webpage processing device, which can execute the webpage processing method, and the webpage processing method can also be executed by the webpage processing device. Fig. 8 is a schematic diagram of a web page processing apparatus according to an embodiment of the present invention, as shown in fig. 8, the apparatus including: the crawling unit 10, the establishing unit 20 and the determining unit 30.

A crawling unit 10, configured to crawl hyperlink information on all pages of a target website, where the pages of the target website include: home page, column list page, content page.

The establishing unit 20 is configured to establish a navigation relationship between the pages according to the crawled website of the page and the hyperlink information, so as to obtain a first navigation relationship.

A determining unit 30 for determining a content page associated with the target list page according to the first navigation relation.

Optionally, the number of pages crawled is multiple, and the establishing unit 20 includes: crawling the subunit, establishing the subunit, and summarizing the subunit. And the crawling subunit is used for crawling the hyperlink information on the first page to obtain the first hyperlink information, wherein the first page is any page of the target website. And the establishing subunit is used for establishing a navigation relationship between the first page and other pages according to the website of the first page and the first hyperlink information to obtain a second navigation relationship. And the summarizing subunit is used for summarizing all the second navigation relationships to obtain the first navigation relationship.

Optionally, the establishing the subunit comprises: the device comprises a first determining module and a drawing module. And the first determining module is used for taking the page linked with the first hyperlink information in the first page as a second page. And the drawing module is used for drawing the navigation relationship between the first page and all the second pages to obtain a second navigation relationship.

Optionally, the determining unit 30 includes: searching the subunits, screening the subunits and determining the subunits. And the searching subunit is used for screening out a third page according to the first navigation relationship, wherein all the screened out third pages form a third page set, and the third page is a page having a bidirectional navigation relationship with the target column list page. And the screening subunit is used for screening out a fourth page from the third page set, and all the screened out fourth pages form a fourth page set, wherein the fourth page is a page in the third page set except for the column list page and the first page. And the determining subunit is used for taking the page in the fourth page set as the content page associated with the target column list page.

Optionally, the screening subunit comprises: the device comprises an acquisition module, a matching module and a second determination module. And the acquisition module is used for acquiring the home page of the target website and the websites of all the column list pages. And the matching module is used for sequentially matching the websites of all the third pages in the third page set with the websites of the first page of the target website and the websites of all the column list pages respectively. And the second determining module is used for determining that the third page is the fourth page under the condition that the matching of the website of the third page, the website of the first page of the target website and the websites of all the column list pages fails.

The web page processing device comprises a processor and a memory, wherein the crawling unit 10, the establishing unit 20, the determining unit 30 and the like are stored in the memory as program units, and the program units stored in the memory are executed by the processor to realize corresponding functions.

The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can set one or more, and the content page associated with the target column list page is determined by adjusting the kernel parameters.

The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.

An embodiment of the present invention provides a storage medium having a program stored thereon, which when executed by a processor implements the web page processing method.

The embodiment of the invention provides a processor, which is used for running a program, wherein the webpage processing method is executed when the program runs.

The embodiment of the invention provides equipment, which comprises a processor, a memory and a program which is stored on the memory and can run on the processor, wherein the processor executes the program and realizes the following steps:

crawling hyperlink information on all pages of a target website, wherein the pages of the target website comprise: a home page, a column list page, and a content page; establishing a navigation relation between pages according to the website of the crawled pages and hyperlink information to obtain a first navigation relation; a content page associated with the target hurdle list page is determined based on the first navigation relationship.

Crawling hyperlink information on a first page to obtain first hyperlink information, wherein the first page is any page of a target website; establishing a navigation relation between the first page and other pages according to the website of the first page and the first hyperlink information to obtain a second navigation relation; and summarizing all the second navigation relations to obtain the first navigation relation.

Taking a page linked by the first hyperlink information in the first page as a second page; and drawing the navigation relation between the first page and all the second pages to obtain a second navigation relation.

Screening out third pages according to the first navigation relationship, wherein all the screened out third pages form a third page set, and the third pages are pages having a bidirectional navigation relationship with the target column list page; screening out a fourth page from the third page set, wherein all the screened out fourth pages form a fourth page set, and the fourth page is a page in the third page set except for a column list page and a first page; and taking the page in the fourth page set as a content page associated with the target column list page.

Acquiring the websites of a home page and all column list pages of a target website; sequentially matching the websites of all the third pages in the third page set with the websites of the first page of the target website and the websites of all the column list pages respectively; and if the matching of the website of the third page with the website of the first page of the target website and the websites of all the column list pages fails, determining that the third page is a fourth page.

The device herein may be a server, a PC, a PAD, a mobile phone, etc.

The present application further provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device:

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method for processing a web page, comprising:

crawling hyperlink information on all pages of a target website, wherein the pages of the target website comprise: a home page, a column list page, and a content page;

establishing a navigation relationship between pages according to the crawled website of the page and the hyperlink information to obtain a first navigation relationship;

determining a content page associated with a target column list page according to the first navigation relationship;

wherein determining the content page associated with the target hurdle list page according to the first navigation relationship comprises:

screening out a third page according to the first navigation relationship, wherein all the screened out third pages form a third page set, and the third page is a page having a bidirectional navigation relationship with the target column list page;

screening out a fourth page from the third page set, wherein all the screened out fourth pages form a fourth page set, and the fourth page is a page in the third page set except the column list page and the first page;

taking the page in the fourth page set as a content page associated with the target column list page;

wherein, sifting out the fourth page from the third page set comprises:

acquiring the websites of the home page and all column list pages of the target website;

sequentially matching the websites of all the third pages in the third page set with the websites of the head page of the target website and the websites of all the column list pages respectively;

and if the matching of the website of the third page with the websites of the first page of the target website and the websites of all the column list pages fails, determining that the third page is the fourth page.

2. The method of claim 1, wherein establishing a navigation relationship between pages according to the crawled web address of the page and the hyperlink information to obtain a first navigation relationship comprises:

crawling hyperlink information on a first page to obtain first hyperlink information, wherein the first page is any page of the target website;

establishing a navigation relationship between the first page and other pages according to the website of the first page and the first hyperlink information to obtain a second navigation relationship;

and summarizing all the second navigation relations to obtain the first navigation relation.

3. The method of claim 2, wherein establishing a navigation relationship between the first page and other pages according to the website of the first page and the first hyperlink information to obtain a second navigation relationship comprises:

taking a page linked by the first hyperlink information in the first page as a second page;

and drawing the navigation relation between the first page and all the second pages to obtain the second navigation relation.

4. A web page processing apparatus, comprising:

the crawling unit is used for crawling hyperlink information on all pages of the target website, wherein the pages of the target website comprise: a home page, a column list page, and a content page;

the establishing unit is used for establishing a navigation relation between pages according to the crawled website of the page and the hyperlink information to obtain a first navigation relation;

the determining unit is used for determining a content page associated with the target column list page according to the first navigation relation;

wherein the determination unit includes:

the searching subunit is configured to screen out a third page according to the first navigation relationship, where all the screened out third pages form a third page set, and the third page is a page having a bidirectional navigation relationship with the target column list page;

a screening subunit, configured to screen a fourth page from the third page set, where all the screened fourth pages form a fourth page set, where the fourth page is a page in the third page set, except the column list page and the first page;

a determining subunit, configured to use a page in the fourth page set as a content page associated with the target column list page;

wherein the screening subunit comprises:

the acquisition module is used for acquiring the websites of the home page and all the column list pages of the target website;

the matching module is used for sequentially matching the websites of all the third pages in the third page set with the websites of the head page of the target website and the websites of all the column list pages respectively;

and the second determining module is used for determining that the third page is the fourth page under the condition that the matching of the website of the third page, the website of the first page of the target website and the websites of all the column list pages fails.

5. The apparatus of claim 4, wherein the crawled pages are multiple, and the establishing unit comprises:

the crawling sub-unit is used for crawling hyperlink information on a first page to obtain first hyperlink information, wherein the first page is any page of the target website;

the establishing subunit is used for establishing a navigation relationship between the first page and other pages according to the website of the first page and the first hyperlink information to obtain a second navigation relationship;

and the summarizing subunit is used for summarizing all the second navigation relationships to obtain the first navigation relationship.

6. The apparatus of claim 5, wherein the establishing subunit comprises:

a first determining module, configured to use a page linked with the first hyperlink information in the first page as a second page;

and the drawing module is used for drawing the navigation relationship between the first page and all the second pages to obtain the second navigation relationship.

7. A storage medium on which a program is stored, the program implementing the web page processing method according to any one of claims 1 to 3 when executed by a processor.