CN109948013B - Webpage processing method and device - Google Patents

Webpage processing method and device Download PDF

Info

Publication number
CN109948013B
CN109948013B CN201710705406.3A CN201710705406A CN109948013B CN 109948013 B CN109948013 B CN 109948013B CN 201710705406 A CN201710705406 A CN 201710705406A CN 109948013 B CN109948013 B CN 109948013B
Authority
CN
China
Prior art keywords
page
pages
navigation
website
column list
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710705406.3A
Other languages
Chinese (zh)
Other versions
CN109948013A (en
Inventor
曹志明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201710705406.3A priority Critical patent/CN109948013B/en
Publication of CN109948013A publication Critical patent/CN109948013A/en
Application granted granted Critical
Publication of CN109948013B publication Critical patent/CN109948013B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a webpage processing method and device. Wherein, the method comprises the following steps: crawling hyperlink information on all pages of a target website, wherein the pages of the target website comprise: a home page, a column list page, and a content page; establishing a navigation relation between pages according to the website of the crawled pages and hyperlink information to obtain a first navigation relation; a content page associated with the target hurdle list page is determined based on the first navigation relationship. The invention solves the technical problem that the content page associated with the column list page cannot be quickly and accurately determined in the prior art.

Description

Webpage processing method and device
Technical Field
The invention relates to the field of internet, in particular to a webpage processing method and device.
Background
Websites generally contain two types of pages: a content page and a list page. The content page is a page containing specific article information; the list page plays a page navigation role, and the arranged hyperlink list is used for navigating to the content page. The website column is a category of website contents, for example, a general portal may have columns of "news", "sports", "entertainment", and the like. The web site column page is typically a list page used to navigate through its various content pages.
In the business of crawling website data, the method has practical significance for accurately acquiring the data of all content pages of a website column list page. For example, whether or not a website is updated can be known by observing the update status of the content page and the update status of the column list page.
Web pages are files in HTML format, and the list part of a list page is generally composed of a plurality of li tags.
XPath is a language that can look for information in HTML documents, for example/li/h 4 represents finding all h4 markup elements nested under li markup.
The existing method for acquiring all content pages of a website column page is to crawl a list item of a website column page, and an HTML source file of the website column page needs to be checked to find an xpath path of the list item. After the crawler crawls down the website column page, the xpath is used for performing text analysis to obtain a list item.
The prior art method for acquiring the content page associated with the website column list page mainly has the following defects: acquiring the xpath of a website column page list item requires much manual work; moreover, some network stations will perform the revision work, and at this time, xpath may change, which may cause the situation of inaccurate analysis.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The embodiment of the invention provides a webpage processing method and a webpage processing device, which are used for at least solving the technical problem that a content page related to a column list page cannot be quickly and accurately determined in the prior art.
According to an aspect of an embodiment of the present invention, there is provided a web page processing method, including: crawling hyperlink information on all pages of a target website, wherein the pages of the target website comprise: a home page, a column list page, and a content page; establishing a navigation relationship between pages according to the crawled website of the page and the hyperlink information to obtain a first navigation relationship; and determining a content page associated with the target column list page according to the first navigation relation.
Further, a navigation relationship between the pages is established according to the crawled website of the page and the hyperlink information, so as to obtain a first navigation relationship, which comprises: crawling hyperlink information on a first page to obtain first hyperlink information, wherein the first page is any page of the target website; establishing a navigation relationship between the first page and other pages according to the website of the first page and the first hyperlink information to obtain a second navigation relationship; and summarizing all the second navigation relations to obtain the first navigation relation.
Further, establishing a navigation relationship between the first page and other pages according to the website of the first page and the first hyperlink information to obtain a second navigation relationship, including: taking a page linked by the first hyperlink information in the first page as a second page; and drawing the navigation relation between the first page and all the second pages to obtain the second navigation relation.
Further, determining a content page associated with the target listing page according to the first navigation relationship includes: screening out a third page according to the first navigation relationship, wherein all the screened out third pages form a third page set, and the third page is a page having a bidirectional navigation relationship with the target column list page; screening out a fourth page from the third page set, wherein all the screened out fourth pages form a fourth page set, and the fourth page is a page in the third page set except the column list page and the first page; and taking the page in the fourth page set as a content page associated with the target column list page.
Further, screening out a fourth page from the third set of pages includes: acquiring the websites of the home page and all column list pages of the target website; sequentially matching the websites of all the third pages in the third page set with the websites of the head page of the target website and the websites of all the column list pages respectively; and if the matching of the website of the third page with the websites of the first page of the target website and the websites of all the column list pages fails, determining that the third page is the fourth page.
According to another aspect of the embodiments of the present invention, there is also provided a web page processing apparatus, including: the crawling unit is used for crawling hyperlink information on all pages of the target website, wherein the pages of the target website comprise: a home page, a column list page, and a content page; the establishing unit is used for establishing a navigation relation between pages according to the crawled website of the page and the hyperlink information to obtain a first navigation relation; and the determining unit is used for determining the content page associated with the target column list page according to the first navigation relation.
Further, the crawled pages are multiple, and the establishing unit comprises: the crawling sub-unit is used for crawling hyperlink information on a first page to obtain first hyperlink information, wherein the first page is any page of the target website; the establishing subunit is used for establishing a navigation relationship between the first page and other pages according to the website of the first page and the first hyperlink information to obtain a second navigation relationship; and the summarizing subunit is used for summarizing all the second navigation relationships to obtain the first navigation relationship.
Further, the establishing subunit includes: a first determining module, configured to use a page linked with the first hyperlink information in the first page as a second page; and the drawing module is used for drawing the navigation relationship between the first page and all the second pages to obtain the second navigation relationship.
Further, the determining unit includes: the searching subunit is configured to screen out a third page according to the first navigation relationship, where all the screened out third pages form a third page set, and the third page is a page having a bidirectional navigation relationship with the target column list page; a screening subunit, configured to screen a fourth page from the third page set, where all the screened fourth pages form a fourth page set, where the fourth page is a page in the third page set, except the column list page and the first page; and the determining subunit is configured to use a page in the fourth page set as a content page associated with the target column list page.
Further, the screening subunit includes: the acquisition module is used for acquiring the websites of the home page and all the column list pages of the target website; the matching module is used for sequentially matching the websites of all the third pages in the third page set with the websites of the head page of the target website and the websites of all the column list pages respectively; and the second determining module is used for determining that the third page is the fourth page under the condition that the matching of the website of the third page, the website of the first page of the target website and the websites of all the column list pages fails.
According to still another aspect of the embodiments of the present invention, there is also provided a storage medium having a program stored thereon, the program implementing the web page processing method when executed by a processor.
According to still another aspect of the embodiments of the present invention, there is also provided a processor, configured to execute a program, where the program executes the web page processing method when running.
In the embodiment of the invention, the page with the bidirectional navigation relation with the target column list page is the content page related to the target column list page except the first page and other column list pages, and the navigation relation between the pages is established according to the website of the crawled page and hyperlink information in the crawled page, namely, the first navigation relation, and the first page and other column list pages are removed according to the first navigation relation, so that the content page related to the target column list page can be obtained.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a flow diagram of a method of web page processing according to an embodiment of the invention;
FIG. 2 is a schematic diagram of a content page according to an embodiment of the invention;
FIG. 3 is a schematic diagram of a column list page according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating a second navigational relationship, in accordance with embodiments of the present invention;
FIG. 5 is a schematic diagram of yet another second navigational relationship, in accordance with embodiments of the present invention;
FIG. 6 is a schematic diagram of yet another second navigational relationship, in accordance with embodiments of the present invention;
FIG. 7 is a schematic diagram of a first navigational relationship, in accordance with embodiments of the present invention;
fig. 8 is a schematic diagram of a web page processing apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
According to an embodiment of the present invention, there is provided an embodiment of an image processing method, it should be noted that the steps shown in the flowchart of the drawings may be executed in a computer system such as a set of computer executable instructions, and that although a logical order is shown in the flowchart, in some cases, the steps shown or described may be executed in an order different from that here.
Fig. 1 is a flowchart of a web page processing method according to an embodiment of the present invention. As shown in fig. 1, the method comprises the steps of:
step S102, crawling hyperlink information on all pages of a target website, wherein the pages of the target website comprise: home page, column list page, content page.
And step S104, establishing a navigation relationship between the pages according to the crawled website of the page and the hyperlink information to obtain a first navigation relationship.
And step S106, determining a content page associated with the target column list page according to the first navigation relation.
The content page is a page containing information of a specific article, and fig. 2 shows one content page.
The column list page serves as a page navigation function using the arranged list of hyperlinks to the content page, and figure 3 shows a column list page.
The column is a category of contents of a website, for example, a general website has a plurality of columns, for example, columns of "news", "sports", "entertainment", and the like. A column list page is typically presented in a list format for navigating through its various content pages.
An A page is considered to be able to navigate to a B page if the A page contains a hyperlink that can link to the B page.
If the B page contains a hyperlink that can link to the A page, then the B page is considered to be able to navigate to the A page.
An A page is considered to have a bi-directional navigational relationship with a B page if the A page contains a hyperlink that can link to the B page and the B page contains a hyperlink that can link to the A page.
The pages of the target website include: home page, column list page, content page. The inventor finds out through a great deal of research that: the home page and the plurality of column list pages have a bidirectional navigation relation; one column list page and other column list pages have a bidirectional navigation relationship; one column list page has a two-way navigation relationship with its own content page and no two-way navigation relationship with the content pages of the other column list pages.
TABLE 1
Figure BDA0001380894990000051
Figure BDA0001380894990000061
For example, as shown in table 1, there are 3 column list pages in total, wherein there are 6 content pages of the column list page L1, which are respectively content page P (1, 1), content page P (1, 2), content page P (1, 3), content page P (1, 4), content page P (1, 5), and content page P (1, 6), and the column list page L1 has a two-way navigation relationship with these 6 content pages.
There are 11 pages of the column list page L2, which are pages P (2, 1), P (2, 2), P (2, 3), P (2, 4), P (2, 5), P (2, 6), P (2, 7), P (2, 8), P (2, 9), P (2, 10), P (2, 11), and the column list page L2 has a two-way navigation relationship with these 11 pages.
There are 8 pages in the column list page L3, which are page P (3, 1), page P (3, 2), page P (3, 3), page P (3, 4), page P (3, 5), page P (3, 6), page P (3, 7) and page P (3, 8), and the column list page L3 has a two-way navigation relationship with these 8 pages.
The column list page has a two-way navigation relationship with the content page of the column list page and has no two-way navigation relationship with the content pages of other column list pages.
In the embodiment of the invention, the page having the bidirectional navigation relationship with the target column list page is the content page associated with the target column list page except the first page and other column list pages, and the navigation relationship between the target website pages, namely the first navigation relationship, is established according to the website of the crawled target website page and the hyperlink information in the crawled page. The home page and other column list pages are removed according to the first navigation relation, and the content page associated with the target column list page can be obtained, manual participation is not needed in the process, the result is not influenced even if the website is changed, the technical problem that the content page associated with the column list page cannot be determined quickly and accurately in the prior art is solved, and the technical effect of quickly and accurately determining the content page associated with the column list page is achieved.
Optionally, the establishing a navigation relationship between the first page and other pages according to the website of the first page and the first hyperlink information to obtain a second navigation relationship includes: taking a page linked by the first hyperlink information in the first page as a second page; and drawing the navigation relation between the first page and all the second pages to obtain a second navigation relation.
Optionally, the step of establishing a navigation relationship between the pages according to the crawled website and hyperlink information of the pages to obtain a first navigation relationship includes: crawling hyperlink information on a first page to obtain first hyperlink information, wherein the first page is any page of a target website; establishing a navigation relation between the first page and other pages according to the website of the first page and the first hyperlink information to obtain a second navigation relation; and summarizing all the second navigation relations to obtain the first navigation relation.
The first page is any page of the target website, the hyperlink information in the first page is crawled to obtain first hyperlink information, and the navigation relation between the first page and other pages is established according to the website of the first page and the first hyperlink information to obtain a second navigation relation. The specific process of establishing the second navigation relationship may be as follows:
assuming that the first hyperlink information points (links) to a page, the first page can navigate to the page, and assuming that there is a hyperlink in the page that links to the first page, the first page has a bi-directional navigation relationship with the page.
Assuming that a target website has M pages, namely M first pages, a navigation relationship between the first page and other pages is established according to the website of each first page and the first hyperlink information, so that a second navigation relationship is obtained, and then M second navigation relationships are obtained in total, and the M second navigation relationships are summarized to obtain a first navigation relationship, wherein the first navigation relationship is a comprehensive navigation relationship.
A second navigation relation of the website home page obtained by crawling the website home page is shown in fig. 4; a second navigation relationship of the column list page 1 is obtained by crawling the column list page 1 and is shown in fig. 5; the second navigation relationship between the column list page 1 and the content page 1 obtained by crawling the column list page 1 and the content page 1 is shown in fig. 6. The 3 second navigation relationships are integrated into a graph to obtain the first navigation relationship shown in fig. 7.
As can be seen from fig. 7, the column list page 1 and the home page, the column list page 2, the column list page 1-content page 1, the column list page 1-content page 2, and the column list page 1-content page 3 all have a two-way navigation relationship, but the column list page 2 and the column list page 1-content page 1, the column list page 1-content page 2, and the column list page 1-content page 3 do not have a two-way navigation relationship, that is, the content pages other than the home page and other column list pages having a two-way navigation relationship with the specific column list page (target column list page) are the content pages associated with the specific column list page.
Optionally, determining the content page associated with the target column list page according to the first navigation relationship comprises: screening out third pages according to the first navigation relationship, wherein all the screened out third pages form a third page set, and the third pages are pages having a bidirectional navigation relationship with the target column list page; screening out a fourth page from the third page set, wherein all the screened out fourth pages form a fourth page set, and the fourth page is a page in the third page set except for a column list page and a first page; and taking the page in the fourth page set as a content page associated with the target column list page.
Optionally, the step of screening out a fourth page from the third page set includes: acquiring the websites of a home page and all column list pages of a target website; sequentially matching the websites of all the third pages in the third page set with the websites of the first page of the target website and the websites of all the column list pages respectively; and if the matching of the website of the third page with the website of the first page of the target website and the websites of all the column list pages fails, determining that the third page is a fourth page.
The other column list pages except the first page and the other column list pages which have a two-way navigation relationship with the target column list page are the content pages related to the target column list page.
According to the first navigation relationship, all pages having a bidirectional navigation relationship with the target column list page can be determined, namely a plurality of third pages are determined. And eliminating the first page and other column list pages in the third page to obtain the content page associated with the target column list page.
And if the website of the third page is successfully matched with the website of the home page of the target website, determining that the third page is the home page of the website.
And if the website of the third page is successfully matched with the website of one column list page, determining that the third page is the column list page.
And if the matching of the website of the third page with the websites of the first page of the target website and the websites of all the column list pages fails, determining that the third page is not the first page of the website and not the column list page but a content page related to the target column list page, namely, a fourth page. There may be a plurality of fourth pages.
In the embodiment of the invention, the network addresses of all column list pages and home pages are counted; crawling hyperlink information in a website page; and establishing a navigation relation graph according to the current page website and hyperlink information in the page. Establishing a comprehensive navigation relation graph according to all the pages, and searching the pages with a bidirectional navigation relation with the specified column list page; other column list pages and the first page are excluded from the pages, and the rest is all content pages related to the column list page.
In the existing method, generally, a list item under a website column is crawled, and an HTML source file of the website column needs to be checked to find an xpath path of the list item. And after the crawler crawls down the website column page, analyzing the text by using xpath to obtain the target website. Xpath is a query language that requires learning costs and operating costs, which increases labor costs. The webpage processing method provided by the embodiment of the invention avoids the use of xpath to acquire the content page data of the column list page, and reduces the labor cost.
The webpage processing method provided by the embodiment of the invention utilizes the internal law of hyperlink navigation in the webpage list page and the content page to accurately acquire the related page of the column list page, namely the content page of the column list page. Because the internal rule does not depend on the version of the website, the method is also applicable to the condition of website version change, and has no subsequent maintenance cost.
The embodiment of the invention also provides a webpage processing device, which can execute the webpage processing method, and the webpage processing method can also be executed by the webpage processing device. Fig. 8 is a schematic diagram of a web page processing apparatus according to an embodiment of the present invention, as shown in fig. 8, the apparatus including: the crawling unit 10, the establishing unit 20 and the determining unit 30.
A crawling unit 10, configured to crawl hyperlink information on all pages of a target website, where the pages of the target website include: home page, column list page, content page.
The establishing unit 20 is configured to establish a navigation relationship between the pages according to the crawled website of the page and the hyperlink information, so as to obtain a first navigation relationship.
A determining unit 30 for determining a content page associated with the target list page according to the first navigation relation.
Optionally, the number of pages crawled is multiple, and the establishing unit 20 includes: crawling the subunit, establishing the subunit, and summarizing the subunit. And the crawling subunit is used for crawling the hyperlink information on the first page to obtain the first hyperlink information, wherein the first page is any page of the target website. And the establishing subunit is used for establishing a navigation relationship between the first page and other pages according to the website of the first page and the first hyperlink information to obtain a second navigation relationship. And the summarizing subunit is used for summarizing all the second navigation relationships to obtain the first navigation relationship.
Optionally, the establishing the subunit comprises: the device comprises a first determining module and a drawing module. And the first determining module is used for taking the page linked with the first hyperlink information in the first page as a second page. And the drawing module is used for drawing the navigation relationship between the first page and all the second pages to obtain a second navigation relationship.
Optionally, the determining unit 30 includes: searching the subunits, screening the subunits and determining the subunits. And the searching subunit is used for screening out a third page according to the first navigation relationship, wherein all the screened out third pages form a third page set, and the third page is a page having a bidirectional navigation relationship with the target column list page. And the screening subunit is used for screening out a fourth page from the third page set, and all the screened out fourth pages form a fourth page set, wherein the fourth page is a page in the third page set except for the column list page and the first page. And the determining subunit is used for taking the page in the fourth page set as the content page associated with the target column list page.
Optionally, the screening subunit comprises: the device comprises an acquisition module, a matching module and a second determination module. And the acquisition module is used for acquiring the home page of the target website and the websites of all the column list pages. And the matching module is used for sequentially matching the websites of all the third pages in the third page set with the websites of the first page of the target website and the websites of all the column list pages respectively. And the second determining module is used for determining that the third page is the fourth page under the condition that the matching of the website of the third page, the website of the first page of the target website and the websites of all the column list pages fails.
The web page processing device comprises a processor and a memory, wherein the crawling unit 10, the establishing unit 20, the determining unit 30 and the like are stored in the memory as program units, and the program units stored in the memory are executed by the processor to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can set one or more, and the content page associated with the target column list page is determined by adjusting the kernel parameters.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
An embodiment of the present invention provides a storage medium having a program stored thereon, which when executed by a processor implements the web page processing method.
The embodiment of the invention provides a processor, which is used for running a program, wherein the webpage processing method is executed when the program runs.
The embodiment of the invention provides equipment, which comprises a processor, a memory and a program which is stored on the memory and can run on the processor, wherein the processor executes the program and realizes the following steps:
crawling hyperlink information on all pages of a target website, wherein the pages of the target website comprise: a home page, a column list page, and a content page; establishing a navigation relation between pages according to the website of the crawled pages and hyperlink information to obtain a first navigation relation; a content page associated with the target hurdle list page is determined based on the first navigation relationship.
Crawling hyperlink information on a first page to obtain first hyperlink information, wherein the first page is any page of a target website; establishing a navigation relation between the first page and other pages according to the website of the first page and the first hyperlink information to obtain a second navigation relation; and summarizing all the second navigation relations to obtain the first navigation relation.
Taking a page linked by the first hyperlink information in the first page as a second page; and drawing the navigation relation between the first page and all the second pages to obtain a second navigation relation.
Screening out third pages according to the first navigation relationship, wherein all the screened out third pages form a third page set, and the third pages are pages having a bidirectional navigation relationship with the target column list page; screening out a fourth page from the third page set, wherein all the screened out fourth pages form a fourth page set, and the fourth page is a page in the third page set except for a column list page and a first page; and taking the page in the fourth page set as a content page associated with the target column list page.
Acquiring the websites of a home page and all column list pages of a target website; sequentially matching the websites of all the third pages in the third page set with the websites of the first page of the target website and the websites of all the column list pages respectively; and if the matching of the website of the third page with the website of the first page of the target website and the websites of all the column list pages fails, determining that the third page is a fourth page.
The device herein may be a server, a PC, a PAD, a mobile phone, etc.
The present application further provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device:
crawling hyperlink information on all pages of a target website, wherein the pages of the target website comprise: a home page, a column list page, and a content page; establishing a navigation relation between pages according to the website of the crawled pages and hyperlink information to obtain a first navigation relation; a content page associated with the target hurdle list page is determined based on the first navigation relationship.
Crawling hyperlink information on a first page to obtain first hyperlink information, wherein the first page is any page of a target website; establishing a navigation relation between the first page and other pages according to the website of the first page and the first hyperlink information to obtain a second navigation relation; and summarizing all the second navigation relations to obtain the first navigation relation.
Taking a page linked by the first hyperlink information in the first page as a second page; and drawing the navigation relation between the first page and all the second pages to obtain a second navigation relation.
Screening out third pages according to the first navigation relationship, wherein all the screened out third pages form a third page set, and the third pages are pages having a bidirectional navigation relationship with the target column list page; screening out a fourth page from the third page set, wherein all the screened out fourth pages form a fourth page set, and the fourth page is a page in the third page set except for a column list page and a first page; and taking the page in the fourth page set as a content page associated with the target column list page.
Acquiring the websites of a home page and all column list pages of a target website; sequentially matching the websites of all the third pages in the third page set with the websites of the first page of the target website and the websites of all the column list pages respectively; and if the matching of the website of the third page with the website of the first page of the target website and the websites of all the column list pages fails, determining that the third page is a fourth page.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (7)

1. A method for processing a web page, comprising:
crawling hyperlink information on all pages of a target website, wherein the pages of the target website comprise: a home page, a column list page, and a content page;
establishing a navigation relationship between pages according to the crawled website of the page and the hyperlink information to obtain a first navigation relationship;
determining a content page associated with a target column list page according to the first navigation relationship;
wherein determining the content page associated with the target hurdle list page according to the first navigation relationship comprises:
screening out a third page according to the first navigation relationship, wherein all the screened out third pages form a third page set, and the third page is a page having a bidirectional navigation relationship with the target column list page;
screening out a fourth page from the third page set, wherein all the screened out fourth pages form a fourth page set, and the fourth page is a page in the third page set except the column list page and the first page;
taking the page in the fourth page set as a content page associated with the target column list page;
wherein, sifting out the fourth page from the third page set comprises:
acquiring the websites of the home page and all column list pages of the target website;
sequentially matching the websites of all the third pages in the third page set with the websites of the head page of the target website and the websites of all the column list pages respectively;
and if the matching of the website of the third page with the websites of the first page of the target website and the websites of all the column list pages fails, determining that the third page is the fourth page.
2. The method of claim 1, wherein establishing a navigation relationship between pages according to the crawled web address of the page and the hyperlink information to obtain a first navigation relationship comprises:
crawling hyperlink information on a first page to obtain first hyperlink information, wherein the first page is any page of the target website;
establishing a navigation relationship between the first page and other pages according to the website of the first page and the first hyperlink information to obtain a second navigation relationship;
and summarizing all the second navigation relations to obtain the first navigation relation.
3. The method of claim 2, wherein establishing a navigation relationship between the first page and other pages according to the website of the first page and the first hyperlink information to obtain a second navigation relationship comprises:
taking a page linked by the first hyperlink information in the first page as a second page;
and drawing the navigation relation between the first page and all the second pages to obtain the second navigation relation.
4. A web page processing apparatus, comprising:
the crawling unit is used for crawling hyperlink information on all pages of the target website, wherein the pages of the target website comprise: a home page, a column list page, and a content page;
the establishing unit is used for establishing a navigation relation between pages according to the crawled website of the page and the hyperlink information to obtain a first navigation relation;
the determining unit is used for determining a content page associated with the target column list page according to the first navigation relation;
wherein the determination unit includes:
the searching subunit is configured to screen out a third page according to the first navigation relationship, where all the screened out third pages form a third page set, and the third page is a page having a bidirectional navigation relationship with the target column list page;
a screening subunit, configured to screen a fourth page from the third page set, where all the screened fourth pages form a fourth page set, where the fourth page is a page in the third page set, except the column list page and the first page;
a determining subunit, configured to use a page in the fourth page set as a content page associated with the target column list page;
wherein the screening subunit comprises:
the acquisition module is used for acquiring the websites of the home page and all the column list pages of the target website;
the matching module is used for sequentially matching the websites of all the third pages in the third page set with the websites of the head page of the target website and the websites of all the column list pages respectively;
and the second determining module is used for determining that the third page is the fourth page under the condition that the matching of the website of the third page, the website of the first page of the target website and the websites of all the column list pages fails.
5. The apparatus of claim 4, wherein the crawled pages are multiple, and the establishing unit comprises:
the crawling sub-unit is used for crawling hyperlink information on a first page to obtain first hyperlink information, wherein the first page is any page of the target website;
the establishing subunit is used for establishing a navigation relationship between the first page and other pages according to the website of the first page and the first hyperlink information to obtain a second navigation relationship;
and the summarizing subunit is used for summarizing all the second navigation relationships to obtain the first navigation relationship.
6. The apparatus of claim 5, wherein the establishing subunit comprises:
a first determining module, configured to use a page linked with the first hyperlink information in the first page as a second page;
and the drawing module is used for drawing the navigation relationship between the first page and all the second pages to obtain the second navigation relationship.
7. A storage medium on which a program is stored, the program implementing the web page processing method according to any one of claims 1 to 3 when executed by a processor.
CN201710705406.3A 2017-08-16 2017-08-16 Webpage processing method and device Active CN109948013B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710705406.3A CN109948013B (en) 2017-08-16 2017-08-16 Webpage processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710705406.3A CN109948013B (en) 2017-08-16 2017-08-16 Webpage processing method and device

Publications (2)

Publication Number Publication Date
CN109948013A CN109948013A (en) 2019-06-28
CN109948013B true CN109948013B (en) 2021-11-05

Family

ID=67003895

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710705406.3A Active CN109948013B (en) 2017-08-16 2017-08-16 Webpage processing method and device

Country Status (1)

Country Link
CN (1) CN109948013B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103294815A (en) * 2013-06-08 2013-09-11 北京邮电大学 Search engine device with various presentation modes based on classification of key words and searching method
CN106547803A (en) * 2015-09-23 2017-03-29 北京国双科技有限公司 The method and apparatus for crawling website incremental resource

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8856100B2 (en) * 2012-07-31 2014-10-07 International Business Machines Corporation Displaying browse sequence with search results

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103294815A (en) * 2013-06-08 2013-09-11 北京邮电大学 Search engine device with various presentation modes based on classification of key words and searching method
CN106547803A (en) * 2015-09-23 2017-03-29 北京国双科技有限公司 The method and apparatus for crawling website incremental resource

Also Published As

Publication number Publication date
CN109948013A (en) 2019-06-28

Similar Documents

Publication Publication Date Title
CN110968824B (en) Page data processing method and device
CN110069683B (en) Method and device for crawling data based on browser
US10073918B2 (en) Classifying URLs
US10621255B2 (en) Identifying equivalent links on a page
US9454535B2 (en) Topical mapping
KR102024998B1 (en) Extracting similar group elements
CN103744856A (en) Method, device and system for linkage extended search
CN111090797B (en) Data acquisition method, device, computer equipment and storage medium
CN107045507B (en) Webpage crawling method and device
CN109582883B (en) Column page determination method and device
CN104102577A (en) Test method of multi-version webpage visiting
CN103605848A (en) Method and device for analyzing paths
CN110609946A (en) Information recommendation method and device
CN104899203B (en) Webpage generation method and device and terminal equipment
CN107015986A (en) A kind of reptile crawls the method and device of webpage
US11055365B2 (en) Mechanism for web crawling e-commerce resource pages
CN104899217A (en) Method and apparatus for implementing customized function
CN109948013B (en) Webpage processing method and device
CN103530392A (en) Method and device for determining capture flows
CN109948034B (en) Method and device for extracting page information based on filtering session
CN112579947A (en) Webpage element graph intercepting method and device and electronic equipment
CN109426540B (en) Element click condition detection method and device, storage medium and processor
CN110971578B (en) User identity confirmation method and device
CN110851746A (en) Crawler seed generation method and device
CN110968754A (en) Detection method and device for crawler page turning strategy

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Beijing city Haidian District Shuangyushu Area No. 76 Zhichun Road cuigongfandian 8 layer A

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant