CN109948013A - Web page processing method and device - Google Patents

Web page processing method and device Download PDF

Info

Publication number
CN109948013A
CN109948013A CN201710705406.3A CN201710705406A CN109948013A CN 109948013 A CN109948013 A CN 109948013A CN 201710705406 A CN201710705406 A CN 201710705406A CN 109948013 A CN109948013 A CN 109948013A
Authority
CN
China
Prior art keywords
page
pages
navigation
network address
navigation relation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710705406.3A
Other languages
Chinese (zh)
Other versions
CN109948013B (en
Inventor
曹志明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201710705406.3A priority Critical patent/CN109948013B/en
Publication of CN109948013A publication Critical patent/CN109948013A/en
Application granted granted Critical
Publication of CN109948013B publication Critical patent/CN109948013B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a kind of web page processing method and devices.Wherein, this method comprises: crawling the hyperlinked information on all pages of targeted website, wherein the page of targeted website includes: homepage, column list page, content pages;The navigation relation between the page is established according to the network address of the page crawled and hyperlinked information, obtains the first navigation relation;Content pages associated with target column list page are determined according to the first navigation relation.The present invention solves the technical issues of can not quick and precisely determining the associated content pages of column list page in the prior art.

Description

Web page processing method and device
Technical field
The present invention relates to internet areas, in particular to a kind of web page processing method and device.
Background technique
Website generally comprises two kinds of pages: content pages and list page.Content pages are the pages comprising specific article information;And List page plays page navigation, navigates to content pages using the list of hyperlinks arranged.Website column is to web site contents Classification, for example general portal website has the columns such as " news ", " sport ", " amusement ".The website column page is usually one List page, for its each content page of navigating.
In the business for crawling website data, the data of the accurate all the elements page for obtaining website column list page have Very actual meaning.For example, being able to know that website by the update status of observation content page and the update status of column list page Whether updated.
Website page is all the file of html format, and the list section of list page is typically all to be marked to combine by multiple li It forms.
As soon as XPath is the language that information can be searched in html document, for example,/li/h4 representative find out it is all embedding Cover the h4 labelled element under li label.
The method of the existing all the elements page for obtaining website column list page, is the list for crawling a website column page , it needs to check the html source file of website column page, finds the path xpath of list items.Lower website column is crawled in crawler After page, text resolution is carried out using xpath and obtains list items.
The method that the prior art obtains the associated content pages of website column list page mainly has the disadvantage that: obtaining website column The xpath of mesh page list items needs many manual workings;Moreover some websites will do it correcting work, and xpath may at this time It can change, will cause the situation of parsing inaccuracy.
For above-mentioned problem, currently no effective solution has been proposed.
Summary of the invention
The embodiment of the invention provides a kind of web page processing method and device, at least solve in the prior art can not be quick The technical issues of accurately determining column list page associated content pages.
According to an aspect of an embodiment of the present invention, a kind of web page processing method is provided, comprising: crawl targeted website Hyperlinked information on all pages, wherein the page of the targeted website includes: homepage, column list page, content pages;Root The navigation relation between the page is established according to the network address and the hyperlinked information of the page crawled, obtains the first navigation relation;Root Content pages associated with target column list page are determined according to first navigation relation.
Further, the navigation relation between the page is established according to the network address of the page crawled and the hyperlinked information, Obtain the first navigation relation, comprising: crawl the hyperlinked information in first page, obtain the first hyperlinked information, wherein described First page is any one page of the targeted website;Believed according to the network address of the first page and first hyperlink Breath establishes the navigation relation between the first page and other pages, obtains the second navigation relation;Summarize all second to lead Boat relationship obtains first navigation relation.
Further, according to the network address of the first page and first hyperlinked information establish the first page with Navigation relation between other pages obtains the second navigation relation, comprising: believes the first hyperlink described in the first page The linked page is ceased as second page;The navigation relation between the first page and all second page is drawn, Obtain second navigation relation.
Further, content pages associated with target column list page are determined according to first navigation relation, comprising: The third page is filtered out according to first navigation relation, the whole filtered out the third page constitutes third page set, The third page is the page for having two-way navigation relation with the target column list page;From the third page set The 4th page is filtered out, the whole filtered out the 4th page constitutes the 4th page set, wherein the 4th page is institute State the page in third page set in addition to the column list page and the homepage;By the page in the 4th page set Face is as the associated content pages of target column list page.
Further, the 4th page is filtered out from the third page set, comprising: obtain the head of the targeted website The network address of page and all column list pages;Successively by the network address of the third pages all in the third page set with it is described The network address of the homepage of targeted website, the network address of all column list pages are matched respectively;If the third page It fails to match for the network address of network address and the homepage of the targeted website, the network address of all column list pages, it is determined that described The third page is the 4th page.
According to another aspect of an embodiment of the present invention, a kind of page processor is additionally provided, comprising: crawl unit, use Hyperlinked information on all pages for crawling targeted website, wherein the page of the targeted website includes: homepage, column List page, content pages;Unit is established, for establishing between the page according to the network address and the hyperlinked information of the page crawled Navigation relation obtains the first navigation relation;Determination unit, for according to first navigation relation determination and the list of target column The associated content pages of page.
Further, the page crawled is multiple, and the unit of establishing includes: to crawl subelement, for crawling the Hyperlinked information on one page obtains the first hyperlinked information, wherein the first page is any of the targeted website One page;Subelement is established, establishes described for the network address and first hyperlinked information according to the first page Navigation relation between one page and other pages obtains the second navigation relation;Summarize subelement, for summarizing all second Navigation relation obtains first navigation relation.
Further, the subelement of establishing includes: the first determining module, and being used for will be first described in the first page The page that hyperlinked information is linked is as second page;Drafting module, for drawing the first page and all described the Navigation relation between two pages obtains second navigation relation.
Further, the determination unit includes: lookup subelement, for filtering out according to first navigation relation Three pages, the whole filtered out the third page constitute third page set, and the third page is and the target column List page has the page of two-way navigation relation;Subelement is screened, for filtering out page four from the third page set Face, the whole filtered out the 4th page constitute the 4th page set, wherein the 4th page is the third page set The page in conjunction in addition to the column list page and the homepage;Determine subelement, being used for will be in the 4th page set The page as the associated content pages of target column list page.
Further, the screening subelement includes: acquisition module, for obtaining the homepage of the targeted website and owning The network address of column list page;Matching module, for successively by the network address of the third pages all in the third page set It is matched respectively with the network address of the network address of the homepage of the targeted website, all column list pages;Second determining module, Network address, the network address of all column list pages for network address and the homepage of the targeted website in the third page is equal In the case that it fails to match, it is determined that the third page is the 4th page.
Another aspect according to an embodiment of the present invention, additionally provides a kind of storage medium, is stored thereon with program, the program The web page processing method is realized when being executed by processor.
It is according to an embodiment of the present invention that in another aspect, additionally providing a kind of processor, the processor is used to run program, Wherein, the web page processing method is executed when described program is run.
In embodiments of the present invention, have the page of two-way navigation relation in addition to homepage and other with target column list page Column list page is exactly the associated content pages of target column list page, according in the network address of the page crawled and the page crawled Hyperlinked information establish the navigation relation between the page, that is, the first navigation relation excludes homepage according to the first navigation relation With other column list pages, the associated content pages of target column list page can be obtained, in this process, do not need artificial It participates in, even if website revision nor affects on as a result, having reached the skill for rapidly and accurately determining the associated content pages of column list page Art effect, and then solve and can not quick and precisely determine that the technology of the associated content pages of column list page is asked in the prior art Topic.
Detailed description of the invention
The drawings described herein are used to provide a further understanding of the present invention, constitutes a part of the invention, this hair Bright illustrative embodiments and their description are used to explain the present invention, and are not constituted improper limitations of the present invention.In the accompanying drawings:
Fig. 1 is the flow chart of web page processing method according to an embodiment of the present invention;
Fig. 2 is the schematic diagram of content pages according to an embodiment of the present invention;
Fig. 3 is the schematic diagram of column list page according to an embodiment of the present invention;
Fig. 4 is a kind of schematic diagram of second navigation relation according to an embodiment of the present invention;
Fig. 5 is the schematic diagram of another the second navigation relation according to an embodiment of the present invention;
Fig. 6 is the schematic diagram of another the second navigation relation according to an embodiment of the present invention;
Fig. 7 is a kind of schematic diagram of first navigation relation according to an embodiment of the present invention;
Fig. 8 is the schematic diagram of page processor according to an embodiment of the present invention.
Specific embodiment
In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work It encloses.
It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product Or other step or units that equipment is intrinsic.
According to embodiments of the present invention, a kind of embodiment of image processing method is provided, it should be noted that in attached drawing The step of process illustrates can execute in a computer system such as a set of computer executable instructions, although also, Logical order is shown in flow chart, but in some cases, it can be to be different from shown by sequence execution herein or retouch The step of stating.
Fig. 1 is a kind of flow chart of web page processing method according to an embodiment of the present invention.As shown in Figure 1, this method includes Following steps:
Step S102 crawls the hyperlinked information on all pages of targeted website, wherein the page packet of targeted website It includes: homepage, column list page, content pages.
Step S104 establishes the navigation relation between the page according to the network address of the page crawled and hyperlinked information, obtains First navigation relation.
Step S106 determines content pages associated with target column list page according to the first navigation relation.
Content pages are the pages comprising specific article information, and Fig. 2 shows a content pages.
Column list page plays page navigation, navigates to content pages using the list of hyperlinks arranged, Fig. 3 is shown One column list page.
Column is the classification to web site contents, for example general website has multiple columns, for example, have " news ", " sport ", Columns such as " amusements ".Column list page is usually that a list mode is presented, for its each content page of navigating.
If the A page includes the hyperlink that can be linked to the B page, then it is assumed that the A page can navigate to the B page.
If the B page includes the hyperlink that can be linked to the A page, then it is assumed that the B page can navigate to the A page.
If the A page includes the hyperlink that can be linked to the B page, also, the B page includes that can be linked to the A page Hyperlink, then it is assumed that the A page and the B page have two-way navigation relation.
The page of targeted website includes: homepage, column list page, content pages.Inventor has found by numerous studies: homepage There is two-way navigation relation with multiple column list pages;There is two-way lead between one column list page and other column list pages Boat relationship;One column list page and the content pages of itself have two-way navigation relation, the content pages with other column list pages Without two-way navigation relation.
Table 1
For example, as shown in table 1, one shares 3 column list pages, wherein the content pages of column list page L1 have 6, point It Wei not content pages P (1,1), content pages P (1,2), content pages P (1,3), content pages P (Isosorbide-5-Nitrae), content pages P (1,5), content pages P (1,6), column list page L1 and this 6 content pages have two-way navigation relation.
The content pages of column list page L2 have 11, respectively content pages P (2,1), content pages P (2,2), content pages P (2, 3), content pages P (2,4), content pages P (2,5), content pages P (2,6), content pages P (2,7), content pages P (2,8), content pages P (2,9), content pages P (2,10), content pages P (2,11), column list page L2 and this 11 content pages have two-way navigation relation.
The content pages of column list page L3 have 8, respectively content pages P (3,1), content pages P (3,2), content pages P (3, 3), content pages P (3,4), content pages P (3,5), content pages P (3,6), content pages P (3,7), content pages P (3,8), column list Page L3 and this 8 content pages have two-way navigation relation.
Column list page and the content pages of itself have two-way navigation relation, do not have with the content pages of other column list pages There is two-way navigation relation.
In embodiments of the present invention, have the page of two-way navigation relation in addition to homepage and other with target column list page Column list page is exactly the associated content pages of target column list page, according to the network address of the page of the targeted website crawled and is climbed The hyperlinked information in the page taken establishes the navigation relation between target web site page, that is, the first navigation relation.According to first Navigation relation excludes homepage and other column list pages, the associated content pages of target column list page can be obtained, at this It during a, does not need manually to participate in, even if website revision nor affects on as a result, solve in the prior art can not be quick and precisely The technical issues of determining column list page associated content pages has reached in rapidly and accurately determining that column list page is associated Hold the technical effect of page.
Optionally, it is established between first page and other pages according to the network address of first page and the first hyperlinked information Navigation relation obtains the second navigation relation, comprising: using the page that the first hyperlinked information is linked in first page as second The page;The navigation relation between first page and whole second pages is drawn, the second navigation relation is obtained.
Optionally, the navigation relation between the page is established according to the network address of the page crawled and hyperlinked information, obtains One navigation relation, comprising: crawl the hyperlinked information in first page, obtain the first hyperlinked information, wherein first page is Any one page of targeted website;First page and other pages are established according to the network address of first page and the first hyperlinked information Navigation relation between face obtains the second navigation relation;Summarize the second all navigation relations, obtains the first navigation relation.
First page is any one page of targeted website, crawls the hyperlinked information in first page, obtains first Hyperlinked information establishes the navigation between first page and other pages according to the network address of first page and the first hyperlinked information Relationship obtains the second navigation relation.The detailed process for establishing the second navigation relation can be such that
Assuming that the first hyperlinked information, which is directed toward (link), arrives some page, then, first page can navigate to the page, Assuming that there is the hyperlink for being linked to first page in the page, then first page and the page have two-way navigation relation.
Assuming that targeted website one shares the M page, i.e., first page is M, according to the network address of each first page and the One hyperlinked information establishes the navigation relation between first page and other pages, can all obtain second navigation relation, that One is obtained M the second navigation relations, this M the second navigation relations are summarized, obtain the first navigation relation, first leads Boat relationship is a comprehensive navigation relation.
By crawling website homepage, the second navigation relation for obtaining website homepage is as shown in Figure 4;By crawling column list Page 1, the second navigation relation for obtaining column list page 1 is as shown in Figure 5;By crawling column list page 1- content pages 1, column is obtained Second navigation relation of mesh list page 1- content pages 1 is as shown in Figure 6.This 3 kind of second navigation relation is comprehensive at a figure, it obtains First navigation relation as shown in Figure 7.
As seen from Figure 7, column list page 1 and homepage, column list page 2, column list page 1- content pages 1, column List page 1- content pages 2, column list page 1- content pages 3 all have two-way navigation relation, but column list page 2 and column arrange Table page 1- content pages 1, column list page 1- content pages 2, column list page 1- content pages 3 do not have two-way navigation relation, that is, With specific column list page (target column list page) have two-way navigation relation in addition to homepage, other column list pages, just It is the specific associated content pages of column list page.
Optionally, associated with target column list page content pages are determined according to the first navigation relation, comprising: according to the One navigation relation filters out the third page, and whole third pages for filtering out constitute third page set, and the third page is and mesh Mark the page that column list page has two-way navigation relation;The 4th page is filtered out from third page set, what is filtered out is complete The 4th page of portion constitutes the 4th page set, wherein the 4th page be in third page set except column list page and homepage with The outer page;Using the page in the 4th page set as the associated content pages of target column list page.
Optionally, the 4th page is filtered out from third page set, comprising: obtain targeted website homepage and all columns The network address of mesh list page;Successively by the network address of the network address of the third pages all in third page set and the homepage of targeted website, The network address of all column list pages is matched respectively;If the network address of the homepage of the network address and targeted website of the third page, institute Have column list page network address it fails to match, it is determined that the third page be the 4th page.
It is exactly the target with target column list page with two-way navigation relation in addition to homepage, other column list pages The associated content pages of column list page.
According to the first navigation relation, it is capable of determining that all pages with target column list page with two-way navigation relation Face, that is, determine the third page, and the quantity of the third page is multiple.Exclude homepage in the third page, other column lists Page, can be obtained the associated content pages of target column list page.
If the network address successful match of the homepage of the network address and targeted website of the third page, it is determined that the third page is website Homepage.
If the network address successful match of the network address of the third page and some column list page, it is determined that the third page is column Mesh list page.
If the network address of the third page matches mistake with the network address of the network address of the homepage of targeted website, all column list pages Lose, it is determined that the third page neither website homepage, nor column list page, but target column list page it is associated in Hold page, that is, the 4th page.4th page can have multiple.
In embodiments of the present invention, the network address of all column list pages and homepage has been counted;It crawls super in Website page Link information;According to the hyperlinked information in current page network address and the page, navigation relation figure is established.It is established according to all pages Integrated navigation relational graph searches the page for having two-way navigation relation with specified column list page;Other columns are excluded from these pages Mesh list page and homepage, remaining is the associated all the elements page of the column list page.
Existing method generally crawls the list items under a website column, needs to check the HTML source document of website column Part finds the path xpath of list items.After crawler crawls lower website column page, text resolution is carried out using xpath and is obtained. Xpath is a kind of query language for needing learning cost and operating cost, and this adds increased human costs.The embodiment of the present invention mentions For web page processing method avoid using xpath and obtain the content page data of column list page, reduce human cost.
Web page processing method provided in an embodiment of the present invention is utilized what hyperlink in web page listings page and content pages was navigated Inherent law accurately to obtain the association page of column list page, the i.e. content pages of its column list page.Due to the inherent law Independent of the version of website, therefore, the case where for website revision, the present invention is equally applicable, without subsequent maintenance cost.
The embodiment of the invention also provides a kind of page processor, which is able to carry out at above-mentioned webpage Reason method, above-mentioned web page processing method can also be executed by the page processor.Fig. 8 is net according to an embodiment of the present invention The schematic diagram of sheet processing apparatus, as shown in figure 8, the device includes: to crawl unit 10, establish unit 20, determination unit 30.
Unit 10 is crawled, the hyperlinked information on all pages for crawling targeted website, wherein the page of targeted website Face includes: homepage, column list page, content pages.
Unit 20 is established, the navigation between the page is established for the network address and hyperlinked information according to the page crawled and closes System, obtains the first navigation relation.
Determination unit 30, for determining content pages associated with target column list page according to the first navigation relation.
Optionally, the page crawled be it is multiple, establish unit 20 include: crawl subelement, establish subelement, summarize it is sub single Member.Subelement is crawled, for crawling the hyperlinked information in first page, obtains the first hyperlinked information, wherein first page For any one page of targeted website.Subelement is established, for building according to the network address and the first hyperlinked information of first page Navigation relation between vertical first page and other pages, obtains the second navigation relation.Summarize subelement, it is all for summarizing Second navigation relation obtains the first navigation relation.
Optionally, establishing subelement includes: the first determining module, drafting module.First determining module is used for first page The page that the first hyperlinked information is linked in face is as second page.Drafting module, for drawing first page and all the Navigation relation between two pages obtains the second navigation relation.
Optionally it is determined that unit 30 includes: to search subelement, screen subelement, determine subelement.Subelement is searched, is used In filtering out the third page according to the first navigation relation, the whole third pages filtered out constitute third page set, third page Face is the page for having two-way navigation relation with target column list page.Subelement is screened, for sieving from third page set The 4th page is selected, all the 4th pages filtered out constitute the 4th page set, wherein the 4th page is third page set In the page in addition to column list page and homepage.Subelement is determined, for using the page in the 4th page set as target The associated content pages of column list page.
Optionally, screening subelement includes: to obtain module, matching module, the second determining module.Module is obtained, for obtaining Take the homepage of targeted website and the network address of all column list pages.Matching module, for will successively own in third page set The network address of the homepage of the network address and targeted website of the third page, the network address of all column list pages are matched respectively.Second really Cover half block, the network address of the homepage for the network address and targeted website in the third page, the network address of all column list pages match In the case where failure, it is determined that the third page is the 4th page.
Page processor includes processor and memory, above-mentioned to crawl unit 10, establish unit 20, determination unit 30 etc. In memory as program unit storage, above procedure unit stored in memory is executed by processor to realize phase The function of answering.
Include kernel in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can be set one Or more, content pages associated with target column list page are determined by adjusting kernel parameter.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/ Or the forms such as Nonvolatile memory, if read-only memory (ROM) or flash memory (flash RAM), memory include that at least one is deposited Store up chip.
The embodiment of the invention provides a kind of storage mediums, are stored thereon with program, real when which is executed by processor The existing web page processing method.
The embodiment of the invention provides a kind of processor, the processor is for running program, wherein described program operation Web page processing method described in Shi Zhihang.
The embodiment of the invention provides a kind of equipment, equipment include processor, memory and storage on a memory and can The program run on a processor, processor perform the steps of when executing program
Crawl the hyperlinked information on all pages of targeted website, wherein the page of targeted website includes: homepage, column Mesh list page, content pages;The navigation relation between the page is established according to the network address of the page crawled and hyperlinked information, obtains One navigation relation;Content pages associated with target column list page are determined according to the first navigation relation.
The hyperlinked information in first page is crawled, the first hyperlinked information is obtained, wherein first page is targeted website Any one page;It is established between first page and other pages according to the network address of first page and the first hyperlinked information Navigation relation obtains the second navigation relation;Summarize the second all navigation relations, obtains the first navigation relation.
Using the page that the first hyperlinked information is linked in first page as second page;Draw first page and whole Navigation relation between second page obtains the second navigation relation.
The third page is filtered out according to the first navigation relation, the whole third pages filtered out constitute third page set, The third page is the page for having two-way navigation relation with target column list page;Page four is filtered out from third page set Face, all the 4th pages filtered out constitute the 4th page set, wherein the 4th page is in third page set except column arranges The page other than table page and homepage;Using the page in the 4th page set as the associated content pages of target column list page.
Obtain the homepage of targeted website and the network address of all column list pages;Successively by thirds all in third page set The network address of the homepage of the network address and targeted website of the page, the network address of all column list pages are matched respectively;If third page It fails to match for the network address of the homepage of the network address and targeted website in face, the network address of all column list pages, it is determined that the third page For the 4th page.
Equipment herein can be server, PC, PAD, mobile phone etc..
Present invention also provides a kind of computer program products, when executing on data processing equipment, are adapted for carrying out just The program of beginningization there are as below methods step:
Crawl the hyperlinked information on all pages of targeted website, wherein the page of targeted website includes: homepage, column Mesh list page, content pages;The navigation relation between the page is established according to the network address of the page crawled and hyperlinked information, obtains One navigation relation;Content pages associated with target column list page are determined according to the first navigation relation.
The hyperlinked information in first page is crawled, the first hyperlinked information is obtained, wherein first page is targeted website Any one page;It is established between first page and other pages according to the network address of first page and the first hyperlinked information Navigation relation obtains the second navigation relation;Summarize the second all navigation relations, obtains the first navigation relation.
Using the page that the first hyperlinked information is linked in first page as second page;Draw first page and whole Navigation relation between second page obtains the second navigation relation.
The third page is filtered out according to the first navigation relation, the whole third pages filtered out constitute third page set, The third page is the page for having two-way navigation relation with target column list page;Page four is filtered out from third page set Face, all the 4th pages filtered out constitute the 4th page set, wherein the 4th page is in third page set except column arranges The page other than table page and homepage;Using the page in the 4th page set as the associated content pages of target column list page.
Obtain the homepage of targeted website and the network address of all column list pages;Successively by thirds all in third page set The network address of the homepage of the network address and targeted website of the page, the network address of all column list pages are matched respectively;If third page It fails to match for the network address of the homepage of the network address and targeted website in face, the network address of all column list pages, it is determined that the third page For the 4th page.
It should be understood by those skilled in the art that, embodiments herein can provide as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the application Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the application, which can be used in one or more, The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces The form of product.
The application is referring to method, the process of equipment (system) and computer program product according to the embodiment of the present application Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.
In a typical configuration, calculating equipment includes one or more processors (CPU), input/output interface, net Network interface and memory.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/ Or the forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable Jie The example of matter.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates Machine readable medium does not include temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.
It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability It include so that the process, method, commodity or the equipment that include a series of elements not only include those elements, but also to wrap Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including element There is also other identical elements in process, method, commodity or equipment.
It will be understood by those skilled in the art that embodiments herein can provide as method, system or computer program product. Therefore, complete hardware embodiment, complete software embodiment or embodiment combining software and hardware aspects can be used in the application Form.It is deposited moreover, the application can be used to can be used in the computer that one or more wherein includes computer usable program code The shape for the computer program product implemented on storage media (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) Formula.
The above is only embodiments herein, are not intended to limit this application.To those skilled in the art, Various changes and changes are possible in this application.It is all within the spirit and principles of the present application made by any modification, equivalent replacement, Improve etc., it should be included within the scope of the claims of this application.

Claims (10)

1. a kind of web page processing method characterized by comprising
Crawl the hyperlinked information on all pages of targeted website, wherein the page of the targeted website includes: homepage, column Mesh list page, content pages;
The navigation relation between the page is established according to the network address of the page crawled and the hyperlinked information, obtains the first navigation pass System;
Content pages associated with target column list page are determined according to first navigation relation.
2. the method according to claim 1, wherein according to the network address of the page crawled and the hyperlinked information The navigation relation between the page is established, the first navigation relation is obtained, comprising:
The hyperlinked information in first page is crawled, the first hyperlinked information is obtained, wherein the first page is the target Any one page of website;
It is established between the first page and other pages according to the network address of the first page and first hyperlinked information Navigation relation, obtain the second navigation relation;
Summarize the second all navigation relations, obtains first navigation relation.
3. according to the method described in claim 2, it is characterized in that, according to the network address of the first page and first hyperlink The navigation relation that information is established between the first page and other pages is connect, the second navigation relation is obtained, comprising:
Using the page that the first hyperlinked information is linked described in the first page as second page;
The navigation relation between the first page and all second page is drawn, second navigation relation is obtained.
4. the method according to claim 1, wherein being arranged according to first navigation relation is determining with target column The associated content pages of table page, comprising:
The third page is filtered out according to first navigation relation, the whole filtered out the third page constitutes third page set It closes, the third page is the page for having two-way navigation relation with the target column list page;
The 4th page is filtered out from the third page set, the whole filtered out the 4th page constitutes the 4th page set It closes, wherein the 4th page is the page in the third page set in addition to the column list page and the homepage;
Using the page in the 4th page set as the associated content pages of target column list page.
5. according to the method described in claim 4, it is characterized in that, filter out the 4th page from the third page set, Include:
Obtain the homepage of the targeted website and the network address of all column list pages;
Successively by the network address of the network address of the third pages all in the third page set and the homepage of the targeted website, The network address of all column list pages is matched respectively;
If network address, the network address of all column list pages of the network address of the third page and the homepage of the targeted website It fails to match, it is determined that the third page is the 4th page.
6. a kind of page processor characterized by comprising
Unit is crawled, the hyperlinked information on all pages for crawling targeted website, wherein the page of the targeted website It include: homepage, column list page, content pages;
Unit is established, for establishing the navigation relation between the page according to the network address and the hyperlinked information of the page crawled, Obtain the first navigation relation;
Determination unit, for determining content pages associated with target column list page according to first navigation relation.
7. device according to claim 6, which is characterized in that the page crawled be it is multiple, it is described to establish unit packet It includes:
Subelement is crawled, for crawling the hyperlinked information in first page, obtains the first hyperlinked information, wherein described One page is any one page of the targeted website;
Subelement is established, for establishing the first page according to the network address and first hyperlinked information of the first page With the navigation relation between other pages, the second navigation relation is obtained;
Summarize subelement, for summarizing the second all navigation relations, obtains first navigation relation.
8. device according to claim 7, which is characterized in that the subelement of establishing includes:
First determining module, for using the page that the first hyperlinked information is linked described in the first page as second page Face;
Drafting module obtains described for drawing the navigation relation between the first page and all second page Two navigation relations.
9. a kind of storage medium, which is characterized in that be stored thereon with program, realized when which is executed by processor as right is wanted Seek 1 to 5 described in any item web page processing methods.
10. a kind of processor, which is characterized in that the processor is for running program, wherein executed such as when described program is run Web page processing method described in any one of claim 1 to 5.
CN201710705406.3A 2017-08-16 2017-08-16 Webpage processing method and device Active CN109948013B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710705406.3A CN109948013B (en) 2017-08-16 2017-08-16 Webpage processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710705406.3A CN109948013B (en) 2017-08-16 2017-08-16 Webpage processing method and device

Publications (2)

Publication Number Publication Date
CN109948013A true CN109948013A (en) 2019-06-28
CN109948013B CN109948013B (en) 2021-11-05

Family

ID=67003895

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710705406.3A Active CN109948013B (en) 2017-08-16 2017-08-16 Webpage processing method and device

Country Status (1)

Country Link
CN (1) CN109948013B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103294815A (en) * 2013-06-08 2013-09-11 北京邮电大学 Search engine device with various presentation modes based on classification of key words and searching method
US20140040225A1 (en) * 2012-07-31 2014-02-06 International Business Machines Corporation Displaying browse sequence with search results
CN106547803A (en) * 2015-09-23 2017-03-29 北京国双科技有限公司 The method and apparatus for crawling website incremental resource

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140040225A1 (en) * 2012-07-31 2014-02-06 International Business Machines Corporation Displaying browse sequence with search results
CN103294815A (en) * 2013-06-08 2013-09-11 北京邮电大学 Search engine device with various presentation modes based on classification of key words and searching method
CN106547803A (en) * 2015-09-23 2017-03-29 北京国双科技有限公司 The method and apparatus for crawling website incremental resource

Also Published As

Publication number Publication date
CN109948013B (en) 2021-11-05

Similar Documents

Publication Publication Date Title
CN107145496B (en) Method for matching image with content item based on keyword
CN106096050A (en) A kind of method and apparatus of video contents search
US9928415B2 (en) Mathematical formula learner support system
US20150278359A1 (en) Method and apparatus for generating a recommendation page
JP6646931B2 (en) Method and apparatus for providing recommendation information
US10296552B1 (en) System and method for automated identification of internet advertising and creating rules for blocking of internet advertising
CN105956148A (en) Resource information recommendation method and apparatus
JP6966158B2 (en) Methods, devices and programs for processing search data
CN104008180B (en) Association method of structural data with picture, association device thereof
CN107145497B (en) Method for selecting image matched with content based on metadata of image and content
US20200293160A1 (en) System for superimposed communication by object oriented resource manipulation on a data network
CN103838862B (en) Video searching method, device and terminal
CN103617192B (en) The clustering method and device of a kind of data object
CN106372130A (en) Static resource management method
CN104331438A (en) Method and device for selectively extracting content of novel webpage
US20200394194A1 (en) Multi-vertical entity-based search system
CN110020236B (en) Webpage parsing method, device, storage medium, processor and equipment
CN107368546A (en) A kind of method and apparatus for generating outline
CN107391528A (en) Front end assemblies Dependency Specification searching method and equipment
CN103455492A (en) Method and device for searching web pages
CN104572874B (en) A kind of abstracting method and device of webpage information
CN109948013A (en) Web page processing method and device
CN115905759A (en) Barrier-free webpage generation method, device, medium and equipment
CN103793509A (en) Picture capturing method and device
CN109559141A (en) A kind of automatic classification method, the apparatus and system of intention pattern

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Beijing city Haidian District Shuangyushu Area No. 76 Zhichun Road cuigongfandian 8 layer A

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant