CN109948013A - Web page processing method and device - Google Patents
Web page processing method and device Download PDFInfo
- Publication number
- CN109948013A CN109948013A CN201710705406.3A CN201710705406A CN109948013A CN 109948013 A CN109948013 A CN 109948013A CN 201710705406 A CN201710705406 A CN 201710705406A CN 109948013 A CN109948013 A CN 109948013A
- Authority
- CN
- China
- Prior art keywords
- page
- pages
- navigation
- network address
- navigation relation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a kind of web page processing method and devices.Wherein, this method comprises: crawling the hyperlinked information on all pages of targeted website, wherein the page of targeted website includes: homepage, column list page, content pages;The navigation relation between the page is established according to the network address of the page crawled and hyperlinked information, obtains the first navigation relation;Content pages associated with target column list page are determined according to the first navigation relation.The present invention solves the technical issues of can not quick and precisely determining the associated content pages of column list page in the prior art.
Description
Technical field
The present invention relates to internet areas, in particular to a kind of web page processing method and device.
Background technique
Website generally comprises two kinds of pages: content pages and list page.Content pages are the pages comprising specific article information;And
List page plays page navigation, navigates to content pages using the list of hyperlinks arranged.Website column is to web site contents
Classification, for example general portal website has the columns such as " news ", " sport ", " amusement ".The website column page is usually one
List page, for its each content page of navigating.
In the business for crawling website data, the data of the accurate all the elements page for obtaining website column list page have
Very actual meaning.For example, being able to know that website by the update status of observation content page and the update status of column list page
Whether updated.
Website page is all the file of html format, and the list section of list page is typically all to be marked to combine by multiple li
It forms.
As soon as XPath is the language that information can be searched in html document, for example,/li/h4 representative find out it is all embedding
Cover the h4 labelled element under li label.
The method of the existing all the elements page for obtaining website column list page, is the list for crawling a website column page
, it needs to check the html source file of website column page, finds the path xpath of list items.Lower website column is crawled in crawler
After page, text resolution is carried out using xpath and obtains list items.
The method that the prior art obtains the associated content pages of website column list page mainly has the disadvantage that: obtaining website column
The xpath of mesh page list items needs many manual workings;Moreover some websites will do it correcting work, and xpath may at this time
It can change, will cause the situation of parsing inaccuracy.
For above-mentioned problem, currently no effective solution has been proposed.
Summary of the invention
The embodiment of the invention provides a kind of web page processing method and device, at least solve in the prior art can not be quick
The technical issues of accurately determining column list page associated content pages.
According to an aspect of an embodiment of the present invention, a kind of web page processing method is provided, comprising: crawl targeted website
Hyperlinked information on all pages, wherein the page of the targeted website includes: homepage, column list page, content pages;Root
The navigation relation between the page is established according to the network address and the hyperlinked information of the page crawled, obtains the first navigation relation;Root
Content pages associated with target column list page are determined according to first navigation relation.
Further, the navigation relation between the page is established according to the network address of the page crawled and the hyperlinked information,
Obtain the first navigation relation, comprising: crawl the hyperlinked information in first page, obtain the first hyperlinked information, wherein described
First page is any one page of the targeted website;Believed according to the network address of the first page and first hyperlink
Breath establishes the navigation relation between the first page and other pages, obtains the second navigation relation;Summarize all second to lead
Boat relationship obtains first navigation relation.
Further, according to the network address of the first page and first hyperlinked information establish the first page with
Navigation relation between other pages obtains the second navigation relation, comprising: believes the first hyperlink described in the first page
The linked page is ceased as second page;The navigation relation between the first page and all second page is drawn,
Obtain second navigation relation.
Further, content pages associated with target column list page are determined according to first navigation relation, comprising:
The third page is filtered out according to first navigation relation, the whole filtered out the third page constitutes third page set,
The third page is the page for having two-way navigation relation with the target column list page;From the third page set
The 4th page is filtered out, the whole filtered out the 4th page constitutes the 4th page set, wherein the 4th page is institute
State the page in third page set in addition to the column list page and the homepage;By the page in the 4th page set
Face is as the associated content pages of target column list page.
Further, the 4th page is filtered out from the third page set, comprising: obtain the head of the targeted website
The network address of page and all column list pages;Successively by the network address of the third pages all in the third page set with it is described
The network address of the homepage of targeted website, the network address of all column list pages are matched respectively;If the third page
It fails to match for the network address of network address and the homepage of the targeted website, the network address of all column list pages, it is determined that described
The third page is the 4th page.
According to another aspect of an embodiment of the present invention, a kind of page processor is additionally provided, comprising: crawl unit, use
Hyperlinked information on all pages for crawling targeted website, wherein the page of the targeted website includes: homepage, column
List page, content pages;Unit is established, for establishing between the page according to the network address and the hyperlinked information of the page crawled
Navigation relation obtains the first navigation relation;Determination unit, for according to first navigation relation determination and the list of target column
The associated content pages of page.
Further, the page crawled is multiple, and the unit of establishing includes: to crawl subelement, for crawling the
Hyperlinked information on one page obtains the first hyperlinked information, wherein the first page is any of the targeted website
One page;Subelement is established, establishes described for the network address and first hyperlinked information according to the first page
Navigation relation between one page and other pages obtains the second navigation relation;Summarize subelement, for summarizing all second
Navigation relation obtains first navigation relation.
Further, the subelement of establishing includes: the first determining module, and being used for will be first described in the first page
The page that hyperlinked information is linked is as second page;Drafting module, for drawing the first page and all described the
Navigation relation between two pages obtains second navigation relation.
Further, the determination unit includes: lookup subelement, for filtering out according to first navigation relation
Three pages, the whole filtered out the third page constitute third page set, and the third page is and the target column
List page has the page of two-way navigation relation;Subelement is screened, for filtering out page four from the third page set
Face, the whole filtered out the 4th page constitute the 4th page set, wherein the 4th page is the third page set
The page in conjunction in addition to the column list page and the homepage;Determine subelement, being used for will be in the 4th page set
The page as the associated content pages of target column list page.
Further, the screening subelement includes: acquisition module, for obtaining the homepage of the targeted website and owning
The network address of column list page;Matching module, for successively by the network address of the third pages all in the third page set
It is matched respectively with the network address of the network address of the homepage of the targeted website, all column list pages;Second determining module,
Network address, the network address of all column list pages for network address and the homepage of the targeted website in the third page is equal
In the case that it fails to match, it is determined that the third page is the 4th page.
Another aspect according to an embodiment of the present invention, additionally provides a kind of storage medium, is stored thereon with program, the program
The web page processing method is realized when being executed by processor.
It is according to an embodiment of the present invention that in another aspect, additionally providing a kind of processor, the processor is used to run program,
Wherein, the web page processing method is executed when described program is run.
In embodiments of the present invention, have the page of two-way navigation relation in addition to homepage and other with target column list page
Column list page is exactly the associated content pages of target column list page, according in the network address of the page crawled and the page crawled
Hyperlinked information establish the navigation relation between the page, that is, the first navigation relation excludes homepage according to the first navigation relation
With other column list pages, the associated content pages of target column list page can be obtained, in this process, do not need artificial
It participates in, even if website revision nor affects on as a result, having reached the skill for rapidly and accurately determining the associated content pages of column list page
Art effect, and then solve and can not quick and precisely determine that the technology of the associated content pages of column list page is asked in the prior art
Topic.
Detailed description of the invention
The drawings described herein are used to provide a further understanding of the present invention, constitutes a part of the invention, this hair
Bright illustrative embodiments and their description are used to explain the present invention, and are not constituted improper limitations of the present invention.In the accompanying drawings:
Fig. 1 is the flow chart of web page processing method according to an embodiment of the present invention;
Fig. 2 is the schematic diagram of content pages according to an embodiment of the present invention;
Fig. 3 is the schematic diagram of column list page according to an embodiment of the present invention;
Fig. 4 is a kind of schematic diagram of second navigation relation according to an embodiment of the present invention;
Fig. 5 is the schematic diagram of another the second navigation relation according to an embodiment of the present invention;
Fig. 6 is the schematic diagram of another the second navigation relation according to an embodiment of the present invention;
Fig. 7 is a kind of schematic diagram of first navigation relation according to an embodiment of the present invention;
Fig. 8 is the schematic diagram of page processor according to an embodiment of the present invention.
Specific embodiment
In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention
Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only
The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people
The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work
It encloses.
It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, "
Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way
Data be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein can in addition to illustrating herein or
Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover
Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to
Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product
Or other step or units that equipment is intrinsic.
According to embodiments of the present invention, a kind of embodiment of image processing method is provided, it should be noted that in attached drawing
The step of process illustrates can execute in a computer system such as a set of computer executable instructions, although also,
Logical order is shown in flow chart, but in some cases, it can be to be different from shown by sequence execution herein or retouch
The step of stating.
Fig. 1 is a kind of flow chart of web page processing method according to an embodiment of the present invention.As shown in Figure 1, this method includes
Following steps:
Step S102 crawls the hyperlinked information on all pages of targeted website, wherein the page packet of targeted website
It includes: homepage, column list page, content pages.
Step S104 establishes the navigation relation between the page according to the network address of the page crawled and hyperlinked information, obtains
First navigation relation.
Step S106 determines content pages associated with target column list page according to the first navigation relation.
Content pages are the pages comprising specific article information, and Fig. 2 shows a content pages.
Column list page plays page navigation, navigates to content pages using the list of hyperlinks arranged, Fig. 3 is shown
One column list page.
Column is the classification to web site contents, for example general website has multiple columns, for example, have " news ", " sport ",
Columns such as " amusements ".Column list page is usually that a list mode is presented, for its each content page of navigating.
If the A page includes the hyperlink that can be linked to the B page, then it is assumed that the A page can navigate to the B page.
If the B page includes the hyperlink that can be linked to the A page, then it is assumed that the B page can navigate to the A page.
If the A page includes the hyperlink that can be linked to the B page, also, the B page includes that can be linked to the A page
Hyperlink, then it is assumed that the A page and the B page have two-way navigation relation.
The page of targeted website includes: homepage, column list page, content pages.Inventor has found by numerous studies: homepage
There is two-way navigation relation with multiple column list pages;There is two-way lead between one column list page and other column list pages
Boat relationship;One column list page and the content pages of itself have two-way navigation relation, the content pages with other column list pages
Without two-way navigation relation.
Table 1
For example, as shown in table 1, one shares 3 column list pages, wherein the content pages of column list page L1 have 6, point
It Wei not content pages P (1,1), content pages P (1,2), content pages P (1,3), content pages P (Isosorbide-5-Nitrae), content pages P (1,5), content pages P
(1,6), column list page L1 and this 6 content pages have two-way navigation relation.
The content pages of column list page L2 have 11, respectively content pages P (2,1), content pages P (2,2), content pages P (2,
3), content pages P (2,4), content pages P (2,5), content pages P (2,6), content pages P (2,7), content pages P (2,8), content pages P
(2,9), content pages P (2,10), content pages P (2,11), column list page L2 and this 11 content pages have two-way navigation relation.
The content pages of column list page L3 have 8, respectively content pages P (3,1), content pages P (3,2), content pages P (3,
3), content pages P (3,4), content pages P (3,5), content pages P (3,6), content pages P (3,7), content pages P (3,8), column list
Page L3 and this 8 content pages have two-way navigation relation.
Column list page and the content pages of itself have two-way navigation relation, do not have with the content pages of other column list pages
There is two-way navigation relation.
In embodiments of the present invention, have the page of two-way navigation relation in addition to homepage and other with target column list page
Column list page is exactly the associated content pages of target column list page, according to the network address of the page of the targeted website crawled and is climbed
The hyperlinked information in the page taken establishes the navigation relation between target web site page, that is, the first navigation relation.According to first
Navigation relation excludes homepage and other column list pages, the associated content pages of target column list page can be obtained, at this
It during a, does not need manually to participate in, even if website revision nor affects on as a result, solve in the prior art can not be quick and precisely
The technical issues of determining column list page associated content pages has reached in rapidly and accurately determining that column list page is associated
Hold the technical effect of page.
Optionally, it is established between first page and other pages according to the network address of first page and the first hyperlinked information
Navigation relation obtains the second navigation relation, comprising: using the page that the first hyperlinked information is linked in first page as second
The page;The navigation relation between first page and whole second pages is drawn, the second navigation relation is obtained.
Optionally, the navigation relation between the page is established according to the network address of the page crawled and hyperlinked information, obtains
One navigation relation, comprising: crawl the hyperlinked information in first page, obtain the first hyperlinked information, wherein first page is
Any one page of targeted website;First page and other pages are established according to the network address of first page and the first hyperlinked information
Navigation relation between face obtains the second navigation relation;Summarize the second all navigation relations, obtains the first navigation relation.
First page is any one page of targeted website, crawls the hyperlinked information in first page, obtains first
Hyperlinked information establishes the navigation between first page and other pages according to the network address of first page and the first hyperlinked information
Relationship obtains the second navigation relation.The detailed process for establishing the second navigation relation can be such that
Assuming that the first hyperlinked information, which is directed toward (link), arrives some page, then, first page can navigate to the page,
Assuming that there is the hyperlink for being linked to first page in the page, then first page and the page have two-way navigation relation.
Assuming that targeted website one shares the M page, i.e., first page is M, according to the network address of each first page and the
One hyperlinked information establishes the navigation relation between first page and other pages, can all obtain second navigation relation, that
One is obtained M the second navigation relations, this M the second navigation relations are summarized, obtain the first navigation relation, first leads
Boat relationship is a comprehensive navigation relation.
By crawling website homepage, the second navigation relation for obtaining website homepage is as shown in Figure 4;By crawling column list
Page 1, the second navigation relation for obtaining column list page 1 is as shown in Figure 5;By crawling column list page 1- content pages 1, column is obtained
Second navigation relation of mesh list page 1- content pages 1 is as shown in Figure 6.This 3 kind of second navigation relation is comprehensive at a figure, it obtains
First navigation relation as shown in Figure 7.
As seen from Figure 7, column list page 1 and homepage, column list page 2, column list page 1- content pages 1, column
List page 1- content pages 2, column list page 1- content pages 3 all have two-way navigation relation, but column list page 2 and column arrange
Table page 1- content pages 1, column list page 1- content pages 2, column list page 1- content pages 3 do not have two-way navigation relation, that is,
With specific column list page (target column list page) have two-way navigation relation in addition to homepage, other column list pages, just
It is the specific associated content pages of column list page.
Optionally, associated with target column list page content pages are determined according to the first navigation relation, comprising: according to the
One navigation relation filters out the third page, and whole third pages for filtering out constitute third page set, and the third page is and mesh
Mark the page that column list page has two-way navigation relation;The 4th page is filtered out from third page set, what is filtered out is complete
The 4th page of portion constitutes the 4th page set, wherein the 4th page be in third page set except column list page and homepage with
The outer page;Using the page in the 4th page set as the associated content pages of target column list page.
Optionally, the 4th page is filtered out from third page set, comprising: obtain targeted website homepage and all columns
The network address of mesh list page;Successively by the network address of the network address of the third pages all in third page set and the homepage of targeted website,
The network address of all column list pages is matched respectively;If the network address of the homepage of the network address and targeted website of the third page, institute
Have column list page network address it fails to match, it is determined that the third page be the 4th page.
It is exactly the target with target column list page with two-way navigation relation in addition to homepage, other column list pages
The associated content pages of column list page.
According to the first navigation relation, it is capable of determining that all pages with target column list page with two-way navigation relation
Face, that is, determine the third page, and the quantity of the third page is multiple.Exclude homepage in the third page, other column lists
Page, can be obtained the associated content pages of target column list page.
If the network address successful match of the homepage of the network address and targeted website of the third page, it is determined that the third page is website
Homepage.
If the network address successful match of the network address of the third page and some column list page, it is determined that the third page is column
Mesh list page.
If the network address of the third page matches mistake with the network address of the network address of the homepage of targeted website, all column list pages
Lose, it is determined that the third page neither website homepage, nor column list page, but target column list page it is associated in
Hold page, that is, the 4th page.4th page can have multiple.
In embodiments of the present invention, the network address of all column list pages and homepage has been counted;It crawls super in Website page
Link information;According to the hyperlinked information in current page network address and the page, navigation relation figure is established.It is established according to all pages
Integrated navigation relational graph searches the page for having two-way navigation relation with specified column list page;Other columns are excluded from these pages
Mesh list page and homepage, remaining is the associated all the elements page of the column list page.
Existing method generally crawls the list items under a website column, needs to check the HTML source document of website column
Part finds the path xpath of list items.After crawler crawls lower website column page, text resolution is carried out using xpath and is obtained.
Xpath is a kind of query language for needing learning cost and operating cost, and this adds increased human costs.The embodiment of the present invention mentions
For web page processing method avoid using xpath and obtain the content page data of column list page, reduce human cost.
Web page processing method provided in an embodiment of the present invention is utilized what hyperlink in web page listings page and content pages was navigated
Inherent law accurately to obtain the association page of column list page, the i.e. content pages of its column list page.Due to the inherent law
Independent of the version of website, therefore, the case where for website revision, the present invention is equally applicable, without subsequent maintenance cost.
The embodiment of the invention also provides a kind of page processor, which is able to carry out at above-mentioned webpage
Reason method, above-mentioned web page processing method can also be executed by the page processor.Fig. 8 is net according to an embodiment of the present invention
The schematic diagram of sheet processing apparatus, as shown in figure 8, the device includes: to crawl unit 10, establish unit 20, determination unit 30.
Unit 10 is crawled, the hyperlinked information on all pages for crawling targeted website, wherein the page of targeted website
Face includes: homepage, column list page, content pages.
Unit 20 is established, the navigation between the page is established for the network address and hyperlinked information according to the page crawled and closes
System, obtains the first navigation relation.
Determination unit 30, for determining content pages associated with target column list page according to the first navigation relation.
Optionally, the page crawled be it is multiple, establish unit 20 include: crawl subelement, establish subelement, summarize it is sub single
Member.Subelement is crawled, for crawling the hyperlinked information in first page, obtains the first hyperlinked information, wherein first page
For any one page of targeted website.Subelement is established, for building according to the network address and the first hyperlinked information of first page
Navigation relation between vertical first page and other pages, obtains the second navigation relation.Summarize subelement, it is all for summarizing
Second navigation relation obtains the first navigation relation.
Optionally, establishing subelement includes: the first determining module, drafting module.First determining module is used for first page
The page that the first hyperlinked information is linked in face is as second page.Drafting module, for drawing first page and all the
Navigation relation between two pages obtains the second navigation relation.
Optionally it is determined that unit 30 includes: to search subelement, screen subelement, determine subelement.Subelement is searched, is used
In filtering out the third page according to the first navigation relation, the whole third pages filtered out constitute third page set, third page
Face is the page for having two-way navigation relation with target column list page.Subelement is screened, for sieving from third page set
The 4th page is selected, all the 4th pages filtered out constitute the 4th page set, wherein the 4th page is third page set
In the page in addition to column list page and homepage.Subelement is determined, for using the page in the 4th page set as target
The associated content pages of column list page.
Optionally, screening subelement includes: to obtain module, matching module, the second determining module.Module is obtained, for obtaining
Take the homepage of targeted website and the network address of all column list pages.Matching module, for will successively own in third page set
The network address of the homepage of the network address and targeted website of the third page, the network address of all column list pages are matched respectively.Second really
Cover half block, the network address of the homepage for the network address and targeted website in the third page, the network address of all column list pages match
In the case where failure, it is determined that the third page is the 4th page.
Page processor includes processor and memory, above-mentioned to crawl unit 10, establish unit 20, determination unit 30 etc.
In memory as program unit storage, above procedure unit stored in memory is executed by processor to realize phase
The function of answering.
Include kernel in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can be set one
Or more, content pages associated with target column list page are determined by adjusting kernel parameter.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/
Or the forms such as Nonvolatile memory, if read-only memory (ROM) or flash memory (flash RAM), memory include that at least one is deposited
Store up chip.
The embodiment of the invention provides a kind of storage mediums, are stored thereon with program, real when which is executed by processor
The existing web page processing method.
The embodiment of the invention provides a kind of processor, the processor is for running program, wherein described program operation
Web page processing method described in Shi Zhihang.
The embodiment of the invention provides a kind of equipment, equipment include processor, memory and storage on a memory and can
The program run on a processor, processor perform the steps of when executing program
Crawl the hyperlinked information on all pages of targeted website, wherein the page of targeted website includes: homepage, column
Mesh list page, content pages;The navigation relation between the page is established according to the network address of the page crawled and hyperlinked information, obtains
One navigation relation;Content pages associated with target column list page are determined according to the first navigation relation.
The hyperlinked information in first page is crawled, the first hyperlinked information is obtained, wherein first page is targeted website
Any one page;It is established between first page and other pages according to the network address of first page and the first hyperlinked information
Navigation relation obtains the second navigation relation;Summarize the second all navigation relations, obtains the first navigation relation.
Using the page that the first hyperlinked information is linked in first page as second page;Draw first page and whole
Navigation relation between second page obtains the second navigation relation.
The third page is filtered out according to the first navigation relation, the whole third pages filtered out constitute third page set,
The third page is the page for having two-way navigation relation with target column list page;Page four is filtered out from third page set
Face, all the 4th pages filtered out constitute the 4th page set, wherein the 4th page is in third page set except column arranges
The page other than table page and homepage;Using the page in the 4th page set as the associated content pages of target column list page.
Obtain the homepage of targeted website and the network address of all column list pages;Successively by thirds all in third page set
The network address of the homepage of the network address and targeted website of the page, the network address of all column list pages are matched respectively;If third page
It fails to match for the network address of the homepage of the network address and targeted website in face, the network address of all column list pages, it is determined that the third page
For the 4th page.
Equipment herein can be server, PC, PAD, mobile phone etc..
Present invention also provides a kind of computer program products, when executing on data processing equipment, are adapted for carrying out just
The program of beginningization there are as below methods step:
Crawl the hyperlinked information on all pages of targeted website, wherein the page of targeted website includes: homepage, column
Mesh list page, content pages;The navigation relation between the page is established according to the network address of the page crawled and hyperlinked information, obtains
One navigation relation;Content pages associated with target column list page are determined according to the first navigation relation.
The hyperlinked information in first page is crawled, the first hyperlinked information is obtained, wherein first page is targeted website
Any one page;It is established between first page and other pages according to the network address of first page and the first hyperlinked information
Navigation relation obtains the second navigation relation;Summarize the second all navigation relations, obtains the first navigation relation.
Using the page that the first hyperlinked information is linked in first page as second page;Draw first page and whole
Navigation relation between second page obtains the second navigation relation.
The third page is filtered out according to the first navigation relation, the whole third pages filtered out constitute third page set,
The third page is the page for having two-way navigation relation with target column list page;Page four is filtered out from third page set
Face, all the 4th pages filtered out constitute the 4th page set, wherein the 4th page is in third page set except column arranges
The page other than table page and homepage;Using the page in the 4th page set as the associated content pages of target column list page.
Obtain the homepage of targeted website and the network address of all column list pages;Successively by thirds all in third page set
The network address of the homepage of the network address and targeted website of the page, the network address of all column list pages are matched respectively;If third page
It fails to match for the network address of the homepage of the network address and targeted website in face, the network address of all column list pages, it is determined that the third page
For the 4th page.
It should be understood by those skilled in the art that, embodiments herein can provide as method, system or computer program
Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the application
Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the application, which can be used in one or more,
The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces
The form of product.
The application is referring to method, the process of equipment (system) and computer program product according to the embodiment of the present application
Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions
The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs
Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce
A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real
The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates,
Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or
The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting
Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or
The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one
The step of function of being specified in a box or multiple boxes.
In a typical configuration, calculating equipment includes one or more processors (CPU), input/output interface, net
Network interface and memory.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/
Or the forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable Jie
The example of matter.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method
Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data.
The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves
State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable
Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM),
Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices
Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates
Machine readable medium does not include temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.
It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability
It include so that the process, method, commodity or the equipment that include a series of elements not only include those elements, but also to wrap
Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want
Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including element
There is also other identical elements in process, method, commodity or equipment.
It will be understood by those skilled in the art that embodiments herein can provide as method, system or computer program product.
Therefore, complete hardware embodiment, complete software embodiment or embodiment combining software and hardware aspects can be used in the application
Form.It is deposited moreover, the application can be used to can be used in the computer that one or more wherein includes computer usable program code
The shape for the computer program product implemented on storage media (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.)
Formula.
The above is only embodiments herein, are not intended to limit this application.To those skilled in the art,
Various changes and changes are possible in this application.It is all within the spirit and principles of the present application made by any modification, equivalent replacement,
Improve etc., it should be included within the scope of the claims of this application.
Claims (10)
1. a kind of web page processing method characterized by comprising
Crawl the hyperlinked information on all pages of targeted website, wherein the page of the targeted website includes: homepage, column
Mesh list page, content pages;
The navigation relation between the page is established according to the network address of the page crawled and the hyperlinked information, obtains the first navigation pass
System;
Content pages associated with target column list page are determined according to first navigation relation.
2. the method according to claim 1, wherein according to the network address of the page crawled and the hyperlinked information
The navigation relation between the page is established, the first navigation relation is obtained, comprising:
The hyperlinked information in first page is crawled, the first hyperlinked information is obtained, wherein the first page is the target
Any one page of website;
It is established between the first page and other pages according to the network address of the first page and first hyperlinked information
Navigation relation, obtain the second navigation relation;
Summarize the second all navigation relations, obtains first navigation relation.
3. according to the method described in claim 2, it is characterized in that, according to the network address of the first page and first hyperlink
The navigation relation that information is established between the first page and other pages is connect, the second navigation relation is obtained, comprising:
Using the page that the first hyperlinked information is linked described in the first page as second page;
The navigation relation between the first page and all second page is drawn, second navigation relation is obtained.
4. the method according to claim 1, wherein being arranged according to first navigation relation is determining with target column
The associated content pages of table page, comprising:
The third page is filtered out according to first navigation relation, the whole filtered out the third page constitutes third page set
It closes, the third page is the page for having two-way navigation relation with the target column list page;
The 4th page is filtered out from the third page set, the whole filtered out the 4th page constitutes the 4th page set
It closes, wherein the 4th page is the page in the third page set in addition to the column list page and the homepage;
Using the page in the 4th page set as the associated content pages of target column list page.
5. according to the method described in claim 4, it is characterized in that, filter out the 4th page from the third page set,
Include:
Obtain the homepage of the targeted website and the network address of all column list pages;
Successively by the network address of the network address of the third pages all in the third page set and the homepage of the targeted website,
The network address of all column list pages is matched respectively;
If network address, the network address of all column list pages of the network address of the third page and the homepage of the targeted website
It fails to match, it is determined that the third page is the 4th page.
6. a kind of page processor characterized by comprising
Unit is crawled, the hyperlinked information on all pages for crawling targeted website, wherein the page of the targeted website
It include: homepage, column list page, content pages;
Unit is established, for establishing the navigation relation between the page according to the network address and the hyperlinked information of the page crawled,
Obtain the first navigation relation;
Determination unit, for determining content pages associated with target column list page according to first navigation relation.
7. device according to claim 6, which is characterized in that the page crawled be it is multiple, it is described to establish unit packet
It includes:
Subelement is crawled, for crawling the hyperlinked information in first page, obtains the first hyperlinked information, wherein described
One page is any one page of the targeted website;
Subelement is established, for establishing the first page according to the network address and first hyperlinked information of the first page
With the navigation relation between other pages, the second navigation relation is obtained;
Summarize subelement, for summarizing the second all navigation relations, obtains first navigation relation.
8. device according to claim 7, which is characterized in that the subelement of establishing includes:
First determining module, for using the page that the first hyperlinked information is linked described in the first page as second page
Face;
Drafting module obtains described for drawing the navigation relation between the first page and all second page
Two navigation relations.
9. a kind of storage medium, which is characterized in that be stored thereon with program, realized when which is executed by processor as right is wanted
Seek 1 to 5 described in any item web page processing methods.
10. a kind of processor, which is characterized in that the processor is for running program, wherein executed such as when described program is run
Web page processing method described in any one of claim 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710705406.3A CN109948013B (en) | 2017-08-16 | 2017-08-16 | Webpage processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710705406.3A CN109948013B (en) | 2017-08-16 | 2017-08-16 | Webpage processing method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109948013A true CN109948013A (en) | 2019-06-28 |
CN109948013B CN109948013B (en) | 2021-11-05 |
Family
ID=67003895
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710705406.3A Active CN109948013B (en) | 2017-08-16 | 2017-08-16 | Webpage processing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109948013B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103294815A (en) * | 2013-06-08 | 2013-09-11 | 北京邮电大学 | Search engine device with various presentation modes based on classification of key words and searching method |
US20140040225A1 (en) * | 2012-07-31 | 2014-02-06 | International Business Machines Corporation | Displaying browse sequence with search results |
CN106547803A (en) * | 2015-09-23 | 2017-03-29 | 北京国双科技有限公司 | The method and apparatus for crawling website incremental resource |
-
2017
- 2017-08-16 CN CN201710705406.3A patent/CN109948013B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140040225A1 (en) * | 2012-07-31 | 2014-02-06 | International Business Machines Corporation | Displaying browse sequence with search results |
CN103294815A (en) * | 2013-06-08 | 2013-09-11 | 北京邮电大学 | Search engine device with various presentation modes based on classification of key words and searching method |
CN106547803A (en) * | 2015-09-23 | 2017-03-29 | 北京国双科技有限公司 | The method and apparatus for crawling website incremental resource |
Also Published As
Publication number | Publication date |
---|---|
CN109948013B (en) | 2021-11-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107145496B (en) | Method for matching image with content item based on keyword | |
CN106096050A (en) | A kind of method and apparatus of video contents search | |
US9928415B2 (en) | Mathematical formula learner support system | |
US20150278359A1 (en) | Method and apparatus for generating a recommendation page | |
JP6646931B2 (en) | Method and apparatus for providing recommendation information | |
US10296552B1 (en) | System and method for automated identification of internet advertising and creating rules for blocking of internet advertising | |
CN105956148A (en) | Resource information recommendation method and apparatus | |
JP6966158B2 (en) | Methods, devices and programs for processing search data | |
CN104008180B (en) | Association method of structural data with picture, association device thereof | |
CN107145497B (en) | Method for selecting image matched with content based on metadata of image and content | |
US20200293160A1 (en) | System for superimposed communication by object oriented resource manipulation on a data network | |
CN103838862B (en) | Video searching method, device and terminal | |
CN103617192B (en) | The clustering method and device of a kind of data object | |
CN106372130A (en) | Static resource management method | |
CN104331438A (en) | Method and device for selectively extracting content of novel webpage | |
US20200394194A1 (en) | Multi-vertical entity-based search system | |
CN110020236B (en) | Webpage parsing method, device, storage medium, processor and equipment | |
CN107368546A (en) | A kind of method and apparatus for generating outline | |
CN107391528A (en) | Front end assemblies Dependency Specification searching method and equipment | |
CN103455492A (en) | Method and device for searching web pages | |
CN104572874B (en) | A kind of abstracting method and device of webpage information | |
CN109948013A (en) | Web page processing method and device | |
CN115905759A (en) | Barrier-free webpage generation method, device, medium and equipment | |
CN103793509A (en) | Picture capturing method and device | |
CN109559141A (en) | A kind of automatic classification method, the apparatus and system of intention pattern |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing Applicant after: Beijing Guoshuang Technology Co.,Ltd. Address before: 100086 Beijing city Haidian District Shuangyushu Area No. 76 Zhichun Road cuigongfandian 8 layer A Applicant before: Beijing Guoshuang Technology Co.,Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |