CN102087648A

CN102087648A - Method and system for fetching news comment page

Info

Publication number: CN102087648A
Application number: CN2009102420552A
Authority: CN
Inventors: 严华梁; 刘伟; 杨建武; 万小军; 肖建国
Original assignee: BEIJING FOUNDER E-GOVERNMENT INFORMATION TECHNOLOGY Co Ltd; Peking University; Peking University Founder Group Co Ltd; Beijing Founder Electronics Co Ltd
Current assignee: Beijing Founder Electronics Chief Information Technology Co ltd; New Founder Holdings Development Co ltd; Peking University; Beijing Founder Electronics Co Ltd
Priority date: 2009-12-03
Filing date: 2009-12-03
Publication date: 2011-06-08
Anticipated expiration: 2029-12-03
Also published as: CN102087648B

Abstract

The invention discloses a method and a system for fetching a news comment page, and belongs to the technical field of information retrieval and data integration. The method comprises the following steps of: performing breadth traversal on the pages from an initial page of a news website, and acquiring page information meeting depth limitation in the traversal process; then calculating the characteristic values of the pages, and identifying the news comment page from the pages according to the size relationship between the characteristic values and a preset threshold value; and finally, acquiring a page turning link of the news comment page, and acquiring other news comment pages according to the page turning link. The method and the system can automatically fetch the news comment page from the web pages of the news website, the fetching speed is high, and the fetched news comment page is comprehensive.

Description

A kind of crawling method of the news analysis page and system

Technical field

The invention belongs to information retrieval and data integration technical field, be specifically related to a kind of crawling method and system of the news analysis page.

Background technology

Web is just development with surprising rapidity since being born the beginning of the nineties in last century, and Web has become maximum in the world information warehouse up till now, has covered the every field of real world, becomes human work's life and obtains the information main path.The issue of Web information mainly is the form realization with webpage, and according to up-to-date estimation, the quantity of webpage has surpassed 550 billion (1 billion equals 1,000,000,000) among the Web.Obviously the visit of manual mode can't be satisfied the needs that people's information is obtained, in order to allow people more effectively visit and utilize the information of magnanimity among the Web, just begun the research of Web information search and integration field from the mid-90 in last century person of beginning one's study, various Web information searches and integrated relevant application have also appearred in industrial community simultaneously, such as vertical search engine, public sentiment analysis etc.These steps necessarys that use to realize are exactly that the page that obtains the information needed place earlier extracts required information from the very poor webpage of structuring degree then exactly.

News analysis among the Web is meant that common viewer in the comment of the news website with comment issue authority at issues such as a certain media event or personages, is that the very important on the internet information of present people is obtained the source.News analysis information is occupied bigger ratio in Web information.Produce many important use and research topic based on news analysis information, mainly comprised following two aspects:

1. comment search engine: towards the vertical search engine of comment, from the user of One's name is legion has the website of comment issue authority, obtain and integrated comment, can provide instant for people comprehensively to particular news event or personage's comment search.For the promptness that guarantees news analysis information and comprehensive, must in time handle a large amount of review pages, the prerequisite of processing is certain as to obtain the news analysis page earlier.

2. public sentiment analysis: be the hot research problem that last decade comes natural language processing and information retrieval field.Its target is to identify topic of system's the unknown and the report relevant with this topic from continuous record.One of its main information source is exactly the news analysis information of issuing among the Web.

By top to the introduction of using as can be seen, news analysis information is their very important Data Sources, the prerequisite that obtains news analysis information is to get access to the news analysis page, but because news website One's name is legion among the Web, and include various webpages in the news website, will inevitably have a strong impact on the efficient of information processing and the quality of retrieval.Therefore, be that many important application press for one of key technical problem of solution to the automatic identification of the news analysis page, have very important practical significance and wide application prospect.At present, the crawling method or the system that also do not have the news analysis page in the prior art.

Summary of the invention

At the defective that exists in the prior art, the purpose of this invention is to provide a kind of crawling method and system of the news analysis page.These method and system can be climbed automatically from news website effectively and be taken out the news analysis page.

To achieve these goals, the technical solution used in the present invention is as follows:

A kind of crawling method of the news analysis page may further comprise the steps:

(A) obtain the page in the news website;

(B) from the page that obtains, identify the news analysis page;

(C) page turning of obtaining in the news analysis page links, and obtains other news analysis pages according to the page turning link.

A kind of news analysis page climb the system of getting, comprise the page deriving means that is used for obtaining the page from news website; Be used for identifying the news analysis page recognition device of the news analysis page from the page that page deriving means obtains; And the page turning link that is used to obtain the news analysis page that news analysis page recognition device identifies, and obtain the news analysis page deriving means of other news analysis pages according to the page turning link.

The method of the invention and system can climb automatically from the webpage of news website and take out the news analysis page, and it is fast to climb the speed of getting, and it is comprehensive to climb the news analysis page of getting.

Description of drawings

Fig. 1 is the preferred implementation structured flowchart that the news analysis page of the present invention is climbed the system of getting;

Fig. 2 adopts system shown in Figure 1 to climb the method flow diagram of getting the news analysis page;

Fig. 3 is a kind of specific implementation method flow diagram that obtains the page in the embodiment;

Fig. 4 is a specific implementation method flow diagram of discerning the news analysis page in the embodiment from the page;

Fig. 5 is the page turning link of obtaining the news analysis page in the embodiment, and obtains the specific implementation method flow diagram of other news analysis pages according to the page turning link.

Embodiment

Below in conjunction with embodiment and accompanying drawing, describe the present invention.

Fig. 1 has shown the preferred implementation structure that the news analysis page of the present invention is climbed the system of getting.This system comprises page deriving means 11, the news analysis page recognition device 12 that is connected with page deriving means 11, and the page turning that is connected with news analysis page recognition device 12 links deriving means 13.

Page deriving means 11 is used for obtaining the page from news website.News analysis page recognition device 12 is used for identifying the news analysis page from the page that page deriving means 11 obtains.Page turning link deriving means 13 is used to obtain the page turning link of the news analysis page that news analysis page recognition device 12 identifies, and obtains other news analysis pages according to the page turning link.

Wherein, page deriving means 11 comprises that further the Html text obtains parts 101 and URL obtains parts 102.The Html text obtains the Html text that parts 101 are used to obtain the page.URL obtains parts 102 and is used for obtaining out URL (Uniform Resource Locator, URL(uniform resource locator) is also referred to as web page address) with other pages of this page link from the Html text of current page.

News analysis page recognition device 12 comprises that further recognition rule is provided with parts 120 and page eigenvalue calculation parts 121.Recognition rule is provided with the eigenvalue calculation rule that parts 120 are used to be provided with the news analysis page.Page eigenvalue calculation parts 121 are used for according to recognition rule the regular eigenwert of calculating the page of eigenvalue calculation that parts 120 are provided with being set.

Page turning link deriving means 13 comprises that further parts 130 are obtained in link and page turning links identification component 131.The link information that parts 130 are used for obtaining the news analysis page is obtained in link, comprises the Html text of link and the URL of link.Page turning link identification component 131 is used for obtaining the link information that parts 130 obtain from link and identifies the page turning link.

Fig. 2 has shown the employing system shown in Figure 1 and has climbed the method flow of getting the news analysis page.This method may further comprise the steps:

(A) page deriving means 11 obtains the page in the news website.

From the news website start page, the page is carried out the range traversal.Obtain all page infos that satisfy the predetermined depth restriction in ergodic process.The Html text obtains the Html text that parts 101 obtain the page, and URL obtains parts 102 and obtain out URL with other pages of this page link from the Html text of current page.

(B) news analysis page recognition device 12 identifies the news analysis page from the page that page deriving means 11 obtains.

Page eigenvalue calculation parts 121 are provided with the eigenvalue calculation rule that parts 120 are provided with according to recognition rule, calculate the eigenwert of the page.Described page eigenwert is meant the summation that contains news analysis page feature in the page, and mainly the eigenwert according to the URL of the page obtains with the eigenwert weighted calculation that is linked to the link text of this page.

(C) page turning link deriving means 13 obtains the page turning link of the news analysis page, and obtains other news analysis pages according to the page turning link.

Link is obtained parts 130 and is obtained all-links information in the news analysis page, page turning link identification component 131 identifies the page turning link URL of link text for " following one page ", whether has common prefix between the URL by this URL and the news analysis page and judges whether page turning link URL corresponding page is the news analysis page.

Fig. 3 has shown the specific implementation method flow that page deriving means 11 in the step (A) obtains the page, may further comprise the steps:

(A1) specify the URL of start page and the degree of depth to limit deep, initialization URL_Page formation and URL_Unique formation promptly empty URL_Page formation and URL_Unique formation.For convenience of description, the URL_Page formation is called first formation, the URL_Unique formation is called second formation.First formation and second formation all are used for the URL of memory page, and the purpose that two formations are set is the uniqueness that guarantees the URL that stores in second formation.Wherein, the value of deep can be set to the degree of depth of the news analysis page according to the news website homepage.Generally speaking, from the homepage of news website, deep is 4 or 5 o'clock, can reach the news analysis page.

(A2) add start page URL to the first formation tail of the queue and the second formation tail of the queue.

(A3) take out the first formation head of the queue URL, judge the degree of depth level of current URL corresponding page with respect to start page.If level＞deep then exports second formation and goes to step (B).Otherwise obtain the Html text of the corresponding page of current URL and the Html text that therefrom extracts URL chained address S set and point to the corresponding page in chained address.

(A4) take out URL who was not removed in the S set, judge whether this URL exists in second formation.If exist, then take out the next URL that was not removed and continue to judge, go to step (A3) behind all URL in having judged S; If there is no, this URL address is added into the first formation tail of the queue, goes to step (A3).

Fig. 4 has shown news analysis page recognition device 12 in the step (B) is discerned the news analysis page from the page specific implementation method flow, may further comprise the steps:

(B1) obtain second formation, judge whether second formation is empty; In this way, then go to step (C);

(B2) take out the second formation head of the queue URL, extract this URL corresponding page, calculate the eigenwert T of this page according to default eigenvalue calculation rule;

(B3) judge that whether the eigenwert T of this page is greater than predetermined threshold value Limit; If T＞Limit, then exporting this page is the news analysis page, and judges whether the URL of this page exists in Comment_URL_Unique formation (calling the 3rd formation in the following text), if there is no, then it is added into the tail of the queue of the 3rd formation, goes to step (B1).The 3rd formation is used to store the URL of the news analysis page.In the present embodiment, the value of Limit is 26, and this value can suitably be adjusted according to the actual conditions of news website.

In step (B2), the process of calculating the eigenwert T of the page according to default eigenvalue calculation rule may further comprise the steps:

(B2_1) judge whether comprise " comment " or " liuyan " among the page URL; In this way, then the eigenwert Score_URL of page URL is 1, otherwise is 0;

(B2_2) according to recognition rule the eigenwert Score_Keyword that rule that parts 120 set in advance is calculated the link text that points to this page is set; In the present embodiment, recognition rule is provided with regular as follows that parts 120 are provided with:

Number of regulation	Rule content
		1	Comprise " comment " in the link text, eigenwert adds 24.5
2	Comprise " follow-up " or " message " or " comment " in the link text, eigenwert adds 22.5
		3	Comprise " saying " and " sentence " in the link text, eigenwert adds 4
4	Comprise " saying " and " I " in the link text, eigenwert adds 4
		5	Comprise " online friend " in the link text, eigenwert adds 4
6	Comprise " issue " or " checking " or " click " in the link text, eigenwert adds 10
		7	Comprise " checking " and " click " in the link text, eigenwert adds 110
8	Comprise " having " or " all " or " owning " or " other " in the link text, eigenwert adds 10

(B2_3) the eigenwert T=Score_URL of the page * 8+Score_Keyword.

Fig. 5 has shown the page turning link that page turning link deriving means 13 obtains the news analysis page in the step (C), and links the specific implementation method flow that obtains other news analysis pages according to page turning, may further comprise the steps:

(C1) obtain the 3rd formation, judge whether the 3rd formation is empty; In this way, then finish;

(C2) take out the head of the queue URL of the 3rd formation, obtain the Html text of the corresponding page of this URL;

(C3) from the Html text, take out link text be " following one page " and the URL of link correspondence;

(C4) judge that whether the corresponding URL of this link exists common prefix with the URL of current page; In this way, then exporting the corresponding URL corresponding page of this link is the news analysis page, and adds corresponding URL to the three formation tails of the queue of this link, goes to step (C1); Otherwise, directly go to step (C1).

In step (C4), judge that whether the corresponding URL of link exists the process of common prefix may further comprise the steps with the URL of current page:

(C4_1) character string sequence S1 and S2 are put sky;

(C4_2) be that separator is cut apart the corresponding URL of link with "/", the part after will cutting apart deposits S1 according to the order of sequence in; With "/" is the URL that separator is cut apart current page, and the part after will cutting apart deposits S2 according to the order of sequence in;

(C4_3) judge whether first element of S1 and S2 is identical; In this way, then there is common prefix in two URL; Otherwise there is not common prefix in two URL.

For example, two URL of existing following structure:

URL1：http://comment2.news.sohu.com/viewcomments.action？id＝267280310；

URL2：http://comment2.news.sohu.com/default/comments.shtml？t＝267280310。

Is to deposit S1 and S2 after separator is cut apart respectively in URL 1 and URL2 with "/":

S1＝{(http://comment2.news.sohu.com)，(viewcomments.action？id＝267280310)}；

S2＝{(http://comment2.news.sohu.com)，(default)，(comments.shtml？t＝267280)}。

First element by S1 relatively and first element of S2 as can be known, both are identical, are http://comment2.news.sohu.com, so there are common prefix in URL1 and URL2.

Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technology thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims

1. the crawling method of a news analysis page may further comprise the steps:

(A) obtain the page in the news website;

(B) from the page that obtains, identify the news analysis page;

2. the crawling method of the news analysis page as claimed in claim 1 is characterized in that, obtains in the step (A) that the process of the page may further comprise the steps in the news website:

(A1) specify the URL of start page and the degree of depth to limit deep, initialization first formation and second formation; Described URL is a URL(uniform resource locator);

(A2) add start page URL to the first formation tail of the queue and the second formation tail of the queue;

(A3) take out the first formation head of the queue URL, judge the degree of depth level of current URL corresponding page with respect to start page; If level＞deep then exports second formation and goes to step (B); Otherwise obtain the Html text of the corresponding page of current URL and the Html text that therefrom extracts URL chained address S set and point to the corresponding page in chained address;

(A4) take out URL who was not removed in the S set, judge whether this URL exists in second formation; If exist, then take out the next URL that was not removed and continue to judge, go to step (A3) behind all URL in having judged S; Otherwise, this URL address is added into the first formation tail of the queue, go to step (A3).

3. the crawling method of the news analysis page as claimed in claim 2 is characterized in that, the process of the identification news analysis page may further comprise the steps in the step (B):

(B3) judge that whether the eigenwert T of this page is greater than predetermined threshold value Limit; If T＞Limit, then exporting this page is the news analysis page, and judges whether the URL of this page exists in the 3rd formation, if there is no, then it is added into the tail of the queue of the 3rd formation, goes to step (B1).

4. the crawling method of the news analysis page as claimed in claim 3 is characterized in that, in the step (B2), the process of calculating the eigenwert T of the page according to default eigenvalue calculation rule may further comprise the steps:

(B2_1) judge whether comprise " comment " or " liuyan " among the page URL, in this way, then the eigenwert Score_URL of page URL is 1; Otherwise, be 0;

(B2_2) according to the following regular eigenwert Score_Keyword that calculates the link text that points to the page;

Comprise " comment " in the link text, eigenwert adds 24.5;

Comprise " follow-up " or " message " or " comment " in the link text, eigenwert adds 22.5;

Comprise " saying " and " sentence " in the link text, eigenwert adds 4;

Comprise " saying " and " I " in the link text, eigenwert adds 4;

Comprise " online friend " in the link text, eigenwert adds 4;

Comprise " issue " or " checking " or " click " in the link text, eigenwert adds 10;

Comprise " checking " and " click " in the link text, eigenwert adds 110;

Comprise " having " or " all " or " owning " or " other " in the link text, eigenwert adds 10;

(B23) the eigenwert T=Score_URL of the page * 8+Score_Keyword.

5. the crawling method of the news analysis page as claimed in claim 3 is characterized in that, obtains the page turning link of the news analysis page described in the step (C), and may further comprise the steps according to the process that other news analysis pages are obtained in the page turning link:

6. the crawling method of the news analysis page as claimed in claim 5 is characterized in that, in the step (C4), judges that whether the corresponding URL of link exists the process of common prefix may further comprise the steps with the URL of current page:

(C4_1) character string sequence S1 and S2 are put sky;

A news analysis page climb the system of getting, comprise the page deriving means (11) that is used for obtaining the page from news website; Be used for identifying the news analysis page recognition device (12) of the news analysis page from the page that page deriving means (11) obtains; And the page turning link that is used to obtain the news analysis page that news analysis page recognition device (12) identifies, and obtain the page turning link deriving means (13) of other news analysis pages according to the page turning link.

8. the news analysis page as claimed in claim 7 is climbed the system of getting, and it is characterized in that: described page deriving means (11) comprises that further the Html text of the Html text that is used for obtaining the page obtains parts (101) and is used for obtaining parts (102) from the URL that the Html text of the page obtains out with the URL of other pages of this page link.

9. the news analysis page as claimed in claim 8 is climbed the system of getting, and it is characterized in that: described news analysis page recognition device (12) comprises that further the recognition rule of the eigenvalue calculation rule that is used to be provided with the news analysis page is provided with parts (120) and is used for according to recognition rule the page eigenvalue calculation parts (121) that eigenvalue calculation rule that parts (120) are provided with is calculated page eigenwert being set.

10. climb the system of getting as the described news analysis page of one of claim 7 to 9, it is characterized in that: described page turning link deriving means (13) comprises further that the link of the link information that is used for obtaining the news analysis page is obtained parts (130) and is used for obtaining the page turning that link information that parts (130) obtain identifies the page turning link from link and links identification component (131).