CN104182482A

CN104182482A - Method for judging news list page and method for screening news list page

Info

Publication number: CN104182482A
Application number: CN201410382359.XA
Authority: CN
Inventors: 刘晓娜; 张凯; 程学旗; 刘悦; 张瑾; 余智华
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2014-08-06
Filing date: 2014-08-06
Publication date: 2014-12-03
Anticipated expiration: 2034-08-06
Also published as: CN104182482B

Abstract

The invention provides a method for judging a news list page and a method for screening the news list page. The method for judging the news list page comprises the following steps: acquiring a webpage, and judging whether the webpage is a news webpage or not; if the webpage is not the news webpage, collecting sub-webpages in the webpage to repeat the judging process on the sub-webpages; if the webpage is the news webpage, and is judged as the news webpage in a channel, judging whether the father webpage of the webpage is the news webpage or not; if the father webpage is not the news webpage, recording the correlation information of the webpage and the father webpage; judging the news list page according to the correlation information; other steps. Through the utilization of the method provided by the invention to find the news list page, an existing news collector can directly take the news list page as the start page to collect the news content, so that the collection efficiency of the news data is improved.

Description

A kind of method of news list page determination methods and screening news list page

Technical field

The present invention relates to field of computer data processing, be specifically related to the method for news list page determination methods and screening news list page.

Background technology

Internet is to provide the important channel of news information, the public and business unit all need to rely on internet and obtain the news information of self paying close attention to, and the Type of website in internet is more numerous and diverse, some comprehensive online media sites for example, except news web page, also have the webpage of a large amount of other guides, user conventionally need to expend great amount of cost when search news.

There are at present some news collection instruments; can be in the website of user's appointment automatic search news pages; and all news pages are collected; find again the news content that wherein user pays close attention to; this type of news collection instrument is when image data, and owing to gathering, target is fuzzyyer, conventionally can judge a large amount of non-news pages; the impact that is even subject to web site url has expanded to acquisition range in the website of non-user's appointment, thereby makes the efficiency of image data very low.If news collection instrument can dwindle the scope of image data, the efficiency of data acquisition will be improved.

Summary of the invention

The object of this invention is to provide a kind of news list page screening technique and filter out news list page, news collection instrument can be using news list page as gathering target, thereby improves the data acquisition efficiency of news collection instrument.

The invention provides a kind of news list page determination methods, comprising:

Step (1), obtains webpage, judges whether described webpage is news web page,

If described webpage is not news web page, acquisition sub-net page in described webpage, and described sub-pages is re-executed to described step (1);

If described webpage is news web page, judge whether father's webpage of described webpage is news web page, if described father's webpage is not news web page, record the related information of described webpage and described father's webpage;

Step (2), judges news list page according to described related information.

Wherein, in described step (1), in described webpage, acquisition sub-net page comprises:

The URL information of the sub-pages that record collects;

If the URL information of sub-pages is not identical with the URL information of described record, gather described sub-pages.

Obtain the link information in described webpage, if the domain name of sub-pages corresponding to described link information is the subdomain name of described webpage, or the domain name of sub-pages corresponding to described link information is identical with the domain name of described webpage, gathers described sub-pages.

If the depth value that the URL information of described sub-pages represents is less than predetermined depth threshold values, gather described sub-pages.

Wherein, in described step (1), judge whether described webpage or described father's webpage are that news web page comprises:

According to whether comprising time letter in the first regular expression, the second regular expression, anchor text size threshold values and web page contents, described webpage or described father's webpage are judged to be to news web page or non-news web page, wherein said the first regular expression is the regular expression of webpage URL, and described the second regular expression is the regular expression of paying close attention to web page contents in channel.

Wherein, the related information that records described webpage and described father's webpage in described step (1) comprises:

The URL information writing in files of the sub-pages that the depth information of described father's webpage, URL information and described father's webpage chain are gone out.

The invention provides a kind of method of screening news list page, comprising:

Step (1), obtains a plurality of URL and puts into queue to be collected;

Step (2) is taken out URL as start page collection webpage wherein from queue described to be adopted;

Step (3), obtain collect webpage, judge whether described webpage is news web page,

If described webpage is not news web page, the URL of the sub-pages in described webpage is added in described queue to be collected;

Step (4), judges that whether queue to be collected is empty, if queue to be collected is for empty and all webpages that collect have all judged, performs step (5); Otherwise execution step (2);

Step (5), excavates the related information of record, filters out news list page.

Wherein, in described step (5), the related information of record is excavated and is comprised:

Read the related information of record, put into key assignments structure, the key in described key assignments structure is the webpage URL information of father's webpage, and value is a statistical framework body, comprises the news web page total quantity that each father's webpage chain goes out;

According to described key assignments structure, filter out news list page.

Further, described method also comprises:

Step (6) is judged to be the news list page that meets any following condition the news list page of high priority:

Condition 1: the issuing time that chain goes out was greater than the news list page of setting threshold with interior news number at N days;

Condition 2: account for interior news the news list page that proportion that whole chains go out news is not less than preset ratio value in the N that chain goes out days;

Other news list pages are judged as to the news list page of low priority.

Further, described method also comprises:

Step (7), sorts to the news list page of same levels::

First the issuing time going out according to the news list page chain of same level sorted with interior news number at N days;

Secondly according to the depth information of the news list page of same level, sort;

Secondly the news web page total quantity going out according to the news list page chain of same level again sorts.

The method of news list page screening technique provided by the invention and screening news list page, by the judgement to web page contents, and the record to webpage relation, filtered out the news list page in website, for news collection work provides accurate acquisition range, improved the efficiency of news collection.

Accompanying drawing explanation

Fig. 1 is the method flow diagram of the screening news list page of one embodiment of the present of invention;

Fig. 2 is the process flow diagram of the news list page determination methods of one embodiment of the present of invention;

Fig. 3 is the news list page mining algorithm process flow diagram of one embodiment of the present of invention;

Fig. 4 is the channel list schematic diagram of certain website;

Fig. 5 is the news web page feature schematic diagram under certain website society channel news.

Embodiment

Below in conjunction with drawings and Examples, the present invention is described in further detail.Specific embodiment described herein is only for explaining the present invention, but not limitation of the invention.It also should be noted that, for convenience of description, in accompanying drawing, only show part related to the present invention but not full content.

Before setting forth various embodiments of the present invention, first related related notion is described:

" news list page " refers to: the headline that on the page, body matter is band link (can there be brief summary title below) or news picture.

" channel " refers to: each in news website navigation bar.Each in example certain guidance to website hurdle is as shown in Figure 4 considered as a channel.

Father's page (claiming again father's webpage) and sub-pages: refer to two webpages with network linking relation, for example webpage A chain goes out webpage a,, for two webpages, A is called father's webpage (or claiming father's page), and a is called sub-pages.

According to one embodiment of present invention, provide a kind of method of screening news list page, as shown in Figure 1, the method comprises the steps:

Step 1, human configuration need to be carried out the news website information of news list page discovery, comprises that the channel of the entry address of news website, the regular expression of news URL (the first regular expression), concern news is judged regular expression (the second regular expression).

Take certain news website as example, and the entry address of this website is http:// news.sina.com.cn/, in this website, there are some news web pages, for example:

http://news.sina.com.cn/s/2014-04-28/112330024988.shtml、

http:// news.sina.com.cn/s/2014-04-28/050630022608.shtmldeng.

Those skilled in the art are by finding the observation of the news web page of same website and summary, the URL information of any news web page in same website can be expressed with unified regular expression, for example, for above-mentioned two websites that news web page is corresponding, according to the URL information of a plurality of news web pages in this website, can learn, the URL of the news web page in this website meets following regular expression: http:// news .sina .com .cn[/ w+ /]+/ d{4}-d{2}-d{2}/d+ .shtml.So, this regular expression can be used as and judges whether a webpage is a condition of news web page, the regular expression (for convenience of description, this regular expression of follow-up title is the first regular expression) that whether meets predefined expression webpage URL according to the URL of webpage judges whether this webpage is news web page.

News website in upper example comprises a plurality of channels, and Fig. 4 is the channel list schematic diagram of certain website, as shown in Figure 4, in this website with channels such as " society ", " physical culture ", " economy ".

Fig. 5 is the news web page schematic diagram in this website society channel, as shown in Figure 5,

Headline in this webpage top is with one section of text:

" the everything > of the > of press center society ".

The regular expression of the webpage source code that this section of text is corresponding is

</a > & the nbsp of < diV class=" blkBreadcrumbLink " data-sudaclick=" blkChannel_path " > < a href=" http://news.sina.com.cn/ " > press center; & gt; & nbsp; Everything </a > & the nbsp of < a href=" http://news.sina.com.cn/society/ " > society; & gt; & nbsp.

So, the webpage that can learn the regular expression (for convenience of description, this regular expression of follow-up title is the second regular expression) that only has source code to meet similar above-mentioned expression particular webpage content according to above analysis is only the news web page under specific channel.To the Rule of judgment of news web page content, be on the basis of news URL judgement, the further clear and definite subordinate channel of news web page, this judgement can filter out the webpage of a certain certain content, and only for the webpage of certain content, carries out subsequent operation, the clear and definite scope of processing.

More than configuration is the condition in subsequent step, webpage being judged, this configuration store, in the database table of design in advance, can be preserved many configurations in a storehouse table.And, because the news list page of large-scale news website dynamically updates, so in one embodiment, read to property program loop realizing based on this method the dredge operation that these configure to do news list page.

Step 2, spreads the initial work of collection.According to an embodiment, comprising:

Initialization URL queue to be collected; In one embodiment, a long queue management mode is used in URL queue to be collected, and specifically, each adds the URL of queue to be first placed into a size is in the buffer memory of 8046 byte-sized, if buffer memory is write completely, cache contents is written on disk.With the grand URL of control, enter queue and whether dequeue writes daily record simultaneously.

Initialization task is looked into heavy object; In one embodiment, with Bloom filter, construct and look into heavy object, and each task has and oneself independently looks into heavy object.When each URL is added into queue to be collected, also under can being added into, task looks in heavy object simultaneously, URL described here, comprises the entry address of user's human configuration, also comprises the address of the sub-pages that in subsequent step, system collects according to decision condition simultaneously.If URL Already in looks in heavy object, namely this URL is recorded, and its collected mistake or put into queue to be collected is described, therefore do not need again to put into queue to be collected.By this mechanism, avoid URL to be repeated to gather, press the scale that spreads collection that subtracts, improve the efficiency that spreads collection.

Step 3, reads task configuration, obtains the entrance URL of maximum N bar tasks and puts into queue to be collected.The value of N depends on and spreads the hardware configuration that gathers machine, for example number of CPU, frequency and memory size, and the value of N is no more than 20 conventionally.

Step 4 is taken out maximum M URL and is put into concurrent collector from queue to be collected.The value of M depends on the Socket linking number that concurrent collector is concurrent.In order to give full play to the performance of concurrent collector, take into account fault recovery cost, the value of M is 2 times of concurrent Socket linking number simultaneously.

Step 5, the collection result that the concurrent collector of poll returns, and process each collection result.In this step, target web is judged, and record the incidence relation meeting between the webpage imposing a condition, for subsequent step provides the basis of data mining.About the particular content of this step, will be described in detail below with reference to accompanying drawing 2.

Step 6, often handles a collection result, all will judge whether gatherer process should finish.When queue to be collected is empty, and while there is no untreated URL in concurrent collector, collecting work finishes, and is transferred to step 7 and carries out aftertreatment; If in processing procedure, concurrent collector has collected new webpage according to the decision condition in step 5 and has proceeded to step 5, continues to process the collection result that concurrent collector returns.

Step 7, after spreading gatherer process and completing, processes and spreads the result that process obtains, and therefrom excavate the news list page of each task.About the detailed description of this step, will be described below with reference to accompanying drawing 3.

Below in conjunction with Fig. 2, the processing procedure of above-mentioned steps 5 is introduced, the method comprises:

Step 51, from concurrent collector obtains a webpage collecting, first will judge whether webpage gathers successfully.If gather successfully, go to step 52; Otherwise, go to step 53.

Step 52, if web retrieval success, under this URL being added, task looks in heavy object.

Step 53, for example, if web retrieval failure judges whether gather failed number of times has surpassed the maximum frequency of failure (value is 3 times) arranging.If do not surpass, proceed to step 54; Otherwise, proceed to step 55.

Step 54, if webpage is because the overtime collection failure causing is reentered into queue to be collected by webpage URL.Because may be subject to the impact of communication condition while gathering webpage, cause accessing overtime, finally cause webpage failure, for this webpage of this situation again links and accesses, gather.

Step 55, is written to unsuccessfully file by gathering failed URL, and record gathers failure cause simultaneously.If Resurvey is repeatedly still failed, failure cause is not to be probably subject to the impact of communication condition but other cause specifics now record failure cause so that related personnel analyzes so.

Step 56, judgement gathers whether successful webpage is news web page, and in one embodiment, decision condition is as follows:

Condition 1, the URL information of webpage.Judge whether this webpage URL information meets described the first regular expression;

Condition 2, the anchor text of webpage.Anchor text claims again anchor text link, is a kind of form of link.Similar with hyperlink, the code of hyperlink is anchor text, and keyword is done to a link, points to other webpage, and the link of this form is called anchor text.The news web page sub-pages of website normally as described above, so news web page has anchor text conventionally, and news web page is normally usingd headline or news core content as anchor text.So judge whether a webpage is news web page, can judge according to the character length of the anchor text of this webpage, for example set an anchor text size threshold values, general headline or core content can not be a word or phrase conventionally, so anchor text size threshold values can be made as to 5, if the anchor text size of certain webpage is greater than 5, this webpage is news web page.

Condition 3, the temporal information in web page contents.In news web page, conventionally have the issuing time of news; temporal information may be below headline or body below; so in the source code of news web page just with temporal information; can using in the source code of webpage whether with temporal information as condition; according to whether extracting temporal information from the source code of this webpage, as this webpage of condition judgment, whether be a news web page in other words; if, with temporal information, this webpage is news web page in the content of certain webpage.

More than webpage meets, be judged as news web page during three conditions simultaneously, otherwise be judged as non-news web page.If this webpage is news web page, proceed to step 57; Otherwise, proceed to step 511.

Step 57-510 is the operation for a news web page.

Step 57, further judges whether news web page is the news web page of paying close attention in channel.Decision condition is as follows:

Condition 4, the html source code information of webpage.Judge whether this information meets described second regular expression of affiliated task, if met, this webpage is the news web page of paying close attention in channel.

When webpage meets above condition, be judged as the news web page in channel simultaneously, otherwise be judged as the news page in non-channel; If this webpage is the news page in channel, proceed to step 58; Otherwise be left intact.Because condition 4 is strong rules, in one embodiment, at described the second regular expression, configure under correct prerequisite, service condition 4 matching requirementss 1,2,3 can determine that this webpage is for paying close attention to the news page in channel.

Step 58, is written to text news.txt by the relevant information of news page, and the form that writes content is as follows:

Gather the deadline t webpage URL information t news page the degree of depth the t news briefing time t anchor text (being headline).

Step 59, whether father's page of the news web page in determining step 58 is news page, if not, proceed to step 510; If so, be left intact.Because if a webpage is news web page, its father's page is exactly probably a news list page so, so will further judge the content of its father's page, this news web page is called the sub-pages of its father's page.

Step 510, is written to the set membership of news page and its father's page in text link.txt, and the form that writes content is as follows:

The degree of depth of father's page t news page URL information father's t page URL information.

So far, the treatment scheme of this branch finishes, above-mentioned steps is the treatment step for a news web page, after having confirmed that a webpage is news web page, judge the content of its father's page, if father's page is not news web page, father's page is exactly probably a news list page so, at this, record the related information of this webpage and its father's page, for follow-up further judgement is prepared.

Step 511-513 is the operation for a non-news web page.

Step 511, whether the domain name that judges web page interlinkage is identical with the entry address domain name of affiliated task or be its subdomain name, if so, proceeds to step 512, otherwise is left intact.

Step 512, the all-links on extraction webpage and the anchor text of link.This step is in order to gather subpage frame on this page.

In one embodiment, the extraction algorithm of link and anchor text is as follows: extract the href attribute information of the < a > label of all non-comment sections in html source code as link, the text between the beginning of < a > label and end is as anchor text.

Step 513, the all-links that step 512 is obtained is processed one by one, only the link satisfying condition is added to queue to be collected, comprises the steps:

Step 513 (1), whether the degree of depth of the link that judgement extracts surpasses maximum spreads the degree of depth, if be no more than, proceeds to step 513 (2).Because the subpage frame level of large-scale website is more, and the link on some webpage may be a kind of endless circulation, for fear of being absorbed in loophole, in one embodiment, the depth value that the URL information of sub-pages is represented judges, if the depth value that the URL information of sub-pages represents is less than predetermined depth threshold values, continues to carry out subsequent step, otherwise will not process.The depth value that URL information represents refers to that sub-pages, for the link degree of depth of start page, is limited the scope that can control collection to sampling depth, improves collecting efficiency.

Step 513 (2), whether the domain name of the link that extracts of judgement is identical with the entry address domain name of affiliated task or be its subdomain name, if so, proceeds to step 513 (3).Owing to may comprising the link of pointing to other websites in webpage, if extended to other websites during acquisition sub-net page, greatly increased the scope gathering, in order to control the scope of acquisition sub-net page, add this Rule of judgment, this operation can guarantee the scope gathering equally, improves collecting efficiency.

Step 513 (3), whether the link that judgement extracts weighs in object looking into of affiliated task, if do not existed, proceeds to step 513 (4).Thereby this decision condition is to be repeated to add queue to be collected to cause repeated acquisition for fear of a URL.

Above three Rule of judgment are the screening conditions that provide for collecting work, and these three conditions do not exist the contact in data each other, thus its successively execution sequence can change, and also can carry out wherein at least one.

Step 513 (4), adds queue to be collected by the link extracting, and adds the heavily queue of looking into of its affiliated task simultaneously.So far, the treatment scheme of a non-news web page is finished, if a webpage is not news web page, in this webpage according to above-mentioned condition acquisition sub-net page, then the sub-pages collecting is re-started to judgement as new webpage.

The news list page determination methods of utilizing the present embodiment to provide, can carry out above-mentioned steps for some given target webs, and the incidence relation record then generating according to this target web, judges whether target web is news list page.

Further, according to one embodiment of present invention, provide a kind of news list page method for digging, below in conjunction with Fig. 3, the processing procedure of above-mentioned steps 7 has been introduced, the method comprises the steps:

Step 71 reads each task and spreads all news URL of from news.txt, puts into the first key assignments structure, wherein, key is the unique ID number of task, is worth the structure for storage news information, comprises URL, the degree of depth, issuing time and the anchor text of news.

Step 72, from link.txt, read the set membership that each task spreads all news URL that, put into the second key assignments structure, wherein key is father URL, value is a statistical framework body, comprises that issuing time that the total number of news page that this father URL chain goes out and chain go out was the N days numbers with the news page of interior (apart from current date in some days).

Step 73, processes the structure object of each task successively.Step 71-73 is by processing the content of text news.txt and link.txt, search out the contact in the two data, filtering out the news list page of each task.

Because a webpage comprises a plurality of sub-pages conventionally, so be recorded, to have the number of the sub-pages of linking relationship with a certain webpage may be also a plurality of.For example webpage A is not news web page, and the sub-pages a of webpage A is that news web page, sub-pages b are also news web pages, should comprise two records so in linking relationship record: 1) A link a, 2) A links b.Known according to above-mentioned decision condition and record condition, the content physical meaning in linking relationship record is: a non-news web page A has two sub-news web page a and b.The quantity of the sub-news web page of generally news list page link is more, so can judge whether webpage A is a news list page according to the quantity of record in annexation record.In one embodiment, set a linking relationship and record threshold values (for example threshold values is 10), if the sub-pages that has a linking relationship with the first webpage A is greater than 10, the first webpage A is judged to be to news list page.

Because in fact described two files and the relation between them embodied the quantity of the news sub-pages that each non-news pages chain picks out, for example, when URL chain goes out news sum and is greater than setting threshold (value is 10) so become a father, just can judge that this father's page is exactly news list page.

Further, in one embodiment, the issuing time that the URL chain that becomes a father goes out for example be take interior news number, for example, when being greater than proportion that setting threshold (value is 5) or the interior news of take for N days accounts for whole news and being not less than preset ratio value (value is 0.5) N days (suggestion value 1), is judged as the news list page of high priority; Other list page are judged as the news list page of low priority.So can provide for news collection device the reference of frequency acquisition, because the news list page table of high priority shows that the news renewal speed under this page is very fast, and news renewal speed under the news list page of low priority is slower, user can distinguish setting to news collection device according to the result of determination of this operation, for example, for high-frequency news list page, need carry out news collection every day, and low frequency news list page can gather once the interval long period.

Further, in one embodiment, if the priority of news list page A, after news list page B, represents their ordering relation with A < B, in like manner also just like the relation of A > B, A=B.The present embodiment is also made as judged for the identical news list page of priority (for A=B situation):

Wherein, layer depth refers to the degree of depth of webpage in certain website, and the news sum that chain goes out refers to the quantity of the sub-news web page of certain news list page.This algorithm is made further judgement for the identical news list page of priority, further its priority is sorted, this algorithm is actual be pay the utmost attention to issuing time that the news list page chain of same level goes out at N days with interior news number, if still cannot separate upper-lower hierarchy, then consider the degree of depth of news list page, if still cannot separate upper-lower hierarchy, then consider the news web page total quantity that news list page chain goes out, if still cannot separate upper-lower hierarchy, finally assert that the priority of news list page equates.

Above-described embodiment provides the method for finding news list page, utilizes after above-described embodiment finds news list page, and existing news collection device can be directly using news list page as start page collection link wherein, thereby collects concrete news content.Because the sub-pages of news list page is all news page, so the collection target of news collection device is very clear and definite, the scope of collection is dwindled effectively, and the efficiency of news collection work is significantly improved.

Finally, the above embodiments are only used for illustrating the present invention, and it should not be construed is that protection scope of the present invention is carried out to any restriction.And, it will be apparent to those skilled in the art that and do not departing under above-described embodiment spirit and principle, the various equivalent variation that above-described embodiment is carried out, modification and in the text not the various improvement of description all within the protection domain of this patent.

Claims

1. a news list page determination methods, is characterized in that, comprising:

Step (1), obtains webpage, judges whether described webpage is news web page,

Step (2), judges news list page according to described related information.

2. news list page determination methods according to claim 1, is characterized in that, in described step (1), in described webpage, acquisition sub-net page comprises:

The URL information of the sub-pages that record collects;

3. news list page determination methods according to claim 1, is characterized in that, in described step (1), in described webpage, acquisition sub-net page comprises:

4. news list page determination methods according to claim 1, is characterized in that, in described step (1), in described webpage, acquisition sub-net page comprises:

5. news list page determination methods according to claim 1, is characterized in that, in described step (1), judges whether described webpage or described father's webpage are that news web page comprises:

6. news list page determination methods according to claim 1, is characterized in that, the related information that records described webpage and described father's webpage in described step (1) comprises:

7. a method of screening news list page, is characterized in that, comprising:

Step (1), obtains a plurality of URL and puts into queue to be collected;

8. the method for screening news list page according to claim 7, is characterized in that, in described step (5), the related information of record is excavated and is comprised:

According to described key assignments structure, filter out news list page.

9. the method for screening news list page according to claim 7, is characterized in that, described method also comprises:

Other news list pages are judged as to the news list page of low priority.

10. the method for screening news list page according to claim 9, is characterized in that, described method also comprises:

Step (7), sorts to the news list page of same levels::