CN104182482A - Method for judging news list page and method for screening news list page - Google Patents

Method for judging news list page and method for screening news list page Download PDF

Info

Publication number
CN104182482A
CN104182482A CN201410382359.XA CN201410382359A CN104182482A CN 104182482 A CN104182482 A CN 104182482A CN 201410382359 A CN201410382359 A CN 201410382359A CN 104182482 A CN104182482 A CN 104182482A
Authority
CN
China
Prior art keywords
webpage
news
page
list page
web page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410382359.XA
Other languages
Chinese (zh)
Other versions
CN104182482B (en
Inventor
刘晓娜
张凯
程学旗
刘悦
张瑾
余智华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN201410382359.XA priority Critical patent/CN104182482B/en
Publication of CN104182482A publication Critical patent/CN104182482A/en
Application granted granted Critical
Publication of CN104182482B publication Critical patent/CN104182482B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/954Navigation, e.g. using categorised browsing

Abstract

The invention provides a method for judging a news list page and a method for screening the news list page. The method for judging the news list page comprises the following steps: acquiring a webpage, and judging whether the webpage is a news webpage or not; if the webpage is not the news webpage, collecting sub-webpages in the webpage to repeat the judging process on the sub-webpages; if the webpage is the news webpage, and is judged as the news webpage in a channel, judging whether the father webpage of the webpage is the news webpage or not; if the father webpage is not the news webpage, recording the correlation information of the webpage and the father webpage; judging the news list page according to the correlation information; other steps. Through the utilization of the method provided by the invention to find the news list page, an existing news collector can directly take the news list page as the start page to collect the news content, so that the collection efficiency of the news data is improved.

Description

A kind of method of news list page determination methods and screening news list page
Technical field
The present invention relates to field of computer data processing, be specifically related to the method for news list page determination methods and screening news list page.
Background technology
Internet is to provide the important channel of news information, the public and business unit all need to rely on internet and obtain the news information of self paying close attention to, and the Type of website in internet is more numerous and diverse, some comprehensive online media sites for example, except news web page, also have the webpage of a large amount of other guides, user conventionally need to expend great amount of cost when search news.
There are at present some news collection instruments; can be in the website of user's appointment automatic search news pages; and all news pages are collected; find again the news content that wherein user pays close attention to; this type of news collection instrument is when image data, and owing to gathering, target is fuzzyyer, conventionally can judge a large amount of non-news pages; the impact that is even subject to web site url has expanded to acquisition range in the website of non-user's appointment, thereby makes the efficiency of image data very low.If news collection instrument can dwindle the scope of image data, the efficiency of data acquisition will be improved.
Summary of the invention
The object of this invention is to provide a kind of news list page screening technique and filter out news list page, news collection instrument can be using news list page as gathering target, thereby improves the data acquisition efficiency of news collection instrument.
The invention provides a kind of news list page determination methods, comprising:
Step (1), obtains webpage, judges whether described webpage is news web page,
If described webpage is not news web page, acquisition sub-net page in described webpage, and described sub-pages is re-executed to described step (1);
If described webpage is news web page, judge whether father's webpage of described webpage is news web page, if described father's webpage is not news web page, record the related information of described webpage and described father's webpage;
Step (2), judges news list page according to described related information.
Wherein, in described step (1), in described webpage, acquisition sub-net page comprises:
The URL information of the sub-pages that record collects;
If the URL information of sub-pages is not identical with the URL information of described record, gather described sub-pages.
Wherein, in described step (1), in described webpage, acquisition sub-net page comprises:
Obtain the link information in described webpage, if the domain name of sub-pages corresponding to described link information is the subdomain name of described webpage, or the domain name of sub-pages corresponding to described link information is identical with the domain name of described webpage, gathers described sub-pages.
Wherein, in described step (1), in described webpage, acquisition sub-net page comprises:
If the depth value that the URL information of described sub-pages represents is less than predetermined depth threshold values, gather described sub-pages.
Wherein, in described step (1), judge whether described webpage or described father's webpage are that news web page comprises:
According to whether comprising time letter in the first regular expression, the second regular expression, anchor text size threshold values and web page contents, described webpage or described father's webpage are judged to be to news web page or non-news web page, wherein said the first regular expression is the regular expression of webpage URL, and described the second regular expression is the regular expression of paying close attention to web page contents in channel.
Wherein, the related information that records described webpage and described father's webpage in described step (1) comprises:
The URL information writing in files of the sub-pages that the depth information of described father's webpage, URL information and described father's webpage chain are gone out.
The invention provides a kind of method of screening news list page, comprising:
Step (1), obtains a plurality of URL and puts into queue to be collected;
Step (2) is taken out URL as start page collection webpage wherein from queue described to be adopted;
Step (3), obtain collect webpage, judge whether described webpage is news web page,
If described webpage is not news web page, the URL of the sub-pages in described webpage is added in described queue to be collected;
If described webpage is news web page, judge whether father's webpage of described webpage is news web page, if described father's webpage is not news web page, record the related information of described webpage and described father's webpage;
Step (4), judges that whether queue to be collected is empty, if queue to be collected is for empty and all webpages that collect have all judged, performs step (5); Otherwise execution step (2);
Step (5), excavates the related information of record, filters out news list page.
Wherein, in described step (5), the related information of record is excavated and is comprised:
Read the related information of record, put into key assignments structure, the key in described key assignments structure is the webpage URL information of father's webpage, and value is a statistical framework body, comprises the news web page total quantity that each father's webpage chain goes out;
According to described key assignments structure, filter out news list page.
Further, described method also comprises:
Step (6) is judged to be the news list page that meets any following condition the news list page of high priority:
Condition 1: the issuing time that chain goes out was greater than the news list page of setting threshold with interior news number at N days;
Condition 2: account for interior news the news list page that proportion that whole chains go out news is not less than preset ratio value in the N that chain goes out days;
Other news list pages are judged as to the news list page of low priority.
Further, described method also comprises:
Step (7), sorts to the news list page of same levels::
First the issuing time going out according to the news list page chain of same level sorted with interior news number at N days;
Secondly according to the depth information of the news list page of same level, sort;
Secondly the news web page total quantity going out according to the news list page chain of same level again sorts.
The method of news list page screening technique provided by the invention and screening news list page, by the judgement to web page contents, and the record to webpage relation, filtered out the news list page in website, for news collection work provides accurate acquisition range, improved the efficiency of news collection.
Accompanying drawing explanation
Fig. 1 is the method flow diagram of the screening news list page of one embodiment of the present of invention;
Fig. 2 is the process flow diagram of the news list page determination methods of one embodiment of the present of invention;
Fig. 3 is the news list page mining algorithm process flow diagram of one embodiment of the present of invention;
Fig. 4 is the channel list schematic diagram of certain website;
Fig. 5 is the news web page feature schematic diagram under certain website society channel news.
Embodiment
Below in conjunction with drawings and Examples, the present invention is described in further detail.Specific embodiment described herein is only for explaining the present invention, but not limitation of the invention.It also should be noted that, for convenience of description, in accompanying drawing, only show part related to the present invention but not full content.
Before setting forth various embodiments of the present invention, first related related notion is described:
" news list page " refers to: the headline that on the page, body matter is band link (can there be brief summary title below) or news picture.
" channel " refers to: each in news website navigation bar.Each in example certain guidance to website hurdle is as shown in Figure 4 considered as a channel.
Father's page (claiming again father's webpage) and sub-pages: refer to two webpages with network linking relation, for example webpage A chain goes out webpage a,, for two webpages, A is called father's webpage (or claiming father's page), and a is called sub-pages.
According to one embodiment of present invention, provide a kind of method of screening news list page, as shown in Figure 1, the method comprises the steps:
Step 1, human configuration need to be carried out the news website information of news list page discovery, comprises that the channel of the entry address of news website, the regular expression of news URL (the first regular expression), concern news is judged regular expression (the second regular expression).
Take certain news website as example, and the entry address of this website is http:// news.sina.com.cn/, in this website, there are some news web pages, for example:
http://news.sina.com.cn/s/2014-04-28/112330024988.shtml
http:// news.sina.com.cn/s/2014-04-28/050630022608.shtmldeng.
Those skilled in the art are by finding the observation of the news web page of same website and summary, the URL information of any news web page in same website can be expressed with unified regular expression, for example, for above-mentioned two websites that news web page is corresponding, according to the URL information of a plurality of news web pages in this website, can learn, the URL of the news web page in this website meets following regular expression: http:// news .sina .com .cn[/ w+ /]+/ d{4}-d{2}-d{2}/d+ .shtml.So, this regular expression can be used as and judges whether a webpage is a condition of news web page, the regular expression (for convenience of description, this regular expression of follow-up title is the first regular expression) that whether meets predefined expression webpage URL according to the URL of webpage judges whether this webpage is news web page.
News website in upper example comprises a plurality of channels, and Fig. 4 is the channel list schematic diagram of certain website, as shown in Figure 4, in this website with channels such as " society ", " physical culture ", " economy ".
Fig. 5 is the news web page schematic diagram in this website society channel, as shown in Figure 5,
Headline in this webpage top is with one section of text:
" the everything > of the > of press center society ".
The regular expression of the webpage source code that this section of text is corresponding is
</a > & the nbsp of < diV class=" blkBreadcrumbLink " data-sudaclick=" blkChannel_path " > < a href=" http://news.sina.com.cn/ " > press center; & gt; & nbsp; Everything </a > & the nbsp of < a href=" http://news.sina.com.cn/society/ " > society; & gt; & nbsp.
So, the webpage that can learn the regular expression (for convenience of description, this regular expression of follow-up title is the second regular expression) that only has source code to meet similar above-mentioned expression particular webpage content according to above analysis is only the news web page under specific channel.To the Rule of judgment of news web page content, be on the basis of news URL judgement, the further clear and definite subordinate channel of news web page, this judgement can filter out the webpage of a certain certain content, and only for the webpage of certain content, carries out subsequent operation, the clear and definite scope of processing.
More than configuration is the condition in subsequent step, webpage being judged, this configuration store, in the database table of design in advance, can be preserved many configurations in a storehouse table.And, because the news list page of large-scale news website dynamically updates, so in one embodiment, read to property program loop realizing based on this method the dredge operation that these configure to do news list page.
Step 2, spreads the initial work of collection.According to an embodiment, comprising:
Initialization URL queue to be collected; In one embodiment, a long queue management mode is used in URL queue to be collected, and specifically, each adds the URL of queue to be first placed into a size is in the buffer memory of 8046 byte-sized, if buffer memory is write completely, cache contents is written on disk.With the grand URL of control, enter queue and whether dequeue writes daily record simultaneously.
Initialization task is looked into heavy object; In one embodiment, with Bloom filter, construct and look into heavy object, and each task has and oneself independently looks into heavy object.When each URL is added into queue to be collected, also under can being added into, task looks in heavy object simultaneously, URL described here, comprises the entry address of user's human configuration, also comprises the address of the sub-pages that in subsequent step, system collects according to decision condition simultaneously.If URL Already in looks in heavy object, namely this URL is recorded, and its collected mistake or put into queue to be collected is described, therefore do not need again to put into queue to be collected.By this mechanism, avoid URL to be repeated to gather, press the scale that spreads collection that subtracts, improve the efficiency that spreads collection.
Step 3, reads task configuration, obtains the entrance URL of maximum N bar tasks and puts into queue to be collected.The value of N depends on and spreads the hardware configuration that gathers machine, for example number of CPU, frequency and memory size, and the value of N is no more than 20 conventionally.
Step 4 is taken out maximum M URL and is put into concurrent collector from queue to be collected.The value of M depends on the Socket linking number that concurrent collector is concurrent.In order to give full play to the performance of concurrent collector, take into account fault recovery cost, the value of M is 2 times of concurrent Socket linking number simultaneously.
Step 5, the collection result that the concurrent collector of poll returns, and process each collection result.In this step, target web is judged, and record the incidence relation meeting between the webpage imposing a condition, for subsequent step provides the basis of data mining.About the particular content of this step, will be described in detail below with reference to accompanying drawing 2.
Step 6, often handles a collection result, all will judge whether gatherer process should finish.When queue to be collected is empty, and while there is no untreated URL in concurrent collector, collecting work finishes, and is transferred to step 7 and carries out aftertreatment; If in processing procedure, concurrent collector has collected new webpage according to the decision condition in step 5 and has proceeded to step 5, continues to process the collection result that concurrent collector returns.
Step 7, after spreading gatherer process and completing, processes and spreads the result that process obtains, and therefrom excavate the news list page of each task.About the detailed description of this step, will be described below with reference to accompanying drawing 3.
Below in conjunction with Fig. 2, the processing procedure of above-mentioned steps 5 is introduced, the method comprises:
Step 51, from concurrent collector obtains a webpage collecting, first will judge whether webpage gathers successfully.If gather successfully, go to step 52; Otherwise, go to step 53.
Step 52, if web retrieval success, under this URL being added, task looks in heavy object.
Step 53, for example, if web retrieval failure judges whether gather failed number of times has surpassed the maximum frequency of failure (value is 3 times) arranging.If do not surpass, proceed to step 54; Otherwise, proceed to step 55.
Step 54, if webpage is because the overtime collection failure causing is reentered into queue to be collected by webpage URL.Because may be subject to the impact of communication condition while gathering webpage, cause accessing overtime, finally cause webpage failure, for this webpage of this situation again links and accesses, gather.
Step 55, is written to unsuccessfully file by gathering failed URL, and record gathers failure cause simultaneously.If Resurvey is repeatedly still failed, failure cause is not to be probably subject to the impact of communication condition but other cause specifics now record failure cause so that related personnel analyzes so.
Step 56, judgement gathers whether successful webpage is news web page, and in one embodiment, decision condition is as follows:
Condition 1, the URL information of webpage.Judge whether this webpage URL information meets described the first regular expression;
Condition 2, the anchor text of webpage.Anchor text claims again anchor text link, is a kind of form of link.Similar with hyperlink, the code of hyperlink is anchor text, and keyword is done to a link, points to other webpage, and the link of this form is called anchor text.The news web page sub-pages of website normally as described above, so news web page has anchor text conventionally, and news web page is normally usingd headline or news core content as anchor text.So judge whether a webpage is news web page, can judge according to the character length of the anchor text of this webpage, for example set an anchor text size threshold values, general headline or core content can not be a word or phrase conventionally, so anchor text size threshold values can be made as to 5, if the anchor text size of certain webpage is greater than 5, this webpage is news web page.
Condition 3, the temporal information in web page contents.In news web page, conventionally have the issuing time of news; temporal information may be below headline or body below; so in the source code of news web page just with temporal information; can using in the source code of webpage whether with temporal information as condition; according to whether extracting temporal information from the source code of this webpage, as this webpage of condition judgment, whether be a news web page in other words; if, with temporal information, this webpage is news web page in the content of certain webpage.
More than webpage meets, be judged as news web page during three conditions simultaneously, otherwise be judged as non-news web page.If this webpage is news web page, proceed to step 57; Otherwise, proceed to step 511.
Step 57-510 is the operation for a news web page.
Step 57, further judges whether news web page is the news web page of paying close attention in channel.Decision condition is as follows:
Condition 4, the html source code information of webpage.Judge whether this information meets described second regular expression of affiliated task, if met, this webpage is the news web page of paying close attention in channel.
When webpage meets above condition, be judged as the news web page in channel simultaneously, otherwise be judged as the news page in non-channel; If this webpage is the news page in channel, proceed to step 58; Otherwise be left intact.Because condition 4 is strong rules, in one embodiment, at described the second regular expression, configure under correct prerequisite, service condition 4 matching requirementss 1,2,3 can determine that this webpage is for paying close attention to the news page in channel.
Step 58, is written to text news.txt by the relevant information of news page, and the form that writes content is as follows:
Gather the deadline t webpage URL information t news page the degree of depth the t news briefing time t anchor text (being headline).
Step 59, whether father's page of the news web page in determining step 58 is news page, if not, proceed to step 510; If so, be left intact.Because if a webpage is news web page, its father's page is exactly probably a news list page so, so will further judge the content of its father's page, this news web page is called the sub-pages of its father's page.
Step 510, is written to the set membership of news page and its father's page in text link.txt, and the form that writes content is as follows:
The degree of depth of father's page t news page URL information father's t page URL information.
So far, the treatment scheme of this branch finishes, above-mentioned steps is the treatment step for a news web page, after having confirmed that a webpage is news web page, judge the content of its father's page, if father's page is not news web page, father's page is exactly probably a news list page so, at this, record the related information of this webpage and its father's page, for follow-up further judgement is prepared.
Step 511-513 is the operation for a non-news web page.
Step 511, whether the domain name that judges web page interlinkage is identical with the entry address domain name of affiliated task or be its subdomain name, if so, proceeds to step 512, otherwise is left intact.
Step 512, the all-links on extraction webpage and the anchor text of link.This step is in order to gather subpage frame on this page.
In one embodiment, the extraction algorithm of link and anchor text is as follows: extract the href attribute information of the < a > label of all non-comment sections in html source code as link, the text between the beginning of < a > label and end is as anchor text.
Step 513, the all-links that step 512 is obtained is processed one by one, only the link satisfying condition is added to queue to be collected, comprises the steps:
Step 513 (1), whether the degree of depth of the link that judgement extracts surpasses maximum spreads the degree of depth, if be no more than, proceeds to step 513 (2).Because the subpage frame level of large-scale website is more, and the link on some webpage may be a kind of endless circulation, for fear of being absorbed in loophole, in one embodiment, the depth value that the URL information of sub-pages is represented judges, if the depth value that the URL information of sub-pages represents is less than predetermined depth threshold values, continues to carry out subsequent step, otherwise will not process.The depth value that URL information represents refers to that sub-pages, for the link degree of depth of start page, is limited the scope that can control collection to sampling depth, improves collecting efficiency.
Step 513 (2), whether the domain name of the link that extracts of judgement is identical with the entry address domain name of affiliated task or be its subdomain name, if so, proceeds to step 513 (3).Owing to may comprising the link of pointing to other websites in webpage, if extended to other websites during acquisition sub-net page, greatly increased the scope gathering, in order to control the scope of acquisition sub-net page, add this Rule of judgment, this operation can guarantee the scope gathering equally, improves collecting efficiency.
Step 513 (3), whether the link that judgement extracts weighs in object looking into of affiliated task, if do not existed, proceeds to step 513 (4).Thereby this decision condition is to be repeated to add queue to be collected to cause repeated acquisition for fear of a URL.
Above three Rule of judgment are the screening conditions that provide for collecting work, and these three conditions do not exist the contact in data each other, thus its successively execution sequence can change, and also can carry out wherein at least one.
Step 513 (4), adds queue to be collected by the link extracting, and adds the heavily queue of looking into of its affiliated task simultaneously.So far, the treatment scheme of a non-news web page is finished, if a webpage is not news web page, in this webpage according to above-mentioned condition acquisition sub-net page, then the sub-pages collecting is re-started to judgement as new webpage.
The news list page determination methods of utilizing the present embodiment to provide, can carry out above-mentioned steps for some given target webs, and the incidence relation record then generating according to this target web, judges whether target web is news list page.
Further, according to one embodiment of present invention, provide a kind of news list page method for digging, below in conjunction with Fig. 3, the processing procedure of above-mentioned steps 7 has been introduced, the method comprises the steps:
Step 71 reads each task and spreads all news URL of from news.txt, puts into the first key assignments structure, wherein, key is the unique ID number of task, is worth the structure for storage news information, comprises URL, the degree of depth, issuing time and the anchor text of news.
Step 72, from link.txt, read the set membership that each task spreads all news URL that, put into the second key assignments structure, wherein key is father URL, value is a statistical framework body, comprises that issuing time that the total number of news page that this father URL chain goes out and chain go out was the N days numbers with the news page of interior (apart from current date in some days).
Step 73, processes the structure object of each task successively.Step 71-73 is by processing the content of text news.txt and link.txt, search out the contact in the two data, filtering out the news list page of each task.
Because a webpage comprises a plurality of sub-pages conventionally, so be recorded, to have the number of the sub-pages of linking relationship with a certain webpage may be also a plurality of.For example webpage A is not news web page, and the sub-pages a of webpage A is that news web page, sub-pages b are also news web pages, should comprise two records so in linking relationship record: 1) A link a, 2) A links b.Known according to above-mentioned decision condition and record condition, the content physical meaning in linking relationship record is: a non-news web page A has two sub-news web page a and b.The quantity of the sub-news web page of generally news list page link is more, so can judge whether webpage A is a news list page according to the quantity of record in annexation record.In one embodiment, set a linking relationship and record threshold values (for example threshold values is 10), if the sub-pages that has a linking relationship with the first webpage A is greater than 10, the first webpage A is judged to be to news list page.
Because in fact described two files and the relation between them embodied the quantity of the news sub-pages that each non-news pages chain picks out, for example, when URL chain goes out news sum and is greater than setting threshold (value is 10) so become a father, just can judge that this father's page is exactly news list page.
Further, in one embodiment, the issuing time that the URL chain that becomes a father goes out for example be take interior news number, for example, when being greater than proportion that setting threshold (value is 5) or the interior news of take for N days accounts for whole news and being not less than preset ratio value (value is 0.5) N days (suggestion value 1), is judged as the news list page of high priority; Other list page are judged as the news list page of low priority.So can provide for news collection device the reference of frequency acquisition, because the news list page table of high priority shows that the news renewal speed under this page is very fast, and news renewal speed under the news list page of low priority is slower, user can distinguish setting to news collection device according to the result of determination of this operation, for example, for high-frequency news list page, need carry out news collection every day, and low frequency news list page can gather once the interval long period.
Further, in one embodiment, if the priority of news list page A, after news list page B, represents their ordering relation with A < B, in like manner also just like the relation of A > B, A=B.The present embodiment is also made as judged for the identical news list page of priority (for A=B situation):
Wherein, layer depth refers to the degree of depth of webpage in certain website, and the news sum that chain goes out refers to the quantity of the sub-news web page of certain news list page.This algorithm is made further judgement for the identical news list page of priority, further its priority is sorted, this algorithm is actual be pay the utmost attention to issuing time that the news list page chain of same level goes out at N days with interior news number, if still cannot separate upper-lower hierarchy, then consider the degree of depth of news list page, if still cannot separate upper-lower hierarchy, then consider the news web page total quantity that news list page chain goes out, if still cannot separate upper-lower hierarchy, finally assert that the priority of news list page equates.
Above-described embodiment provides the method for finding news list page, utilizes after above-described embodiment finds news list page, and existing news collection device can be directly using news list page as start page collection link wherein, thereby collects concrete news content.Because the sub-pages of news list page is all news page, so the collection target of news collection device is very clear and definite, the scope of collection is dwindled effectively, and the efficiency of news collection work is significantly improved.
Finally, the above embodiments are only used for illustrating the present invention, and it should not be construed is that protection scope of the present invention is carried out to any restriction.And, it will be apparent to those skilled in the art that and do not departing under above-described embodiment spirit and principle, the various equivalent variation that above-described embodiment is carried out, modification and in the text not the various improvement of description all within the protection domain of this patent.

Claims (10)

1. a news list page determination methods, is characterized in that, comprising:
Step (1), obtains webpage, judges whether described webpage is news web page,
If described webpage is not news web page, acquisition sub-net page in described webpage, and described sub-pages is re-executed to described step (1);
If described webpage is news web page, judge whether father's webpage of described webpage is news web page, if described father's webpage is not news web page, record the related information of described webpage and described father's webpage;
Step (2), judges news list page according to described related information.
2. news list page determination methods according to claim 1, is characterized in that, in described step (1), in described webpage, acquisition sub-net page comprises:
The URL information of the sub-pages that record collects;
If the URL information of sub-pages is not identical with the URL information of described record, gather described sub-pages.
3. news list page determination methods according to claim 1, is characterized in that, in described step (1), in described webpage, acquisition sub-net page comprises:
Obtain the link information in described webpage, if the domain name of sub-pages corresponding to described link information is the subdomain name of described webpage, or the domain name of sub-pages corresponding to described link information is identical with the domain name of described webpage, gathers described sub-pages.
4. news list page determination methods according to claim 1, is characterized in that, in described step (1), in described webpage, acquisition sub-net page comprises:
If the depth value that the URL information of described sub-pages represents is less than predetermined depth threshold values, gather described sub-pages.
5. news list page determination methods according to claim 1, is characterized in that, in described step (1), judges whether described webpage or described father's webpage are that news web page comprises:
According to whether comprising time letter in the first regular expression, the second regular expression, anchor text size threshold values and web page contents, described webpage or described father's webpage are judged to be to news web page or non-news web page, wherein said the first regular expression is the regular expression of webpage URL, and described the second regular expression is the regular expression of paying close attention to web page contents in channel.
6. news list page determination methods according to claim 1, is characterized in that, the related information that records described webpage and described father's webpage in described step (1) comprises:
The URL information writing in files of the sub-pages that the depth information of described father's webpage, URL information and described father's webpage chain are gone out.
7. a method of screening news list page, is characterized in that, comprising:
Step (1), obtains a plurality of URL and puts into queue to be collected;
Step (2) is taken out URL as start page collection webpage wherein from queue described to be adopted;
Step (3), obtain collect webpage, judge whether described webpage is news web page,
If described webpage is not news web page, the URL of the sub-pages in described webpage is added in described queue to be collected;
If described webpage is news web page, judge whether father's webpage of described webpage is news web page, if described father's webpage is not news web page, record the related information of described webpage and described father's webpage;
Step (4), judges that whether queue to be collected is empty, if queue to be collected is for empty and all webpages that collect have all judged, performs step (5); Otherwise execution step (2);
Step (5), excavates the related information of record, filters out news list page.
8. the method for screening news list page according to claim 7, is characterized in that, in described step (5), the related information of record is excavated and is comprised:
Read the related information of record, put into key assignments structure, the key in described key assignments structure is the webpage URL information of father's webpage, and value is a statistical framework body, comprises the news web page total quantity that each father's webpage chain goes out;
According to described key assignments structure, filter out news list page.
9. the method for screening news list page according to claim 7, is characterized in that, described method also comprises:
Step (6) is judged to be the news list page that meets any following condition the news list page of high priority:
Condition 1: the issuing time that chain goes out was greater than the news list page of setting threshold with interior news number at N days;
Condition 2: account for interior news the news list page that proportion that whole chains go out news is not less than preset ratio value in the N that chain goes out days;
Other news list pages are judged as to the news list page of low priority.
10. the method for screening news list page according to claim 9, is characterized in that, described method also comprises:
Step (7), sorts to the news list page of same levels::
First the issuing time going out according to the news list page chain of same level sorted with interior news number at N days;
Secondly according to the depth information of the news list page of same level, sort;
Secondly the news web page total quantity going out according to the news list page chain of same level again sorts.
CN201410382359.XA 2014-08-06 2014-08-06 A kind of news list page determination methods and the method for screening news list page Active CN104182482B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410382359.XA CN104182482B (en) 2014-08-06 2014-08-06 A kind of news list page determination methods and the method for screening news list page

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410382359.XA CN104182482B (en) 2014-08-06 2014-08-06 A kind of news list page determination methods and the method for screening news list page

Publications (2)

Publication Number Publication Date
CN104182482A true CN104182482A (en) 2014-12-03
CN104182482B CN104182482B (en) 2018-05-22

Family

ID=51963522

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410382359.XA Active CN104182482B (en) 2014-08-06 2014-08-06 A kind of news list page determination methods and the method for screening news list page

Country Status (1)

Country Link
CN (1) CN104182482B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104484382A (en) * 2014-12-10 2015-04-01 北京奇虎科技有限公司 Method and device for generating time-based seed page set
CN106407217A (en) * 2015-07-31 2017-02-15 北京国双科技有限公司 Navigation webpage identification method and device
CN106407218A (en) * 2015-07-31 2017-02-15 北京国双科技有限公司 Navigation webpage detection method and device
CN106649810A (en) * 2016-12-29 2017-05-10 山东舜网传媒股份有限公司 Ajax-based news webpage dynamic data grabbing method and system
CN107145556A (en) * 2017-04-28 2017-09-08 安徽博约信息科技股份有限公司 General distributed parallel computing environment
CN107729153A (en) * 2017-10-31 2018-02-23 麦格创科技(深圳)有限公司 Web retrieval method for allocating tasks and system
CN107908780A (en) * 2017-12-06 2018-04-13 厦门市美亚柏科信息股份有限公司 The webpage of news website differentiates processing method, terminal device and storage medium
CN112650910A (en) * 2020-12-30 2021-04-13 北京百度网讯科技有限公司 Method, device, equipment and storage medium for determining website update information

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1853183A (en) * 2003-09-16 2006-10-25 Google公司 Systems and methods for improving the ranking of news articles
CN101042694A (en) * 2006-03-21 2007-09-26 松下电器产业株式会社 Method for accessing father page in the time of browing web page
CN101329687A (en) * 2008-07-31 2008-12-24 清华大学 Method for positioning news web page
CN102073730A (en) * 2011-01-14 2011-05-25 哈尔滨工程大学 Method for constructing topic web crawler system
US20110225157A1 (en) * 2010-03-12 2011-09-15 Rajaram Shyam Sundar Method and system for providing website content

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1853183A (en) * 2003-09-16 2006-10-25 Google公司 Systems and methods for improving the ranking of news articles
CN101042694A (en) * 2006-03-21 2007-09-26 松下电器产业株式会社 Method for accessing father page in the time of browing web page
CN101329687A (en) * 2008-07-31 2008-12-24 清华大学 Method for positioning news web page
US20110225157A1 (en) * 2010-03-12 2011-09-15 Rajaram Shyam Sundar Method and system for providing website content
CN102073730A (en) * 2011-01-14 2011-05-25 哈尔滨工程大学 Method for constructing topic web crawler system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
唐壹勋: "正则表达式在批量新闻网页处理中的应用", 《福建电脑》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104484382A (en) * 2014-12-10 2015-04-01 北京奇虎科技有限公司 Method and device for generating time-based seed page set
CN106407218B (en) * 2015-07-31 2020-03-03 北京国双科技有限公司 Navigation webpage detection method and device
CN106407217A (en) * 2015-07-31 2017-02-15 北京国双科技有限公司 Navigation webpage identification method and device
CN106407218A (en) * 2015-07-31 2017-02-15 北京国双科技有限公司 Navigation webpage detection method and device
CN106649810A (en) * 2016-12-29 2017-05-10 山东舜网传媒股份有限公司 Ajax-based news webpage dynamic data grabbing method and system
CN106649810B (en) * 2016-12-29 2019-05-28 山东舜网传媒股份有限公司 The grasping means and system of news web page dynamic data based on Ajax
CN107145556A (en) * 2017-04-28 2017-09-08 安徽博约信息科技股份有限公司 General distributed parallel computing environment
CN107145556B (en) * 2017-04-28 2020-12-29 安徽博约信息科技股份有限公司 Universal distributed acquisition system
CN107729153A (en) * 2017-10-31 2018-02-23 麦格创科技(深圳)有限公司 Web retrieval method for allocating tasks and system
CN107908780A (en) * 2017-12-06 2018-04-13 厦门市美亚柏科信息股份有限公司 The webpage of news website differentiates processing method, terminal device and storage medium
CN107908780B (en) * 2017-12-06 2020-02-21 厦门市美亚柏科信息股份有限公司 Webpage distinguishing and processing method of news website, terminal equipment and storage medium
CN112650910A (en) * 2020-12-30 2021-04-13 北京百度网讯科技有限公司 Method, device, equipment and storage medium for determining website update information
CN112650910B (en) * 2020-12-30 2024-03-12 北京百度网讯科技有限公司 Method, device, equipment and storage medium for determining website update information

Also Published As

Publication number Publication date
CN104182482B (en) 2018-05-22

Similar Documents

Publication Publication Date Title
CN104182482A (en) Method for judging news list page and method for screening news list page
Oh et al. Advanced evidence collection and analysis of web browser activity
CN101192227B (en) Log file analytical method and system based on distributed type computing network
CN100405371C (en) Method and system for abstracting new word
CN101329687B (en) Method for positioning news web page
CN106934014B (en) Hadoop-based network data mining and analyzing platform and method thereof
CN101399818B (en) Theme related webpage filtering method and system based on navigation route information
CN1955963B (en) System and method for searching dates in electronic documents
CN101369276B (en) Evidence obtaining method for Web browser caching data
CN103023714B (en) The liveness of topic Network Based and cluster topology analytical system and method
CN102473190B (en) Keyword assignment to a web page
CN106095979B (en) URL merging processing method and device
CN102737021B (en) Search engine and realization method thereof
CN102207936B (en) Method and system for indicating content change of electronic document
CN103186600B (en) The specific analysis method and apparatus of internet public feelings
CN102915335B (en) Based on the information correlation method of user operation records and resource content
CN102831193A (en) Topic detecting device and topic detecting method based on distributed multistage cluster
CN102722499B (en) Search engine and implementation method thereof
CN103838785A (en) Vertical search engine in patent field
CN103577478A (en) Web page pushing method and system
CN102609474A (en) Access information providing method and system
CN102722498A (en) Search engine and implementation method thereof
WO2013146736A1 (en) Synonym relation determination device, synonym relation determination method, and program thereof
CN104391978A (en) Method and device for storing and processing web pages of browsers
CN101630315B (en) Quick retrieval method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant