CN102087648A - Method and system for fetching news comment page - Google Patents

Method and system for fetching news comment page Download PDF

Info

Publication number
CN102087648A
CN102087648A CN2009102420552A CN200910242055A CN102087648A CN 102087648 A CN102087648 A CN 102087648A CN 2009102420552 A CN2009102420552 A CN 2009102420552A CN 200910242055 A CN200910242055 A CN 200910242055A CN 102087648 A CN102087648 A CN 102087648A
Authority
CN
China
Prior art keywords
page
url
link
news
news analysis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2009102420552A
Other languages
Chinese (zh)
Other versions
CN102087648B (en
Inventor
严华梁
刘伟
杨建武
万小军
肖建国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Founder Electronics Chief Information Technology Co ltd
New Founder Holdings Development Co ltd
Peking University
Beijing Founder Electronics Co Ltd
Original Assignee
BEIJING FOUNDER E-GOVERNMENT INFORMATION TECHNOLOGY Co Ltd
Peking University
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING FOUNDER E-GOVERNMENT INFORMATION TECHNOLOGY Co Ltd, Peking University, Peking University Founder Group Co Ltd, Beijing Founder Electronics Co Ltd filed Critical BEIJING FOUNDER E-GOVERNMENT INFORMATION TECHNOLOGY Co Ltd
Priority to CN200910242055.2A priority Critical patent/CN102087648B/en
Publication of CN102087648A publication Critical patent/CN102087648A/en
Application granted granted Critical
Publication of CN102087648B publication Critical patent/CN102087648B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a method and a system for fetching a news comment page, and belongs to the technical field of information retrieval and data integration. The method comprises the following steps of: performing breadth traversal on the pages from an initial page of a news website, and acquiring page information meeting depth limitation in the traversal process; then calculating the characteristic values of the pages, and identifying the news comment page from the pages according to the size relationship between the characteristic values and a preset threshold value; and finally, acquiring a page turning link of the news comment page, and acquiring other news comment pages according to the page turning link. The method and the system can automatically fetch the news comment page from the web pages of the news website, the fetching speed is high, and the fetched news comment page is comprehensive.

Description

A kind of crawling method of the news analysis page and system
Technical field
The invention belongs to information retrieval and data integration technical field, be specifically related to a kind of crawling method and system of the news analysis page.
Background technology
Web is just development with surprising rapidity since being born the beginning of the nineties in last century, and Web has become maximum in the world information warehouse up till now, has covered the every field of real world, becomes human work's life and obtains the information main path.The issue of Web information mainly is the form realization with webpage, and according to up-to-date estimation, the quantity of webpage has surpassed 550 billion (1 billion equals 1,000,000,000) among the Web.Obviously the visit of manual mode can't be satisfied the needs that people's information is obtained, in order to allow people more effectively visit and utilize the information of magnanimity among the Web, just begun the research of Web information search and integration field from the mid-90 in last century person of beginning one's study, various Web information searches and integrated relevant application have also appearred in industrial community simultaneously, such as vertical search engine, public sentiment analysis etc.These steps necessarys that use to realize are exactly that the page that obtains the information needed place earlier extracts required information from the very poor webpage of structuring degree then exactly.
News analysis among the Web is meant that common viewer in the comment of the news website with comment issue authority at issues such as a certain media event or personages, is that the very important on the internet information of present people is obtained the source.News analysis information is occupied bigger ratio in Web information.Produce many important use and research topic based on news analysis information, mainly comprised following two aspects:
1. comment search engine: towards the vertical search engine of comment, from the user of One's name is legion has the website of comment issue authority, obtain and integrated comment, can provide instant for people comprehensively to particular news event or personage's comment search.For the promptness that guarantees news analysis information and comprehensive, must in time handle a large amount of review pages, the prerequisite of processing is certain as to obtain the news analysis page earlier.
2. public sentiment analysis: be the hot research problem that last decade comes natural language processing and information retrieval field.Its target is to identify topic of system's the unknown and the report relevant with this topic from continuous record.One of its main information source is exactly the news analysis information of issuing among the Web.
By top to the introduction of using as can be seen, news analysis information is their very important Data Sources, the prerequisite that obtains news analysis information is to get access to the news analysis page, but because news website One's name is legion among the Web, and include various webpages in the news website, will inevitably have a strong impact on the efficient of information processing and the quality of retrieval.Therefore, be that many important application press for one of key technical problem of solution to the automatic identification of the news analysis page, have very important practical significance and wide application prospect.At present, the crawling method or the system that also do not have the news analysis page in the prior art.
Summary of the invention
At the defective that exists in the prior art, the purpose of this invention is to provide a kind of crawling method and system of the news analysis page.These method and system can be climbed automatically from news website effectively and be taken out the news analysis page.
To achieve these goals, the technical solution used in the present invention is as follows:
A kind of crawling method of the news analysis page may further comprise the steps:
(A) obtain the page in the news website;
(B) from the page that obtains, identify the news analysis page;
(C) page turning of obtaining in the news analysis page links, and obtains other news analysis pages according to the page turning link.
A kind of news analysis page climb the system of getting, comprise the page deriving means that is used for obtaining the page from news website; Be used for identifying the news analysis page recognition device of the news analysis page from the page that page deriving means obtains; And the page turning link that is used to obtain the news analysis page that news analysis page recognition device identifies, and obtain the news analysis page deriving means of other news analysis pages according to the page turning link.
The method of the invention and system can climb automatically from the webpage of news website and take out the news analysis page, and it is fast to climb the speed of getting, and it is comprehensive to climb the news analysis page of getting.
Description of drawings
Fig. 1 is the preferred implementation structured flowchart that the news analysis page of the present invention is climbed the system of getting;
Fig. 2 adopts system shown in Figure 1 to climb the method flow diagram of getting the news analysis page;
Fig. 3 is a kind of specific implementation method flow diagram that obtains the page in the embodiment;
Fig. 4 is a specific implementation method flow diagram of discerning the news analysis page in the embodiment from the page;
Fig. 5 is the page turning link of obtaining the news analysis page in the embodiment, and obtains the specific implementation method flow diagram of other news analysis pages according to the page turning link.
Embodiment
Below in conjunction with embodiment and accompanying drawing, describe the present invention.
Fig. 1 has shown the preferred implementation structure that the news analysis page of the present invention is climbed the system of getting.This system comprises page deriving means 11, the news analysis page recognition device 12 that is connected with page deriving means 11, and the page turning that is connected with news analysis page recognition device 12 links deriving means 13.
Page deriving means 11 is used for obtaining the page from news website.News analysis page recognition device 12 is used for identifying the news analysis page from the page that page deriving means 11 obtains.Page turning link deriving means 13 is used to obtain the page turning link of the news analysis page that news analysis page recognition device 12 identifies, and obtains other news analysis pages according to the page turning link.
Wherein, page deriving means 11 comprises that further the Html text obtains parts 101 and URL obtains parts 102.The Html text obtains the Html text that parts 101 are used to obtain the page.URL obtains parts 102 and is used for obtaining out URL (Uniform Resource Locator, URL(uniform resource locator) is also referred to as web page address) with other pages of this page link from the Html text of current page.
News analysis page recognition device 12 comprises that further recognition rule is provided with parts 120 and page eigenvalue calculation parts 121.Recognition rule is provided with the eigenvalue calculation rule that parts 120 are used to be provided with the news analysis page.Page eigenvalue calculation parts 121 are used for according to recognition rule the regular eigenwert of calculating the page of eigenvalue calculation that parts 120 are provided with being set.
Page turning link deriving means 13 comprises that further parts 130 are obtained in link and page turning links identification component 131.The link information that parts 130 are used for obtaining the news analysis page is obtained in link, comprises the Html text of link and the URL of link.Page turning link identification component 131 is used for obtaining the link information that parts 130 obtain from link and identifies the page turning link.
Fig. 2 has shown the employing system shown in Figure 1 and has climbed the method flow of getting the news analysis page.This method may further comprise the steps:
(A) page deriving means 11 obtains the page in the news website.
From the news website start page, the page is carried out the range traversal.Obtain all page infos that satisfy the predetermined depth restriction in ergodic process.The Html text obtains the Html text that parts 101 obtain the page, and URL obtains parts 102 and obtain out URL with other pages of this page link from the Html text of current page.
(B) news analysis page recognition device 12 identifies the news analysis page from the page that page deriving means 11 obtains.
Page eigenvalue calculation parts 121 are provided with the eigenvalue calculation rule that parts 120 are provided with according to recognition rule, calculate the eigenwert of the page.Described page eigenwert is meant the summation that contains news analysis page feature in the page, and mainly the eigenwert according to the URL of the page obtains with the eigenwert weighted calculation that is linked to the link text of this page.
(C) page turning link deriving means 13 obtains the page turning link of the news analysis page, and obtains other news analysis pages according to the page turning link.
Link is obtained parts 130 and is obtained all-links information in the news analysis page, page turning link identification component 131 identifies the page turning link URL of link text for " following one page ", whether has common prefix between the URL by this URL and the news analysis page and judges whether page turning link URL corresponding page is the news analysis page.
Fig. 3 has shown the specific implementation method flow that page deriving means 11 in the step (A) obtains the page, may further comprise the steps:
(A1) specify the URL of start page and the degree of depth to limit deep, initialization URL_Page formation and URL_Unique formation promptly empty URL_Page formation and URL_Unique formation.For convenience of description, the URL_Page formation is called first formation, the URL_Unique formation is called second formation.First formation and second formation all are used for the URL of memory page, and the purpose that two formations are set is the uniqueness that guarantees the URL that stores in second formation.Wherein, the value of deep can be set to the degree of depth of the news analysis page according to the news website homepage.Generally speaking, from the homepage of news website, deep is 4 or 5 o'clock, can reach the news analysis page.
(A2) add start page URL to the first formation tail of the queue and the second formation tail of the queue.
(A3) take out the first formation head of the queue URL, judge the degree of depth level of current URL corresponding page with respect to start page.If level>deep then exports second formation and goes to step (B).Otherwise obtain the Html text of the corresponding page of current URL and the Html text that therefrom extracts URL chained address S set and point to the corresponding page in chained address.
(A4) take out URL who was not removed in the S set, judge whether this URL exists in second formation.If exist, then take out the next URL that was not removed and continue to judge, go to step (A3) behind all URL in having judged S; If there is no, this URL address is added into the first formation tail of the queue, goes to step (A3).
Fig. 4 has shown news analysis page recognition device 12 in the step (B) is discerned the news analysis page from the page specific implementation method flow, may further comprise the steps:
(B1) obtain second formation, judge whether second formation is empty; In this way, then go to step (C);
(B2) take out the second formation head of the queue URL, extract this URL corresponding page, calculate the eigenwert T of this page according to default eigenvalue calculation rule;
(B3) judge that whether the eigenwert T of this page is greater than predetermined threshold value Limit; If T>Limit, then exporting this page is the news analysis page, and judges whether the URL of this page exists in Comment_URL_Unique formation (calling the 3rd formation in the following text), if there is no, then it is added into the tail of the queue of the 3rd formation, goes to step (B1).The 3rd formation is used to store the URL of the news analysis page.In the present embodiment, the value of Limit is 26, and this value can suitably be adjusted according to the actual conditions of news website.
In step (B2), the process of calculating the eigenwert T of the page according to default eigenvalue calculation rule may further comprise the steps:
(B2_1) judge whether comprise " comment " or " liuyan " among the page URL; In this way, then the eigenwert Score_URL of page URL is 1, otherwise is 0;
(B2_2) according to recognition rule the eigenwert Score_Keyword that rule that parts 120 set in advance is calculated the link text that points to this page is set; In the present embodiment, recognition rule is provided with regular as follows that parts 120 are provided with:
Number of regulation Rule content
1 Comprise " comment " in the link text, eigenwert adds 24.5
2 Comprise " follow-up " or " message " or " comment " in the link text, eigenwert adds 22.5
3 Comprise " saying " and " sentence " in the link text, eigenwert adds 4
4 Comprise " saying " and " I " in the link text, eigenwert adds 4
5 Comprise " online friend " in the link text, eigenwert adds 4
6 Comprise " issue " or " checking " or " click " in the link text, eigenwert adds 10
7 Comprise " checking " and " click " in the link text, eigenwert adds 110
8 Comprise " having " or " all " or " owning " or " other " in the link text, eigenwert adds 10
(B2_3) the eigenwert T=Score_URL of the page * 8+Score_Keyword.
Fig. 5 has shown the page turning link that page turning link deriving means 13 obtains the news analysis page in the step (C), and links the specific implementation method flow that obtains other news analysis pages according to page turning, may further comprise the steps:
(C1) obtain the 3rd formation, judge whether the 3rd formation is empty; In this way, then finish;
(C2) take out the head of the queue URL of the 3rd formation, obtain the Html text of the corresponding page of this URL;
(C3) from the Html text, take out link text be " following one page " and the URL of link correspondence;
(C4) judge that whether the corresponding URL of this link exists common prefix with the URL of current page; In this way, then exporting the corresponding URL corresponding page of this link is the news analysis page, and adds corresponding URL to the three formation tails of the queue of this link, goes to step (C1); Otherwise, directly go to step (C1).
In step (C4), judge that whether the corresponding URL of link exists the process of common prefix may further comprise the steps with the URL of current page:
(C4_1) character string sequence S1 and S2 are put sky;
(C4_2) be that separator is cut apart the corresponding URL of link with "/", the part after will cutting apart deposits S1 according to the order of sequence in; With "/" is the URL that separator is cut apart current page, and the part after will cutting apart deposits S2 according to the order of sequence in;
(C4_3) judge whether first element of S1 and S2 is identical; In this way, then there is common prefix in two URL; Otherwise there is not common prefix in two URL.
For example, two URL of existing following structure:
URL1:http://comment2.news.sohu.com/viewcomments.action?id=267280310;
URL2:http://comment2.news.sohu.com/default/comments.shtml?t=267280310。
Is to deposit S1 and S2 after separator is cut apart respectively in URL 1 and URL2 with "/":
S1={(http://comment2.news.sohu.com),(viewcomments.action?id=267280310)};
S2={(http://comment2.news.sohu.com),(default),(comments.shtml?t=267280)}。
First element by S1 relatively and first element of S2 as can be known, both are identical, are http://comment2.news.sohu.com, so there are common prefix in URL1 and URL2.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technology thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims (10)

1. the crawling method of a news analysis page may further comprise the steps:
(A) obtain the page in the news website;
(B) from the page that obtains, identify the news analysis page;
(C) page turning of obtaining in the news analysis page links, and obtains other news analysis pages according to the page turning link.
2. the crawling method of the news analysis page as claimed in claim 1 is characterized in that, obtains in the step (A) that the process of the page may further comprise the steps in the news website:
(A1) specify the URL of start page and the degree of depth to limit deep, initialization first formation and second formation; Described URL is a URL(uniform resource locator);
(A2) add start page URL to the first formation tail of the queue and the second formation tail of the queue;
(A3) take out the first formation head of the queue URL, judge the degree of depth level of current URL corresponding page with respect to start page; If level>deep then exports second formation and goes to step (B); Otherwise obtain the Html text of the corresponding page of current URL and the Html text that therefrom extracts URL chained address S set and point to the corresponding page in chained address;
(A4) take out URL who was not removed in the S set, judge whether this URL exists in second formation; If exist, then take out the next URL that was not removed and continue to judge, go to step (A3) behind all URL in having judged S; Otherwise, this URL address is added into the first formation tail of the queue, go to step (A3).
3. the crawling method of the news analysis page as claimed in claim 2 is characterized in that, the process of the identification news analysis page may further comprise the steps in the step (B):
(B1) obtain second formation, judge whether second formation is empty; In this way, then go to step (C);
(B2) take out the second formation head of the queue URL, extract this URL corresponding page, calculate the eigenwert T of this page according to default eigenvalue calculation rule;
(B3) judge that whether the eigenwert T of this page is greater than predetermined threshold value Limit; If T>Limit, then exporting this page is the news analysis page, and judges whether the URL of this page exists in the 3rd formation, if there is no, then it is added into the tail of the queue of the 3rd formation, goes to step (B1).
4. the crawling method of the news analysis page as claimed in claim 3 is characterized in that, in the step (B2), the process of calculating the eigenwert T of the page according to default eigenvalue calculation rule may further comprise the steps:
(B2_1) judge whether comprise " comment " or " liuyan " among the page URL, in this way, then the eigenwert Score_URL of page URL is 1; Otherwise, be 0;
(B2_2) according to the following regular eigenwert Score_Keyword that calculates the link text that points to the page;
Comprise " comment " in the link text, eigenwert adds 24.5;
Comprise " follow-up " or " message " or " comment " in the link text, eigenwert adds 22.5;
Comprise " saying " and " sentence " in the link text, eigenwert adds 4;
Comprise " saying " and " I " in the link text, eigenwert adds 4;
Comprise " online friend " in the link text, eigenwert adds 4;
Comprise " issue " or " checking " or " click " in the link text, eigenwert adds 10;
Comprise " checking " and " click " in the link text, eigenwert adds 110;
Comprise " having " or " all " or " owning " or " other " in the link text, eigenwert adds 10;
(B23) the eigenwert T=Score_URL of the page * 8+Score_Keyword.
5. the crawling method of the news analysis page as claimed in claim 3 is characterized in that, obtains the page turning link of the news analysis page described in the step (C), and may further comprise the steps according to the process that other news analysis pages are obtained in the page turning link:
(C1) obtain the 3rd formation, judge whether the 3rd formation is empty; In this way, then finish;
(C2) take out the head of the queue URL of the 3rd formation, obtain the Html text of the corresponding page of this URL;
(C3) from the Html text, take out link text be " following one page " and the URL of link correspondence;
(C4) judge that whether the corresponding URL of this link exists common prefix with the URL of current page; In this way, then exporting the corresponding URL corresponding page of this link is the news analysis page, and adds corresponding URL to the three formation tails of the queue of this link, goes to step (C1); Otherwise, directly go to step (C1).
6. the crawling method of the news analysis page as claimed in claim 5 is characterized in that, in the step (C4), judges that whether the corresponding URL of link exists the process of common prefix may further comprise the steps with the URL of current page:
(C4_1) character string sequence S1 and S2 are put sky;
(C4_2) be that separator is cut apart the corresponding URL of link with "/", the part after will cutting apart deposits S1 according to the order of sequence in; With "/" is the URL that separator is cut apart current page, and the part after will cutting apart deposits S2 according to the order of sequence in;
(C4_3) judge whether first element of S1 and S2 is identical; In this way, then there is common prefix in two URL; Otherwise there is not common prefix in two URL.
A news analysis page climb the system of getting, comprise the page deriving means (11) that is used for obtaining the page from news website; Be used for identifying the news analysis page recognition device (12) of the news analysis page from the page that page deriving means (11) obtains; And the page turning link that is used to obtain the news analysis page that news analysis page recognition device (12) identifies, and obtain the page turning link deriving means (13) of other news analysis pages according to the page turning link.
8. the news analysis page as claimed in claim 7 is climbed the system of getting, and it is characterized in that: described page deriving means (11) comprises that further the Html text of the Html text that is used for obtaining the page obtains parts (101) and is used for obtaining parts (102) from the URL that the Html text of the page obtains out with the URL of other pages of this page link.
9. the news analysis page as claimed in claim 8 is climbed the system of getting, and it is characterized in that: described news analysis page recognition device (12) comprises that further the recognition rule of the eigenvalue calculation rule that is used to be provided with the news analysis page is provided with parts (120) and is used for according to recognition rule the page eigenvalue calculation parts (121) that eigenvalue calculation rule that parts (120) are provided with is calculated page eigenwert being set.
10. climb the system of getting as the described news analysis page of one of claim 7 to 9, it is characterized in that: described page turning link deriving means (13) comprises further that the link of the link information that is used for obtaining the news analysis page is obtained parts (130) and is used for obtaining the page turning that link information that parts (130) obtain identifies the page turning link from link and links identification component (131).
CN200910242055.2A 2009-12-03 2009-12-03 Method and system for fetching news comment page Expired - Fee Related CN102087648B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200910242055.2A CN102087648B (en) 2009-12-03 2009-12-03 Method and system for fetching news comment page

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200910242055.2A CN102087648B (en) 2009-12-03 2009-12-03 Method and system for fetching news comment page

Publications (2)

Publication Number Publication Date
CN102087648A true CN102087648A (en) 2011-06-08
CN102087648B CN102087648B (en) 2013-06-19

Family

ID=44099461

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200910242055.2A Expired - Fee Related CN102087648B (en) 2009-12-03 2009-12-03 Method and system for fetching news comment page

Country Status (1)

Country Link
CN (1) CN102087648B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102279894A (en) * 2011-09-19 2011-12-14 嘉兴亿言堂信息科技有限公司 Method for searching, integrating and providing comment information based on semantics and searching system
CN102722580A (en) * 2012-06-07 2012-10-10 杭州电子科技大学 Method for downloading video comments dynamically generated in video websites
CN102810110A (en) * 2012-05-07 2012-12-05 北京京东世纪贸易有限公司 Method and system for acquiring web text data
CN102821088A (en) * 2012-05-07 2012-12-12 北京京东世纪贸易有限公司 System and method for acquiring network data
CN103488675A (en) * 2013-07-11 2014-01-01 哈尔滨工程大学 Automatic precise extraction device for multi-webpage news comment contents
CN103593344A (en) * 2012-08-13 2014-02-19 北大方正集团有限公司 Information acquisition method and device
CN103617229A (en) * 2013-11-25 2014-03-05 北京奇虎科技有限公司 Method and device for establishing relevant-webpage data base
CN104408198A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Method and device for acquiring webpage contents
CN104504016A (en) * 2014-12-10 2015-04-08 河海大学 User-oriented automatic WEB information extracting method
CN105138357A (en) * 2015-08-11 2015-12-09 中山大学 Method and device for implementing mobile application operation assistant
CN107045507A (en) * 2016-02-05 2017-08-15 北京国双科技有限公司 Web page crawl method and device
CN109241402A (en) * 2018-07-31 2019-01-18 成都华栖云科技有限公司 A kind of virtual comment machine introduction method based on news content
CN111339242A (en) * 2020-02-26 2020-06-26 广东小天才科技有限公司 Comment data processing method, comment data display method, server and client

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100401301C (en) * 2006-05-30 2008-07-09 南京大学 Body learning based intelligent subject-type network reptile system configuration method
CN100461184C (en) * 2007-07-10 2009-02-11 北京大学 Subject crawling method based on link hierarchical classification in network search
CN101441662B (en) * 2008-11-28 2010-12-22 北京交通大学 Topic information acquisition method based on network topology
CN101561814B (en) * 2009-05-08 2012-05-09 华中科技大学 Topic crawler system based on social labels

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102279894B (en) * 2011-09-19 2013-01-09 嘉兴亿言堂信息科技有限公司 Method for searching, integrating and providing comment information based on semantics and searching system
CN102279894A (en) * 2011-09-19 2011-12-14 嘉兴亿言堂信息科技有限公司 Method for searching, integrating and providing comment information based on semantics and searching system
CN102810110B (en) * 2012-05-07 2015-08-05 北京京东世纪贸易有限公司 Obtain the method and system of network text data
CN102810110A (en) * 2012-05-07 2012-12-05 北京京东世纪贸易有限公司 Method and system for acquiring web text data
CN102821088A (en) * 2012-05-07 2012-12-12 北京京东世纪贸易有限公司 System and method for acquiring network data
CN102821088B (en) * 2012-05-07 2015-12-16 北京京东世纪贸易有限公司 Obtain the system and method for network data
CN102722580A (en) * 2012-06-07 2012-10-10 杭州电子科技大学 Method for downloading video comments dynamically generated in video websites
CN103593344A (en) * 2012-08-13 2014-02-19 北大方正集团有限公司 Information acquisition method and device
CN103593344B (en) * 2012-08-13 2016-09-21 北大方正集团有限公司 A kind of information collecting method and device
CN103488675A (en) * 2013-07-11 2014-01-01 哈尔滨工程大学 Automatic precise extraction device for multi-webpage news comment contents
CN103617229A (en) * 2013-11-25 2014-03-05 北京奇虎科技有限公司 Method and device for establishing relevant-webpage data base
CN104504016A (en) * 2014-12-10 2015-04-08 河海大学 User-oriented automatic WEB information extracting method
CN104408198A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Method and device for acquiring webpage contents
CN104408198B (en) * 2014-12-15 2018-07-17 北京国双科技有限公司 The acquisition methods and device of Webpage content
CN105138357A (en) * 2015-08-11 2015-12-09 中山大学 Method and device for implementing mobile application operation assistant
CN105138357B (en) * 2015-08-11 2018-05-01 中山大学 A kind of implementation method and its device of mobile application operation assistant
CN107045507A (en) * 2016-02-05 2017-08-15 北京国双科技有限公司 Web page crawl method and device
CN107045507B (en) * 2016-02-05 2020-08-21 北京国双科技有限公司 Webpage crawling method and device
CN109241402A (en) * 2018-07-31 2019-01-18 成都华栖云科技有限公司 A kind of virtual comment machine introduction method based on news content
CN111339242A (en) * 2020-02-26 2020-06-26 广东小天才科技有限公司 Comment data processing method, comment data display method, server and client

Also Published As

Publication number Publication date
CN102087648B (en) 2013-06-19

Similar Documents

Publication Publication Date Title
CN102087648B (en) Method and system for fetching news comment page
CN107229668B (en) Text extraction method based on keyword matching
CN102622445B (en) User interest perception based webpage push system and webpage push method
Bellaachia et al. Ne-rank: A novel graph-based keyphrase extraction in twitter
CN103365924B (en) A kind of method of internet information search, device and terminal
CN106202294B (en) Related news computing method and device based on keyword and topic model fusion
WO2016000555A1 (en) Methods and systems for recommending social network-based content and news
TWI695277B (en) Automatic website data collection method
WO2015149533A1 (en) Method and device for word segmentation processing on basis of webpage content classification
CN107544988B (en) Method and device for acquiring public opinion data
CN103365839A (en) Recommendation search method and device for search engines
CN103294681B (en) Method and device for generating search result
CN103853834B (en) Text structure analysis-based Web document abstract generation method
CN102929928A (en) Multidimensional-similarity-based personalized news recommendation method
CN101894102A (en) Method and device for analyzing emotion tendentiousness of subjective text
CN103336766A (en) Short text garbage identification and modeling method and device
CN106126502A (en) A kind of emotional semantic classification system and method based on support vector machine
CN105138558A (en) User access content-based real-time personalized information collection method
CN103186574A (en) Method and device for generating searching result
Han et al. HIT at TREC 2012 Microblog Track.
CN111104801B (en) Text word segmentation method, system, equipment and medium based on website domain name
CN110457579B (en) Webpage denoising method and system based on cooperative work of template and classifier
CN110134788B (en) Microblog release optimization method and system based on text mining
CN104268230A (en) Method for detecting objective points of Chinese micro-blogs based on heterogeneous graph random walk
CN105512333A (en) Product comment theme searching method based on emotional tendency

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220921

Address after: 100871 No. 5, the Summer Palace Road, Beijing, Haidian District

Patentee after: Peking University

Patentee after: New founder holdings development Co.,Ltd.

Patentee after: BEIJING FOUNDER ELECTRONICS CHIEF INFORMATION TECHNOLOGY Co.,Ltd.

Patentee after: BEIJING FOUNDER ELECTRONICS Co.,Ltd.

Address before: 100871 No. 5, the Summer Palace Road, Beijing, Haidian District

Patentee before: Peking University

Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee before: BEIJING FOUNDER ELECTRONICS CHIEF INFORMATION TECHNOLOGY Co.,Ltd.

Patentee before: BEIJING FOUNDER ELECTRONICS Co.,Ltd.

TR01 Transfer of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130619

CF01 Termination of patent right due to non-payment of annual fee