WO2013087005A1 - 一种网络评论的采集方法及系统 - Google Patents
一种网络评论的采集方法及系统 Download PDFInfo
- Publication number
- WO2013087005A1 WO2013087005A1 PCT/CN2012/086575 CN2012086575W WO2013087005A1 WO 2013087005 A1 WO2013087005 A1 WO 2013087005A1 CN 2012086575 W CN2012086575 W CN 2012086575W WO 2013087005 A1 WO2013087005 A1 WO 2013087005A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- network
- comments
- network comments
- webpage
- link address
- Prior art date
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/02—Standardisation; Integration
- H04L41/0246—Exchanging or transporting network management information using the Internet; Embedding network management web servers in network elements; Web-services-based protocols
- H04L41/0253—Exchanging or transporting network management information using the Internet; Embedding network management web servers in network elements; Web-services-based protocols using browsers or web-pages for accessing management information
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Definitions
- the present invention relates to the field of information retrieval and data integration technologies, and in particular, to a method and system for collecting network comments. Background technique
- Embodiments of the present invention provide a method and system for collecting network comments, which are used for efficiently and comprehensively collecting network comments.
- An embodiment of the present invention provides a method for collecting network comments, including: obtaining a webpage entry link address; determining whether there are N network comments on the webpage corresponding to the webpage entry link address, wherein the N is positive An integer; when there are the N network comments, determining whether there are M network comments in the N network comments satisfying the condition of the set, wherein the M is a positive integer less than or equal to N; When M network comments satisfy the conditions of the collection, the M network comments are collected.
- the obtaining the webpage entry link address specifically includes: acquiring a theme webpage where the topic reviewed by the N network reviews is located; acquiring a feature code of the topic webpage; acquiring a feature code of a channel where the theme is located; The feature code of the theme webpage and the feature code of the channel obtain a webpage entry link address.
- the theme webpage entry link address is periodically refreshed.
- the webpage portal link address is deleted.
- the determining whether the M network comments satisfy the set of the N network comments comprises: calculating a difference between N and P, and if N is greater than P, indicating that there is a new network comment, and The number of the newly added network comments is the difference M between N and P, where P is the number of network comments when the page was last accessed.
- calculating the number L of network comments included on the current page of the page if the L is less than M, calculating the number of pages of the page turning, and extracting a page turning link corresponding to the number of pages, where L Is a positive integer.
- each of the N network comments is compared with each of the P network comments, and if the comparison results are different, the M networks with different comparison results are extracted. comment.
- the determining, by the network comment, whether the M network comments meet the set of conditions comprises: each of the N network reviews and each of the P network reviews The comments are compared separately, and if the comparison results are different, it is determined that the M network comments having different alignment results are network comments satisfying the collection condition.
- the extracted M network comment content is saved to a storage unit different from the web page.
- An embodiment of the present invention provides a network comment collection system, the system includes: an entry link obtaining component, configured to acquire a webpage entry link address, and a first determining component, configured to determine the webpage entry link address Whether there are N network comments on the corresponding webpage, where the N is a positive integer; the second determining component is configured to determine whether there are M networks in the N network comments when there are the N network comments The comment satisfies the condition of the set, wherein the M is a positive integer less than or equal to N; the content collection component is configured to collect the M networks when the M network reviews satisfy the condition of the set comment.
- the network commentary collection system is used to collect network comments, and the technical effect of comprehensively collecting network comments is achieved by obtaining the entry link address of the network comment and setting the collection conditions.
- FIG. 1 is a flowchart of a method for collecting a set according to an embodiment of the present invention
- FIG. 2 is a detailed flowchart of the method for collecting the data in FIG. 1 according to the present invention
- FIG. 3 is a detailed flowchart of the method of collecting the data in FIG. 1 according to the present invention.
- FIG. 4 is a structural diagram of a collection system according to a first embodiment of the present invention.
- FIG. 5 is a structural diagram of a collection system according to a second embodiment of the present invention
- FIG. 6 is a structural diagram of a collection system according to a third embodiment of the present invention
- FIG. 7 is a structural diagram of a collection system according to a fourth embodiment of the present invention.
- FIG. 8 is a structural diagram of a collection system according to another embodiment of the present invention. detailed description
- An embodiment of the present invention provides a method for collecting network comments, which is used to collect network comments. As shown in Figure 1, the collection method includes:
- Step 11 Obtain a webpage entry link address
- Step 12 Determine whether there are N network comments on the webpage corresponding to the webpage link address, where N is a positive integer;
- Step 13 When there are N network comments, it is judged whether there are M network comments in the N network comments satisfying the condition of the set, wherein M is a positive integer less than or equal to N;
- Step 14 Collect M network comments when there are M network comments satisfying the conditions of the collection.
- step 11 specifically includes:
- Step 111 Obtain a topic webpage where the topics reviewed by the N network reviews are located;
- Step 112 Obtain a feature code of the topic webpage.
- Step 113 Obtain the feature code of the channel where the theme is located;
- Step 114 splicing the feature code of the theme webpage and the feature code of the channel to obtain a webpage entry link address.
- the topic webpage may be the page where the news is located or the page where the product information is located.
- the present embodiment is described in detail by taking the news webpage as an example.
- the topic webpage may also be a page where other information is located. The invention is not limited.
- the comment page entry link address for commenting on the news is obtained by splicing the feature code in the script program of the news page by a specific rule.
- the entry link address of the news comment page of the news is that the script of the news page will identify the feature code of the piece of news, the feature code of the channel identifying the piece of news, plus the domain name and some other elements (such as the current time). ) stitched together. Obtain the above feature code, and configure the personalized rule to match the entry link address of the network comment page according to the specified mode.
- step 11 further includes:
- Step 115 Periodically refresh the webpage entry link address.
- the backend of the website of the news webpage may re-edit the news, and the news webpage link of the same content may change.
- the new online comment content will be loaded by the changed network comment entry link, and the previously extracted online comment entry link.
- the page indicated by the address will no longer have updates for new comments. It can be seen that if the original recorded web comment entry link is used for access, the newly updated comment content cannot be obtained, so for this case, The link of the currently recorded news page is periodically refreshed.
- the site will automatically jump to the changed news page, so that the network comment entry link can be re-extracted according to the newly obtained news page to continue the collection. That is, when the news webpage entry link address is updated, the jump proceeds to step 111, otherwise, the flow ends.
- Step 3 is the specific steps of Step 13, including:
- Step 131 Extract the number N of current network comments from the webpage, and calculate the difference M between N and P, where P is the number of network comments extracted from the last access to the link;
- Step 132 Determine whether M is greater than zero
- Step 133 When the result of step 132 is YES, M network comments are extracted.
- the number N of the current network comments extracted from the webpage in step 131 may be extracted from the webpage by using a regular expression, or may be extracted by using other methods, and the invention is not limited.
- P is equal to zero.
- step 133 specifically includes:
- Step 1331 Calculate the number L of network comments included on the current page of the page, where L is a positive integer less than or equal to M;
- Step 1332 Determine whether L is less than M
- Step 1333 When the result of step 1332 is YES, the number of pages of the page turning is calculated, and the page turning link corresponding to the number of pages is extracted.
- step 1333 the formula for turning the page is: corpse D ⁇
- step 133 further includes: Step 1334: Determine whether each of the network comments in the N network comments is the same as each of the network comments in the P network comments;
- Step 1335 When the result of step 1334 is no, M network comments with different comparison results are extracted.
- the M network comment content extracted in step 1335 will be saved to a storage unit different from the comment webpage, and the network comment saved to the storage unit is convenient for centralized browsing, so that the user can apply the collected network comment.
- the news is time-sensitive. If the news exceeds a certain period of time, it is considered meaningless.
- the news commentary which is also an accessory to the news, also fails with the news. For the above reasons, if the network comment is not updated after the predetermined time is exceeded, the news comment link is deleted, and the refresh is not continued, which saves system resources and has higher work efficiency.
- the method of calculating the difference M of N and P in the foregoing embodiment may be omitted, but directly Each of the N network comments is compared with each of the P network comments, and if the comparison results are different, the M network comments having different comparison results are extracted.
- the use of this method is because the back-end of the website of the news page will delete the online comments from time to time.
- the system has 15 online comments for the first time, in two intervals, for some reasons. In the background of the website, all 15 comments are deleted. At the same time, 30 new comments are added, and only 15 comments can be displayed in a web page. Therefore, the first and second web comments of the web comment can be considered. New.
- the 30 comments collected in this collection are compared with the last 15 comments. The result of this comparison is the 30 comments and the last 15 comments of this collection. They are different, so this time we should collect 30 new comments.
- the 30 pieces of network comment content collected in this collection are saved to a storage unit different from the comment page, and the network comments saved to the storage unit are conveniently browsed, so that the user can apply the collected network comments.
- the first embodiment of the present invention provides a network data collection system.
- FIG. 4 is a system architecture diagram in the embodiment.
- the system includes an entry link acquisition unit 10, a first determination unit 20, a second determination unit 30, and a content collection unit 40.
- the portal link obtaining component 10 is configured to obtain a webpage entry link address.
- the first determining component 20 is configured to determine whether there are N network comments on the webpage corresponding to the webpage entry link address.
- the second judging section 30 is for judging whether there are M network comments satisfying the collection condition.
- the content collection component 40 is used to collect web reviews.
- the entry link obtaining component 10 includes a first obtaining unit 101, a second obtaining unit 102, a third acquiring unit 103, and a tiling unit 104.
- the first obtaining unit 101 is configured to acquire a theme webpage in which the topic of the N-theme comment is located;
- the second acquiring unit 102 is configured to acquire the feature code of the topic webpage;
- the third acquiring unit 103 is configured to acquire the feature code of the channel where the topic is located;
- the splicing unit 104 is configured to splicing the feature code of the theme webpage and the feature code of the channel to obtain a webpage entry link address.
- the second determining component 30 determines whether there are M network comments satisfying the collection condition, specifically extracting N network comments from the webpage, and calculating a difference M between N and P, where P is the network extracted last time accessing the link The number of comments. Further, it is judged whether M is greater than zero, and if M is greater than zero, it means that M network comments are comments satisfying the collection condition.
- the system further includes an entry link address refreshing component 50 for periodically refreshing the webpage entry link address.
- the portal link The address refresh component 50 can be utilized in conjunction with the portal link acquisition component 10 to enable timely review of updated network comments.
- system further includes a network comment page refreshing component.
- the network comment page refreshing component 60 can be used in conjunction with the first determining component 20 to improve system collection efficiency, and the network commentary that has not been updated for a long time is discarded.
- the content collecting unit 40 further includes a page turning extracting unit 401, a comparing unit 402, a content extracting unit 403, and a disk I/O unit. 404.
- the page turning extracting unit 401 is configured to calculate the page number of the page turning and extract the page turning link corresponding to the page number; the comparing component 402 is configured to use each of the N network comments and the P network comments Each of the network comments is separately compared; the content extracting unit 403 is configured to extract the network comments having different comparison results when the comparison results are different.
- the disk I/O component 404 is configured to save the extracted web comment content to a storage unit other than the web page. Please refer to FIG. 7 for this embodiment.
- FIG. 8 is a system architecture diagram in the embodiment.
- the present embodiment is different from the first embodiment in that the present embodiment does not include the matching unit 402 and the content extracting unit 403.
- the system of the present embodiment includes an entry link acquisition section 80, a first determination section 81, a second determination section 82, and a content collection section 83.
- the portal link obtaining component 80 is configured to obtain a webpage entry link address.
- the first determining component 81 is configured to determine whether there is a network comment on the webpage corresponding to the webpage entry link address.
- the second judging section 82 is for judging whether or not there is a web comment satisfying the collection condition.
- the content collection component 83 is used to collect network comments.
- the entry link obtaining component 80 includes a first obtaining unit 801, a second obtaining unit 802, a third acquiring unit 803, and a splicing unit 804.
- the first obtaining unit 801 is configured to acquire a theme webpage in which the topic of the N-theme comment is located;
- the second acquiring unit 802 is configured to acquire the feature code of the topic webpage;
- the third acquiring unit 803 is configured to acquire the feature code of the channel where the topic is located;
- the splicing unit 804 is configured to splicing the feature code of the theme webpage and the feature code of the channel to obtain a webpage entry link address.
- the second determining component 82 is configured to compare each of the N network comments with each of the P network reviews, and if the comparison result is different, determine the M with different comparison results.
- Online reviews are web reviews that satisfy the collection criteria.
- the content collection unit 83 includes a page turning extraction unit 831 and a disk I/O unit 832.
- the page flipping component 831 is configured to calculate the number of page turning pages and extract a page turning link corresponding to the number of pages;
- the disk I/O component 832 is configured to save the extracted network comment content to a storage unit different from the web page.
- the portal link obtaining component 80 can cooperate with the portal link address refreshing component 84 in the second embodiment to implement a more comprehensive network comment.
- the first judging section 81 can be used in conjunction with the web commenting page refreshing section 85 in the third embodiment to achieve comprehensive and efficient collection of web comments.
- a network comment collection system is used to collect network comments, and the technical effect of comprehensively collecting network comments is achieved by obtaining an entry link address of the network comment and setting a collection condition. Further, by using the matching component, it is possible to compare each of the comments of the current extraction with each of the comments of the last extraction, and then use the content extraction component only to compare The comments on the different results are extracted, so the effect of efficient collection can be achieved on the basis of comprehensive collection of network comments.
- the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
- the apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
- These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
- the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Information Transfer Between Computers (AREA)
Abstract
Description
Claims
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2014532240A JP2014532220A (ja) | 2011-12-13 | 2012-12-13 | ネットコメントの収集方法およびシステム |
EP12857642.8A EP2713287A4 (en) | 2011-12-13 | 2012-12-13 | METHOD AND SYSTEM FOR COLLECTING NETWORK COMMENTS |
US14/127,400 US20140289395A1 (en) | 2011-12-13 | 2012-12-13 | Network comment collection method and system |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110415749.9A CN103164438B (zh) | 2011-12-13 | 2011-12-13 | 一种网络评论的采集方法及系统 |
CN201110415749.9 | 2011-12-13 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2013087005A1 true WO2013087005A1 (zh) | 2013-06-20 |
Family
ID=48587532
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2012/086575 WO2013087005A1 (zh) | 2011-12-13 | 2012-12-13 | 一种网络评论的采集方法及系统 |
Country Status (5)
Country | Link |
---|---|
US (1) | US20140289395A1 (zh) |
EP (1) | EP2713287A4 (zh) |
JP (1) | JP2014532220A (zh) |
CN (1) | CN103164438B (zh) |
WO (1) | WO2013087005A1 (zh) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103902674B (zh) * | 2014-03-19 | 2017-10-27 | 百度在线网络技术(北京)有限公司 | 特定主题的评论数据的采集方法和装置 |
CN106250417A (zh) * | 2016-07-22 | 2016-12-21 | 乐视控股(北京)有限公司 | 一种获取评论数据的方法及装置 |
CN108520441A (zh) * | 2018-04-04 | 2018-09-11 | 网易无尾熊(杭州)科技有限公司 | 数据处理方法、介质、系统和计算设备 |
KR20220112755A (ko) | 2019-12-06 | 2022-08-11 | 소니그룹주식회사 | 정보 처리 시스템, 정보 처리 방법 및 기억 매체 |
JP7284336B1 (ja) | 2022-12-27 | 2023-05-30 | ヤフー株式会社 | コンテンツ提供装置、コンテンツ提供方法、およびプログラム |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6707470B1 (en) * | 1999-05-21 | 2004-03-16 | Nec Corporation | Apparatus for and method of gathering information, which can automatically obtain HTML file of URL even if user does not specify URL |
CN101178713A (zh) * | 2006-11-29 | 2008-05-14 | 腾讯科技(深圳)有限公司 | 一种采集网页的方法及系统 |
CN101436196A (zh) * | 2008-11-25 | 2009-05-20 | 北京邮电大学 | 自动动态更新论坛爬虫系统的构建方法 |
CN101515269A (zh) * | 2008-02-20 | 2009-08-26 | 中国科学院自动化研究所 | 实现观点搜索引擎排序的方法 |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3918230B2 (ja) * | 1996-04-30 | 2007-05-23 | セイコーエプソン株式会社 | データ更新監視サーバ |
US7464097B2 (en) * | 2002-08-16 | 2008-12-09 | Sap Ag | Managing data integrity using a filter condition |
JP2006309515A (ja) * | 2005-04-28 | 2006-11-09 | Dainippon Printing Co Ltd | 情報配信方法および情報配信サーバ |
US20100205168A1 (en) * | 2009-02-10 | 2010-08-12 | Microsoft Corporation | Thread-Based Incremental Web Forum Crawling |
US20130041901A1 (en) * | 2011-08-12 | 2013-02-14 | Rawllin International Inc. | News feed by filter |
-
2011
- 2011-12-13 CN CN201110415749.9A patent/CN103164438B/zh not_active Expired - Fee Related
-
2012
- 2012-12-13 EP EP12857642.8A patent/EP2713287A4/en not_active Withdrawn
- 2012-12-13 JP JP2014532240A patent/JP2014532220A/ja active Pending
- 2012-12-13 US US14/127,400 patent/US20140289395A1/en not_active Abandoned
- 2012-12-13 WO PCT/CN2012/086575 patent/WO2013087005A1/zh active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6707470B1 (en) * | 1999-05-21 | 2004-03-16 | Nec Corporation | Apparatus for and method of gathering information, which can automatically obtain HTML file of URL even if user does not specify URL |
CN101178713A (zh) * | 2006-11-29 | 2008-05-14 | 腾讯科技(深圳)有限公司 | 一种采集网页的方法及系统 |
CN101515269A (zh) * | 2008-02-20 | 2009-08-26 | 中国科学院自动化研究所 | 实现观点搜索引擎排序的方法 |
CN101436196A (zh) * | 2008-11-25 | 2009-05-20 | 北京邮电大学 | 自动动态更新论坛爬虫系统的构建方法 |
Non-Patent Citations (1)
Title |
---|
See also references of EP2713287A4 * |
Also Published As
Publication number | Publication date |
---|---|
EP2713287A4 (en) | 2015-06-03 |
CN103164438B (zh) | 2016-07-06 |
CN103164438A (zh) | 2013-06-19 |
JP2014532220A (ja) | 2014-12-04 |
US20140289395A1 (en) | 2014-09-25 |
EP2713287A1 (en) | 2014-04-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7356206B2 (ja) | コンテンツ推薦及び表示 | |
CN103885987B (zh) | 一种音乐推荐方法和系统 | |
CN105095211B (zh) | 多媒体数据的获取方法和装置 | |
JP5823620B2 (ja) | ネットデータの採集方法及びシステム | |
WO2015196907A1 (zh) | 一种挖掘用户需求的搜索推送方法和装置 | |
CN102708174B (zh) | 一种浏览器中的富媒体信息的展示方法和装置 | |
WO2018000557A1 (zh) | 搜索结果展示方法和装置 | |
CN106415540B (zh) | 联合搜索 | |
WO2013087005A1 (zh) | 一种网络评论的采集方法及系统 | |
CN102446225A (zh) | 一种实时搜索的方法、装置和系统 | |
WO2015196906A1 (zh) | 一种基于搜索获取疾病咨询信息的方法和装置 | |
JP2009532774A5 (zh) | ||
CN107784059A (zh) | 用于搜索和选择图像的方法和系统以及机器可读媒体 | |
JP2017535860A (ja) | マルチメディア内容の提供方法および装置 | |
US20140006418A1 (en) | Method and apparatus for ranking apps in the wide-open internet | |
CN107766399A (zh) | 用于使图像与内容项目匹配的方法和系统及机器可读介质 | |
US10275472B2 (en) | Method for categorizing images to be associated with content items based on keywords of search queries | |
EP3260968A1 (en) | Method and apparatus for displaying electronic picture, and mobile device | |
JP2015502608A5 (zh) | ||
WO2021129122A1 (zh) | 书籍查询页面的展示方法、电子设备及计算机存储介质 | |
CN104035972A (zh) | 一种基于微博的知识推荐方法与系统 | |
CN104090923A (zh) | 一种浏览器中的富媒体信息的展示方法和装置 | |
CN106682206A (zh) | 一种大数据处理方法及系统 | |
CN109947935A (zh) | 新闻事件的生成方法及装置 | |
WO2014059851A1 (zh) | 一种搜索服务器及搜索方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 12857642 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2014532240 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2012857642 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 14127400 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |