WO2013087005A1 - 一种网络评论的采集方法及系统 - Google Patents

一种网络评论的采集方法及系统 Download PDF

Info

Publication number
WO2013087005A1
WO2013087005A1 PCT/CN2012/086575 CN2012086575W WO2013087005A1 WO 2013087005 A1 WO2013087005 A1 WO 2013087005A1 CN 2012086575 W CN2012086575 W CN 2012086575W WO 2013087005 A1 WO2013087005 A1 WO 2013087005A1
Authority
WO
WIPO (PCT)
Prior art keywords
network
comments
network comments
webpage
link address
Prior art date
Application number
PCT/CN2012/086575
Other languages
English (en)
French (fr)
Inventor
张涛
于晓明
杨建武
Original Assignee
北大方正集团有限公司
北京大学
北京北大方正电子有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北大方正集团有限公司, 北京大学, 北京北大方正电子有限公司 filed Critical 北大方正集团有限公司
Priority to JP2014532240A priority Critical patent/JP2014532220A/ja
Priority to EP12857642.8A priority patent/EP2713287A4/en
Priority to US14/127,400 priority patent/US20140289395A1/en
Publication of WO2013087005A1 publication Critical patent/WO2013087005A1/zh

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/02Standardisation; Integration
    • H04L41/0246Exchanging or transporting network management information using the Internet; Embedding network management web servers in network elements; Web-services-based protocols
    • H04L41/0253Exchanging or transporting network management information using the Internet; Embedding network management web servers in network elements; Web-services-based protocols using browsers or web-pages for accessing management information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Definitions

  • the present invention relates to the field of information retrieval and data integration technologies, and in particular, to a method and system for collecting network comments. Background technique
  • Embodiments of the present invention provide a method and system for collecting network comments, which are used for efficiently and comprehensively collecting network comments.
  • An embodiment of the present invention provides a method for collecting network comments, including: obtaining a webpage entry link address; determining whether there are N network comments on the webpage corresponding to the webpage entry link address, wherein the N is positive An integer; when there are the N network comments, determining whether there are M network comments in the N network comments satisfying the condition of the set, wherein the M is a positive integer less than or equal to N; When M network comments satisfy the conditions of the collection, the M network comments are collected.
  • the obtaining the webpage entry link address specifically includes: acquiring a theme webpage where the topic reviewed by the N network reviews is located; acquiring a feature code of the topic webpage; acquiring a feature code of a channel where the theme is located; The feature code of the theme webpage and the feature code of the channel obtain a webpage entry link address.
  • the theme webpage entry link address is periodically refreshed.
  • the webpage portal link address is deleted.
  • the determining whether the M network comments satisfy the set of the N network comments comprises: calculating a difference between N and P, and if N is greater than P, indicating that there is a new network comment, and The number of the newly added network comments is the difference M between N and P, where P is the number of network comments when the page was last accessed.
  • calculating the number L of network comments included on the current page of the page if the L is less than M, calculating the number of pages of the page turning, and extracting a page turning link corresponding to the number of pages, where L Is a positive integer.
  • each of the N network comments is compared with each of the P network comments, and if the comparison results are different, the M networks with different comparison results are extracted. comment.
  • the determining, by the network comment, whether the M network comments meet the set of conditions comprises: each of the N network reviews and each of the P network reviews The comments are compared separately, and if the comparison results are different, it is determined that the M network comments having different alignment results are network comments satisfying the collection condition.
  • the extracted M network comment content is saved to a storage unit different from the web page.
  • An embodiment of the present invention provides a network comment collection system, the system includes: an entry link obtaining component, configured to acquire a webpage entry link address, and a first determining component, configured to determine the webpage entry link address Whether there are N network comments on the corresponding webpage, where the N is a positive integer; the second determining component is configured to determine whether there are M networks in the N network comments when there are the N network comments The comment satisfies the condition of the set, wherein the M is a positive integer less than or equal to N; the content collection component is configured to collect the M networks when the M network reviews satisfy the condition of the set comment.
  • the network commentary collection system is used to collect network comments, and the technical effect of comprehensively collecting network comments is achieved by obtaining the entry link address of the network comment and setting the collection conditions.
  • FIG. 1 is a flowchart of a method for collecting a set according to an embodiment of the present invention
  • FIG. 2 is a detailed flowchart of the method for collecting the data in FIG. 1 according to the present invention
  • FIG. 3 is a detailed flowchart of the method of collecting the data in FIG. 1 according to the present invention.
  • FIG. 4 is a structural diagram of a collection system according to a first embodiment of the present invention.
  • FIG. 5 is a structural diagram of a collection system according to a second embodiment of the present invention
  • FIG. 6 is a structural diagram of a collection system according to a third embodiment of the present invention
  • FIG. 7 is a structural diagram of a collection system according to a fourth embodiment of the present invention.
  • FIG. 8 is a structural diagram of a collection system according to another embodiment of the present invention. detailed description
  • An embodiment of the present invention provides a method for collecting network comments, which is used to collect network comments. As shown in Figure 1, the collection method includes:
  • Step 11 Obtain a webpage entry link address
  • Step 12 Determine whether there are N network comments on the webpage corresponding to the webpage link address, where N is a positive integer;
  • Step 13 When there are N network comments, it is judged whether there are M network comments in the N network comments satisfying the condition of the set, wherein M is a positive integer less than or equal to N;
  • Step 14 Collect M network comments when there are M network comments satisfying the conditions of the collection.
  • step 11 specifically includes:
  • Step 111 Obtain a topic webpage where the topics reviewed by the N network reviews are located;
  • Step 112 Obtain a feature code of the topic webpage.
  • Step 113 Obtain the feature code of the channel where the theme is located;
  • Step 114 splicing the feature code of the theme webpage and the feature code of the channel to obtain a webpage entry link address.
  • the topic webpage may be the page where the news is located or the page where the product information is located.
  • the present embodiment is described in detail by taking the news webpage as an example.
  • the topic webpage may also be a page where other information is located. The invention is not limited.
  • the comment page entry link address for commenting on the news is obtained by splicing the feature code in the script program of the news page by a specific rule.
  • the entry link address of the news comment page of the news is that the script of the news page will identify the feature code of the piece of news, the feature code of the channel identifying the piece of news, plus the domain name and some other elements (such as the current time). ) stitched together. Obtain the above feature code, and configure the personalized rule to match the entry link address of the network comment page according to the specified mode.
  • step 11 further includes:
  • Step 115 Periodically refresh the webpage entry link address.
  • the backend of the website of the news webpage may re-edit the news, and the news webpage link of the same content may change.
  • the new online comment content will be loaded by the changed network comment entry link, and the previously extracted online comment entry link.
  • the page indicated by the address will no longer have updates for new comments. It can be seen that if the original recorded web comment entry link is used for access, the newly updated comment content cannot be obtained, so for this case, The link of the currently recorded news page is periodically refreshed.
  • the site will automatically jump to the changed news page, so that the network comment entry link can be re-extracted according to the newly obtained news page to continue the collection. That is, when the news webpage entry link address is updated, the jump proceeds to step 111, otherwise, the flow ends.
  • Step 3 is the specific steps of Step 13, including:
  • Step 131 Extract the number N of current network comments from the webpage, and calculate the difference M between N and P, where P is the number of network comments extracted from the last access to the link;
  • Step 132 Determine whether M is greater than zero
  • Step 133 When the result of step 132 is YES, M network comments are extracted.
  • the number N of the current network comments extracted from the webpage in step 131 may be extracted from the webpage by using a regular expression, or may be extracted by using other methods, and the invention is not limited.
  • P is equal to zero.
  • step 133 specifically includes:
  • Step 1331 Calculate the number L of network comments included on the current page of the page, where L is a positive integer less than or equal to M;
  • Step 1332 Determine whether L is less than M
  • Step 1333 When the result of step 1332 is YES, the number of pages of the page turning is calculated, and the page turning link corresponding to the number of pages is extracted.
  • step 1333 the formula for turning the page is: corpse D ⁇
  • step 133 further includes: Step 1334: Determine whether each of the network comments in the N network comments is the same as each of the network comments in the P network comments;
  • Step 1335 When the result of step 1334 is no, M network comments with different comparison results are extracted.
  • the M network comment content extracted in step 1335 will be saved to a storage unit different from the comment webpage, and the network comment saved to the storage unit is convenient for centralized browsing, so that the user can apply the collected network comment.
  • the news is time-sensitive. If the news exceeds a certain period of time, it is considered meaningless.
  • the news commentary which is also an accessory to the news, also fails with the news. For the above reasons, if the network comment is not updated after the predetermined time is exceeded, the news comment link is deleted, and the refresh is not continued, which saves system resources and has higher work efficiency.
  • the method of calculating the difference M of N and P in the foregoing embodiment may be omitted, but directly Each of the N network comments is compared with each of the P network comments, and if the comparison results are different, the M network comments having different comparison results are extracted.
  • the use of this method is because the back-end of the website of the news page will delete the online comments from time to time.
  • the system has 15 online comments for the first time, in two intervals, for some reasons. In the background of the website, all 15 comments are deleted. At the same time, 30 new comments are added, and only 15 comments can be displayed in a web page. Therefore, the first and second web comments of the web comment can be considered. New.
  • the 30 comments collected in this collection are compared with the last 15 comments. The result of this comparison is the 30 comments and the last 15 comments of this collection. They are different, so this time we should collect 30 new comments.
  • the 30 pieces of network comment content collected in this collection are saved to a storage unit different from the comment page, and the network comments saved to the storage unit are conveniently browsed, so that the user can apply the collected network comments.
  • the first embodiment of the present invention provides a network data collection system.
  • FIG. 4 is a system architecture diagram in the embodiment.
  • the system includes an entry link acquisition unit 10, a first determination unit 20, a second determination unit 30, and a content collection unit 40.
  • the portal link obtaining component 10 is configured to obtain a webpage entry link address.
  • the first determining component 20 is configured to determine whether there are N network comments on the webpage corresponding to the webpage entry link address.
  • the second judging section 30 is for judging whether there are M network comments satisfying the collection condition.
  • the content collection component 40 is used to collect web reviews.
  • the entry link obtaining component 10 includes a first obtaining unit 101, a second obtaining unit 102, a third acquiring unit 103, and a tiling unit 104.
  • the first obtaining unit 101 is configured to acquire a theme webpage in which the topic of the N-theme comment is located;
  • the second acquiring unit 102 is configured to acquire the feature code of the topic webpage;
  • the third acquiring unit 103 is configured to acquire the feature code of the channel where the topic is located;
  • the splicing unit 104 is configured to splicing the feature code of the theme webpage and the feature code of the channel to obtain a webpage entry link address.
  • the second determining component 30 determines whether there are M network comments satisfying the collection condition, specifically extracting N network comments from the webpage, and calculating a difference M between N and P, where P is the network extracted last time accessing the link The number of comments. Further, it is judged whether M is greater than zero, and if M is greater than zero, it means that M network comments are comments satisfying the collection condition.
  • the system further includes an entry link address refreshing component 50 for periodically refreshing the webpage entry link address.
  • the portal link The address refresh component 50 can be utilized in conjunction with the portal link acquisition component 10 to enable timely review of updated network comments.
  • system further includes a network comment page refreshing component.
  • the network comment page refreshing component 60 can be used in conjunction with the first determining component 20 to improve system collection efficiency, and the network commentary that has not been updated for a long time is discarded.
  • the content collecting unit 40 further includes a page turning extracting unit 401, a comparing unit 402, a content extracting unit 403, and a disk I/O unit. 404.
  • the page turning extracting unit 401 is configured to calculate the page number of the page turning and extract the page turning link corresponding to the page number; the comparing component 402 is configured to use each of the N network comments and the P network comments Each of the network comments is separately compared; the content extracting unit 403 is configured to extract the network comments having different comparison results when the comparison results are different.
  • the disk I/O component 404 is configured to save the extracted web comment content to a storage unit other than the web page. Please refer to FIG. 7 for this embodiment.
  • FIG. 8 is a system architecture diagram in the embodiment.
  • the present embodiment is different from the first embodiment in that the present embodiment does not include the matching unit 402 and the content extracting unit 403.
  • the system of the present embodiment includes an entry link acquisition section 80, a first determination section 81, a second determination section 82, and a content collection section 83.
  • the portal link obtaining component 80 is configured to obtain a webpage entry link address.
  • the first determining component 81 is configured to determine whether there is a network comment on the webpage corresponding to the webpage entry link address.
  • the second judging section 82 is for judging whether or not there is a web comment satisfying the collection condition.
  • the content collection component 83 is used to collect network comments.
  • the entry link obtaining component 80 includes a first obtaining unit 801, a second obtaining unit 802, a third acquiring unit 803, and a splicing unit 804.
  • the first obtaining unit 801 is configured to acquire a theme webpage in which the topic of the N-theme comment is located;
  • the second acquiring unit 802 is configured to acquire the feature code of the topic webpage;
  • the third acquiring unit 803 is configured to acquire the feature code of the channel where the topic is located;
  • the splicing unit 804 is configured to splicing the feature code of the theme webpage and the feature code of the channel to obtain a webpage entry link address.
  • the second determining component 82 is configured to compare each of the N network comments with each of the P network reviews, and if the comparison result is different, determine the M with different comparison results.
  • Online reviews are web reviews that satisfy the collection criteria.
  • the content collection unit 83 includes a page turning extraction unit 831 and a disk I/O unit 832.
  • the page flipping component 831 is configured to calculate the number of page turning pages and extract a page turning link corresponding to the number of pages;
  • the disk I/O component 832 is configured to save the extracted network comment content to a storage unit different from the web page.
  • the portal link obtaining component 80 can cooperate with the portal link address refreshing component 84 in the second embodiment to implement a more comprehensive network comment.
  • the first judging section 81 can be used in conjunction with the web commenting page refreshing section 85 in the third embodiment to achieve comprehensive and efficient collection of web comments.
  • a network comment collection system is used to collect network comments, and the technical effect of comprehensively collecting network comments is achieved by obtaining an entry link address of the network comment and setting a collection condition. Further, by using the matching component, it is possible to compare each of the comments of the current extraction with each of the comments of the last extraction, and then use the content extraction component only to compare The comments on the different results are extracted, so the effect of efficient collection can be achieved on the basis of comprehensive collection of network comments.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

本申请公开了一种网络评论的釆集方法及系统。所述方法包括:获取网页入口链接地址;判断所述网页入口链接地址对应的网页上是否有N个网络评论,其中,所述N为正整数;在有所述N个网络评论时,判断所述N个网络评论中是否有M个网络评论满足釆集的条件,其中,所述M为小于或等于N的正整数;在有所述M个网络评论满足釆集的条件时,釆集所述M个网络评论。

Description

一种网络评论的采集方法及系统 本申请要求在 2011年 12月 13 日提交中国专利局、 申请号为 201110415749.9、 发明名 称为"一种网络评论的釆集方法及系统"的中国专利申请的优先权, 其全部内容通过引用结 合在本申请中。 技术领域
本发明涉及信息检索和数据集成技术领域, 尤其涉及一种网络评论的釆集方法及系统。 背景技术
目前, 随着互联网技术的高速发展, 互联网已经成了世界上最大的信息库, 它几乎涵 盖了人类所有领域, 已经成为人们获取信息、 交流信息的重要平台。 为了方便人们查阅信 息, 基于互联网的信息检索技术也得到了深入的研究与长足发展, 而基于信息检索的相关 应用, 诸如网络舆情分析、 评价垂直搜索等也由此产生。 这些应用技术都是首先将网页下 载到本地, 然后祛除杂盾抽取出分析需要的内容, 最后在此基础上进一步分析。
对于发布在互联网上的信息, 网络用户能够在浏览信息后发表自己的想法, 形成对该 信息的评论。 由于当前互联网的普及型性、 广泛性和即时性, 可以说网络评论在一定程度 上代表了大众对某一事件的看法, 这对舆情分析有着重大意义和应用空间。
因此, 网络评论已成为多种应用的重要数据源之一, 釆集网络评论数据源则是最基础 的条件。而在现有技术中,对网络评论的釆集研究几乎为空白,缺少对网络评论进行高效、 全面的釆集技术。 发明内容
本发明实施例提供一种网络评论的釆集方法及系统,用于高效、全面地釆集网络评论。 本发明实施例一方面提供了一种网络评论的釆集方法, 包括:获取网页入口链接地址; 判断所述网页入口链接地址对应的网页上是否有 N个网络评论, 其中, 所述 N为正整数; 在有所述 N个网络评论时, 判断所述 N个网络评论中是否有 M个网络评论满足釆集的条 件,其中,所述 M为小于或等于 N的正整数;在有所述 M个网络评论满足釆集的条件时, 釆集所述 M个网络评论。
优选地, 所述获取网页入口链接地址具体包括: 获取所述 N个网络评论所评论的主题 所在的主题网页; 获取所述主题网页的特征码; 获取所述主题所在频道的特征码; 以及拼 接所述主题网页的特征码和所述频道的特征码, 得到网页入口链接地址。
优选地, 周期性刷新所述主题网页入口链接地址。 优选地, 当所述网页上的网络评论无更新超过一预定时间时, 删除所述网页入口链接 地址。
优选地, 所述判断所述 N个网络评论中是否有 M个网络评论满足釆集的条件具体包 括: 计算 N和 P的差值, 如果 N大于 P, 则表示有新增的网络评论, 且所述新增的网络评 论的个数为 N和 P的差值 M, 其中 P为上次访问所述页面时的网络评论的个数。
优选地, 计算所述页面的当前页面上包含的网络评论的个数 L, 如果所述 L小于 M, 则计算翻页的页数, 并抽取与所述页数对应的翻页链接, 其中 L为正整数。
优选地,将所述 N个网络评论中每一个网络评论与所述 P个网络评论中每一个网络评 论分别进行比对, 如果比对结果不同, 则抽取所述比对结果不同的 M个网络评论。
优选地, 所述判断所述 N个网络评论中是否有 M个网络评论满足釆集的条件具体包 括:将所述 N个网络评论中每一个网络评论与所述 P个网络评论中每一个网络评论分别进 行比对,如果比对结果不同, 则确定比对结果不同的 M个网络评论为满足釆集条件的网络 评论。
优选地, 将抽取的所述 M个网络评论内容保存到不同于所述网页的存储单元。
本发明实施例另一方面提供一种网络评论的釆集系统, 所述系统包括: 入口链接获取 部件, 用于获取一网页入口链接地址; 第一判断部件, 用于判断所述网页入口链接地址对 应的网页上是否有 N个网络评论, 其中, 所述 N为正整数; 第二判断部件, 用于在有所述 N个网络评论时,判断所述 N个网络评论中是否有 M个网络评论满足釆集的条件,其中, 所述 M为小于或等于 N的正整数; 内容釆集部件, 用于在有所述 M个网络评论满足釆集 的条件时, 釆集所述 M个网络评论。
本发明有益效果如下:
本发明实施例釆用网络评论釆集系统釆集网络评论, 通过获取网络评论的入口链接地 址及设定釆集条件来达到全面釆集网络评论的技术效果。
进一步, 还釆用了比对部件, 可以实现将本次抽取的所有评论中的每一条评论和上一 次抽取的所有评论中的每一条评论进行比对, 然后釆用了内容抽取部件只将比对结果不同 的评论抽取出来, 所以可以在全面釆集网络评论的基础上达到高效釆集的效果。 附图说明
图 1为本发明一实施例中的釆集方法的流程图;
图 2为本发明图 1中釆集方法的详细流程图;
图 3为本发明图 1中釆集方法的详细流程图;
图 4为本发明第一实施例的釆集系统架构图;
图 5为本发明第二实施例的釆集系统架构图; 图 6为本发明第三实施例的釆集系统架构图;
图 7为本发明第四实施例的釆集系统架构图;
图 8为本发明另一实施例的釆集系统架构图。 具体实施方式
本发明一实施例提供一种网络评论的釆集方法, 用于釆集网络评论。 如图 1所示, 釆 集方法包括:
步骤 11: 获取一网页入口链接地址;
步骤 12: 判断网页入口链接地址对应的网页上是否有 N个网络评论, 其中, N为正整 数;
步骤 13: 在有 N个网络评论时, 判断 N个网络评论中是否有 M个网络评论满足釆集 的条件, 其中, M为小于或等于 N的正整数;
步骤 14: 在有 M个网络评论满足釆集的条件时, 釆集 M个网络评论。
其中, 请参考图 2, 步骤 11具体又包括:
步骤 111 : 获取 N个网络评论所评论的主题所在的主题网页;
步骤 112: 获取主题网页的特征码;
步骤 113: 获取主题所在频道的特征码;
步骤 114: 拼接主题网页的特征码和频道的特征码, 得到网页入口链接地址。
本发明中, 主题网页可以是新闻所在的页面也可以是商品信息所在的页面, 现以新闻 网页为例对本实施例进行详细说明, 在实际中, 主题网页也可以是其它信息所在的页面, 本发明不作限制。
在本实施例中, 对新闻进行评论的评论页面入口链接地址由新闻页面的脚本程序中的 特征码按特定规则拼接后获得。 例如, 对新闻的网络评论页面的入口链接地址是由新闻页 面的脚本程序将标识该篇新闻的特征码、 标识该篇新闻所在的频道的特征码再加上域名以 及一些其他元素 (例如当前时间)拼接而成。 获得上述特征码, 并配置个性化的规则, 按 照指定模式匹配出网络评论页面的入口链接地址。
请继续参考图 2, 进一步地, 步骤 11还包括:
步骤 115: 周期性刷新网页入口链接地址。
在步骤 115中, 新闻网页的网站后台会可能对新闻进行再编辑, 同一内容的新闻网页 链接会发生变化。 即意味着标识新闻以及新闻所在频道的特征码会发生变化, 网络评论入 口链接也随之变化, 新的网络评论内容会由变化后网络评论入口链接加载, 而之前所抽取 出的网络评论入口链接地址所指页面不会再有新评论的更新。 由此可见, 如果继续使用原 来记录的网络评论入口链接进行访问的话,无法获取到新更新的评论内容,故针对该情况, 周期性的对当前记录的新闻页面链接进行刷新, 如果链接地址变化, 站点会自动跳转到变 化后的新闻网页, 这样就可以根据新获得的新闻网页重新抽取网络评论入口链接继续进行 釆集。 即当新闻网页入口链接地址有更新时, 跳转执行步骤 111 , 否则, 本流程结束。
请参考图 3 , 图 3为步骤 13的具体步骤, 包括:
步骤 131 : 从网页中抽取出当前网络评论的个数 N, 计算 N与 P的差值 M, 其中 P为 上次访问该链接抽取出的网络评论个数;
步骤 132: 判断 M是否大于零;
步骤 133: 当步骤 132的结果为是时, 抽取 M个网络评论。
其中,步骤 131中从网页中抽取当前网络评论的个数 N可以是通过正则表达式从网页 中抽取, 也可以是使用其它方法进行抽取, 本发明不作限制。 在第一次对网络评论进行釆 集时, P等于零。
请继续参考图 3 , 其中步骤 133具体包括:
步骤 1331 : 计算页面的当前页面上包含的网络评论的个数 L, 其中 L为小于或者等于 M的正整数;
步骤 1332: 判断 L是否小于 M;
步骤 1333: 当步骤 1332的结果为是时, 计算翻页的页数, 并抽取与页数对应的翻页 链接。
其中, 在步骤 1333中, 翻页的 公式为: 尸 D
count―
Figure imgf000006_0001
^Perpage
其中, Pcmmt表示翻页的页数, U te (即 M )表示评论更新数,
Figure imgf000006_0002
(即 L)表示当前网页评论个数, Perp e表示单篇网页评论数。
请继续参考图 3 , 进一步地, 步骤 133还包括: 步骤 1334: 判断 N个网络评论中每一个网络评论与 P个网络评论中每一个网络评论 是否相同;
步骤 1335: 当步骤 1334结果为否时, 则抽取比对结果不同的 M个网络评论。
在步骤 1335中抽取出的 M个网络评论内容将会被保存到一不同于评论网页的存储单 元, 保存至存储单元的网络评论便于集中浏览, 方便用户对釆集后的网络评论进行应用。
在本实施例中, 新闻是有时效性的, 超过一定时间的新闻, 就认为是无意义的, 同样 作为新闻的附属品新闻评论也随着新闻的失效而失效。 基于上述原因, 若网络评论在超过 预定时间无更新的话, 就删除该新闻评论链接, 不再继续刷新, 这样可以节省系统资源, 具有更高的工作效率。 在另一实施例中, 判断在 N个网络评论中是否有 M个网络评论满足釆集的条件时, 可以不釆用上述实施例中计算 N和 P的差值 M的方法, 而是直接将 N个网络评论中每一 个网络评论与 P个网络评论中每一个网络评论分别进行比对, 如果比对结果不同, 则抽取 所述比对结果不同的 M个网络评论。釆用此种釆集方法,是因为新闻网页的网站后台会不 定期对网络评论进行删除, 譬如, 系统第一次釆集有 15 条网络评论, 在两次釆集间隔当 中, 因为某些原因网站后台将 15条评论全部删除与此同时又有 30条新的评论添加进来, 而一篇网页中只能显示 15 条评论, 所以可以认为网络评论的第一页和第二页的网络评论 都是新的。当釆集周期到达时,就将本次釆集到的 30条评论与上次的 15条评论进行比对, 这样比对的结果是本次釆集的 30条评论与上次的 15条评论都不相同, 故, 本次应釆集 30 条新的评论。 进一步, 将本次釆集的 30条网络评论内容被保存到一不同于评论网页的存 储单元, 保存至存储单元的网络评论便于集中浏览, 方便用户对釆集后的网络评论进行应 用。
本发明第一实施例提供一种网络数据的釆集系统, 请参考图 4, 图 4为本实施例中的 系统架构图。 如图 4所示, 系统包括入口链接获取部件 10、 第一判断部件 20、 第二判断 部件 30和内容釆集部件 40。 入口链接获取部件 10用于获取一网页入口链接地址。 第一判 断部件 20用于判断网页入口链接地址对应的网页上是否有 N个网络评论。 第二判断部件 30用于判断是否有满足釆集条件的 M个网络评论。 内容釆集部件 40用于釆集网络评论。
其中, 入口链接获取部件 10包括第一获取单元 101、 第二获取单元 102、 第三获取单 元 103、 拼接单元 104。 第一获取单元 101用于获取 N个网络评论所评论的主题所在的主 题网页; 第二获取单元 102用于获取主题网页的特征码; 第三获取单元 103用于获取主题 所在频道的特征码; 拼接单元 104用于拼接主题网页的特征码和频道的特征码, 得到网页 入口链接地址。
第二判断部件 30判断是否有满足釆集条件的 M个网络评论具体是从网页中抽取出 N 个网络评论, 计算 N与 P的差值 M, 其中 P为上次访问该链接抽取出的网络评论个数。 进 一步, 判断 M是否大于零, 若 M大于零, 则表示 M个网络评论为满足釆集条件的评论。 在第二实施例中, 与第一实施例不同的是, 系统还包括入口链接地址刷新部件 50, 入口链 接地址刷新部件 50 用于周期性刷新网页入口链接地址, 在本实施例中, 入口链接地址刷 新部件 50可以与入口链接获取部件 10配合运用以便实现及时釆集更新的网络评论。
在第三实施例中, 与第一、 第二实施例不同的是, 系统还包括网络评论页面刷新部件
60, 用于判断网页上的网络评论无更新是否超过一预定时间, 如果是, 则删除网页入口链 接地址。 本实施例中, 网络评论页面刷新部件 60可以与第一判断部件 20配合运用以便提 高系统釆集效率, 对久未更新的网络评论便放弃釆集。
第二和第三实施例请分别参考图 5和图 6。 在实际中, 两个实施例可以结合使用, 以 便釆集实现全面釆集网络评论的同时提高系统的釆集效率。 在第四实施例中, 与第一、 第 二、 第三实施例不同的是, 内容釆集部件 40还包括翻页抽取部件 401、 比对部件 402、 内 容抽取部件 403和磁盘 I/O部件 404。 翻页抽取部件 401用于计算翻页的页数并抽取与页 数对应的翻页链接; 比对部件 402用于将所述 N个网络评论中每一个网络评论与所述 P个 网络评论中每一个网络评论分别进行比对; 内容抽取部件 403用于当比对结果不同时抽取 所述比对结果不同的网络评论。 磁盘 I/O部件 404用于将抽取的网络评论内容保存到一不 同于网页的存储单元。 本实施例请参考图 7。
本发明另一实施例提供一种网络数据的釆集系统, 请参考图 8, 图 8为本实施例中的 系统架构图。
本实施例与第一实施例不同的是本实施例不包括比对部件 402和内容抽取部件 403。 如图 8所示, 本实施例的系统包括入口链接获取部件 80、 第一判断部件 81、 第二判断部 件 82和内容釆集部件 83。 入口链接获取部件 80用于获取一网页入口链接地址。 第一判断 部件 81用于判断网页入口链接地址对应的网页上是否有网络评论。 第二判断部件 82用于 判断是否有满足釆集条件的网络评论。 内容釆集部件 83用于釆集网络评论。
其中, 入口链接获取部件 80包括第一获取单元 801、 第二获取单元 802、 第三获取单 元 803、 拼接单元 804。 第一获取单元 801用于获取 N个网络评论所评论的主题所在的主 题网页; 第二获取单元 802用于获取主题网页的特征码; 第三获取单元 803用于获取主题 所在频道的特征码; 拼接单元 804用于拼接主题网页的特征码和频道的特征码, 得到网页 入口链接地址。
第二判断部件 82用于将所述 N个网络评论中每一个网络评论与所述 P个网络评论中 每一个网络评论分别进行比对,如果比对结果不同, 则确定比对结果不同的 M个网络评论 为满足釆集条件的网络评论。
内容釆集部件 83包括翻页抽取部件 831及磁盘 I/O部件 832。 翻页抽取部件 831用于 计算翻页的页数并抽取与页数对应的翻页链接; 磁盘 I/O部件 832用于将抽取的网络评论 内容保存到一不同于网页的存储单元。
在本实施例中, 入口链接获取部件 80 可以结合第二实施例中的入口链接地址刷新部 件 84配合应用, 以便实现较全面的釆集网络评论。 第一判断部件 81可以结合第三实施例 中的网络评论页面刷新部件 85配合应用, 以便实现全面、 高效地釆集网络评论。
上述第一、 第二、 第三、 第四及另一实施例中的系统可以根据本发明提供的一种网络 评论釆集方法的实施例中对方法及其各种变化形式的描述进行实施。 本处为了说明书的筒 洁, 所以不再详述。
本发明一实施例釆用一网络评论釆集系统釆集网络评论, 通过获取网络评论的入口链 接地址及设定釆集条件来达到全面釆集网络评论的技术效果。 进一步, 还釆用了比对部件, 可以实现将本次抽取的所有评论中的每一条评论和上一 次抽取的所有评论中的每一条评论进行比对, 然后釆用了内容抽取部件只将比对结果不同 的评论抽取出来, 所以可以在全面釆集网络评论的基础上达到高效釆集的效果。
本发明是参照根据本发明实施例的方法、 设备(系统)、 和计算机程序产品的流程图 和 /或方框图来描述的。 应理解可由计算机程序指令实现流程图和 /或方框图中的每一流 程和 /或方框、 以及流程图和 /或方框图中的流程和 /或方框的结合。 可提供这些计算机 程序指令到通用计算机、 专用计算机、 嵌入式处理机或其他可编程数据处理设备的处理器 以产生一个机器, 使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用 于实现在流程图一个流程或多个流程和 /或方框图一个方框或多个方框中指定的功能的 装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方 式工作的计算机可读存储器中, 使得存储在该计算机可读存储器中的指令产生包括指令装 置的制造品, 该指令装置实现在流程图一个流程或多个流程和 /或方框图一个方框或多个 方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上, 使得在计算机 或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理, 从而在计算机或其他 可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和 /或方框图一个 方框或多个方框中指定的功能的步骤。
尽管已描述了本发明的优选实施例, 但本领域内的技术人员一旦得知了基本创造性概 念, 则可对这些实施例作出另外的变更和修改。 所以, 所附权利要求意欲解释为包括优选 实施例以及落入本发明范围的所有变更和修改。
显然, 本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和 范围。这样,倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内, 则本发明也意图包含这些改动和变型在内。

Claims

1、 一种网络评论的釆集方法, 其特征在于, 包括:
获取网页入口链接地址;
判断所述网页入口链接地址对应的网页上是否有 N个网络评论,其中,所述 N为正整 数;
在有所述 N个网络评论时, 判断所述 N个网络评论中是否有 M个网络评论满足釆集 的条件, 其中, 所述 M为小于或等于 N的正整数;
在有所述 M个网络评论满足釆集的条件时, 釆集所述 M个网络评论。
2、 如权利要求 1所述的方法, 其特征在于, 所述获取网页入口链接地址具体包括: 获取所述 N个网络评论所评论的主题所在的主题网页;
获取所述主题网页的特征码;
获取所述主题所在频道的特征码; 以及
拼接所述主题网页的特征码和所述频道的特征码, 得到网页入口链接地址。
3、 如权利要求 2 所述的方法, 其特征在于, 所述方法还包括: 周期性刷新所述网页 入口链接地址。
4、 如权利要求 1 所述的方法, 其特征在于, 所述方法还包括: 当所述网页上的网络 评论无更新超过一预定时间时, 删除所述网页入口链接地址。
5、 如权利要求 1所述的方法, 其特征在于, 所述判断所述 N个网络评论中是否有 M 个网络评论满足釆集的条件具体包括: 计算 N和 P的差值, 如果 N大于 P, 则表示有新增 的网络评论,且所述新增的网络评论的个数为 N和 P的差值 M,其中 P为上次访问所述页 面时的网络评论的个数。
6、 如权利要求 5 所述的方法, 其特征在于, 所述方法还包括: 计算所述页面的当前 页面上包含的网络评论的个数 L, 如果所述 L小于 M, 则计算翻页的页数, 并抽取与所述 页数对应的翻页链接, 其中 L为正整数。
7、 如权利要求 5所述的方法, 其特征在于, 所述方法还包括: 将所述 N个网络评论 中每一个网络评论与所述 P个网络评论中每一个网络评论分别进行比对, 如果比对结果不 同, 则抽取所述比对结果不同的 M个网络评论。
8、 如权利要求 1所述的方法, 其特征在于, 所述判断所述 N个网络评论中是否有 M 个网络评论满足釆集的条件具体包括:将所述 N个网络评论中每一个网络评论与所述 P个 网络评论中每一个网络评论分别进行比对, 如果比对结果不同, 则确定比对结果不同的 M 个网络评论为满足釆集条件的网络评论。
9、 如权利要求 1所述的方法, 其特征在于, 所述方法还包括: 将抽取的所述 M个网 络评论内容保存到不同于所述网页的存储单元。
10、 一种网络评论的釆集系统, 其特征在于, 包括:
入口链接获取部件 , 用于获取一网页入口链接地址;
第一判断部件, 用于判断所述网页入口链接地址对应的网页上是否有 N个网络评论, 其中, 所述 N为正整数;
第二判断部件, 用于在有所述 N个网络评论时, 判断所述 N个网络评论中是否有 M 个网络评论满足采集的条件, 其中, 所述 M为小于或等于 N的正整数;
内容釆集部件, 用于在有所述 M个网络评论满足釆集的条件时, 釆集所述 M个网络 评论。
PCT/CN2012/086575 2011-12-13 2012-12-13 一种网络评论的采集方法及系统 WO2013087005A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2014532240A JP2014532220A (ja) 2011-12-13 2012-12-13 ネットコメントの収集方法およびシステム
EP12857642.8A EP2713287A4 (en) 2011-12-13 2012-12-13 METHOD AND SYSTEM FOR COLLECTING NETWORK COMMENTS
US14/127,400 US20140289395A1 (en) 2011-12-13 2012-12-13 Network comment collection method and system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201110415749.9A CN103164438B (zh) 2011-12-13 2011-12-13 一种网络评论的采集方法及系统
CN201110415749.9 2011-12-13

Publications (1)

Publication Number Publication Date
WO2013087005A1 true WO2013087005A1 (zh) 2013-06-20

Family

ID=48587532

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2012/086575 WO2013087005A1 (zh) 2011-12-13 2012-12-13 一种网络评论的采集方法及系统

Country Status (5)

Country Link
US (1) US20140289395A1 (zh)
EP (1) EP2713287A4 (zh)
JP (1) JP2014532220A (zh)
CN (1) CN103164438B (zh)
WO (1) WO2013087005A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902674B (zh) * 2014-03-19 2017-10-27 百度在线网络技术(北京)有限公司 特定主题的评论数据的采集方法和装置
CN106250417A (zh) * 2016-07-22 2016-12-21 乐视控股(北京)有限公司 一种获取评论数据的方法及装置
CN108520441A (zh) * 2018-04-04 2018-09-11 网易无尾熊(杭州)科技有限公司 数据处理方法、介质、系统和计算设备
KR20220112755A (ko) 2019-12-06 2022-08-11 소니그룹주식회사 정보 처리 시스템, 정보 처리 방법 및 기억 매체
JP7284336B1 (ja) 2022-12-27 2023-05-30 ヤフー株式会社 コンテンツ提供装置、コンテンツ提供方法、およびプログラム

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6707470B1 (en) * 1999-05-21 2004-03-16 Nec Corporation Apparatus for and method of gathering information, which can automatically obtain HTML file of URL even if user does not specify URL
CN101178713A (zh) * 2006-11-29 2008-05-14 腾讯科技(深圳)有限公司 一种采集网页的方法及系统
CN101436196A (zh) * 2008-11-25 2009-05-20 北京邮电大学 自动动态更新论坛爬虫系统的构建方法
CN101515269A (zh) * 2008-02-20 2009-08-26 中国科学院自动化研究所 实现观点搜索引擎排序的方法

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3918230B2 (ja) * 1996-04-30 2007-05-23 セイコーエプソン株式会社 データ更新監視サーバ
US7464097B2 (en) * 2002-08-16 2008-12-09 Sap Ag Managing data integrity using a filter condition
JP2006309515A (ja) * 2005-04-28 2006-11-09 Dainippon Printing Co Ltd 情報配信方法および情報配信サーバ
US20100205168A1 (en) * 2009-02-10 2010-08-12 Microsoft Corporation Thread-Based Incremental Web Forum Crawling
US20130041901A1 (en) * 2011-08-12 2013-02-14 Rawllin International Inc. News feed by filter

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6707470B1 (en) * 1999-05-21 2004-03-16 Nec Corporation Apparatus for and method of gathering information, which can automatically obtain HTML file of URL even if user does not specify URL
CN101178713A (zh) * 2006-11-29 2008-05-14 腾讯科技(深圳)有限公司 一种采集网页的方法及系统
CN101515269A (zh) * 2008-02-20 2009-08-26 中国科学院自动化研究所 实现观点搜索引擎排序的方法
CN101436196A (zh) * 2008-11-25 2009-05-20 北京邮电大学 自动动态更新论坛爬虫系统的构建方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP2713287A4 *

Also Published As

Publication number Publication date
EP2713287A4 (en) 2015-06-03
CN103164438B (zh) 2016-07-06
CN103164438A (zh) 2013-06-19
JP2014532220A (ja) 2014-12-04
US20140289395A1 (en) 2014-09-25
EP2713287A1 (en) 2014-04-02

Similar Documents

Publication Publication Date Title
JP7356206B2 (ja) コンテンツ推薦及び表示
CN103885987B (zh) 一种音乐推荐方法和系统
CN105095211B (zh) 多媒体数据的获取方法和装置
JP5823620B2 (ja) ネットデータの採集方法及びシステム
WO2015196907A1 (zh) 一种挖掘用户需求的搜索推送方法和装置
CN102708174B (zh) 一种浏览器中的富媒体信息的展示方法和装置
WO2018000557A1 (zh) 搜索结果展示方法和装置
CN106415540B (zh) 联合搜索
WO2013087005A1 (zh) 一种网络评论的采集方法及系统
CN102446225A (zh) 一种实时搜索的方法、装置和系统
WO2015196906A1 (zh) 一种基于搜索获取疾病咨询信息的方法和装置
JP2009532774A5 (zh)
CN107784059A (zh) 用于搜索和选择图像的方法和系统以及机器可读媒体
JP2017535860A (ja) マルチメディア内容の提供方法および装置
US20140006418A1 (en) Method and apparatus for ranking apps in the wide-open internet
CN107766399A (zh) 用于使图像与内容项目匹配的方法和系统及机器可读介质
US10275472B2 (en) Method for categorizing images to be associated with content items based on keywords of search queries
EP3260968A1 (en) Method and apparatus for displaying electronic picture, and mobile device
JP2015502608A5 (zh)
WO2021129122A1 (zh) 书籍查询页面的展示方法、电子设备及计算机存储介质
CN104035972A (zh) 一种基于微博的知识推荐方法与系统
CN104090923A (zh) 一种浏览器中的富媒体信息的展示方法和装置
CN106682206A (zh) 一种大数据处理方法及系统
CN109947935A (zh) 新闻事件的生成方法及装置
WO2014059851A1 (zh) 一种搜索服务器及搜索方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12857642

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2014532240

Country of ref document: JP

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 2012857642

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 14127400

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE