WO2015172567A1 - 一种互联网信息搜索聚合呈现方法 - Google Patents
一种互联网信息搜索聚合呈现方法 Download PDFInfo
- Publication number
- WO2015172567A1 WO2015172567A1 PCT/CN2014/095164 CN2014095164W WO2015172567A1 WO 2015172567 A1 WO2015172567 A1 WO 2015172567A1 CN 2014095164 W CN2014095164 W CN 2014095164W WO 2015172567 A1 WO2015172567 A1 WO 2015172567A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- content
- webpage
- page
- node
- dom tree
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Definitions
- the invention relates to an internet information search aggregation presentation method, belonging to the technical field of computer networks.
- the search engine refers to a system that automatically collects information from the Internet and, after some sorting, provides the user with a query.
- the information on the Internet is vast and unordered. All the information is like a small island on the ocean.
- the web link is a bridge between these islands, and the search engine draws a clear picture for the user.
- Information map for users to check at any time.
- this patent designs a method for aggregating search results with high homogeneity or similarity, in order to aggregate information from different sources (ie, help users analyze) and provide valuable value to users.
- Information services not as "information transfer stations.”
- the invention provides a new Internet search information integration and presentation method, which integrates and aggregates core information of homologous or similar webpages to provide valuable information services for users.
- the inventive goal is to provide users with valuable aggregated information, unlike existing search engines that merely provide a list of connections that contain information.
- An internet information search aggregation presentation method the steps of which are:
- the method for extracting the text content of the crawled webpage is:
- step 23 According to the target webpage DOM tree and the reference webpage DOM tree processed in step 22), determine a core content path of the target webpage and the reference webpage, and extract the webpage text.
- the method for deleting the same node in the DOM tree and the reference web page DOM tree is:
- the method for determining the core content path is: calculating the number of texts of each node in the DOM tree of the target webpage and the DOM tree of the reference webpage, and deleting the node if the number of texts of a node is less than a threshold of the set number of texts. Extracting the remaining text-containing nodes in the target page DOM tree and the reference page DOM tree as the core content path of the corresponding DOM tree corresponding web page.
- each node includes a link element ⁇ a> The link text density of the node, if it is greater than the set density threshold, delete the node.
- the homogenous content is extracted for all the webpages in each similar page group. Take and extract the differentiated content.
- the method for generating the page Pi is: fusing the homogenous content and the differentiated content into a new document, wherein the homogenous content font is bold or blackened, and the homogeneous content and the different content are presented in different colors. Then, the original address corresponding to all the webpages in the similar page group is attached to the document, and a new URL URLi is dynamically created for the page Pi to generate the page Pi.
- the method for generating the similar page group is: traversing the webpage in the candidate result set by two or two, calculating a string matching degree T of the title, a matching degree L of the effective content length of the webpage, and N keys having the highest frequency of occurrence of the page.
- the homogenization information of the webpage is searched from the candidate result set, and the webpages in the candidate result set are clustered according to the homogenization information degree, and then the webpages in each class are traversed one by two to calculate the similarity of the pages. Sex.
- the query word and the finally formed aggregate result are saved into a database and indexed; when a new query word is input, the corresponding aggregate result is retrieved according to the index.
- the "intrinsic template-based web page body content extraction" algorithm does not involve the convergence and periodicity of sample annotation and learning algorithms, and does not address web content language, web design style. Any assumptions made with the web page template style type greatly improve the efficiency of the algorithm and reduce the labor cost, and have good versatility for the core content extraction of modern website pages.
- Impurity content deletion and core content path extraction algorithms in the algorithm for web page body content extraction based on intrinsic templates may be different according to needs and scalability issues. Only a reference is given in the description of the algorithm. In practical applications, suitable algorithms (including statistical algorithms, machine learning algorithms, etc.) can be used according to different situations or the steps can be omitted directly.
- the method for obtaining the reference webpage in the webpage text content extraction algorithm based on the intrinsic template can also design a flexible method according to the actual application, and is not limited to the currently proposed strategy.
- the invention we designed makes the query result obtained by the user more targeted, the redundancy of the content is smaller than the user's own search, and the query result is more accurate and clean because the useless information such as advertisement is removed. It provides a more diversified content presentation for reading needs, and provides a more convenient expansion and supplement for purposeful reading.
- the present invention contemplates a method of aggregating search results of homogeneity or similarity in order to expect Aggregating information from different sources (ie, helping users analyze) directly provides users with valuable information services.
- the invention improves the existing search engine as a defect of "information transfer station”.
- Figure 1 is a flow chart of the method of the present invention.
- the system For the user's query, the system first finds in the aggregated content library whether there is already a cached result, and if so, directly responds to the aggregated content in the form of hierarchical information, and renders on the user page; if the content is in the aggregated content library If there is no relevant content, the related page is indexed in the page library through the user's query, the similarity comparison and the aggregation operation are performed, the response data source is formed, the result data is organized and organized according to the information hierarchical manner, and finally the presentation result is displayed.
- the web index library is crawled by web crawlers on the Internet, extracting and building the web content, and establishing related indexes.
- the core extraction is carried out by the algorithm of “extracting the content of the webpage based on the intrinsic template”: removing the irrelevant advertisement links, website navigation bar, website copyright and other information in the webpage, so that the webpage content is more accurate and more concise; the algorithm is summarized.
- the webpage template can be used to quickly extract the content of the webpage of the same topic and accelerate the processing of the basic resource data; the "extraction of the webpage content based on the intrinsic template” is as follows:
- a method for calculating the similarity of URLs is a method for calculating the similarity of URLs:
- the URL similarity between the reference webpage and the target webpage that is really beneficial to the text extraction is 1; that is, by setting the threshold value, the webpage to be extracted by the text is found to satisfy a certain condition similar to that on the URL.
- the page, then the two pages serve as a "target page, reference page" pair to extract the text.
- auxiliary nodes that are not related to the core content of the web page from the target web page and the reference web page, such as ⁇ style>, ⁇ script>, ⁇ noscript>, ⁇ link>, ⁇ meta>, and the like.
- Template node processing the same node deletion algorithm (template node deletion) is performed in the form of text in each node of the target webpage DOM tree and the reference webpage DOM tree, and the "deletion algorithm" is as follows:
- the link text density of the node containing the link element ⁇ a> ie (the number of texts contained in the link element) / (the total number of texts contained in the link element parent node)
- the result interval of the indicator is [0, 1] If the indicator is greater than a certain threshold, the node (the parent of the link element) can be considered as having little relevance to the core of the web page, so that the node can be deleted.
- the threshold according to the node text number property of the entire DOM tree (the text characteristics of the node include the number of texts in the node, the number of texts in the node containing the link element ⁇ a>, etc., and some statistical methods can be used by some sample web pages. Summarize the text features of the smallest node containing the body content to infer a threshold, which plays a role in distinguishing the core content nodes to exclude other nodes with obvious non-text features, and then delete the text features according to the threshold.
- the body of the web page is extracted (after positioning the smallest node where the core content is located, all the nodes on the path of the node to the ⁇ body> node are recorded in turn, and these nodes constitute the path from the root node ⁇ body> to the core content node.
- the path extraction facilitates text extraction with web pages having the same intrinsic template Because "the body of the page content based on internal template extraction" narrow the scope of the algorithm from the DOM tree node whole page to the smallest node contains only the core content of the).
- the present invention recursively traverses the target web page DOM tree, streamlines the DOM tree structure, and removes elements that affect the processing of the template node, thereby improving algorithm precision and computational efficiency.
- the DOM tree structure of the same channel or the same topic on the same website due to design style and development efficiency, often use the same template, as well as the same style and component scripts.
- their content is in the DOM tree.
- the layout in the middle is often traceable (such web pages generally have a very high similarity when they belong to the same website).
- hot content recommendation, site navigation, site copyright information, etc. are almost identical on the nodes in the DOM tree of the same template, while the core content of the topic-related web pages is at the level and node of the DOM tree because of the content.
- There is a difference in content so the removal of elements that are not related to the core content of the web page can be done using DOM tree alignment.
- tags, attributes, and nodes with exactly the same content they are more likely to be unrelated to the core content of the page, so they can be deleted.
- the remaining content is the unique content related to the content of the webpage, and a small amount of different information related to time, user statistics, and the like. This has greatly improved the accuracy of the extraction of the core content of the webpage.
- the page content processed by the "intrinsic template based web page content extraction” algorithm will be used to store and index the results to build a result page set (ResultSet) based on the user's query.
- ResultSet result page set
- the user's query will search in the established index to retrieve the corresponding web pages.
- These pages constitute the query result page set, and the body content of the pages in these page sets will be used to perform the aggregation algorithm to form the final. Processing results.
- the result of the aggregation will be cached in the aggregate library, making it easy for the next user to respond to the same query.
- the system After receiving the query word submitted by the user, the system first checks whether the response content of the query word exists in the aggregated content library (ContentDB), and if so, directly returns the aggregated result set as a search result, and ends the process; Exist, go to step 3;
- ContentDB aggregated content library
- the candidate result set (ResultSet) is obtained by retrieving the latest index library (the index database of the ordinary webpage and the indexing library of the webpage (such as news) with high real-time requirements), and the update frequency thereof is different.
- the value will take into account the resource type of the page in the candidate result set: for example, consider whether the page is mainly text or image or video, and the alpha value will be larger for the image set and video page.
- the homogenization information of this information is preferentially sought from the ResultSet (because some of the more popular content is easy to cause more Search and reprint, and these content are more common in news, encyclopedia, blog and other types of websites, so the aggregation of the page content indexed by such keywords in advance will increase the speed of responding to user queries; In the classified candidate result set, prioritizing the aggregation of news, encyclopedia, blog and other types of pages will also improve the aggregation efficiency.
- These popular information can be obtained from data similar to the Baidu index; the acquisition of homogenization information remains. You can use the algorithm mentioned in step 4 to make the homogenization decision.), then go to the aggregation, which will drastically reduce the comparison time, which can be more real-time and more efficient.
- the link first extracts the content with higher content similarity, and further extracts the portion with higher content difference, and the extraction method can use steps.
- the algorithm for discriminating page similarity in 4 the scope of the algorithm is the paragraphs of the body text, not the entire document; thus distinguishing the similarity content and the difference content in the body content at a more precise granularity; Homogeneous content and differentiated content A new document Pi, in which the homogenous content font is bold or blackened (you can further add comments next to the homogeneous content, prompts from different K pages, etc.), and the homogenous content and the difference content are different.
- the color is presented; in addition, the original address (URL) of all the web pages in Si is also attached to the Pi document, and a new URL URLi is dynamically created for the Pi to be accessed by the user.
- URL original address
- a return result page to the user based on the page and URL generated in step 4, taking the page Pi (ie, the new page representing each page group generated in step 4) as an example: select the first M characters of the Pi and associate it with the URLi Together as the ith result.
- the above N results are displayed in turn on the return page.
- the Pi document items can be distinguished from the search results of the general search engine, and can be distinguished by setting a special background color.
- the above content is stored in the aggregated content library (ContentDB), corresponding to the result of the query word, and updated regularly to cope with the repeated query of the user.
- ContentDB aggregated content library
- search engine results are presented as separate summary information and original webpage hyperlinks, that is, functions of the information relay station; search results presented by the system of the present invention It is based on aggregated information after web content integration, and a list of source links behind the information.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Information Transfer Between Computers (AREA)
Abstract
Description
Claims (10)
- 一种互联网信息搜索聚合呈现方法,其步骤为:1)利用搜索引擎在互联网上爬取页面,对爬到的网页做正文内容提取,并根据正文内容建立对应该网页的索引;2)根据输入的查询词检索聚合内容库,若存在该查询词对应的应答内容,则将其作为搜索结果返回;若不存在,则进行步骤3);3)根据该查询词利用所建索引进行网页检索,获得一候选结果集;4)将该候选结果集中的网页正文进行内容相似性对比,将同质或内容相似性大于设定阈值的页面作为一组,得到一系列的相似页面组{S1,S2,…Sk};5)对每一相似页面组Si,提取该组内所有网页的同质内容和差异化内容,并将其融合生成一新的页面Pi;6)将每一相似页面组Si及其对应页面Pi作为该查询词对应的应答内容返回,并且将该查询词及其对应的应答内容保存到所述聚合内容库中。
- 如权利要求1所述的方法,其特征在于所述对爬到的网页做正文内容提取的方法为:21)将所爬取的网页集中一网页作为目标网页,从该网页集中搜寻一与该目标网页URL相似度最高网页作为参考网页,然后将这两个页面转化成相应的DOM树;22)删除目标网页DOM树和参考网页DOM树中相同的节点;23)根据步骤22)处理后的目标网页DOM树和参考网页DOM树,确定目标网页和参考网页的核心内容路径,进行网页正文的提取。
- 如权利要求2所述的方法,其特征在于所述删除目标网页DOM树和参考网页DOM树中相同的节点的方法为:31)从目标网页DOM树中的第一层节点开始,对于每一层节点,在参考网页DOM树中寻找准相同节点:即标签相同,并且属性键值对也相同;32)将准相同节点看成文本行进行逐行的字符串对比;如果两个节点的对应文本行完全相同,则这两个节点完全相同,在两颗DOM树中分别删除该节点;如果两个节点对应的文本行不同,则逐层递归地对该节点的子节点们进行比对,查找相同节点并在两颗DOM树中分别删除,直到目标网页DOM树中不再有与参考网页DOM树中相同的节点。
- 如权利要求2或3所述的方法,其特征在于所述核心内容路径的确定方法为:计算目标网页DOM树和参考网页DOM树中每一节点的文本数,如果某个节点的文本数小于设定的 文本数阈值,则删除该节点;提取目标网页DOM树和参考网页DOM树中剩余的包含文本的节点作为相应DOM树对应网页的核心内容路径。
- 如权利要求2或3所述的方法,其特征在于确定所述核心内容路径之前,对目标网页DOM树和参考网页DOM树进行杂质内容删除处理,其方法:计算目标网页DOM树和参考网页DOM树中每一节点包含链接元素<a>的节点的链接文本密度,如果大于设定密度阈值,则删除该节点。
- 如权利要求1所述的方法,其特征在于以网页正文的段落为单元,对每一相似页面组内所有网页进行同质内容的提取和差异化内容的提取。
- 如权利要求6所述的方法,其特征在于生成所述页面Pi的方法为:将同质内容和差异化内容融合为一新的文档,其中同质内容字体加粗或加黑,且同质内容和差异性内容以不同颜色呈现;然后将对应相似页面组内所有网页的原始地址附在该文档中,并为其动态建立一个新的网址URLi,生成所述页面Pi。
- 如权利要求1所述的方法,其特征在于生成所述相似页面组的方法为:两两遍历所述候选结果集中的网页,计算标题的字符串匹配度T、网页有效内容长度的匹配度L和页面出现频率最高的N个关键词的重叠度F,然后根据S=alpha*T+beta*L+gamma*F得到两个页面的相似性S,将相似性S大于设定阈值的页面作为一组;其中,alpha,beta和gamma的取值均在[0,1]区间,且alpha+beta+gamma=1。
- 如权利要求8所述的方法,其特征在于首先从所述候选结果集中寻找网页的同质化信息,根据同质化信息度所述候选结果集中的网页进行聚类,然后两两遍历每一类中的网页,计算页面的相似性。
- 如权利要求1所述的方法,其特征在于将查询词以及最终形成的聚合结果保存到一数据库中并建立索引;当输入新的查询词时,根据该索引检索出相对应的聚合结果。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410198228.6A CN103955529B (zh) | 2014-05-12 | 2014-05-12 | 一种互联网信息搜索聚合呈现方法 |
CN201410198228.6 | 2014-05-12 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2015172567A1 true WO2015172567A1 (zh) | 2015-11-19 |
Family
ID=51332804
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2014/095164 WO2015172567A1 (zh) | 2014-05-12 | 2014-12-26 | 一种互联网信息搜索聚合呈现方法 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN103955529B (zh) |
WO (1) | WO2015172567A1 (zh) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111274467A (zh) * | 2019-12-31 | 2020-06-12 | 中国电子科技集团公司第二十八研究所 | 面向大规模数据采集的三层分布式去重架构和方法 |
CN112862536A (zh) * | 2021-02-25 | 2021-05-28 | 腾讯科技(深圳)有限公司 | 一种数据处理方法、装置、设备及存储介质 |
CN114372267A (zh) * | 2021-11-12 | 2022-04-19 | 哈尔滨工业大学 | 一种基于静态域的恶意网页识别检测方法、计算机及存储介质 |
Families Citing this family (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103955529B (zh) * | 2014-05-12 | 2018-05-01 | 中国科学院计算机网络信息中心 | 一种互联网信息搜索聚合呈现方法 |
CN104834703A (zh) * | 2015-04-29 | 2015-08-12 | 深圳市梦网科技股份有限公司 | 检索方法及系统 |
CN106802899B (zh) * | 2015-11-26 | 2020-11-24 | 北京搜狗科技发展有限公司 | 网页正文抽取方法及装置 |
CN106855859B (zh) * | 2015-12-08 | 2020-11-10 | 北京搜狗科技发展有限公司 | 一种网页正文提取方法及装置 |
CN106326447B (zh) * | 2016-08-26 | 2019-06-21 | 北京量科邦信息技术有限公司 | 一种众包网络爬虫抓取数据的检测方法及系统 |
CN106372214A (zh) * | 2016-09-05 | 2017-02-01 | 青岛海信宽带多媒体技术有限公司 | 网页页面的显示控制方法和智能终端 |
CN106777206A (zh) * | 2016-12-23 | 2017-05-31 | 北京奇虎科技有限公司 | 影视剧类关键词搜索展现方法及装置 |
CN106844540B (zh) * | 2016-12-30 | 2021-02-05 | 腾讯科技(深圳)有限公司 | 一种信息处理方法及装置 |
CN107656985B (zh) * | 2017-09-11 | 2020-11-27 | 北京京东尚科信息技术有限公司 | 网页查询方法及其系统 |
CN107748802A (zh) * | 2017-11-17 | 2018-03-02 | 北京百度网讯科技有限公司 | 文章聚合方法及装置 |
CN110162356B (zh) * | 2018-05-14 | 2021-09-28 | 腾讯科技(深圳)有限公司 | 页面的融合方法、装置、存储介质及电子装置 |
CN110633407B (zh) * | 2018-06-20 | 2022-05-24 | 百度在线网络技术(北京)有限公司 | 信息检索方法、装置、设备及计算机可读介质 |
CN110162607B (zh) * | 2019-02-20 | 2021-08-31 | 北京捷风数据技术有限公司 | 一种基于卷积神经网络的政府组织公文信息追溯方法及装置 |
CN110134853A (zh) * | 2019-05-13 | 2019-08-16 | 重庆八戒传媒有限公司 | 数据爬取方法及系统 |
CN110175288B (zh) * | 2019-05-23 | 2020-05-19 | 中国搜索信息科技股份有限公司 | 一种面向青少年群体的文字和图像数据的过滤方法及系统 |
CN111966940B (zh) * | 2020-07-30 | 2021-06-18 | 北京大学 | 一种基于用户请求序列的目标数据定位方法和装置 |
CN113836449A (zh) * | 2021-09-28 | 2021-12-24 | 北京字节跳动网络技术有限公司 | 一种信息展示方法、装置以及计算机存储介质 |
CN116881595B (zh) * | 2023-09-06 | 2023-12-15 | 江西顶易科技发展有限公司 | 一种可自定义的网页数据爬取方法 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2010014954A2 (en) * | 2008-08-01 | 2010-02-04 | Google Inc. | Providing posts to discussion threads in response to a search query |
CN103294781A (zh) * | 2013-05-14 | 2013-09-11 | 百度在线网络技术(北京)有限公司 | 一种用于处理页面数据的方法与设备 |
CN103544176A (zh) * | 2012-07-13 | 2014-01-29 | 百度在线网络技术(北京)有限公司 | 用于生成多个页面所对应的页面结构模板的方法和设备 |
CN103955529A (zh) * | 2014-05-12 | 2014-07-30 | 中国科学院计算机网络信息中心 | 一种互联网信息搜索聚合呈现方法 |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP4431744B2 (ja) * | 2004-06-07 | 2010-03-17 | 独立行政法人情報通信研究機構 | Webページ情報融合表示装置、Webページ情報融合表示方法、Webページ情報融合表示プログラムおよびそのプログラムを記録したコンピュータ読み取り可能な記録媒体 |
KR20080059713A (ko) * | 2006-12-26 | 2008-07-01 | 한국과학기술정보연구원 | 과학기술 정보에 대한 융합 정보 검색 시스템 및 그 방법 |
CN100476830C (zh) * | 2007-06-07 | 2009-04-08 | 北京金山软件有限公司 | 一种网络资源检索方法及系统 |
CN103559259A (zh) * | 2013-11-04 | 2014-02-05 | 同济大学 | 基于云平台的消除近似重复网页方法 |
-
2014
- 2014-05-12 CN CN201410198228.6A patent/CN103955529B/zh active Active
- 2014-12-26 WO PCT/CN2014/095164 patent/WO2015172567A1/zh active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2010014954A2 (en) * | 2008-08-01 | 2010-02-04 | Google Inc. | Providing posts to discussion threads in response to a search query |
CN103544176A (zh) * | 2012-07-13 | 2014-01-29 | 百度在线网络技术(北京)有限公司 | 用于生成多个页面所对应的页面结构模板的方法和设备 |
CN103294781A (zh) * | 2013-05-14 | 2013-09-11 | 百度在线网络技术(北京)有限公司 | 一种用于处理页面数据的方法与设备 |
CN103955529A (zh) * | 2014-05-12 | 2014-07-30 | 中国科学院计算机网络信息中心 | 一种互联网信息搜索聚合呈现方法 |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111274467A (zh) * | 2019-12-31 | 2020-06-12 | 中国电子科技集团公司第二十八研究所 | 面向大规模数据采集的三层分布式去重架构和方法 |
CN112862536A (zh) * | 2021-02-25 | 2021-05-28 | 腾讯科技(深圳)有限公司 | 一种数据处理方法、装置、设备及存储介质 |
CN112862536B (zh) * | 2021-02-25 | 2023-07-11 | 腾讯科技(深圳)有限公司 | 一种数据处理方法、装置、设备及存储介质 |
CN114372267A (zh) * | 2021-11-12 | 2022-04-19 | 哈尔滨工业大学 | 一种基于静态域的恶意网页识别检测方法、计算机及存储介质 |
CN114372267B (zh) * | 2021-11-12 | 2024-05-28 | 哈尔滨工业大学 | 一种基于静态域的恶意网页识别检测方法、计算机及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN103955529A (zh) | 2014-07-30 |
CN103955529B (zh) | 2018-05-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2015172567A1 (zh) | 一种互联网信息搜索聚合呈现方法 | |
Shinzato et al. | Tsubaki: An open search engine infrastructure for developing information access methodology | |
US9454599B2 (en) | Automatic definition of entity collections | |
CN109033358B (zh) | 新闻聚合与智能实体关联的方法 | |
US8554800B2 (en) | System, methods and applications for structured document indexing | |
CN103294781B (zh) | 一种用于处理页面数据的方法与设备 | |
US20090248707A1 (en) | Site-specific information-type detection methods and systems | |
Su et al. | Combining tag and value similarity for data extraction and alignment | |
US20150287047A1 (en) | Extracting Information from Chain-Store Websites | |
WO2015051481A1 (en) | Determining collection membership in a data graph | |
CN103559258A (zh) | 基于云计算的网页排序方法 | |
CN110889023A (zh) | 一种elasticsearch的分布式多功能搜索引擎 | |
CN104572934A (zh) | 一种基于dom的网页关键内容抽取方法 | |
Patil et al. | Search engine optimization technique importance | |
CN113239111A (zh) | 一种基于知识图谱的网络舆情可视化分析方法及系统 | |
Grigalis | Towards web-scale structured web data extraction | |
CN114443928B (zh) | 一种网络文本数据爬虫方法与系统 | |
Yu et al. | Web content information extraction based on DOM tree and statistical information | |
CN109948015B (zh) | 一种元搜索列表结果抽取方法及系统 | |
Saravanan et al. | Extraction of Core Web Content from Web Pages using Noise Elimination. | |
Moreira et al. | Analysis of structured data on Wikipedia | |
Zhang et al. | An improved ontology-based web information extraction | |
Lim et al. | Generalized and lightweight algorithms for automated web forum content extraction | |
Qinghua | Application of WebCrawler in Information Search and Data Mining | |
Wei et al. | Semi-automated construction of a knowledge graph with template |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 14892066 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 14892066 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 20.03.2017) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 14892066 Country of ref document: EP Kind code of ref document: A1 |