CN102298638A - Method and system for extracting news webpage contents by clustering webpage labels - Google Patents

Method and system for extracting news webpage contents by clustering webpage labels Download PDF

Info

Publication number
CN102298638A
CN102298638A CN2011102704180A CN201110270418A CN102298638A CN 102298638 A CN102298638 A CN 102298638A CN 2011102704180 A CN2011102704180 A CN 2011102704180A CN 201110270418 A CN201110270418 A CN 201110270418A CN 102298638 A CN102298638 A CN 102298638A
Authority
CN
China
Prior art keywords
node
dom tree
anchor text
web page
deletion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011102704180A
Other languages
Chinese (zh)
Inventor
高勇
王放
许欢庆
郭永福
陈沛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongsou Network Technology Co ltd
Original Assignee
Beijing Zhongsou Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongsou Network Technology Co ltd filed Critical Beijing Zhongsou Network Technology Co ltd
Priority to CN2011102704180A priority Critical patent/CN102298638A/en
Publication of CN102298638A publication Critical patent/CN102298638A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a method and a system for extracting news webpage contents by clustering webpage labels. The method comprises the following steps of: preprocessing webpage contents, namely resolving the webpage contents into a document object model (DOM) tree and counting information of each node of the DOM tree; deleting the nodes of the DOM tree in an elicitation mode; deleting the nodes of the DOM tree regularly; and deleting the nodes of the DOM tree on the basis of the clustering of a label structure to generate and output the final DOM tree.

Description

Use the webpage label clustering to extract the method and system of news web page content
Technical field
The present invention relates generally to news web page contents extraction field, more particularly, the present invention relates to a kind of method and system that uses the webpage label clustering to extract the news web page content.
Background technology
At news (perhaps information) search field, body is extracted the link that is absolutely necessary, and the quality that its text extracts has just determined the quality and the user experience of news search.
Whether the body extracting method is of all kinds at present, by using the mode of template to be divided into two big classes: extract and extract based on the non-template mode based on template (or wrapper) mode.
In extracting based on template way, at first define template, coding is resolved the execution template and is obtained data then.Can be divided into again according to the template generating mode: artificial template extraction and automatic moulding plate extract.In artificial template extraction, at the targeted sites of extracting, artificial hand-coding template, template can be the canonical matching way, also can be the first matching way of simple string matching.In automatic moulding plate extracts, utilize machine learning algorithm, obtain a part of web data from the targeted website earlier and carry out learning training, obtain template, program is utilized the template extraction data then.
The non-template mode is extracted mostly and is realized based on statistics and mode of learning.At present main algorithm have rule-based, based on piecemeal, based on vision etc.More representational is the page partitioning algorithm based on vision of Microsoft, extracts through page piece, and divider extracts and 3 steps of semantic chunk reconstruct, determines the main semantic chunk of webpage.
The shortcoming of manual compiling template way is need expend huge human resources to write template, and along with the variation of targeted website, safeguards that the cost of template is also very big.The shortcoming of automatic moulding plate mode is the algorithm complexity, also needs simultaneously to the monitoring of targeted website cycle, to safeguard the variation of template.No matter be manually or automatically to produce template, the data of its hypothesis website are to produce by template, some large-scale website basic problems are little, just different inlet possibility template differences, but for numerous medium and small websites, its templating is not fine, utilizes template extraction can only extract most information, has more chance to comprise junk information.
Because rule is complicated, and performance is not high, not too be fit to the application of news search engine based on the page partitioning algorithm of vision.
Therefore, the method that needs a kind of effective extraction news web page content.
Summary of the invention
In order to solve above-mentioned shortcoming of the prior art and problem the present invention is proposed.The present invention is directed to the deficiency of news web page contents extraction technology, design is extracted news content based on the algorithm of label clustering, avoids the drawback of artificial rule and template thereof.
According to an aspect of the present invention, provide a kind of method of using the webpage label clustering to extract the news web page content, having comprised: web page contents has been carried out pre-service, comprise the information that described web page contents is resolved to each node of dom tree and the described dom tree of statistics; The node of the described dom tree of heuristic deletion; Formal style is deleted the node of described dom tree; And delete the node of described dom tree, thereby generate final dom tree with output based on the label construction cluster.
In described method, the node of deleting described dom tree based on the label construction cluster can comprise: to all nodes statistics label construction information through the described dom tree of heuristic deletion and formal style deletion; The label construction information of being added up is carried out similar cluster calculation, thereby obtain a plurality of classes; Choose the class of content maximum in described a plurality of class, and with the common father node of node in the selected class as content node; And, handle all other nodes according to described content node, to form described final dom tree.
Described method may further include carries out fine processing to described web page contents.
Described method can comprise gets the MD5 Hash to the label construction information of being added up, and carries out strict cluster by the MD5 value is identical.
Described information comprises: punctuate, symbolic number, number of characters, link number, picture number.And described symbol and number of characters are divided into by link: the Chinese symbolic number in the anchor text, the English symbol number in the anchor text, the Chinese number of words in the anchor text, the English number of words in the anchor text; Chinese symbolic number in the non-anchor text, the English symbol number in the non-anchor text, the Chinese number of words in the non-anchor text, the English number of words in the non-anchor text.
In described method, the node that formal style is deleted described dom tree can comprise: the statistics link number of node and the ratio of non-anchor text character number, if described ratio greater than threshold value, but be deletion of node then with this vertex ticks.Described threshold value links the ratio of number and overall non-anchor text character number based on this webpage overall situation or rule of thumb is worth.
According to another aspect of the present invention, a kind of system that uses the webpage label clustering to extract the news web page content is provided, comprise: pretreatment module is used for web page contents is carried out pre-service, with the information that described web page contents is resolved to dom tree and adds up each node of described dom tree; Heuristic removing module is used for the corresponding node of label object of heuristic deletion and described dom tree appointment; The formal style removing module is used for the node of the ratio of formal style Remove Links number and non-anchor text character number greater than assign thresholds; And label construction cluster removing module, be used for based on label construction cluster deletion of node.
Described system may further include: the fine processing module is used for described web page contents is carried out fine processing.
Description of drawings
By the description of carrying out below in conjunction with accompanying drawing, above-mentioned and other aspects of some one exemplary embodiment of the present invention, feature and advantage will become apparent to those skilled in the art, wherein:
Fig. 1 is the process flow diagram of method that the extraction news web page content of the one exemplary embodiment according to the present invention is shown;
Fig. 2 is the process flow diagram based on the process of label construction cluster deletion of node that according to a present invention one exemplary embodiment is shown; And
Fig. 3 is the block diagram that the system of one exemplary embodiment according to the present invention is shown.
Embodiment
Describing below to help complete understanding one exemplary embodiment of the present invention with reference to the accompanying drawings is provided.It comprises various details to help understanding, and they should be thought it only is exemplary.Therefore, those of ordinary skills will be appreciated that, can make various changes and modification to the embodiments described herein, and can not deviate from scope and spirit of the present invention.Equally, for clarity and conciseness, omitted description to known function and structure.
Describe the present invention below with reference to the accompanying drawings in detail.
Fig. 1 is the process flow diagram of method 100 that the extraction news web page content of the one exemplary embodiment according to the present invention is shown.
As shown in fig. 1, method 100 starts from step 110.In step 110, web page contents is carried out pre-service, with the information that web page contents is resolved to DOM (Document Object Model, DOM Document Object Model) tree and add up each node of dom tree.How web page contents being resolved to dom tree is known for those skilled in the art, therefore is not described in detail at this.
Web page contents is carried out pre-service can be comprised: arrangement web page contents label, will not match or inc label is organized into closed label by rule, and afterwards the web page contents of putting in order is resolved to dom tree; And the information of adding up each node.
The information of node can comprise: punctuate, symbolic number, number of characters (for example, Chinese, English, numeral, other etc.), link number, picture number etc.Wherein, symbol and character can be divided into by link again: the Chinese symbolic number in the anchor text, the English symbol number in the anchor text, the Chinese number of words in the anchor text, the English number of words in the anchor text; Chinese symbolic number in the non-anchor text, the English symbol number in the non-anchor text, the Chinese number of words in the non-anchor text, the English number of words in the non-anchor text.
In step 120, heuristic deletion of node.The operand of this step is the label object of appointment in the dom tree.For example, the heuristic html tag that relates to that removes mainly contains META, HR, IMG, STYLE, SCRIPT, NOSCRIPT, INPUT, SELECT, EMBED, BUTTON, OPTION, OPTGROUP.
Why will delete above-mentioned html tag object, be because they can not influence content itself.For example, punctuate and content are more in Script the Script section, but whether the page itself is showed content, and the style pattern is to show content, also can not influence content itself, so on dom tree, but this two category node is labeled as deletion of node.The deletion mode take from low and on mode, guarantee simultaneously that in deletion during this node, the statistical information of this node is also deleted in its father node.
In step 130, the formal style deletion of node.The operand of this step is table, content tabs such as div.In this step, the link number that can computing node and the ratio of non-anchor text character number if ratio is greater than threshold value (for example, threshold value can be 0.05), but can this vertex ticks be deletion of node then.
In step 140,, thereby generate final dom tree with output based on label construction cluster deletion of node.The process of coming deletion of node based on the label construction cluster is described below with reference to Fig. 2.
Fig. 2 is the process flow diagram based on the process 200 of label construction cluster deletion of node that according to a present invention one exemplary embodiment is shown.
In step 210, to all nodes statistics label construction information through the dom tree of heuristic deletion and formal style deletion.
In step 220, the label construction information of being added up is carried out similar cluster calculation, thereby obtain a plurality of classes.For each node, from its father node to node itself paths is arranged, this path is referred to as the structure path of node, and all labels on the path are formed the character string of (such as connecting to form with for example separator of "-" or "/" etc.) as path values.Therefore, the label construction information of being added up being carried out similar cluster calculation promptly is that path values is carried out similar cluster calculation.For example, can come all path values are done similar cluster calculation by any in K mean algorithm, C mean algorithm, the EM algorithm etc.Each label construction is (or a plurality of) class, and each class comprises: the ancestor node pointer, and it is ancestors' pointer of the content node that finally finds; The speech number is used for determining the class of content maximum; Rubbish speech number and deletion ratio can be used to eliminate the noise in the content territory; And node set, can find common ancestor node by node set.
In step 230, choose the class of content (effectively punctuate meets number and non-anchor text character number) maximum, and with the common father node of node in such content node as whole news web page.
In step 240, handle all nodes under the described content node, be used for further abating the noise, thereby form final dom tree.
In addition, in step 220,, can get the MD5 Hash, carry out strict cluster by the MD5 value is identical to the label construction information (being path values) of being added up in order further to improve program efficiency.
Method of the present invention is described above, yet can refines with the fine processing news content method of the present invention.Some statement is understood in some finance and economic websites more, and for example: " Sohu's security statement: this channel information content system quotes from cooperation medium and cooperative institution; do not represent Sohu's security self viewpoint and position; the investor is to the careful judgement of this information in suggestion, enters the market in view of the above, and is at your own risk." in this case, if with the part of such statement as news content, with regard to a word or shorter, then such statement can influence the calculating of news content, such as fingerprint calculating, the calculating of news similarity etc. as if the news actual content.In order to eliminate such adverse effect, can adopt two kinds of following retrofit content way: (1) is for all nodes of news node content the inside, if its deletion rate (same paths node encumbrance/same paths node sum) higher (more than 90%) is then deleted this node; (2) make a vocabulary, the vocabulary content is speech string and part of speech, wherein the speech string is the string of 3-4 the non-speech of Chinese character, its attribute flags is statement, navigation, copyright, advertisement or the like, cut speech by the corresponding content in the label being done the maximum forward coupling, obtain the speech property set of this label correspondence, according to the choice of ratio (speech community set number/text size) decision node.
Fig. 3 is the block diagram that the system 300 of the one exemplary embodiment according to the present invention is shown.
As shown in Figure 3, system 300 can comprise pretreatment module 310, heuristic removing module 320, formal style removing module 330, label construction cluster removing module 340.Alternatively, system 300 can also comprise fine processing module (not shown among Fig. 3).
Particularly, 310 pairs of web page contents of pretreatment module carry out pre-service, with the information that web page contents is resolved to dom tree and add up each node of dom tree.
Heuristic removing module 320 is used for the corresponding node of label object of heuristic deletion and dom tree appointment.
Formal style removing module 330 is used for the node of the ratio of formal style Remove Links number and non-anchor text character number greater than assign thresholds.
Label construction cluster removing module 340 is used for based on label construction cluster deletion of node.Describe how to come deletion of node in detail in conjunction with Fig. 2 owing to top, so be not described in detail in this based on the label construction cluster.
According to the present invention, provide a kind of method and system that uses the webpage label clustering to realize the news web page contents extraction.Those of ordinary skills will recognize that method of the present invention and advantage can obtain following advantage: (1) need not template based on single web page analysis, saves a large amount of artificial; (2) algorithm is simple, the analysis efficiency height; (3) can be for follow-up fingerprint calculate, content clustering, the media event cluster provides quality data to guarantee.
Respectively system and method embodiment of the present invention is described respectively above being to be noted that, but the details that an embodiment is described also can be applicable to another embodiment.
Ultimate principle of the present invention has below been described in conjunction with specific embodiments, but, it is to be noted, for those of ordinary skill in the art, can understand the whole or any steps or the parts of method and system of the present invention can be realized with software, hardware, firmware or their combination, and this is that those of ordinary skills use their basic programming skill just can realize under the situation of having read explanation of the present invention.
Therefore, purpose of the present invention can also realize by a software module of operation or one group of software module on any calculation element.Described calculation element can be known fexible unit.Therefore, purpose of the present invention also can be only by providing the program product that comprises the program code of realizing described method or system to realize.That is to say that such program product also constitutes the present invention, and the storage medium that stores such program product also constitutes the present invention.Obviously, described storage medium can be any storage medium that is developed in any known storage medium or future.
Though this instructions comprises many specific implementations details, but these details should be interpreted as the restriction of the scope of the content that maybe can advocate any invention, and should be interpreted as can be specific to the description of the feature of the specific embodiment of specific invention.Some characteristics combination of describing in the situation of the embodiment that separates in this manual can also be realized in single embodiment.On the contrary, also each feature of describing in the situation of single embodiment can be realized in a plurality of embodiments discretely or in any suitable sub-portfolio, realizes.In addition, work although may describe feature as in the above in some combination, even initial opinion so, but can in some cases the one or more features from the combination of being advocated be left out from combination, and the combination of being advocated can be pointed to the variant of sub-portfolio or sub-portfolio.
Similarly, though described operation with certain order in the accompanying drawings, this should be interpreted as need with shown in certain order or carry out such operation or need to carry out the result that all illustrated operations just can reach expectation with continuous order.In some cases, multitask and parallel processing can be favourable.In addition, the separation of in the above-described embodiments various system components should be interpreted as and all need such separation in all embodiments, and should be understood that, usually can be with described program assembly and the system integration to becoming single software product together or being encapsulated as a plurality of software products.
Computer program (being also referred to as program, software, software application, script or code) can be write by programming language in any form, described programming language comprises compiling or interpretative code or illustrative or procedural language, and it can be disposed in any form, comprises as stand-alone program or as module, assembly, subroutine or other unit of being suitable for using in computing environment.Computer program there is no need corresponding to the file in the file system.Can be at the file that keeps other program or data (for example with procedure stores, be stored in the one or more scripts in the marking language document) a part, the single file that is exclusively used in question program or a plurality of coordinative file (for example, storing the file of one or more modules, subroutine or partial code) in.
Above-mentioned embodiment does not constitute limiting the scope of the invention.Those skilled in the art should be understood that, depend on designing requirement and other factors, and various modifications, combination, sub-portfolio and alternative can take place.Any modification of being done within the spirit and principles in the present invention, be equal to and replace and improvement etc., all should be included within the protection domain of the present invention.

Claims (10)

1. method of using the webpage label clustering to extract the news web page content comprises:
Web page contents is carried out pre-service, comprise the information that described web page contents is resolved to each node of dom tree and the described dom tree of statistics;
The node of the described dom tree of heuristic deletion;
Formal style is deleted the node of described dom tree; And
Delete the node of described dom tree based on the label construction cluster, thereby generate final dom tree with output.
2. method according to claim 1, wherein, the node of deleting described dom tree based on the label construction cluster comprises:
To all nodes statistics label construction information through the described dom tree of heuristic deletion and formal style deletion;
The label construction information of being added up is carried out similar cluster calculation, thereby obtain a plurality of classes;
Choose the class of content maximum in described a plurality of class, and with the common father node of node in the selected class as content node; And
Handle all nodes under the described content node, to form described final dom tree.
3. method according to claim 1, wherein, described method further comprises carries out fine processing to described web page contents.
4. method according to claim 2 wherein, is got the MD5 Hash to the label construction information of being added up, and carries out strict cluster by the MD5 value is identical.
5. method according to claim 1, wherein, described information comprises: punctuate, symbolic number, number of characters, link number, picture number.
6. method according to claim 1, wherein, described symbol and number of characters are divided into by link: the Chinese symbolic number in the anchor text, the English symbol number in the anchor text, the Chinese number of words in the anchor text, the English number of words in the anchor text; Chinese symbolic number in the non-anchor text, the English symbol number in the non-anchor text, the Chinese number of words in the non-anchor text, the English number of words in the non-anchor text.
7. method according to claim 1, wherein, the node that formal style is deleted described dom tree comprises: the statistics link number of node and the ratio of non-anchor text character number, if described ratio greater than threshold value, but be deletion of node then with this vertex ticks.
8. method according to claim 1, wherein, described threshold value links the ratio of number and overall non-anchor text character number based on this webpage overall situation or rule of thumb is worth.
9. system that uses the webpage label clustering to extract the news web page content comprises:
Pretreatment module is used for web page contents is carried out pre-service, with the information that described web page contents is resolved to dom tree and adds up each node of described dom tree;
Heuristic removing module is used for the corresponding node of label object of heuristic deletion and described dom tree appointment;
The formal style removing module is used for the node of the ratio of formal style Remove Links number and non-anchor text character number greater than assign thresholds; And
Label construction cluster removing module is used for based on label construction cluster deletion of node.
10. system according to claim 9, wherein, described system further comprises:
The fine processing module is used for described web page contents is carried out fine processing.
CN2011102704180A 2011-08-31 2011-08-31 Method and system for extracting news webpage contents by clustering webpage labels Pending CN102298638A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011102704180A CN102298638A (en) 2011-08-31 2011-08-31 Method and system for extracting news webpage contents by clustering webpage labels

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011102704180A CN102298638A (en) 2011-08-31 2011-08-31 Method and system for extracting news webpage contents by clustering webpage labels

Publications (1)

Publication Number Publication Date
CN102298638A true CN102298638A (en) 2011-12-28

Family

ID=45359052

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011102704180A Pending CN102298638A (en) 2011-08-31 2011-08-31 Method and system for extracting news webpage contents by clustering webpage labels

Country Status (1)

Country Link
CN (1) CN102298638A (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514174A (en) * 2012-06-18 2014-01-15 北京百度网讯科技有限公司 Text categorization method and device
CN103530429A (en) * 2013-11-04 2014-01-22 北京中搜网络技术股份有限公司 Webpage content extracting method
CN103902581A (en) * 2012-12-28 2014-07-02 腾讯科技(深圳)有限公司 Method and device for removing DOM (document object model) nodes of pages
WO2014201873A1 (en) * 2013-06-18 2014-12-24 Tencent Technology (Shenzhen) Company Limited Method and device for processing web page content
CN104657347A (en) * 2015-02-06 2015-05-27 北京中搜网络技术股份有限公司 News optimized reading mobile application-oriented automatic summarization method
CN104699797A (en) * 2015-03-18 2015-06-10 浪潮集团有限公司 Webpage data structured analytic method and device
CN105183843A (en) * 2012-09-29 2015-12-23 北京奇虎科技有限公司 List page recognition system and method
CN105653668A (en) * 2015-12-29 2016-06-08 武汉理工大学 Webpage content analysis and extraction optimization method based on DOM Tree in cloud environment
CN106339455A (en) * 2016-08-26 2017-01-18 电子科技大学 Webpage text extracting method based on text tag feature mining
CN107436931A (en) * 2017-07-17 2017-12-05 广州特道信息科技有限公司 web page text extracting method and device
CN107451215A (en) * 2017-07-17 2017-12-08 广州特道信息科技有限公司 Feature text abstracting method and device
CN107463634A (en) * 2017-07-17 2017-12-12 广州特道信息科技有限公司 web page text extracting method and device
CN107590288A (en) * 2017-10-11 2018-01-16 百度在线网络技术(北京)有限公司 Method and apparatus for extracting webpage picture and text block
CN108021600A (en) * 2016-11-03 2018-05-11 财团法人资讯工业策进会 Webpage data capturing equipment and webpage data capturing method thereof
CN108804458A (en) * 2017-05-02 2018-11-13 阿里巴巴集团控股有限公司 A kind of reptile web retrieval method and apparatus
CN109635219A (en) * 2018-12-05 2019-04-16 云孚科技(北京)有限公司 A kind of webpage content extracting method
CN110020247A (en) * 2017-12-22 2019-07-16 中移(苏州)软件技术有限公司 A kind of webpage key modules extracting method and device
CN110209906A (en) * 2018-02-07 2019-09-06 北京京东尚科信息技术有限公司 Method and apparatus for extracting webpage information
CN111104624A (en) * 2018-10-25 2020-05-05 富士通株式会社 Content extraction method and apparatus, and storage medium
CN112887381A (en) * 2021-01-15 2021-06-01 中国地质大学(武汉) Method and device for detecting and converging new content facing specific network entrance
CN114218515A (en) * 2021-12-21 2022-03-22 北京大学 Web digital object extraction method and system based on content segmentation
CN114528811A (en) * 2022-01-21 2022-05-24 北京麦克斯泰科技有限公司 Article content extraction method, device, equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101464905A (en) * 2009-01-08 2009-06-24 中国科学院计算技术研究所 Web page information extraction system and method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101464905A (en) * 2009-01-08 2009-06-24 中国科学院计算技术研究所 Web page information extraction system and method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
刘云峰: "一种基于标签路径聚类的文本信息抽取算法", 《计算机应用与软件》 *
匿名: "基于linkNum/textNum比例的网页去噪", 《HTTP://CODE.GOOGLE.COM/P/HTMLXTRACTOR/》 *
彭同坠: "Web新闻正文信息抽取技术研究", 《科教文汇》 *
赵文等: "基于统计的中文网页正文抽取的研究", 《电脑知识与技术》 *

Cited By (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514174A (en) * 2012-06-18 2014-01-15 北京百度网讯科技有限公司 Text categorization method and device
CN103514174B (en) * 2012-06-18 2019-01-15 北京百度网讯科技有限公司 A kind of file classification method and device
CN105183843B (en) * 2012-09-29 2018-09-14 北京奇虎科技有限公司 list page identification system and method
CN105183843A (en) * 2012-09-29 2015-12-23 北京奇虎科技有限公司 List page recognition system and method
CN103902581B (en) * 2012-12-28 2017-12-08 腾讯科技(深圳)有限公司 A kind of method and apparatus for the DOM node for removing the page
CN103902581A (en) * 2012-12-28 2014-07-02 腾讯科技(深圳)有限公司 Method and device for removing DOM (document object model) nodes of pages
WO2014201873A1 (en) * 2013-06-18 2014-12-24 Tencent Technology (Shenzhen) Company Limited Method and device for processing web page content
CN103530429B (en) * 2013-11-04 2017-01-18 北京中搜网络技术股份有限公司 Webpage content extracting method
CN103530429A (en) * 2013-11-04 2014-01-22 北京中搜网络技术股份有限公司 Webpage content extracting method
CN104657347A (en) * 2015-02-06 2015-05-27 北京中搜网络技术股份有限公司 News optimized reading mobile application-oriented automatic summarization method
CN104699797A (en) * 2015-03-18 2015-06-10 浪潮集团有限公司 Webpage data structured analytic method and device
CN104699797B (en) * 2015-03-18 2018-02-23 浪潮集团有限公司 A kind of web page data structured analysis method and device
CN105653668A (en) * 2015-12-29 2016-06-08 武汉理工大学 Webpage content analysis and extraction optimization method based on DOM Tree in cloud environment
CN106339455A (en) * 2016-08-26 2017-01-18 电子科技大学 Webpage text extracting method based on text tag feature mining
CN106339455B (en) * 2016-08-26 2019-06-04 电子科技大学 Webpage context extraction method based on text label feature mining
CN108021600A (en) * 2016-11-03 2018-05-11 财团法人资讯工业策进会 Webpage data capturing equipment and webpage data capturing method thereof
CN108804458A (en) * 2017-05-02 2018-11-13 阿里巴巴集团控股有限公司 A kind of reptile web retrieval method and apparatus
CN107463634A (en) * 2017-07-17 2017-12-12 广州特道信息科技有限公司 web page text extracting method and device
CN107451215A (en) * 2017-07-17 2017-12-08 广州特道信息科技有限公司 Feature text abstracting method and device
CN107451215B (en) * 2017-07-17 2021-01-01 云润大数据服务有限公司 Feature text extraction method and device
CN107436931A (en) * 2017-07-17 2017-12-05 广州特道信息科技有限公司 web page text extracting method and device
CN107436931B (en) * 2017-07-17 2020-12-22 云润大数据服务有限公司 Webpage text extraction method and device
US10755091B2 (en) 2017-10-11 2020-08-25 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for retrieving image-text block from web page
CN107590288A (en) * 2017-10-11 2018-01-16 百度在线网络技术(北京)有限公司 Method and apparatus for extracting webpage picture and text block
CN107590288B (en) * 2017-10-11 2020-09-18 百度在线网络技术(北京)有限公司 Method and device for extracting webpage image-text blocks
CN110020247A (en) * 2017-12-22 2019-07-16 中移(苏州)软件技术有限公司 A kind of webpage key modules extracting method and device
CN110020247B (en) * 2017-12-22 2021-05-14 中移(苏州)软件技术有限公司 Webpage key module extraction method and device
CN110209906A (en) * 2018-02-07 2019-09-06 北京京东尚科信息技术有限公司 Method and apparatus for extracting webpage information
CN111104624A (en) * 2018-10-25 2020-05-05 富士通株式会社 Content extraction method and apparatus, and storage medium
CN111104624B (en) * 2018-10-25 2023-08-22 富士通株式会社 Content extraction method and apparatus, and storage medium
CN109635219A (en) * 2018-12-05 2019-04-16 云孚科技(北京)有限公司 A kind of webpage content extracting method
CN112887381A (en) * 2021-01-15 2021-06-01 中国地质大学(武汉) Method and device for detecting and converging new content facing specific network entrance
CN112887381B (en) * 2021-01-15 2022-07-19 中国地质大学(武汉) Method and device for detecting and converging new content facing specific network entrance
CN114218515A (en) * 2021-12-21 2022-03-22 北京大学 Web digital object extraction method and system based on content segmentation
CN114218515B (en) * 2021-12-21 2022-09-06 北京大学 Web digital object extraction method and system based on content segmentation
CN114528811A (en) * 2022-01-21 2022-05-24 北京麦克斯泰科技有限公司 Article content extraction method, device, equipment and storage medium
CN114528811B (en) * 2022-01-21 2022-09-02 北京麦克斯泰科技有限公司 Article content extraction method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN102298638A (en) Method and system for extracting news webpage contents by clustering webpage labels
CN101593200B (en) Method for classifying Chinese webpages based on keyword frequency analysis
US11501082B2 (en) Sentence generation method, sentence generation apparatus, and smart device
Peters et al. Content extraction using diverse feature sets
CN104598577B (en) A kind of extracting method of Web page text
WO2020000717A1 (en) Web page classification method and device, and computer-readable storage medium
CN102110140A (en) Network-based method for analyzing opinion information in discrete text
CN100552673C (en) Open type document isomorphism engines system
CN102591612B (en) General webpage text extraction method based on punctuation continuity and system thereof
CN103914494A (en) Method and system for identifying identity of microblog user
CN103699525A (en) Method and device for automatically generating abstract on basis of multi-dimensional characteristics of text
CN102609427A (en) Public opinion vertical search analysis system and method
Giannakopoulos et al. Representation models for text classification: a comparative analysis over three web document types
CN112667940B (en) Webpage text extraction method based on deep learning
Zvonarev et al. A Comparison of Machine Learning Methods of Sentiment Analysis Based on Russian Language Twitter Data.
CN103530429A (en) Webpage content extracting method
CN104978332A (en) UGC label data generating method, UGC label data generating device, relevant method and relevant device
CN103744837B (en) Many texts contrast method based on keyword abstraction
CN103150331A (en) Method and device for providing search engine tags
Plu et al. A hybrid approach for entity recognition and linking
Papadakis et al. Graph vs. bag representation models for the topic classification of web documents
CN107145591A (en) A kind of effective content metadata extracting method of webpage based on title
KR20130099327A (en) Apparatus for extracting information from open domains and method for the same
CN103927176A (en) Method for generating program feature tree on basis of hierarchical topic model
CN104281695B (en) The semantic information abstracting method and its system of natural language based on combinatorial theory

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20111228