CN102193944A - Method for extracting webpage subject contents - Google Patents

Method for extracting webpage subject contents Download PDF

Info

Publication number
CN102193944A
CN102193944A CN2010101251916A CN201010125191A CN102193944A CN 102193944 A CN102193944 A CN 102193944A CN 2010101251916 A CN2010101251916 A CN 2010101251916A CN 201010125191 A CN201010125191 A CN 201010125191A CN 102193944 A CN102193944 A CN 102193944A
Authority
CN
China
Prior art keywords
web page
template
rss
subject content
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2010101251916A
Other languages
Chinese (zh)
Inventor
沈文南
酆晓杰
王艳丽
王进
玄东俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Samsung Electronics China R&D Center
Samsung Electronics Co Ltd
Original Assignee
Samsung Electronics China R&D Center
Samsung Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Samsung Electronics China R&D Center, Samsung Electronics Co Ltd filed Critical Samsung Electronics China R&D Center
Priority to CN2010101251916A priority Critical patent/CN102193944A/en
Publication of CN102193944A publication Critical patent/CN102193944A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method for extracting webpage subject contents. The method comprises the following steps of: selecting latest RSS (Really Simple Syndication) information and a corresponding webpage from an RSS file; searching the position of the RSS information in a Dom Tree of the corresponding webpage, wherein the information at the position is taken as a webpage template; and extracting the subject contents of a plurality of target webpages by using the webpage template. The method further comprises the following step of: after the subject contents of a predetermined quantity of target webpages in the plurality of target webpages are extracted, regenerating the webpage template and continually extracting the subject contents of the plurality of target webpages.

Description

Web page subject content extraction method
Technical field
The present invention relates to the extraction of Web page subject information, be specifically related to the extraction of Web page subject content.
Background technology
In webpage, there are navigation link, shell script, related article, advertisement link, copyright information etc. and the irrelevant noise information of subject content, remove these noise informations, the subject content that extracts webpage all has its using value aspect a lot, Web page classifying, the webpage that for example is used for improving search engine disappear weight, the direct accessed web page subject content of portable terminal etc.
The technology that extracts the Web page subject content at present mainly is divided into two classes, and a class mainly is to be applied to the structuring webpage, by the feature of analytical structure webpage, finds the template of extracted data, thereby web data is extracted in batch processing; The another kind of training webpage collection that makes up earlier obtains template with the machine learning training, then then to each web page extraction data.For generic web page, mainly be to use second class methods.
But, above-mentioned training set is difficult to comprise all situations, thereby the template that causes generating can not accurately extract subject content, and existing method just extracts one section subject content for a webpage, can not extract text (description), title (title), classification (category) etc. separately.And, training template by machine learning, can not on the equipment of resource-constraineds such as portable terminal, carry out.
Summary of the invention
In view of the problem that exists in the above-mentioned existing Web page subject content extraction method, the invention provides and a kind ofly can accurately extract the Web page subject content, and the Web page subject content extraction method that can on the equipment of resource-constrained, move.
To achieve these goals, according to Web page subject content extraction method of the present invention, comprise step: select up-to-date RSS information and corresponding webpage thereof in the easy information polymerization of conforming to the principle of simplicity (the RSS:Really Simple Syndication) file; Search for the position of described RSS information in the tree structure (DomTree) of the described webpage of its correspondence, with the information of described position as web page template; Utilize described web page template a plurality of target webs to be carried out the extraction of subject content.
And, above-mentioned Web page subject content extraction method, it is characterized in that by successively traveling through the tree structure of described webpage, search first node, with the XPath of described first node information as described position as the position of described RSS information in described tree structure; In all nodes of described tree structure, the similarity of the text message of the text of described first node and described RSS information is maximum and greater than predetermined threshold value.
And, above-mentioned Web page subject content extraction method, it is characterized in that by successively traveling through the tree structure of described webpage, search first node, in all nodes of described tree structure, the similarity of the text of described first node and described RSS text message is maximum and greater than predetermined threshold value; Then travel through descendants's node of described first node, search Section Point, with the XPath of described Section Point information as described position as the position of described RSS information in described tree structure; The similarity of the text message of the text of described Section Point and described RSS information greater than the similarity of the child node of described threshold value and described Section Point and described RSS text message all less than described threshold value.
And, above-mentioned Web page subject content extraction method, it is characterized in that also comprising step: whenever the target web to the predetermined number in described a plurality of target webs has carried out after subject content extracts, regenerate described web page template and proceed the extraction of the subject content of described a plurality of target webs., regenerate described web page template and proceed the extraction of the subject content of described target web.
And above-mentioned Web page subject content extraction method is characterized in that using described web page template to extract the subject content of target web up-to-date in the described target web; Calculate the similarity of the subject content of described up-to-date pairing RSS information of target web and described up-to-date target web; Described similarity then needs to regenerate web page template greater than predetermined threshold value.
And above-mentioned Web page subject content extraction method is characterized in that described RSS information is title RSS information, text RSS information or classification RSS information; Described web page template is title template, text template or class template; Described subject content is title content, body matter or classification content.
According to Web page subject content extraction method of the present invention, can accurately extract the Web page subject content, and the Web page subject content extraction method that can on the equipment of resource-constrained, move.
Description of drawings
By the description of carrying out below in conjunction with accompanying drawing, above-mentioned and other purposes of the present invention and characteristics will become apparent, wherein:
Fig. 1 is for the synoptic diagram of RSS information is described;
Fig. 2 is for the process flow diagram of Web page subject content extraction method according to the embodiment of the present invention is described;
Fig. 3 is for the process flow diagram of web page template generation method according to the embodiment of the present invention is described;
Fig. 4 A is for the synoptic diagram of web page files, DOM Tree and XPath is described;
Fig. 4 B is for the synoptic diagram of web page files, DOM Tree and XPath is described;
Fig. 5 is in order to illustrate that the template that obtains according to template generation method shown in Figure 3 generates result's figure;
Fig. 6 is for the process flow diagram of web page template verification method according to the embodiment of the present invention is described.
Main symbol description: 101 are template generation template; 102 is the content extraction template; 103 is the template authentication module; S1010-S1100, S2010-S2110 and S2010-S3050 are step.
Embodiment
Below, describe embodiments of the present invention in detail with reference to accompanying drawing.
(embodiment)
The Web page subject content extraction method of present embodiment relates to RSS information, therefore, at first RSS information is described.
RSS information is the synchronous website of a kind of description format of content, is a kind of new information publishing technology means, and present a lot of webpages all attach RSS information such as blog, news category website etc. when issuing.RSS information can directly be called by other websites, and because these data all are extending mark language (the XML:Extensible Markup Language) forms of standard, thus also can other terminal and service in use.
RSS is that present most popular XML uses, a sub-channel of portal website, and such as scientific and technological channel, all blogs that certain blog master writes all have the webpage RSS information of a up-to-date issue of RSS file maintenance.Generally, a RSS file only comprises several webpage RSS information of latest update, and changes along with the renewal of information issue.
For example, Fig. 1 shows for RSS information is described the RSS file fragment as Information technology (IT:Information Technology) channel of the website of an example.In this figure, use square frame 10 shows the title RSS information in the html web page file, uses square frame 20 to show text RSS information, and uses square frame 30 to show classification RSS information.In addition, the RSS text message is meant the text message after RSS information is removed html tag, title RSS text message such as this webpage is " China Mobile builds deployment the startup TD network fourth phase ", and as shown in the figure, text RSS text message is a fragment of actual text in this example.
Below, the Web page subject content extraction method according to present embodiment is described.
Fig. 2 is for the process flow diagram according to the Web page subject content extraction method of present embodiment is described.
As shown in Figure 2, can be divided into template generation module 101, content extraction module 102 and 103 3 modules of template authentication module according to the Web page subject content extraction method of present embodiment, and more specifically be divided into ten steps of S1010 to S1100.Wherein, step S1010 to S1040 constitutes template generation module 101, step S1050 to S1070 constitution content abstraction module 102, and step S1090 to S1100 constitutes template authentication module 103.
At step S1010, select up-to-date part webpage and its corresponding RSS information.Specifically, all RSS information that at first the RSS file comprised sorted by update time; Select forward some RSS information update time then, select the number can manual configuration; Then, find its corresponding webpage according to the RSS information of selecting.
At step S1020, obtain the title template by the position of text message in webpage of locating title RSS.In this step, utilization each title template that obtains in some RSS information that step S1010 selects and corresponding webpage thereof all should be identical, if in these title templates, has indivedual inequality, then it is removed, and keep most of identical title template.
Then, obtain the text template by the position of text message in webpage of locating text RSS at step S1030.In this step, if it is the same in step 1020, utilization each text template that obtains in some RSS information that step S1010 selects and corresponding webpage thereof all should be identical, if in these text templates, have indivedual inequality, then it is removed, and keep most of identical text template.
Then at step S1040, the position of text message in webpage by positioning label RSS obtains class template.In this step, similarly keep most of identical class template.
So, in template generation module 101, select part webpage and its corresponding RSS information of latest update, and obtain the web page template that constitutes by title masterplate, text template and class template by the position of RSS text message in webpage, location.Localization method about RSS text message position in webpage describes in detail in the back.
Then, utilize the title template pair target web extracting header content corresponding in the web page template with this title RSS information at step S1050.
Then, utilize the text template pair target web corresponding in the web page template to extract body matter with this text RSS information at step S1060.
Then, utilize the class template pair target web corresponding in the web page template to extract the classification content with this classification RSS information at step S1070.
Then, judge whether all target webs corresponding with this RSS information have been carried out the extraction of subject content at step S1080.
If all webpages to this RSS information correspondence have not carried out the extraction of subject content, need to continue the residue webpage is carried out extraction (the step S1080: "No") of subject content, then then at step S1090, utilize the subject content of the corresponding extraction of RSS information of latest update to verify web page template, comprise checking particularly, for the checking of text template and for the checking of class template for the title template with it.
Then at step S1100, checking result according to step S1090, judge that the subject content whether in title template, text template and the class template any one has not been suitable for up-to-date webpage extracts and need regenerate, judge promptly whether web page template has not been suitable for the subject content extraction of up-to-date webpage, and need regenerate web page template.
If at step S1100, judgement need regenerate web page template (step S1100: "Yes"), then return step S1020.On the contrary, if at step S1100, judging does not need to regenerate web page template (step S1100: "No"), then return step S1050, continue remaining target web is carried out the extraction of subject content.
So, in template authentication module 103, whether utilize the subject content of the corresponding extraction with it of RSS of latest update to monitor current web page template needs to change, change then regenerating the extraction that web page template carries out the Web page subject content more if desired, otherwise continue to use current web page template to continue remaining target web is carried out the extraction of subject content.Here, the concrete verification method that whether needs to change for current web page template will describe in detail in the back.
When at step S1080, judge to all target webs of this RSS file correspondence carried out subject content extraction (step S1080: "Yes"), end process then.
So, in content extraction module 102, utilize web page template all target webs of this RSS file correspondence to be carried out comprising the extraction of the Web page subject content of title content, body matter and classification content.
Fig. 3 is for the process flow diagram according to the web page template generation method of present embodiment is described.Particularly, Fig. 3 is the process flow diagram of the generation method of text template, promptly corresponding to the step S among Fig. 2 1030.
At first at step S2010, for webpage makes up DOM Tree.
Fig. 4 A and Fig. 4 B are the synoptic diagram that is used to illustrate web page files, DOM Tree and XPath.
Fig. 4 A represents an example of hypertext markup language (HTML:HyperText Markup Language) web page files particularly.Fig. 4 B is the DOMTree corresponding to the html web page file shown in Fig. 4 A, and each node among the DOM Tree is corresponding to the html tag in the html web page.The text message that the text message of each node comprises for the subtree of this node correspondence among the DOM Tree.In Fig. 4 B,, show to illustrative "/html/body/ul " as the XPath of node.
Then, remove garbages such as Javascript at step S2020.
Then, calculate the similarity of all child node text messages of RSS text message and root node at step S2030.
The calculation of similarity degree method is as follows.In view of the RSS text message all is the segment of actual text or all, it is one section simple summary about actual text that few part is also arranged usually,, use simple algorithm so calculate for similarity.Promptly at first to RSS text message participle, obtain word array a and word number n, whether the word that calculates then among the array a occurs in the node text message, and the word number m that obtains occurring calculates s=m/n again as similarity.
Then, from the result of calculation of step S2030, get maximum similarity s and corresponding child node a at step S2040.
Then, judge that whether maximum similarity s is greater than predetermined threshold value sv at step S2050.
About similarity threshold, in view of can not mating fully with actual text at RSS text message in some cases, thus an empirical data obtained by experiment, when similarity surpasses this empirical data, then think RSS and node text matches, this empirical data is exactly a similarity threshold.
When judging that at step S2050 (step S2050: in the time of "No"), then return failure information at step S2060, this information representation text template generates failure to maximum similarity s, and finishes the template generation and handle smaller or equal to threshold value sv.
When judging that at step S2050 maximum similarity s is greater than threshold value sv (step S2050: in the time of "Yes"), follow at step S2070 the text message of all descendants's nodes of recurrence traversal ground calculating a and the similarity of RSS text message.
Then, from the result of calculation of step S2070, get maximum similarity s1 and corresponding descendants's node b thereof at step S2080.
Then, judge that whether maximum similarity s1 is greater than predetermined threshold value sv at step S2090.
When judging that at step S2090 maximum similarity s1 is greater than threshold value sv (step S2090: in the time of "Yes"), then at step S2100, the b node as a node, and is returned the processing that S2070 repeats subsequent step.
(step S2090: in the time of "No"), then at step S2110, return the Xpath of a node, Xpath is the text template smaller or equal to threshold value when judge maximum similarity s1 at step S2090.
When obtaining the text template according to method shown in Figure 2, node b meets following two conditions, the similarity of text that is text RSS text message and node b is greater than threshold value sv, and the similarity of the text of the child node of text RSS text message and node b is all less than threshold value sv simultaneously.The node b that obtains thus is for successively traveling through the webpage tree structure from top to bottom, find by the text similarity comparison, to the most similar node of text RSS text message.The position of this node in tree, promptly the Xpath of this node is exactly the text template.
Generation method (the step S1040 among Fig. 2) according to the generation method (the step S1020 among Fig. 2) of the title template of present embodiment and class template is identical with the generation method of text template.All can analogize and obtain from method shown in Figure 3.Particularly, if use title RSS information to carry out processing shown in Figure 3, the Xpath that then obtains is the title template.And if use classes RSS information is carried out processing shown in Figure 3, the Xpath that then obtains is class template.
Fig. 5 is in order to illustrate that the template that obtains according to template generation method shown in Figure 3 generates result's figure.
The main body of Fig. 5 is the webpage picture corresponding to RSS file shown in Figure 1.Title 40 among Fig. 5 is corresponding to the title RSS information among Fig. 1, and text 50 is corresponding to the text RSS information among Fig. 1.According to template generation method shown in Figure 3, obtain title template/html/body/div[4 by the position of title RSS text message in html web page, location]/div/div[5]/div/div[1]/div/div[3]/div[2]/div[1]/h3, and obtain text template/html/body/div[4 by the position of text RSS text message in the html webpage, location]/div/div[5]/div/div[1]/div/div[3]/div[2]/div[2]/div[1]/div[1].
Fig. 6 is for the process flow diagram according to the web page template verification method of present embodiment is described.Particularly, Fig. 6 is the process flow diagram of the verification method of title template, and all the verification method with the title template is identical according to the text template verification method of present embodiment and class template verification method, can analogize from flow process shown in Figure 6 to obtain.Thereby the step S1090 among realization Fig. 2.
At first, obtain the title RSS text message of up-to-date webpage at step S3010.
Then, obtain the title content that extracts according to the title template at step S3020.
Then, calculate the similarity s2 of title RSS text message that obtains at step S3010 and the title content that obtains at step S3020 at step S3030.
Then, whether judge similarity s2 greater than predetermined threshold value sv, if greater than threshold value sv (step S3040: "Yes"), then finish the checking of title template at step S3040.On the contrary, if be not more than threshold value (step S3040: "No"), then, return the information that expression need regenerate template then at step S3050.
As mentioned above, in Web page subject content extraction method,, can improve the accuracy rate that template generates because utilize RSS information to generate web page template according to present embodiment.
And, in Web page subject content extraction method according to present embodiment, because use a spot of webpage and corresponding RSS information generation web page template, and utilize this web page template that all target webs are carried out the extraction of subject content, so can improve the efficient of Web page subject content extraction.And the subject content that is extracted can refine to title, text, classification etc., extracts fine size, can extract the subject content of webpage more accurately.
And, in Web page subject content extraction method, because only need a spot of webpage (in the return for " blog ") and RSS information, so can on portable terminal etc. has the equipment of resource limit, implement according to present embodiment.
And, according to the Web page subject content extraction method of present embodiment, not only can obtain significant text message, simultaneously can access multimedia texts such as relevant picture, video, can filter out the multimedia file with text-independent conversely speaking,, such as advertisement picture etc.
And according to the Web page subject content extraction method of present embodiment, variation that can the detecting real-time template if template changes, can in time be made correction, therefore has adaptation mechanism.
In addition, under the situation that does not break away from the spirit and scope of the present invention that are defined by the claims, can also carry out various changes on form and the details to the Web page subject content extraction method in the present embodiment.
For example, though in the Web page subject content extraction method of present embodiment, after the content extraction of each target web, all carry out the checking of web page template, but the present invention is not limited to this, also can after the target web of the intact predetermined number of every extraction, carries out the checking of web page template.
Again for example, though web page template is meant title template, content template or class template in the present embodiment, the present invention is not limited to, and web page template also can be other templates such as author (author) template.
Utilizability on the industry
Method for extracting webpage subject contents of the present invention is applicable to the extraction of the subject content of the webpage that uses RSS information.

Claims (6)

1. Web page subject content extraction method comprises step:
Conform to the principle of simplicity to select up-to-date RSS information and corresponding webpage thereof in easy information polymerization (the RSS:Really Simple Syndication) file;
Search for the position of described RSS information in the tree structure (Dom Tree) of the described webpage of its correspondence, with the information of described position as web page template;
Utilize described web page template a plurality of target webs to be carried out the extraction of subject content.
2. Web page subject content extraction method as claimed in claim 1, it is characterized in that by successively traveling through the tree structure of described webpage, search first node as the position of described RSS information in described tree structure, with the XPath of described first node information as described position;
In all nodes of described tree structure, the similarity of the text message of the text of described first node and described RSS information is maximum and greater than predetermined threshold value.
3. Web page subject content extraction method as claimed in claim 1, it is characterized in that by successively traveling through the tree structure of described webpage, search first node, in all nodes of described tree structure, the similarity of the text of described first node and described RSS text message is maximum and greater than predetermined threshold value;
Then travel through descendants's node of described first node, search Section Point, with the XPath of described Section Point information as described position as the position of described RSS information in described tree structure;
The similarity of the text message of the text of described Section Point and described RSS information greater than the similarity of the child node of described threshold value and described Section Point and described RSS text message all less than described threshold value.
4. Web page subject content extraction method as claimed in claim 1 is characterized in that also comprising step:
Whenever the target web to the predetermined number in described a plurality of target webs has carried out after subject content extracts, regenerate described web page template and proceed the extraction of the subject content of described a plurality of target webs.
5. Web page subject content extraction method as claimed in claim 1 is characterized in that using described web page template to extract the subject content of target web up-to-date in the described target web;
Calculate the similarity of the subject content of described up-to-date pairing RSS information of target web and described up-to-date target web;
Described similarity then needs to regenerate web page template greater than predetermined threshold value.
6. Web page subject content extraction method as claimed in claim 1 is characterized in that described RSS information is title (Title) RSS information, text (Description) RSS information or classification (Category) RSS information;
Described web page template is title template, text template or class template;
Described subject content is title content, body matter or classification content.
CN2010101251916A 2010-03-12 2010-03-12 Method for extracting webpage subject contents Pending CN102193944A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010101251916A CN102193944A (en) 2010-03-12 2010-03-12 Method for extracting webpage subject contents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010101251916A CN102193944A (en) 2010-03-12 2010-03-12 Method for extracting webpage subject contents

Publications (1)

Publication Number Publication Date
CN102193944A true CN102193944A (en) 2011-09-21

Family

ID=44602023

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010101251916A Pending CN102193944A (en) 2010-03-12 2010-03-12 Method for extracting webpage subject contents

Country Status (1)

Country Link
CN (1) CN102193944A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102591971A (en) * 2011-12-31 2012-07-18 北京百度网讯科技有限公司 Method and device for extracting webpage information
CN103218420A (en) * 2013-04-01 2013-07-24 北京鹏宇成软件技术有限公司 Method and device for extracting page titles
CN103279495A (en) * 2013-05-06 2013-09-04 百度在线网络技术(北京)有限公司 Method and device for confirming site information template corresponding to target object
CN103389972A (en) * 2013-07-26 2013-11-13 Tcl集团股份有限公司 Method and device for obtaining text based on really simple syndication (RSS)
CN105183730A (en) * 2014-05-30 2015-12-23 北大方正集团有限公司 Method and device for processing webpage information
CN106126711A (en) * 2016-06-30 2016-11-16 北京奇虎科技有限公司 Encyclopaedia entry sorting technique and device
CN106802899A (en) * 2015-11-26 2017-06-06 北京搜狗科技发展有限公司 web page text extracting method and device
CN108009171A (en) * 2016-10-27 2018-05-08 腾讯科技(北京)有限公司 A kind of method and apparatus for extracting content-data
CN108255975A (en) * 2017-12-27 2018-07-06 东软集团股份有限公司 Template construction method, content of pages grasping means and device, medium and equipment
CN109657180A (en) * 2018-12-11 2019-04-19 中科国力(镇江)智能技术有限公司 It is a kind of intelligence web page contents automatically obscure extraction system
CN110968761A (en) * 2019-11-29 2020-04-07 福州大学 Self-adaptive extraction method for webpage structured data
CN111488511A (en) * 2019-01-25 2020-08-04 深信服科技股份有限公司 Website theme extraction method and system, electronic equipment and storage medium
CN112632421A (en) * 2020-12-25 2021-04-09 杭州电子科技大学 Self-adaptive structured document extraction method

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101094135A (en) * 2006-06-23 2007-12-26 腾讯科技(深圳)有限公司 Method and system for extracting information of content in Internet

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101094135A (en) * 2006-06-23 2007-12-26 腾讯科技(深圳)有限公司 Method and system for extracting information of content in Internet

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张艳: "一个RSS级别的网页主题内容抽取方法与系统", 《图书情报工作》 *

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102591971B (en) * 2011-12-31 2015-03-18 北京百度网讯科技有限公司 Method and device for extracting webpage information
CN102591971A (en) * 2011-12-31 2012-07-18 北京百度网讯科技有限公司 Method and device for extracting webpage information
CN103218420A (en) * 2013-04-01 2013-07-24 北京鹏宇成软件技术有限公司 Method and device for extracting page titles
CN103218420B (en) * 2013-04-01 2016-12-28 北京创世泰克科技股份有限公司 A kind of web page title extracting method and device
CN103279495A (en) * 2013-05-06 2013-09-04 百度在线网络技术(北京)有限公司 Method and device for confirming site information template corresponding to target object
CN103279495B (en) * 2013-05-06 2016-05-25 百度在线网络技术(北京)有限公司 For determining the method and apparatus of the corresponding site information template of destination object
CN103389972B (en) * 2013-07-26 2017-12-26 Tcl集团股份有限公司 A kind of method and device that text is obtained based on Simple Syndication
CN103389972A (en) * 2013-07-26 2013-11-13 Tcl集团股份有限公司 Method and device for obtaining text based on really simple syndication (RSS)
CN105183730A (en) * 2014-05-30 2015-12-23 北大方正集团有限公司 Method and device for processing webpage information
CN105183730B (en) * 2014-05-30 2018-07-24 北大方正集团有限公司 The treating method and apparatus of webpage information
CN106802899B (en) * 2015-11-26 2020-11-24 北京搜狗科技发展有限公司 Webpage text extraction method and device
CN106802899A (en) * 2015-11-26 2017-06-06 北京搜狗科技发展有限公司 web page text extracting method and device
CN106126711A (en) * 2016-06-30 2016-11-16 北京奇虎科技有限公司 Encyclopaedia entry sorting technique and device
CN108009171A (en) * 2016-10-27 2018-05-08 腾讯科技(北京)有限公司 A kind of method and apparatus for extracting content-data
CN108009171B (en) * 2016-10-27 2020-06-30 腾讯科技(北京)有限公司 Method and device for extracting content data
CN108255975A (en) * 2017-12-27 2018-07-06 东软集团股份有限公司 Template construction method, content of pages grasping means and device, medium and equipment
CN108255975B (en) * 2017-12-27 2021-05-07 东软集团股份有限公司 Template construction method, page content capture method and device, medium and equipment
CN109657180A (en) * 2018-12-11 2019-04-19 中科国力(镇江)智能技术有限公司 It is a kind of intelligence web page contents automatically obscure extraction system
CN109657180B (en) * 2018-12-11 2021-11-26 中科国力(镇江)智能技术有限公司 Intelligent automatic fuzzy extraction system for webpage content
CN111488511A (en) * 2019-01-25 2020-08-04 深信服科技股份有限公司 Website theme extraction method and system, electronic equipment and storage medium
CN111488511B (en) * 2019-01-25 2024-04-09 深信服科技股份有限公司 Website theme extraction method and system, electronic equipment and storage medium
CN110968761A (en) * 2019-11-29 2020-04-07 福州大学 Self-adaptive extraction method for webpage structured data
CN110968761B (en) * 2019-11-29 2022-07-08 福州大学 Webpage structured data self-adaptive extraction method
CN112632421A (en) * 2020-12-25 2021-04-09 杭州电子科技大学 Self-adaptive structured document extraction method
CN112632421B (en) * 2020-12-25 2022-05-10 杭州电子科技大学 Self-adaptive structured document extraction method

Similar Documents

Publication Publication Date Title
CN102193944A (en) Method for extracting webpage subject contents
CN101727461B (en) Method for extracting content of web page
US7941420B2 (en) Method for organizing structurally similar web pages from a web site
Chen et al. Function-based object model towards website adaptation
CN102541874B (en) Webpage text content extracting method and device
CN104598577B (en) A kind of extracting method of Web page text
CN101853300B (en) Method and system for identifying and evaluating video downloading service website
CN102270206A (en) Method and device for capturing valid web page contents
CN100514323C (en) System and method for automatically extracting by-line information
CN102819591B (en) A kind of content-based Web page classification method and system
CN101551800B (en) Marked information generation device, inquiry unit and sharing system
CN107590219A (en) Webpage personage subject correlation message extracting method
CN106503211B (en) Method for automatically generating mobile version facing information publishing website
CN103166981B (en) A kind of radio web page code-transferring method and device
JP2010501096A (en) Cooperative optimization of wrapper generation and template detection
WO2004083989A2 (en) Web server for adapted web content
CN110457579B (en) Webpage denoising method and system based on cooperative work of template and classifier
CN109857956A (en) The automatic abstracting method of news web page key message based on label and blocking characteristic
EP1604305A2 (en) Web content adaption process and system
CN103064845B (en) Web information processing device and Web information processing method
CN103870486A (en) Webpage type confirming method and device
CN102446255A (en) Method and device for detecting page tamper
Uzun et al. An effective and efficient Web content extractor for optimizing the crawling process
CN104361059A (en) Harmful information identification and web page classification method based on multi-instance learning
CN107145591B (en) Title-based webpage effective metadata content extraction method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20110921