CN102193944A

CN102193944A - Method for extracting webpage subject contents

Info

Publication number: CN102193944A
Application number: CN2010101251916A
Authority: CN
Inventors: 沈文南; 酆晓杰; 王艳丽; 王进; 玄东俊
Original assignee: Samsung Electronics China R&D Center; Samsung Electronics Co Ltd
Current assignee: Samsung Electronics China R&D Center; Samsung Electronics Co Ltd
Priority date: 2010-03-12
Filing date: 2010-03-12
Publication date: 2011-09-21

Abstract

The invention discloses a method for extracting webpage subject contents. The method comprises the following steps of: selecting latest RSS (Really Simple Syndication) information and a corresponding webpage from an RSS file; searching the position of the RSS information in a Dom Tree of the corresponding webpage, wherein the information at the position is taken as a webpage template; and extracting the subject contents of a plurality of target webpages by using the webpage template. The method further comprises the following step of: after the subject contents of a predetermined quantity of target webpages in the plurality of target webpages are extracted, regenerating the webpage template and continually extracting the subject contents of the plurality of target webpages.

Description

Web page subject content extraction method

Technical field

The present invention relates to the extraction of Web page subject information, be specifically related to the extraction of Web page subject content.

Background technology

In webpage, there are navigation link, shell script, related article, advertisement link, copyright information etc. and the irrelevant noise information of subject content, remove these noise informations, the subject content that extracts webpage all has its using value aspect a lot, Web page classifying, the webpage that for example is used for improving search engine disappear weight, the direct accessed web page subject content of portable terminal etc.

The technology that extracts the Web page subject content at present mainly is divided into two classes, and a class mainly is to be applied to the structuring webpage, by the feature of analytical structure webpage, finds the template of extracted data, thereby web data is extracted in batch processing; The another kind of training webpage collection that makes up earlier obtains template with the machine learning training, then then to each web page extraction data.For generic web page, mainly be to use second class methods.

But, above-mentioned training set is difficult to comprise all situations, thereby the template that causes generating can not accurately extract subject content, and existing method just extracts one section subject content for a webpage, can not extract text (description), title (title), classification (category) etc. separately.And, training template by machine learning, can not on the equipment of resource-constraineds such as portable terminal, carry out.

Summary of the invention

In view of the problem that exists in the above-mentioned existing Web page subject content extraction method, the invention provides and a kind ofly can accurately extract the Web page subject content, and the Web page subject content extraction method that can on the equipment of resource-constrained, move.

To achieve these goals, according to Web page subject content extraction method of the present invention, comprise step: select up-to-date RSS information and corresponding webpage thereof in the easy information polymerization of conforming to the principle of simplicity (the RSS:Really Simple Syndication) file; Search for the position of described RSS information in the tree structure (DomTree) of the described webpage of its correspondence, with the information of described position as web page template; Utilize described web page template a plurality of target webs to be carried out the extraction of subject content.

And, above-mentioned Web page subject content extraction method, it is characterized in that by successively traveling through the tree structure of described webpage, search first node, with the XPath of described first node information as described position as the position of described RSS information in described tree structure; In all nodes of described tree structure, the similarity of the text message of the text of described first node and described RSS information is maximum and greater than predetermined threshold value.

And, above-mentioned Web page subject content extraction method, it is characterized in that by successively traveling through the tree structure of described webpage, search first node, in all nodes of described tree structure, the similarity of the text of described first node and described RSS text message is maximum and greater than predetermined threshold value; Then travel through descendants's node of described first node, search Section Point, with the XPath of described Section Point information as described position as the position of described RSS information in described tree structure; The similarity of the text message of the text of described Section Point and described RSS information greater than the similarity of the child node of described threshold value and described Section Point and described RSS text message all less than described threshold value.

And, above-mentioned Web page subject content extraction method, it is characterized in that also comprising step: whenever the target web to the predetermined number in described a plurality of target webs has carried out after subject content extracts, regenerate described web page template and proceed the extraction of the subject content of described a plurality of target webs., regenerate described web page template and proceed the extraction of the subject content of described target web.

And above-mentioned Web page subject content extraction method is characterized in that using described web page template to extract the subject content of target web up-to-date in the described target web; Calculate the similarity of the subject content of described up-to-date pairing RSS information of target web and described up-to-date target web; Described similarity then needs to regenerate web page template greater than predetermined threshold value.

And above-mentioned Web page subject content extraction method is characterized in that described RSS information is title RSS information, text RSS information or classification RSS information; Described web page template is title template, text template or class template; Described subject content is title content, body matter or classification content.

According to Web page subject content extraction method of the present invention, can accurately extract the Web page subject content, and the Web page subject content extraction method that can on the equipment of resource-constrained, move.

Description of drawings

By the description of carrying out below in conjunction with accompanying drawing, above-mentioned and other purposes of the present invention and characteristics will become apparent, wherein:

Fig. 1 is for the synoptic diagram of RSS information is described;

Fig. 2 is for the process flow diagram of Web page subject content extraction method according to the embodiment of the present invention is described;

Fig. 3 is for the process flow diagram of web page template generation method according to the embodiment of the present invention is described;

Fig. 4 A is for the synoptic diagram of web page files, DOM Tree and XPath is described;

Fig. 4 B is for the synoptic diagram of web page files, DOM Tree and XPath is described;

Fig. 5 is in order to illustrate that the template that obtains according to template generation method shown in Figure 3 generates result's figure;

Fig. 6 is for the process flow diagram of web page template verification method according to the embodiment of the present invention is described.

Main symbol description: 101 are template generation template; 102 is the content extraction template; 103 is the template authentication module; S1010-S1100, S2010-S2110 and S2010-S3050 are step.

Embodiment

Below, describe embodiments of the present invention in detail with reference to accompanying drawing.

(embodiment)

The Web page subject content extraction method of present embodiment relates to RSS information, therefore, at first RSS information is described.

RSS information is the synchronous website of a kind of description format of content, is a kind of new information publishing technology means, and present a lot of webpages all attach RSS information such as blog, news category website etc. when issuing.RSS information can directly be called by other websites, and because these data all are extending mark language (the XML:Extensible Markup Language) forms of standard, thus also can other terminal and service in use.

RSS is that present most popular XML uses, a sub-channel of portal website, and such as scientific and technological channel, all blogs that certain blog master writes all have the webpage RSS information of a up-to-date issue of RSS file maintenance.Generally, a RSS file only comprises several webpage RSS information of latest update, and changes along with the renewal of information issue.

For example, Fig. 1 shows for RSS information is described the RSS file fragment as Information technology (IT:Information Technology) channel of the website of an example.In this figure, use square frame 10 shows the title RSS information in the html web page file, uses square frame 20 to show text RSS information, and uses square frame 30 to show classification RSS information.In addition, the RSS text message is meant the text message after RSS information is removed html tag, title RSS text message such as this webpage is " China Mobile builds deployment the startup TD network fourth phase ", and as shown in the figure, text RSS text message is a fragment of actual text in this example.

Below, the Web page subject content extraction method according to present embodiment is described.

Fig. 2 is for the process flow diagram according to the Web page subject content extraction method of present embodiment is described.

As shown in Figure 2, can be divided into template generation module 101,

content extraction module

102 and 103 3 modules of template authentication module according to the Web page subject content extraction method of present embodiment, and more specifically be divided into ten steps of S1010 to S1100.Wherein, step S1010 to S1040 constitutes template generation module 101, step S1050 to S1070 constitution content abstraction module 102, and step S1090 to S1100 constitutes template authentication module 103.

At step S1010, select up-to-date part webpage and its corresponding RSS information.Specifically, all RSS information that at first the RSS file comprised sorted by update time; Select forward some RSS information update time then, select the number can manual configuration; Then, find its corresponding webpage according to the RSS information of selecting.

At step S1020, obtain the title template by the position of text message in webpage of locating title RSS.In this step, utilization each title template that obtains in some RSS information that step S1010 selects and corresponding webpage thereof all should be identical, if in these title templates, has indivedual inequality, then it is removed, and keep most of identical title template.

Then, obtain the text template by the position of text message in webpage of locating text RSS at step S1030.In this step, if it is the same in step 1020, utilization each text template that obtains in some RSS information that step S1010 selects and corresponding webpage thereof all should be identical, if in these text templates, have indivedual inequality, then it is removed, and keep most of identical text template.

Then at step S1040, the position of text message in webpage by positioning label RSS obtains class template.In this step, similarly keep most of identical class template.

So, in template generation module 101, select part webpage and its corresponding RSS information of latest update, and obtain the web page template that constitutes by title masterplate, text template and class template by the position of RSS text message in webpage, location.Localization method about RSS text message position in webpage describes in detail in the back.

Then, utilize the title template pair target web extracting header content corresponding in the web page template with this title RSS information at step S1050.

Then, utilize the text template pair target web corresponding in the web page template to extract body matter with this text RSS information at step S1060.

Then, utilize the class template pair target web corresponding in the web page template to extract the classification content with this classification RSS information at step S1070.

Then, judge whether all target webs corresponding with this RSS information have been carried out the extraction of subject content at step S1080.

If all webpages to this RSS information correspondence have not carried out the extraction of subject content, need to continue the residue webpage is carried out extraction (the step S1080: "No") of subject content, then then at step S1090, utilize the subject content of the corresponding extraction of RSS information of latest update to verify web page template, comprise checking particularly, for the checking of text template and for the checking of class template for the title template with it.

Then at step S1100, checking result according to step S1090, judge that the subject content whether in title template, text template and the class template any one has not been suitable for up-to-date webpage extracts and need regenerate, judge promptly whether web page template has not been suitable for the subject content extraction of up-to-date webpage, and need regenerate web page template.

If at step S1100, judgement need regenerate web page template (step S1100: "Yes"), then return step S1020.On the contrary, if at step S1100, judging does not need to regenerate web page template (step S1100: "No"), then return step S1050, continue remaining target web is carried out the extraction of subject content.

So, in template authentication module 103, whether utilize the subject content of the corresponding extraction with it of RSS of latest update to monitor current web page template needs to change, change then regenerating the extraction that web page template carries out the Web page subject content more if desired, otherwise continue to use current web page template to continue remaining target web is carried out the extraction of subject content.Here, the concrete verification method that whether needs to change for current web page template will describe in detail in the back.

When at step S1080, judge to all target webs of this RSS file correspondence carried out subject content extraction (step S1080: "Yes"), end process then.

So, in content extraction module 102, utilize web page template all target webs of this RSS file correspondence to be carried out comprising the extraction of the Web page subject content of title content, body matter and classification content.

Fig. 3 is for the process flow diagram according to the web page template generation method of present embodiment is described.Particularly, Fig. 3 is the process flow diagram of the generation method of text template, promptly corresponding to the step S among Fig. 2 1030.

At first at step S2010, for webpage makes up DOM Tree.

Fig. 4 A and Fig. 4 B are the synoptic diagram that is used to illustrate web page files, DOM Tree and XPath.

Fig. 4 A represents an example of hypertext markup language (HTML:HyperText Markup Language) web page files particularly.Fig. 4 B is the DOMTree corresponding to the html web page file shown in Fig. 4 A, and each node among the DOM Tree is corresponding to the html tag in the html web page.The text message that the text message of each node comprises for the subtree of this node correspondence among the DOM Tree.In Fig. 4 B,, show to illustrative "/html/body/ul " as the XPath of node.

Then, remove garbages such as Javascript at step S2020.

Then, calculate the similarity of all child node text messages of RSS text message and root node at step S2030.

The calculation of similarity degree method is as follows.In view of the RSS text message all is the segment of actual text or all, it is one section simple summary about actual text that few part is also arranged usually,, use simple algorithm so calculate for similarity.Promptly at first to RSS text message participle, obtain word array a and word number n, whether the word that calculates then among the array a occurs in the node text message, and the word number m that obtains occurring calculates s=m/n again as similarity.

Then, from the result of calculation of step S2030, get maximum similarity s and corresponding child node a at step S2040.

Then, judge that whether maximum similarity s is greater than predetermined threshold value sv at step S2050.

About similarity threshold, in view of can not mating fully with actual text at RSS text message in some cases, thus an empirical data obtained by experiment, when similarity surpasses this empirical data, then think RSS and node text matches, this empirical data is exactly a similarity threshold.

When judging that at step S2050 (step S2050: in the time of "No"), then return failure information at step S2060, this information representation text template generates failure to maximum similarity s, and finishes the template generation and handle smaller or equal to threshold value sv.

When judging that at step S2050 maximum similarity s is greater than threshold value sv (step S2050: in the time of "Yes"), follow at step S2070 the text message of all descendants's nodes of recurrence traversal ground calculating a and the similarity of RSS text message.

Then, from the result of calculation of step S2070, get maximum similarity s1 and corresponding descendants's node b thereof at step S2080.

Then, judge that whether maximum similarity s1 is greater than predetermined threshold value sv at step S2090.

When judging that at step S2090 maximum similarity s1 is greater than threshold value sv (step S2090: in the time of "Yes"), then at step S2100, the b node as a node, and is returned the processing that S2070 repeats subsequent step.

(step S2090: in the time of "No"), then at step S2110, return the Xpath of a node, Xpath is the text template smaller or equal to threshold value when judge maximum similarity s1 at step S2090.

When obtaining the text template according to method shown in Figure 2, node b meets following two conditions, the similarity of text that is text RSS text message and node b is greater than threshold value sv, and the similarity of the text of the child node of text RSS text message and node b is all less than threshold value sv simultaneously.The node b that obtains thus is for successively traveling through the webpage tree structure from top to bottom, find by the text similarity comparison, to the most similar node of text RSS text message.The position of this node in tree, promptly the Xpath of this node is exactly the text template.

Generation method (the step S1040 among Fig. 2) according to the generation method (the step S1020 among Fig. 2) of the title template of present embodiment and class template is identical with the generation method of text template.All can analogize and obtain from method shown in Figure 3.Particularly, if use title RSS information to carry out processing shown in Figure 3, the Xpath that then obtains is the title template.And if use classes RSS information is carried out processing shown in Figure 3, the Xpath that then obtains is class template.

Fig. 5 is in order to illustrate that the template that obtains according to template generation method shown in Figure 3 generates result's figure.

The main body of Fig. 5 is the webpage picture corresponding to RSS file shown in Figure 1.Title 40 among Fig. 5 is corresponding to the title RSS information among Fig. 1, and text 50 is corresponding to the text RSS information among Fig. 1.According to template generation method shown in Figure 3, obtain title template/html/body/div[4 by the position of title RSS text message in html web page, location]/div/div[5]/div/div[1]/div/div[3]/div[2]/div[1]/h3, and obtain text template/html/body/div[4 by the position of text RSS text message in the html webpage, location]/div/div[5]/div/div[1]/div/div[3]/div[2]/div[2]/div[1]/div[1].

Fig. 6 is for the process flow diagram according to the web page template verification method of present embodiment is described.Particularly, Fig. 6 is the process flow diagram of the verification method of title template, and all the verification method with the title template is identical according to the text template verification method of present embodiment and class template verification method, can analogize from flow process shown in Figure 6 to obtain.Thereby the step S1090 among realization Fig. 2.

At first, obtain the title RSS text message of up-to-date webpage at step S3010.

Then, obtain the title content that extracts according to the title template at step S3020.

Then, calculate the similarity s2 of title RSS text message that obtains at step S3010 and the title content that obtains at step S3020 at step S3030.

Then, whether judge similarity s2 greater than predetermined threshold value sv, if greater than threshold value sv (step S3040: "Yes"), then finish the checking of title template at step S3040.On the contrary, if be not more than threshold value (step S3040: "No"), then, return the information that expression need regenerate template then at step S3050.

As mentioned above, in Web page subject content extraction method,, can improve the accuracy rate that template generates because utilize RSS information to generate web page template according to present embodiment.

And, in Web page subject content extraction method according to present embodiment, because use a spot of webpage and corresponding RSS information generation web page template, and utilize this web page template that all target webs are carried out the extraction of subject content, so can improve the efficient of Web page subject content extraction.And the subject content that is extracted can refine to title, text, classification etc., extracts fine size, can extract the subject content of webpage more accurately.

And, in Web page subject content extraction method, because only need a spot of webpage (in the return for " blog ") and RSS information, so can on portable terminal etc. has the equipment of resource limit, implement according to present embodiment.

And, according to the Web page subject content extraction method of present embodiment, not only can obtain significant text message, simultaneously can access multimedia texts such as relevant picture, video, can filter out the multimedia file with text-independent conversely speaking,, such as advertisement picture etc.

And according to the Web page subject content extraction method of present embodiment, variation that can the detecting real-time template if template changes, can in time be made correction, therefore has adaptation mechanism.

In addition, under the situation that does not break away from the spirit and scope of the present invention that are defined by the claims, can also carry out various changes on form and the details to the Web page subject content extraction method in the present embodiment.

For example, though in the Web page subject content extraction method of present embodiment, after the content extraction of each target web, all carry out the checking of web page template, but the present invention is not limited to this, also can after the target web of the intact predetermined number of every extraction, carries out the checking of web page template.

Again for example, though web page template is meant title template, content template or class template in the present embodiment, the present invention is not limited to, and web page template also can be other templates such as author (author) template.

Utilizability on the industry

Method for extracting webpage subject contents of the present invention is applicable to the extraction of the subject content of the webpage that uses RSS information.

Claims

1. Web page subject content extraction method comprises step:

Conform to the principle of simplicity to select up-to-date RSS information and corresponding webpage thereof in easy information polymerization (the RSS:Really Simple Syndication) file;

Search for the position of described RSS information in the tree structure (Dom Tree) of the described webpage of its correspondence, with the information of described position as web page template;

Utilize described web page template a plurality of target webs to be carried out the extraction of subject content.

2. Web page subject content extraction method as claimed in claim 1, it is characterized in that by successively traveling through the tree structure of described webpage, search first node as the position of described RSS information in described tree structure, with the XPath of described first node information as described position;

In all nodes of described tree structure, the similarity of the text message of the text of described first node and described RSS information is maximum and greater than predetermined threshold value.

3. Web page subject content extraction method as claimed in claim 1, it is characterized in that by successively traveling through the tree structure of described webpage, search first node, in all nodes of described tree structure, the similarity of the text of described first node and described RSS text message is maximum and greater than predetermined threshold value;

Then travel through descendants's node of described first node, search Section Point, with the XPath of described Section Point information as described position as the position of described RSS information in described tree structure;

The similarity of the text message of the text of described Section Point and described RSS information greater than the similarity of the child node of described threshold value and described Section Point and described RSS text message all less than described threshold value.

4. Web page subject content extraction method as claimed in claim 1 is characterized in that also comprising step:

Whenever the target web to the predetermined number in described a plurality of target webs has carried out after subject content extracts, regenerate described web page template and proceed the extraction of the subject content of described a plurality of target webs.

5. Web page subject content extraction method as claimed in claim 1 is characterized in that using described web page template to extract the subject content of target web up-to-date in the described target web;

Calculate the similarity of the subject content of described up-to-date pairing RSS information of target web and described up-to-date target web;

Described similarity then needs to regenerate web page template greater than predetermined threshold value.

6. Web page subject content extraction method as claimed in claim 1 is characterized in that described RSS information is title (Title) RSS information, text (Description) RSS information or classification (Category) RSS information;

Described web page template is title template, text template or class template;

Described subject content is title content, body matter or classification content.