CN102236713A

CN102236713A - Digital television interaction service page information extraction method and device

Info

Publication number: CN102236713A
Application number: CN2011101868253A
Authority: CN
Inventors: 林格; 张洁; 颜权
Original assignee: GUANGDONG XINGHAI DIGITAL HOME INDUSTRY TECHNOLOGY RESEARCH INSTITUTE Co Ltd; National Sun Yat Sen University
Current assignee: GUANGDONG XINGHAI DIGITAL HOME INDUSTRY TECHNOLOGY RESEARCH INSTITUTE Co Ltd; Sun Yat Sen University; National Sun Yat Sen University
Priority date: 2011-07-05
Filing date: 2011-07-05
Publication date: 2011-11-09

Abstract

The embodiment of the invention discloses a digital television interaction service page information extraction method and a digital television interaction service page information extraction device. The method comprises the following steps of: acquiring web pages, and remaking the web pages to obtain extensible hypertext markup language (XHTML) documents; establishing a document object model (DOM) tree according to the XHTML documents; clustering the acquired web pages according to the DOM tree; acquiring a web page template corresponding to the clustered web pages of the same cluster; and performing information extraction according to the web page template, and acquiring extracted detailed information. By the digital television interaction service page information extraction method and the digital television interaction service page information extraction device provided by the embodiment of the invention, the digital television interaction service page key information acquisition speed can be increased, and the digital television interaction service page information data processing load also can be reduced.

Description

A kind of information extracting method of digital television interactive service page and device thereof

Technical field

The present invention relates to the digital television techniques field, relate in particular to a kind of information extracting method and device thereof of digital television interactive service page.

Background technology

Along with developing rapidly of the Internet (Internet) and Digital Television, the interactive service page has become a huge and complicated information warehouse.Extraction information and then improve the efficient that people obtain information and become more and more important fast from the interactive service page of magnanimity how.At present, most interactive service pages are dynamic web webpages, they normally are made of by certain general template the background data base of website, quite similar page structure is arranged, the Search Results that returns of search engine for example, the merchandise news page of Online Store etc. all is typical dynamic web page.This class webpage is enormous amount and abundant in content often, thereby extraction work is very valuable; Free text data is few in these pages simultaneously, Web page structural degree height, and wherein fixing text data is a lot.

In the prior art, the interactive service page is lack of standardization, and quantity is many, and wherein the data that comprised are a lot, needs to handle lot of data in retrieving, causes the wasting of resources, and can not retrieve fast in the critical data of the interactive service page apace.

Summary of the invention

The objective of the invention is to overcome the deficiencies in the prior art, the invention provides a kind of information extracting method and device thereof of digital television interactive service page, can retrieve fast Digital Television interactive service page critical data.

In order to address the above problem, the present invention proposes a kind of information extracting method of digital television interactive service page, described method comprises:

Obtain webpage and described webpage is write acquisition again can expand Hypertext Markup Language XHTML document;

Set up the DOM Document Object Model dom tree according to described XHTML document;

According to described dom tree collected webpage is carried out cluster;

Obtain the pairing Page template of same class webpage after the cluster;

According to described Page template carry out information extraction and to extract after details.

Preferably, the described step of setting up the DOM Document Object Model dom tree according to described XHTML document comprises:

The name storage of all beginning label correspondences of searching all beginning labels in the described XHTML document and will find is in label table;

Judge whether to exist the end mark corresponding one by one with arbitrary beginning label of described label table;

If the content stores between then that described end mark is corresponding with the described end mark beginning label is in described label table;

If not, then delete described beginning label;

Set up dom tree according to the described label table of the content between beginning label and the described beginning label end mark corresponding that comprises with it.

Preferably, described according to described Page template carry out information extraction and to extract after the step of details comprise:

Mode by the traversal dom tree is carried out information extraction according to described Page template;

Details after obtaining to extract;

Described details are stored.

Preferably, the described step that described details are stored comprises:

Described details are carried out structured storage.

Preferably, the described step that described details are carried out structured storage comprises:

Store described details in the mode of expandable mark language XML document.

Correspondingly, the embodiment of the invention also discloses a kind of information extracting device of digital television interactive service page, described information extracting device comprises:

The document acquisition module is used to obtain webpage and described webpage is write acquisition again can expand Hypertext Markup Language XHTML document;

Set up module, be used for setting up the DOM Document Object Model dom tree according to the accessed XHTML document of described acquisition module;

The cluster module is used for setting up the dom tree that module sets up collected webpage being carried out cluster according to described;

The masterplate acquisition module is used to obtain the pairing Page template of same class webpage after the described cluster module institute cluster;

Extraction module, be used for according to the accessed Page template of described masterplate acquisition module carry out information extraction and to extract after details.

Preferably, the described module of setting up comprises:

Search the unit, be used for searching all beginning labels of described XHTML document and the name storage of all beginning label correspondences that will find in label table;

Judging unit is used for judging whether to exist the end mark corresponding with arbitrary beginning label of described label table own one by one;

First storage unit, be used for when the judged result of described judging unit when being, the content stores between the beginning label that described end mark is corresponding with described end mark is in described label table;

Delete cells, be used for when the judged result of described judging unit for not the time, delete described beginning label;

Set up the unit, be used for setting up dom tree according to the described label table of the content between beginning label and the described beginning label end mark corresponding that comprises with it.

Preferably, described extraction module comprises:

Extraction unit, be used for by the mode of traversal dom tree carry out information extraction according to described Page template and to extract after details;

Second storage unit, the details that are used for described extraction unit is extracted are stored.

Implement the information extracting method and the device thereof of the digital television interactive service page of the embodiment of the invention, can improve the acquisition speed of digital television interactive service page key message, can also reduce the treatment capacity of digital television interactive service page information data.

Description of drawings

In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art, to do to introduce simply to the accompanying drawing of required use in embodiment or the description of the Prior Art below, apparently, accompanying drawing in describing below only is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.

Fig. 1 is the schematic flow sheet of information extracting method of the digital television interactive service page of the embodiment of the invention;

Fig. 2 is the principle schematic of information extracting method of the digital television interactive service page of the embodiment of the invention;

Fig. 3 sets up the idiographic flow synoptic diagram of the process of dom tree according to the XHTML document among the inventive method embodiment;

Fig. 4 is the idiographic flow synoptic diagram of the process of obtaining the pairing Page template of same class webpage after the cluster among the inventive method embodiment;

Fig. 5 is that the structure of information extracting device of the digital television interactive service page of the embodiment of the invention is formed synoptic diagram.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the invention, the technical scheme in the embodiment of the invention is clearly and completely described, obviously, described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills belong to the scope of protection of the invention not making the every other embodiment that is obtained under the creative work prerequisite.

In the present invention, at the characteristics of the interactive service page, proposed a kind of based on DOM Document Object Model (Document Object Model, DOM) digital television interactive service page information extraction method and device thereof.DOM is that W3C sets up extend markup language (element in the XML document can be represented with the node in the dom tree structure for eXtensible MarkupLanguage, XML) a kind of standard criterion of providing of the tree construction of document in internal memory.It is cross-platform, as can to adapt to a distinct program language document dbject model, and (HyperText Markup Language, HTML) document also can adopt DOM to be described to the text mark language.Adopt the DOM models treated to have several advantages: (1) because tree is lasting in internal memory, therefore can revise it any node in case application program can change data and structure; (2) can be at any time navigation up and down in tree, use simply, can create document easily, its structure of navigating; (3) appearance of the DOM standard processing of simplifying the structure document in programmed environment greatly.

Principle of the present invention is that the html document with the not enough standard of the interactive service page is organized into (the eXtensible HyperText Markup Language of good the expanded hypertext markup language of form, XHTML) document, again the XHTML document is resolved to a dom tree, and then carry out the extraction of information and the search of analog structure webpage according to dom tree, the result who extracts represents with XML document, and carries out structured storage.

Fig. 1 is the schematic flow sheet of information extracting method of the digital television interactive service page of the embodiment of the invention, and as shown in Figure 1, this method comprises:

S101 obtains webpage and webpage is write acquisition XHTML document again;

S102 sets up dom tree according to the XHTML document;

S103 carries out cluster according to dom tree to collected webpage;

S104 obtains the pairing Page template of same class webpage after the cluster;

S105, according to Page template carry out information extraction and to extract after details.

Fig. 2 is the principle schematic of information extracting method of the digital television interactive service page of the embodiment of the invention, below in conjunction with Fig. 1, Fig. 2 the information extracting method of the digital television interactive service page of the embodiment of the invention is further specified.

In concrete the enforcement, in S101, obtain webpage and arrangement.The Web page of searching by site link comprises two kinds: the page that comprises desired data; The hyperlink page that comprises the target pages of desired data.To the navigation rule of Web website targeted sites by analysis, write in conjunction with the characteristics of desired data.And arrangement is that data source is mapped to XHTML.Can realize by following three aspects: (1) adds end mark "/" for azygous mark, and is for example, right＜br〉add that end mark is＜br/ 〉; (2) for all properties value adds quotation marks, for example,＜a href=http: //www.w3c.org〉add that quotation marks become＜ahref=" http://www.w3c.org "; (3) with URL(uniform resource locator) (Uniform/UniversalResource Locator, all in URL) " " change "/" into.

In S102, data source is resolved, and will set up dom tree through the XHTML document that is converted to, the element map in the XHTML document is become node in the dom tree.

In S103, according to dom tree collected webpage is carried out cluster according to similarity.Judge according to dom tree whether the webpage collected is similar to the composition of sample, and then determine whether to utilize existing pattern to extract the information in the webpage collected.In certain collections of web pages, the webpage with identical similarity can be used as the webpage that same template produces, and that is to say that this web pages has similar dom tree structure.Therefore the collections of web pages after institute's cluster can mark off k class in S103, and then extracts the template of each class in S104 successively.

In S104, the pairing Page template of same class webpage after the extraction cluster, the web page template here is meant public dom tree in a certain class webpage, i.e. all dom trees common factor.Obtain web page template by comparing two HTML parsing dom trees with webpage of analog structure.

In S105, carry out information extraction according to Page template.The XPath that utilizes inductive learning to obtain writes the XSLT document, can change node among the DOM according to the document, generates an XML document, only keeps the node of XPath appointment in this XML document, thereby finishes information extraction.

Implement the information extracting method of the digital television interactive service page of the embodiment of the invention, can improve the acquisition speed of digital television interactive service page key message, can also reduce the treatment capacity of digital television interactive service page information data.

Further, S102 can also comprise:

The name storage of all beginning label correspondences of searching all beginning labels in the XHTML document and will find is in label table;

Judge whether to exist the end mark corresponding one by one with arbitrary beginning label of label table;

If the content stores between then that end mark is corresponding with the end mark beginning label is in label table;

If not, then delete beginning label;

Set up dom tree according to the label table that comprises the content between beginning label and the beginning label end mark corresponding with it.

Below in conjunction with Fig. 3 the S102 in the inventive method is further detailed.

As shown in Figure 3, the process of setting up dom tree among the inventive method embodiment comprises:

S1021 finds out beginning labels all in the webpage, deposits its title in label table;

S1022 finds out each mark in the webpage one by one, and judge whether to exist one with corresponding end mark or the comment token of being found out of a beginning label; If not, then carry out S1024, if then carry out S1023;

S1023, in label table, this content is exactly a leaf node with the content stores between this end mark and its beginning label;

S1024 deletes this mark;

S1025 judges whether all beginning labels all dispose; If, then finish, if not, then return S1022.

Like this, each mark is all handled in webpage, has just set up a label table by the content between beginning label and the end mark corresponding with it, and whole dom tree can be broken down into the n stalk and set and deposit in this label table.

Further, S105 can comprise:

Mode by the traversal dom tree is carried out information extraction according to Page template;

Details after obtaining to extract;

Details are stored.

In the embodiment of the invention, can adopt the mode of structured storage, further, can adopt the mode of XML document to store this details the storage of details.

The process of information extraction be from top to bottom, the process of the order degree of depth from left to right traversal dom tree, in traversal, utilize decimation rule that present node is tested, obtaining qualified semantic item keeps in, after whole semantic item of finishing an object, assemble, deposit database then in.

As shown in Figure 4, S105 can comprise:

S1051, the root node of dom tree is set to present node;

Whether S1052 obtains the DOM path of present node and compares with the path rule in the current rule, check to mate, if not, then carry out S1057, if then carry out S1053;

S1053 judges that whether the mark of front and back adjacent node of present node and the left and right sides adjacent marker of current rule mate, and if not, then carry out S1057, if then carry out S1054;

S1054, the feature of judging present node whether with rule in specific characteristic be complementary, if not, then carry out S1057, if then carry out S1055;

S1055 takes out the information in the present node, deposits in the buffer memory;

S1056 gets the decimation rule of next semantic item from knowledge base, if the success then with it as current rule; Otherwise the extraction of having finished last semantic item is described, these semantic item should be assembled into object and deposits database in, from knowledge base, take out the decimation rule of first semantic item then as current rule;

S1057 judges whether to travel through whole dom tree, if not, then returns S1052, if then finish to extract flow process.

Correspondingly, the embodiment of the invention also provides a kind of information extracting device of digital television interactive service page, and as shown in Figure 5, this information extracting device comprises:

Document acquisition module 50 is used to obtain webpage and webpage is write acquisition XHTML document again;

Set up module 51, be used for setting up dom tree according to acquisition module 50 accessed XHTML documents;

Cluster module 52 is used for according to setting up the dom tree that module 50 set up collected webpage being carried out cluster;

Masterplate acquisition module 53 is used to obtain the pairing Page template of same class webpage after 52 clusters of cluster module;

Extraction module 54, be used for according to masterplate acquisition module 53 accessed Page templates carry out information extraction and to extract after details.

The realization principle of the information extracting device of the digital television interactive service page of the embodiment of the invention and process can repeat no more here referring to the associated description among the inventive method embodiment.

Further, this is set up module 51 and can comprise:

Search the unit, be used for searching all beginning labels of XHTML document and the name storage of all beginning label correspondences that will find in label table;

Judging unit is used for judging whether to exist the end mark corresponding with arbitrary beginning label of label table own one by one;

First storage unit, be used for when the judged result of judging unit when being, the content stores between the beginning label that end mark is corresponding with end mark is in label table;

Delete cells, be used for when the judged result of judging unit for not the time, the deletion beginning label;

Set up the unit, be used for setting up dom tree according to the label table that comprises the content between beginning label and the beginning label end mark corresponding with it.

Further, this extraction module 54 comprises:

Extraction unit, be used for by the mode of traversal dom tree carry out information extraction according to Page template and to extract after details;

Second storage unit, the details that are used for extraction unit is extracted are stored.

In concrete enforcement, first storage unit and second storage unit can merge realization, can realize the memory function of first storage unit and second storage unit with the form of a storage unit.

Implement the information extracting device of the digital television interactive service page of the embodiment of the invention, can improve the acquisition speed of digital television interactive service page key message, can also reduce the treatment capacity of digital television interactive service page information data.

Need to prove, contents such as each module in the said apparatus and the information interaction between each unit, implementation since with the inventive method embodiment based on same design, particular content can repeat no more referring to the narration among the inventive method embodiment herein.

One of ordinary skill in the art will appreciate that all or part of step in the whole bag of tricks of the foregoing description is to instruct relevant hardware to finish by program, this program can be stored in the computer-readable recording medium, storage medium can comprise: ROM (read-only memory) (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), disk or CD etc.

In addition, more than the information extracting method and the device thereof of the digital television interactive service page that the embodiment of the invention provided is described in detail, used specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims

1. the information extracting method of a digital television interactive service page is characterized in that, described method comprises:

According to described dom tree collected webpage is carried out cluster;

Obtain the pairing Page template of same class webpage after the cluster;

2. the information extracting method of digital television interactive service page as claimed in claim 1 is characterized in that, the described step of setting up the DOM Document Object Model dom tree according to described XHTML document comprises:

If not, then delete described beginning label;

3. the information extracting method of digital television interactive service page as claimed in claim 1 or 2 is characterized in that, described according to described Page template carry out information extraction and to extract after the step of details comprise:

Details after obtaining to extract;

Described details are stored.

4. the information extracting method of digital television interactive service page as claimed in claim 3 is characterized in that, the described step that described details are stored comprises:

Described details are carried out structured storage.

5. the information extracting method of digital television interactive service page as claimed in claim 4 is characterized in that, the described step that described details are carried out structured storage comprises:

Store described details in the mode of expandable mark language XML document.

6. the information extracting device of a digital television interactive service page is characterized in that, described information extracting device comprises:

7. the information extracting device of digital television interactive service page as claimed in claim 6 is characterized in that, the described module of setting up comprises:

Judging unit is used for judging whether one by one to exist the end mark corresponding with arbitrary beginning label of described label table;

8. as the information extracting device of claim 6 or 7 described digital television interactive service pages, it is characterized in that described extraction module comprises: