CN102236713A - Digital television interaction service page information extraction method and device - Google Patents

Digital television interaction service page information extraction method and device Download PDF

Info

Publication number
CN102236713A
CN102236713A CN2011101868253A CN201110186825A CN102236713A CN 102236713 A CN102236713 A CN 102236713A CN 2011101868253 A CN2011101868253 A CN 2011101868253A CN 201110186825 A CN201110186825 A CN 201110186825A CN 102236713 A CN102236713 A CN 102236713A
Authority
CN
China
Prior art keywords
digital television
label
service page
details
dom tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011101868253A
Other languages
Chinese (zh)
Inventor
林格
张洁
颜权
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
GUANGDONG XINGHAI DIGITAL HOME INDUSTRY TECHNOLOGY RESEARCH INSTITUTE Co Ltd
Sun Yat Sen University
National Sun Yat Sen University
Original Assignee
GUANGDONG XINGHAI DIGITAL HOME INDUSTRY TECHNOLOGY RESEARCH INSTITUTE Co Ltd
National Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by GUANGDONG XINGHAI DIGITAL HOME INDUSTRY TECHNOLOGY RESEARCH INSTITUTE Co Ltd, National Sun Yat Sen University filed Critical GUANGDONG XINGHAI DIGITAL HOME INDUSTRY TECHNOLOGY RESEARCH INSTITUTE Co Ltd
Priority to CN2011101868253A priority Critical patent/CN102236713A/en
Publication of CN102236713A publication Critical patent/CN102236713A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the invention discloses a digital television interaction service page information extraction method and a digital television interaction service page information extraction device. The method comprises the following steps of: acquiring web pages, and remaking the web pages to obtain extensible hypertext markup language (XHTML) documents; establishing a document object model (DOM) tree according to the XHTML documents; clustering the acquired web pages according to the DOM tree; acquiring a web page template corresponding to the clustered web pages of the same cluster; and performing information extraction according to the web page template, and acquiring extracted detailed information. By the digital television interaction service page information extraction method and the digital television interaction service page information extraction device provided by the embodiment of the invention, the digital television interaction service page key information acquisition speed can be increased, and the digital television interaction service page information data processing load also can be reduced.

Description

A kind of information extracting method of digital television interactive service page and device thereof
Technical field
The present invention relates to the digital television techniques field, relate in particular to a kind of information extracting method and device thereof of digital television interactive service page.
Background technology
Along with developing rapidly of the Internet (Internet) and Digital Television, the interactive service page has become a huge and complicated information warehouse.Extraction information and then improve the efficient that people obtain information and become more and more important fast from the interactive service page of magnanimity how.At present, most interactive service pages are dynamic web webpages, they normally are made of by certain general template the background data base of website, quite similar page structure is arranged, the Search Results that returns of search engine for example, the merchandise news page of Online Store etc. all is typical dynamic web page.This class webpage is enormous amount and abundant in content often, thereby extraction work is very valuable; Free text data is few in these pages simultaneously, Web page structural degree height, and wherein fixing text data is a lot.
In the prior art, the interactive service page is lack of standardization, and quantity is many, and wherein the data that comprised are a lot, needs to handle lot of data in retrieving, causes the wasting of resources, and can not retrieve fast in the critical data of the interactive service page apace.
Summary of the invention
The objective of the invention is to overcome the deficiencies in the prior art, the invention provides a kind of information extracting method and device thereof of digital television interactive service page, can retrieve fast Digital Television interactive service page critical data.
In order to address the above problem, the present invention proposes a kind of information extracting method of digital television interactive service page, described method comprises:
Obtain webpage and described webpage is write acquisition again can expand Hypertext Markup Language XHTML document;
Set up the DOM Document Object Model dom tree according to described XHTML document;
According to described dom tree collected webpage is carried out cluster;
Obtain the pairing Page template of same class webpage after the cluster;
According to described Page template carry out information extraction and to extract after details.
Preferably, the described step of setting up the DOM Document Object Model dom tree according to described XHTML document comprises:
The name storage of all beginning label correspondences of searching all beginning labels in the described XHTML document and will find is in label table;
Judge whether to exist the end mark corresponding one by one with arbitrary beginning label of described label table;
If the content stores between then that described end mark is corresponding with the described end mark beginning label is in described label table;
If not, then delete described beginning label;
Set up dom tree according to the described label table of the content between beginning label and the described beginning label end mark corresponding that comprises with it.
Preferably, described according to described Page template carry out information extraction and to extract after the step of details comprise:
Mode by the traversal dom tree is carried out information extraction according to described Page template;
Details after obtaining to extract;
Described details are stored.
Preferably, the described step that described details are stored comprises:
Described details are carried out structured storage.
Preferably, the described step that described details are carried out structured storage comprises:
Store described details in the mode of expandable mark language XML document.
Correspondingly, the embodiment of the invention also discloses a kind of information extracting device of digital television interactive service page, described information extracting device comprises:
The document acquisition module is used to obtain webpage and described webpage is write acquisition again can expand Hypertext Markup Language XHTML document;
Set up module, be used for setting up the DOM Document Object Model dom tree according to the accessed XHTML document of described acquisition module;
The cluster module is used for setting up the dom tree that module sets up collected webpage being carried out cluster according to described;
The masterplate acquisition module is used to obtain the pairing Page template of same class webpage after the described cluster module institute cluster;
Extraction module, be used for according to the accessed Page template of described masterplate acquisition module carry out information extraction and to extract after details.
Preferably, the described module of setting up comprises:
Search the unit, be used for searching all beginning labels of described XHTML document and the name storage of all beginning label correspondences that will find in label table;
Judging unit is used for judging whether to exist the end mark corresponding with arbitrary beginning label of described label table own one by one;
First storage unit, be used for when the judged result of described judging unit when being, the content stores between the beginning label that described end mark is corresponding with described end mark is in described label table;
Delete cells, be used for when the judged result of described judging unit for not the time, delete described beginning label;
Set up the unit, be used for setting up dom tree according to the described label table of the content between beginning label and the described beginning label end mark corresponding that comprises with it.
Preferably, described extraction module comprises:
Extraction unit, be used for by the mode of traversal dom tree carry out information extraction according to described Page template and to extract after details;
Second storage unit, the details that are used for described extraction unit is extracted are stored.
Implement the information extracting method and the device thereof of the digital television interactive service page of the embodiment of the invention, can improve the acquisition speed of digital television interactive service page key message, can also reduce the treatment capacity of digital television interactive service page information data.
Description of drawings
In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art, to do to introduce simply to the accompanying drawing of required use in embodiment or the description of the Prior Art below, apparently, accompanying drawing in describing below only is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is the schematic flow sheet of information extracting method of the digital television interactive service page of the embodiment of the invention;
Fig. 2 is the principle schematic of information extracting method of the digital television interactive service page of the embodiment of the invention;
Fig. 3 sets up the idiographic flow synoptic diagram of the process of dom tree according to the XHTML document among the inventive method embodiment;
Fig. 4 is the idiographic flow synoptic diagram of the process of obtaining the pairing Page template of same class webpage after the cluster among the inventive method embodiment;
Fig. 5 is that the structure of information extracting device of the digital television interactive service page of the embodiment of the invention is formed synoptic diagram.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the invention, the technical scheme in the embodiment of the invention is clearly and completely described, obviously, described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills belong to the scope of protection of the invention not making the every other embodiment that is obtained under the creative work prerequisite.
In the present invention, at the characteristics of the interactive service page, proposed a kind of based on DOM Document Object Model (Document Object Model, DOM) digital television interactive service page information extraction method and device thereof.DOM is that W3C sets up extend markup language (element in the XML document can be represented with the node in the dom tree structure for eXtensible MarkupLanguage, XML) a kind of standard criterion of providing of the tree construction of document in internal memory.It is cross-platform, as can to adapt to a distinct program language document dbject model, and (HyperText Markup Language, HTML) document also can adopt DOM to be described to the text mark language.Adopt the DOM models treated to have several advantages: (1) because tree is lasting in internal memory, therefore can revise it any node in case application program can change data and structure; (2) can be at any time navigation up and down in tree, use simply, can create document easily, its structure of navigating; (3) appearance of the DOM standard processing of simplifying the structure document in programmed environment greatly.
Principle of the present invention is that the html document with the not enough standard of the interactive service page is organized into (the eXtensible HyperText Markup Language of good the expanded hypertext markup language of form, XHTML) document, again the XHTML document is resolved to a dom tree, and then carry out the extraction of information and the search of analog structure webpage according to dom tree, the result who extracts represents with XML document, and carries out structured storage.
Fig. 1 is the schematic flow sheet of information extracting method of the digital television interactive service page of the embodiment of the invention, and as shown in Figure 1, this method comprises:
S101 obtains webpage and webpage is write acquisition XHTML document again;
S102 sets up dom tree according to the XHTML document;
S103 carries out cluster according to dom tree to collected webpage;
S104 obtains the pairing Page template of same class webpage after the cluster;
S105, according to Page template carry out information extraction and to extract after details.
Fig. 2 is the principle schematic of information extracting method of the digital television interactive service page of the embodiment of the invention, below in conjunction with Fig. 1, Fig. 2 the information extracting method of the digital television interactive service page of the embodiment of the invention is further specified.
In concrete the enforcement, in S101, obtain webpage and arrangement.The Web page of searching by site link comprises two kinds: the page that comprises desired data; The hyperlink page that comprises the target pages of desired data.To the navigation rule of Web website targeted sites by analysis, write in conjunction with the characteristics of desired data.And arrangement is that data source is mapped to XHTML.Can realize by following three aspects: (1) adds end mark "/" for azygous mark, and is for example, right<br〉add that end mark is<br/ 〉; (2) for all properties value adds quotation marks, for example,<a href=http: //www.w3c.org〉add that quotation marks become<ahref=" http://www.w3c.org "; (3) with URL(uniform resource locator) (Uniform/UniversalResource Locator, all in URL) " " change "/" into.
In S102, data source is resolved, and will set up dom tree through the XHTML document that is converted to, the element map in the XHTML document is become node in the dom tree.
In S103, according to dom tree collected webpage is carried out cluster according to similarity.Judge according to dom tree whether the webpage collected is similar to the composition of sample, and then determine whether to utilize existing pattern to extract the information in the webpage collected.In certain collections of web pages, the webpage with identical similarity can be used as the webpage that same template produces, and that is to say that this web pages has similar dom tree structure.Therefore the collections of web pages after institute's cluster can mark off k class in S103, and then extracts the template of each class in S104 successively.
In S104, the pairing Page template of same class webpage after the extraction cluster, the web page template here is meant public dom tree in a certain class webpage, i.e. all dom trees common factor.Obtain web page template by comparing two HTML parsing dom trees with webpage of analog structure.
In S105, carry out information extraction according to Page template.The XPath that utilizes inductive learning to obtain writes the XSLT document, can change node among the DOM according to the document, generates an XML document, only keeps the node of XPath appointment in this XML document, thereby finishes information extraction.
Implement the information extracting method of the digital television interactive service page of the embodiment of the invention, can improve the acquisition speed of digital television interactive service page key message, can also reduce the treatment capacity of digital television interactive service page information data.
Further, S102 can also comprise:
The name storage of all beginning label correspondences of searching all beginning labels in the XHTML document and will find is in label table;
Judge whether to exist the end mark corresponding one by one with arbitrary beginning label of label table;
If the content stores between then that end mark is corresponding with the end mark beginning label is in label table;
If not, then delete beginning label;
Set up dom tree according to the label table that comprises the content between beginning label and the beginning label end mark corresponding with it.
Below in conjunction with Fig. 3 the S102 in the inventive method is further detailed.
As shown in Figure 3, the process of setting up dom tree among the inventive method embodiment comprises:
S1021 finds out beginning labels all in the webpage, deposits its title in label table;
S1022 finds out each mark in the webpage one by one, and judge whether to exist one with corresponding end mark or the comment token of being found out of a beginning label; If not, then carry out S1024, if then carry out S1023;
S1023, in label table, this content is exactly a leaf node with the content stores between this end mark and its beginning label;
S1024 deletes this mark;
S1025 judges whether all beginning labels all dispose; If, then finish, if not, then return S1022.
Like this, each mark is all handled in webpage, has just set up a label table by the content between beginning label and the end mark corresponding with it, and whole dom tree can be broken down into the n stalk and set and deposit in this label table.
Further, S105 can comprise:
Mode by the traversal dom tree is carried out information extraction according to Page template;
Details after obtaining to extract;
Details are stored.
In the embodiment of the invention, can adopt the mode of structured storage, further, can adopt the mode of XML document to store this details the storage of details.
The process of information extraction be from top to bottom, the process of the order degree of depth from left to right traversal dom tree, in traversal, utilize decimation rule that present node is tested, obtaining qualified semantic item keeps in, after whole semantic item of finishing an object, assemble, deposit database then in.
As shown in Figure 4, S105 can comprise:
S1051, the root node of dom tree is set to present node;
Whether S1052 obtains the DOM path of present node and compares with the path rule in the current rule, check to mate, if not, then carry out S1057, if then carry out S1053;
S1053 judges that whether the mark of front and back adjacent node of present node and the left and right sides adjacent marker of current rule mate, and if not, then carry out S1057, if then carry out S1054;
S1054, the feature of judging present node whether with rule in specific characteristic be complementary, if not, then carry out S1057, if then carry out S1055;
S1055 takes out the information in the present node, deposits in the buffer memory;
S1056 gets the decimation rule of next semantic item from knowledge base, if the success then with it as current rule; Otherwise the extraction of having finished last semantic item is described, these semantic item should be assembled into object and deposits database in, from knowledge base, take out the decimation rule of first semantic item then as current rule;
S1057 judges whether to travel through whole dom tree, if not, then returns S1052, if then finish to extract flow process.
Correspondingly, the embodiment of the invention also provides a kind of information extracting device of digital television interactive service page, and as shown in Figure 5, this information extracting device comprises:
Document acquisition module 50 is used to obtain webpage and webpage is write acquisition XHTML document again;
Set up module 51, be used for setting up dom tree according to acquisition module 50 accessed XHTML documents;
Cluster module 52 is used for according to setting up the dom tree that module 50 set up collected webpage being carried out cluster;
Masterplate acquisition module 53 is used to obtain the pairing Page template of same class webpage after 52 clusters of cluster module;
Extraction module 54, be used for according to masterplate acquisition module 53 accessed Page templates carry out information extraction and to extract after details.
The realization principle of the information extracting device of the digital television interactive service page of the embodiment of the invention and process can repeat no more here referring to the associated description among the inventive method embodiment.
Further, this is set up module 51 and can comprise:
Search the unit, be used for searching all beginning labels of XHTML document and the name storage of all beginning label correspondences that will find in label table;
Judging unit is used for judging whether to exist the end mark corresponding with arbitrary beginning label of label table own one by one;
First storage unit, be used for when the judged result of judging unit when being, the content stores between the beginning label that end mark is corresponding with end mark is in label table;
Delete cells, be used for when the judged result of judging unit for not the time, the deletion beginning label;
Set up the unit, be used for setting up dom tree according to the label table that comprises the content between beginning label and the beginning label end mark corresponding with it.
Further, this extraction module 54 comprises:
Extraction unit, be used for by the mode of traversal dom tree carry out information extraction according to Page template and to extract after details;
Second storage unit, the details that are used for extraction unit is extracted are stored.
In concrete enforcement, first storage unit and second storage unit can merge realization, can realize the memory function of first storage unit and second storage unit with the form of a storage unit.
Implement the information extracting device of the digital television interactive service page of the embodiment of the invention, can improve the acquisition speed of digital television interactive service page key message, can also reduce the treatment capacity of digital television interactive service page information data.
Need to prove, contents such as each module in the said apparatus and the information interaction between each unit, implementation since with the inventive method embodiment based on same design, particular content can repeat no more referring to the narration among the inventive method embodiment herein.
One of ordinary skill in the art will appreciate that all or part of step in the whole bag of tricks of the foregoing description is to instruct relevant hardware to finish by program, this program can be stored in the computer-readable recording medium, storage medium can comprise: ROM (read-only memory) (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), disk or CD etc.
In addition, more than the information extracting method and the device thereof of the digital television interactive service page that the embodiment of the invention provided is described in detail, used specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims (8)

1. the information extracting method of a digital television interactive service page is characterized in that, described method comprises:
Obtain webpage and described webpage is write acquisition again can expand Hypertext Markup Language XHTML document;
Set up the DOM Document Object Model dom tree according to described XHTML document;
According to described dom tree collected webpage is carried out cluster;
Obtain the pairing Page template of same class webpage after the cluster;
According to described Page template carry out information extraction and to extract after details.
2. the information extracting method of digital television interactive service page as claimed in claim 1 is characterized in that, the described step of setting up the DOM Document Object Model dom tree according to described XHTML document comprises:
The name storage of all beginning label correspondences of searching all beginning labels in the described XHTML document and will find is in label table;
Judge whether to exist the end mark corresponding one by one with arbitrary beginning label of described label table;
If the content stores between then that described end mark is corresponding with the described end mark beginning label is in described label table;
If not, then delete described beginning label;
Set up dom tree according to the described label table of the content between beginning label and the described beginning label end mark corresponding that comprises with it.
3. the information extracting method of digital television interactive service page as claimed in claim 1 or 2 is characterized in that, described according to described Page template carry out information extraction and to extract after the step of details comprise:
Mode by the traversal dom tree is carried out information extraction according to described Page template;
Details after obtaining to extract;
Described details are stored.
4. the information extracting method of digital television interactive service page as claimed in claim 3 is characterized in that, the described step that described details are stored comprises:
Described details are carried out structured storage.
5. the information extracting method of digital television interactive service page as claimed in claim 4 is characterized in that, the described step that described details are carried out structured storage comprises:
Store described details in the mode of expandable mark language XML document.
6. the information extracting device of a digital television interactive service page is characterized in that, described information extracting device comprises:
The document acquisition module is used to obtain webpage and described webpage is write acquisition again can expand Hypertext Markup Language XHTML document;
Set up module, be used for setting up the DOM Document Object Model dom tree according to the accessed XHTML document of described acquisition module;
The cluster module is used for setting up the dom tree that module sets up collected webpage being carried out cluster according to described;
The masterplate acquisition module is used to obtain the pairing Page template of same class webpage after the described cluster module institute cluster;
Extraction module, be used for according to the accessed Page template of described masterplate acquisition module carry out information extraction and to extract after details.
7. the information extracting device of digital television interactive service page as claimed in claim 6 is characterized in that, the described module of setting up comprises:
Search the unit, be used for searching all beginning labels of described XHTML document and the name storage of all beginning label correspondences that will find in label table;
Judging unit is used for judging whether one by one to exist the end mark corresponding with arbitrary beginning label of described label table;
First storage unit, be used for when the judged result of described judging unit when being, the content stores between the beginning label that described end mark is corresponding with described end mark is in described label table;
Delete cells, be used for when the judged result of described judging unit for not the time, delete described beginning label;
Set up the unit, be used for setting up dom tree according to the described label table of the content between beginning label and the described beginning label end mark corresponding that comprises with it.
8. as the information extracting device of claim 6 or 7 described digital television interactive service pages, it is characterized in that described extraction module comprises:
Extraction unit, be used for by the mode of traversal dom tree carry out information extraction according to described Page template and to extract after details;
Second storage unit, the details that are used for described extraction unit is extracted are stored.
CN2011101868253A 2011-07-05 2011-07-05 Digital television interaction service page information extraction method and device Pending CN102236713A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011101868253A CN102236713A (en) 2011-07-05 2011-07-05 Digital television interaction service page information extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011101868253A CN102236713A (en) 2011-07-05 2011-07-05 Digital television interaction service page information extraction method and device

Publications (1)

Publication Number Publication Date
CN102236713A true CN102236713A (en) 2011-11-09

Family

ID=44887359

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011101868253A Pending CN102236713A (en) 2011-07-05 2011-07-05 Digital television interaction service page information extraction method and device

Country Status (1)

Country Link
CN (1) CN102236713A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102982161A (en) * 2012-12-05 2013-03-20 北京奇虎科技有限公司 Method and device for acquiring webpage information
CN103092973A (en) * 2013-01-24 2013-05-08 浪潮(北京)电子信息产业有限公司 Information extraction method and device
CN107562600A (en) * 2017-08-23 2018-01-09 广州阿里巴巴文学信息技术有限公司 Page detection method, apparatus, computing device and storage medium
CN108427664A (en) * 2018-02-22 2018-08-21 阿里巴巴集团控股有限公司 A kind of document analysis method and device
WO2019024755A1 (en) * 2017-08-01 2019-02-07 阿里巴巴集团控股有限公司 Webpage information extraction method, apparatus and system, and electronic device
WO2019090738A1 (en) * 2017-11-10 2019-05-16 深圳市华阅文化传媒有限公司 Method and device for purifying web fiction page

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006051974A1 (en) * 2004-11-12 2006-05-18 Justsystems Corporation Document processing device and document processing method
CN101582075A (en) * 2009-06-24 2009-11-18 大连海事大学 Web information extraction system
CN101727498A (en) * 2010-01-15 2010-06-09 西安交通大学 Automatic extraction method of web page information based on WEB structure
CN101727486A (en) * 2009-12-04 2010-06-09 中国人民解放军信息工程大学 Web forum information extraction system
CN101944094A (en) * 2009-07-06 2011-01-12 富士通株式会社 Webpage information extraction method and device thereof

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006051974A1 (en) * 2004-11-12 2006-05-18 Justsystems Corporation Document processing device and document processing method
CN101582075A (en) * 2009-06-24 2009-11-18 大连海事大学 Web information extraction system
CN101944094A (en) * 2009-07-06 2011-01-12 富士通株式会社 Webpage information extraction method and device thereof
CN101727486A (en) * 2009-12-04 2010-06-09 中国人民解放军信息工程大学 Web forum information extraction system
CN101727498A (en) * 2010-01-15 2010-06-09 西安交通大学 Automatic extraction method of web page information based on WEB structure

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102982161A (en) * 2012-12-05 2013-03-20 北京奇虎科技有限公司 Method and device for acquiring webpage information
CN103092973A (en) * 2013-01-24 2013-05-08 浪潮(北京)电子信息产业有限公司 Information extraction method and device
CN103092973B (en) * 2013-01-24 2015-12-02 浪潮(北京)电子信息产业有限公司 information extraction method and device
WO2019024755A1 (en) * 2017-08-01 2019-02-07 阿里巴巴集团控股有限公司 Webpage information extraction method, apparatus and system, and electronic device
CN107562600A (en) * 2017-08-23 2018-01-09 广州阿里巴巴文学信息技术有限公司 Page detection method, apparatus, computing device and storage medium
CN107562600B (en) * 2017-08-23 2021-12-10 阿里巴巴(中国)有限公司 Page detection method and device, computing equipment and storage medium
WO2019090738A1 (en) * 2017-11-10 2019-05-16 深圳市华阅文化传媒有限公司 Method and device for purifying web fiction page
CN108427664A (en) * 2018-02-22 2018-08-21 阿里巴巴集团控股有限公司 A kind of document analysis method and device

Similar Documents

Publication Publication Date Title
Liu et al. Vide: A vision-based approach for deep web data extraction
US7941420B2 (en) Method for organizing structurally similar web pages from a web site
Madhavan et al. Harnessing the deep web: Present and future
US7913163B1 (en) Determining semantically distinct regions of a document
CN101647020B (en) Searching structured geographical data
US7055094B2 (en) Virtual tags and the process of virtual tagging utilizing user feedback in transformation rules
Zheng et al. Template-independent news extraction based on visual consistency
CN102890713B (en) A kind of music recommend method based on user's current geographic position and physical environment
KR100930455B1 (en) Method and system for generating search collection by query
TWI695277B (en) Automatic website data collection method
CN101630330A (en) Method for webpage classification
CN101520798A (en) Webpage classification technology based on vertical search and focused crawler
CN103955529A (en) Internet information searching and aggregating presentation method
Pol et al. A survey on web content mining and extraction of structured and semistructured data
CN102236713A (en) Digital television interaction service page information extraction method and device
CN103294781A (en) Method and equipment used for processing page data
CN110457579B (en) Webpage denoising method and system based on cooperative work of template and classifier
CN103617174A (en) Distributed searching method based on cloud computing
CN103530429A (en) Webpage content extracting method
CN103853770B (en) The method and system of model content in a kind of extraction forum Web pages
CN101639840A (en) Method and device for identifying semantic structure of network information
WO2008038416A1 (en) Document searching device and document searching method
Sabri et al. Improving performance of DOM in semi-structured data extraction using WEIDJ model
CN106934036A (en) A kind of method and system of Network Learning Resource aggregate query
CN104281693A (en) Semantic search method and semantic search system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C53 Correction of patent of invention or patent application
CB03 Change of inventor or designer information

Inventor after: Luo Xiaonan

Inventor after: Lin Ge

Inventor after: Zhang Jie

Inventor before: Lin Ge

Inventor before: Zhang Jie

Inventor before: Yan Quan

COR Change of bibliographic data

Free format text: CORRECT: INVENTOR; FROM: LIN GE ZHANG JIE YAN QUAN TO: LUO XIAONAN LIN GE ZHANG JIE

C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20111109