CN101464905A - Web page information extraction system and method - Google Patents

Web page information extraction system and method Download PDF

Info

Publication number
CN101464905A
CN101464905A CNA2009100765483A CN200910076548A CN101464905A CN 101464905 A CN101464905 A CN 101464905A CN A2009100765483 A CNA2009100765483 A CN A2009100765483A CN 200910076548 A CN200910076548 A CN 200910076548A CN 101464905 A CN101464905 A CN 101464905A
Authority
CN
China
Prior art keywords
webpage
web page
node
mark
automatically
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2009100765483A
Other languages
Chinese (zh)
Other versions
CN101464905B (en
Inventor
吴博
王宇
张刚
丁国栋
程学旗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN2009100765483A priority Critical patent/CN101464905B/en
Publication of CN101464905A publication Critical patent/CN101464905A/en
Application granted granted Critical
Publication of CN101464905B publication Critical patent/CN101464905B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention relates to a system for extracting web page information and a method thereof. The system comprises a template generation module, a web page homogenization module, an automatic tagging module, a wrapper file generation module and an on-line extraction module, wherein, the template generation module is used for selecting web pages to be automatically tagged from a web page collection, and the web pages to be automatically tagged is classified according to training web pages tagged by a user, so as to generate a classified category web page template; the web page homogenization module is used for screening out the difference between the automatic tagging web pages and the web page template belonging to the same category with automatic tagging web; the automatic tagging module is used for analyzing training web pages corresponding to the category, so as to generate a first wrapper file; automatic tagging can be performed on the automatic tagging web pages according to the fisrt wrapper file, so as to generate new training web pages; the wrapper file generation module is used for analyzing all the training web pages and generating a second wrapper file; and the on-line extraction module is applied to the second wrapper document, and is used for extracting unselected web page information in the web page collection. The invention ensures that a plurality of templates corresponding to inhomogeneous web pages can be generated, and extracting can be performed on a plurality of records in a web page and a plurality of attributes of each record.

Description

A kind of system and method for Web page information extraction
Technical field
The invention belongs to network information process field, relate in particular to a kind of system and method for Web page information extraction.
Background technology
Present web page extraction technology can be divided at the web page extraction technology of specific area and general web page extraction technology according to the field of using.
In web page extraction technology, need treat the content of extraction usually and make some prerequisite hypothesis at specific area.For example to the extraction of news web page text, to the extraction of some particular community in the webpage, as extraction to product price.These class methods often according to the feature of waiting to extract object, by statistical method or by summing up the method for hairdo rule, extract webpage.But owing to extract special objects, limited the versatility of these class methods and the kind and the quantity of the information that can extract.
In general web page extraction technology, according to the automaticity of web page extraction instrument, be divided into the extraction system of manual construction rule, the extraction system of supervision is partly arranged, unsupervised extraction system and the extraction system that supervision is arranged.
In the extraction system of manual construction rule, the user is by extracting a wrapper of each website hand-coding (wrapper), the employed language of wrapper can be the language of a general programming language or the specially designed extraction that is used for, such instrument needs the user that the knowledge of certain computing machine and programming aspect is arranged, so the cost of this method is quite high, can not put up with often for this cost of extraction of a large amount of websites and magnanimity webpage.
The extraction system that supervision is partly arranged is with respect to the extraction system that supervision is arranged, and mark just can generate the rule that is used to extract not need the user to make accurately to the data in the webpage usually, so this type systematic is called as the extraction system that supervision is partly arranged.Though this type systematic does not need the user that the data in the webpage are made mark, but this type systematic often needs the user to do subsequent treatment, the data of for example selecting target pattern and wanting to extract, and this all type systematics all is in order to extract other data of record level.So the extraction precision of this type systematic can not be satisfied the demand usually and accurately be extracted the requirement of attribute information in the webpage.
In unsupervised extraction system, do not need the user to mark out any training data, so in generating the process of wrapper, just do not need the interface with user interactions yet.Being different from the data that extract in the extraction system that supervision is arranged is marked out by the user, unsupervised extraction system extracted data is to be determined by data itself, be that unsupervised system it has been generally acknowledged that webpage is actually the data that a web page template that is generated by program adds in the background data base and generates, and the task of unsupervised extraction system is exactly the data that extract in these background data bases.But because this full automatic extraction mode tends to extract the unwanted information of many users, some information that the user needs but may not extract, and because the data integration and the understanding that do not have mark to extract also become a difficult problem.
There is the extraction system of supervision normally to import a series of webpages that mark by the user, utilizes described training webpage to generate the wrapper file then, utilize the wrapper file that generates that the information in the similar webpage is extracted at last.In such system, often not wanting special programming personnel only needs some common users to train simply through some that to mark out the data that will extract on graphic user interface just passable, and it is higher that such extraction system extracts precision, and the data that extract are owing to there is label also to be convenient to understanding and integrated.The system of being introduced among the present invention is exactly a kind of extraction system that supervision is arranged.
Now increase like the blast of information on the internet, and webpage is as information carrier important on the network, how extracting the information that needs from webpage becomes an important research project day by day.But webpage is all served the user and is browsed on the internet, thereby the information in the webpage is surrounded the difficulty that has caused the information that extracts from webpage by a lot of webpage labels and format information.
Higher semi-automatic of at present popular a kind of accurate rate has the abstracting method of supervision to be: grasp down the same web page template of cause generates webpage from certain website, therefrom choose several webpages as the training webpage, being marked out by the user needs the information that extracts in these webpages, practise out the contextual feature of the data field that will extract by the mode of machine learning from these training webpage middle schools then, generate the wrapper file that is used to extract at last.Just use this wrapper to extract automatically to other webpages in this website.But there are the following problems for this method.
First, at present the foundation judged during similar webpage in grasping the website of webpage capture program all is whether these webpages are under same ur1 path, but exist a large amount of dynamic ur1 on the present website, even have this situation, even the structure between the ur1 path same web page at some webpage place also may be very dissimilar.Will cause like this can't extracting the webpage that generates by the different web pages template in the collections of web pages by the wrapper file that the training webpage generates.
Second, even these webpages are generated by same web page template, but there is a lot of non-template nodes in the webpage, and there is each species diversity between the non-template node of different web pages, often can't contain these all differences for the wrapper file of only training webpage to generate so by part, caused the wrapper file to being competent at task to this part web page extraction, and traditional mode is the webpage that can't correctly extract for these, it is submitted to the user, allow the user go to mark out data field in these webpages, and then these webpages are offered the web page extraction program as the training webpage regenerate wrapper.
The 3rd, all there is degree of accuracy in present web page extraction system, automaticity is with the contradiction between the needed artificial intervention, such as higher accuracy is arranged, need the extraction system of less training examples often to need the long needs that can't satisfy online instant extraction working time at extraction stage, and the system of greater efficiency can be arranged at extraction stage, often need more training webpage and manual intervention can generate all wrapper files preferably of accurate rate and recall rate.
The 4th, present network upgrade speed is fast, and after generating correct wrapper file, along with the correcting of website, the wrapper file that utilizes the old edition webpage to generate just can't be finished the task to web page extraction in the website after the correcting.
The 5th, present stage, a lot of web page extraction technology all were the websites at certain type, as extracting at news web page, perhaps can only some product of certain object be extracted, as extracting attributes such as the price of product and titles.
Press for the general information that extracts of the needs to any appointment that can be cross-cutting now and can finish the instrument of extraction.
Summary of the invention
In order to solve above-mentioned technical matters, the invention provides a kind of system and method for Web page information extraction, can generate a plurality of templates by corresponding inhomogeneous webpage, and a plurality of attributes in the webpage are extracted.
The invention discloses a kind of system of Web page information extraction, comprising:
The template generation module is used for choosing from collections of web pages and treats to mark automatically webpage, treats that to described marking webpage automatically classifies according to the training webpage of user's mark, generates the web page template of described training webpage corresponding class simultaneously;
Webpage homogeneity module, be used for according to the shielding of the web page template of described classification belong to described classification treat mark webpage automatically with the difference between the web page template of described classification;
Automatically labeling module is used to resolve the training webpage of described classification, generates the first wrapper file, by the described first wrapper file to described classification treat that marking webpage automatically marks automatically, to generate new training webpage;
The wrapper file generating module is used to resolve all training webpages, generates the second wrapper file;
Online abstraction module is used for using the described second wrapper file info web that described collections of web pages is not selected is extracted.
Described template generation module is further used for carrying out the described operation for the treatment of to mark automatically webpage of choosing from collections of web pages, make up the dom tree and the described dom tree for the treatment of to mark automatically webpage of the training webpage of user's mark, the web page template of described training webpage corresponding class is the dom tree of described training webpage, the dom tree that calculates described training webpage is with the described similarity for the treatment of to mark automatically the dom tree of webpage, to carry out similarity relatively, if described similarity is greater than predetermined threshold value, then describedly treat to mark automatically webpage and belong to described training webpage corresponding class, otherwise, the prompting user treats that to described marking webpage automatically marks, to generate new training webpage, make up the dom tree of described new training page or leaf and treat to mark automatically the dom tree of webpage, the web page template of new training webpage corresponding class is the dom tree of described new training webpage, the dom tree of new training page or leaf is treated that together the dom tree that marks webpage automatically carries out described similarity relatively, to finish classification.
Described template generation module also is used for utilizing the dom tree that marks webpage automatically for the treatment of of described classification to simplify described web page template.
Described template generation module is further used for identifying the repeating data node in the described web page template when simplifying described web page template, the treat dom tree that automatically mark webpage of described web page template with described classification mated, will do not mated in the described web page template and not be the knot removal of described repeating data node.
Described template generation module is further used for determining common ancestor's node that is labeled all nodes in the record in described web page template when the repeating data node of identification in the described web page template, search all neighbor nodes of the same name of described ancestor node, judgement with the described neighbor node of the same name subtree that is root node with the similarity of the subtree that is root node with the described node that is labeled in the record whether greater than predetermined threshold value, if then described neighbor node of the same name and the described node that is labeled in the record are the repeating data node; Otherwise, in the last layer node of described ancestor node, search till finding the repeating data node or finding the root node of described web page template.
It is that the subtree of root node is mated with the subtree that is root node with the described node that is labeled in the record with the repeating data node except that the described node that is labeled in the record that described template generation module also is used for after described web page template is simplified described web page template, will be that the node that is not mated in the subtree of root node shields with the repeating data node except that the described node that is labeled in the record.
Described webpage homogeneity module is further used for and will treats described in the same classification that the dom tree that marks webpage automatically mates with described web page template, treats to mark automatically the node shielding of not mated in the dom tree of webpage with described.
Described automatic labeling module resolving the training webpage of described classification, is further used for when generating the first wrapper file described training webpage is converted into flag sequence, locatees the front and back separator of the data of described mark, determine described separator about rule.
Described automatic labeling module, at the training webpage of resolving described classification, also be used for when generating the first wrapper file the state of each separator, with regular about each separator correspondence as the redirect rule that jumps to NextState from current state as the nondeterministic statement machine.
Described wrapper file generating module is further used for described training webpage is converted into flag sequence, locatees the front and back separator of the data of described mark, determine described separator about the rule; Each separator is a state of nondeterministic statement machine, and rule is for jumping to the redirect rule of NextState about each separator correspondence from current state.
The webpage that described online abstraction module is further used for not being selected in the described collections of web pages is converted into flag sequence, travel through mark in the described flag sequence, whether judge mark meets the redirect rule of current state, if, then jump to NextState, when the separator of the front and back of the respectively corresponding attribute of described current state and described NextState, web page text between the decollator of the separator of described current state correspondence and described NextState correspondence is preserved as the value of the attribute of described separator correspondence.
The invention also discloses a kind of method of Web page information extraction, comprising:
Step 1 is chosen from collections of web pages and is treated to mark automatically webpage, treats that to described marking webpage automatically classifies according to the training webpage of user's mark, generates the web page template of described training webpage corresponding class simultaneously;
Step 2, according to the shielding of the web page template of described classification belong to described classification treat mark webpage automatically with the difference between the web page template of described classification;
Step 3 is resolved the training webpage of described classification, generates the first wrapper file, by the described first wrapper file to described classification treat that marking webpage automatically marks automatically, to generate new training webpage;
Step 4 is resolved all training webpages, generates the second wrapper file;
Step 5 is used the described second wrapper file info web that is not selected in the described collections of web pages is extracted.
Described step 1 further is:
Step 131 is carried out the described operation for the treatment of to mark automatically webpage of choosing from collections of web pages;
Step 132, make up the dom tree and the described dom tree for the treatment of to mark automatically webpage of the training webpage of user's mark, the dom tree of described training webpage is as the web page template of described training webpage corresponding class, the dom tree that calculates described training webpage is with the described similarity for the treatment of to mark automatically the dom tree of webpage, if described similarity is greater than predetermined threshold value, determine describedly to treat to mark automatically webpage and belong to described training webpage corresponding class, otherwise, execution in step 133;
Described step 133, the prompting user treats that to described marking webpage automatically marks, and to generate new training webpage, carries out described step 132.
Described step 132 also comprises:
Step 141, the dom tree that utilizes treating in the described classification to mark webpage is automatically simplified described web page template.
Described step 141 further is:
Step 151 identifies the repeating data node in the described web page template;
Step 152, determine described treat to mark automatically webpage and belong to described training webpage corresponding class after, the treat dom tree that automatically mark webpage of described web page template with described classification mated, will do not mated in the described web page template and not be the knot removal of described repeating data node.
Described step 151 further is
Step 161 is determined common ancestor's node that is labeled all nodes in the record in described web page template, search all neighbor nodes of the same name of described ancestor node;
Step 162, judgement with the described neighbor node of the same name subtree that is root node with the similarity of the subtree that is root node with the described node that is labeled in the record whether greater than predetermined threshold value, if determine that then described neighbor node of the same name and the described node that is labeled in the record are the repeating data node; Otherwise, in the last layer node of described ancestor node, search till finding the repeating data node or finding the root node of described web page template.
Also comprise after the described step 141:
Step 171, with in the described web page template with the repeating data node except that the described node that is labeled in the record be the subtree of root node subtree is mated with being root node with the described node that is labeled in the record,
Step 172, with described be the node shielding of not mated in the subtree of root node with the repeating data node except that the described node that is labeled in the record.
Described step 2 further is,
Step 181 will treat described in the same classification that the dom tree that marks webpage automatically mates with described web page template, treats to mark automatically the node shielding of not mated in the dom tree of webpage with described.
Resolving the training webpage of described classification in the described step 3, generating the first wrapper file and further be,
Step 191 is converted into flag sequence with described training webpage, locatees the front and back separator of the data of described mark, determine described separator about the rule.
Also comprise after the described step 191:
With the state of each separator, with regular about each separator correspondence as the redirect rule that jumps to NextState from current state as the nondeterministic statement machine.
Described step 4 further is,
Step 211 is converted into flag sequence with described training webpage, locatees the front and back separator of the data of described mark, determine described separator about the rule;
Step 212, each separator are as a state of nondeterministic statement machine, and rule is for jumping to the redirect rule of NextState about each separator correspondence from current state.
Described step 5 further is:
Step 221 is converted into flag sequence with the described webpage that is not selected;
Step 222 travels through mark in the described flag sequence, and whether judge mark meets the redirect rule of current state, if then jump to NextState;
Step 223, when the separator of the front and back of the respectively corresponding attribute of described current state and described NextState, web page text between the decollator of the separator of described current state correspondence and described NextState correspondence is preserved as the value of the attribute of described separator correspondence.
Beneficial effect of the present invention is, by the template generation module, webpage is classified according to template, utilizes same wrapper to extract of a sort webpage cluster together and has improved the accuracy rate that extracts; By webpage homogeneity and automatic labeling module, thereby shield difference between certain class webpage and mark out target data in all these webpages, and then can be used as the training webpage of wrapper generation module, difference between these webpages is learnt and is written in the decimation rule by wrapper at this moment, just can these dissimilar webpages be extracted, improved the recall rate of web page extraction at the extraction stage of reality; By wrapper generation module and online abstraction module, thereby the context token sequence signature of training webpage is learnt to obtain decimation rule based on the data segment contextual feature, and this rule can be fast extracts the data of magnanimity at extraction stage; And, the present invention extracts because only depending on the context sequence signature of target data in the webpage, the present invention do not need the type of webpage and the data type that will extract are made more restriction, so can extract to the webpage of most types and data wherein.
Description of drawings
Fig. 1 is the structural drawing of the system of Web page information extraction of the present invention;
Fig. 2 is template generation module function realization flow figure;
Fig. 3 is the process flow diagram of method for abstracting web page information of the present invention.
Embodiment
Below in conjunction with accompanying drawing, the present invention is described in further detail.
System architecture of the present invention comprises as shown in Figure 1:
Template generation module 101 is used for choosing from collections of web pages and treats to mark automatically webpage, will describedly treat to mark automatically Web page classifying according to the training webpage of user's mark, generates the web page template of training the webpage corresponding class simultaneously.
Webpage homogeneity module 102, be used for according to the shielding of the web page template of described classification belong to described classification treat mark webpage automatically with the difference between the web page template of described classification.
Automatically labeling module 103 is used to resolve the training webpage of described classification, generates a warpper (wrapper) file, by a described wrapper file to described classification treat that marking webpage automatically marks automatically, to generate new training webpage.
Wrapper file generating module 104 is used to resolve all training webpages, generates the 2nd wrapper file.
Online abstraction module 105 is used for using described the 2nd wrapper file the info web that described collections of web pages is not selected is extracted.
The wrapper file is the set of decimation rule.Decimation rule in the wrapper file is the description to the contextual token sequence signature of the data that require to extract.
In a specific embodiment, the specific implementation of template generation module 101 is as described below.
DOM Document Object Model (DOM) is the standard interface standard that W3C formulates.Can the institutional framework of a webpage be described as one tree with the DOM model, be commonly referred to dom tree.Each node in the tree is an object.Dom tree has not only been described the structure of webpage, has also defined the behavior of node object, can utilize the method and the attribute of object, easily node in the tree and content is done dynamically operation, as visit, modification, interpolation and deletion action.
Template generation module 101 specific implementation flow processs as shown in Figure 2
Step S201, prompting user ID training webpage after user's mark is finished, enters step S202.
One specific implementation method is for by a user interface tool, and the user is by using mouse to mark to want the data that extract in browser.
Step S202 chooses from collections of web pages and treats to mark automatically webpage, makes up the dom tree of training webpage, is labeled as TempTree and treats to mark automatically the dom tree of webpage, is labeled as Tree.
Step S203, the repeating data node among the identification TempTree.
The repeating data node is the node of the similar record in the dom tree.Can follow a lot of follow-up such as the main subsides in forum's webpage, described follow-up is a repeating data district node.
The method of concrete identification repeating data node is:
In TempTree, determine common ancestor's node that is labeled all nodes in the record, search all neighbor nodes of the same name of this ancestor node; Judgement is that the subtree of root node is with being whether the similarity of subtree of root node is greater than predetermined threshold value, if determine that then neighbor node of the same name and the node that is labeled in the record are the repeating data node to be labeled node in the record with neighbor node of the same name; Otherwise, in the last layer node of described ancestor node, search till finding the repeating data node or finding the root node of TempTree.
Step S204 obtains a Tree.
Step S205 judges whether not processed Tree, if, execution in step S206, otherwise, execution in step S210.
Step S206 calculates the similarity of TempTree with Tree.
Calculating the process of Tree with the similarity of TempTree be, at first use the similarity of tree matching algorithm calculating Tree and TempTree, the calculation of similarity degree formula is, the editing distance between two trees is divided by the absolute average of the node of two trees.
Whether step S207 judges similarity greater than predetermined threshold value, if similarity greater than predetermined threshold value, determine describedly to treat to mark automatically webpage and belong to described training webpage corresponding class, execution in step S208, otherwise, execution in step S201.
Step S208 simplifies TempTree.
TempTree is mated with Tree, will do not mated among the TempTree and not be the knot removal of repeating data node.
Web page template is TempTree, comprising node and repeating data node total among the generic Tree.
Step S209, normalization TempTree.
Be normalized to and utilize the data subtree mark out to go other data subtrees in the abbreviation training page, all data subtrees are turned to data subtree with unified structure.
Concrete mode for among the TempTree with the repeating data node the node in being labeled record be the subtree of root node with being that the subtree of root node is mated to be labeled node in the record, will be that the node that is not mated in the subtree of root node shields with the repeating data node except that the described node that is labeled in the record.
Shielding is for to tell program by the mode that labels, and the content of this part is to be left in the basket.
Step S210 finishes.
The final TempTree that obtains is the web page template of classification.
Webpage homogeneity module 102 specific implementations are as described below.
To treat in the same classification that the dom tree that marks webpage automatically mates with web page template, the node shielding of will treat to mark automatically not mated in the dom tree of webpage marks automatically thereby the wrapper file that automatic labeling module 203 can the applying web page template be generated marks webpage automatically to treating in this classification.
Shielding is for to tell program by the mode that labels, and the content of this part is to be left in the basket.
Automatically labeling module 103 specific implementations are as described below.
To train webpage to be converted into token (mark) sequence, locate the front and back separator of the data of described mark, determine separator about the rule.With the state of each separator, with regular about each separator correspondence as the redirect rule that jumps to NextState from current state as the nondeterministic statement machine.
Because the robotization mark webpage for the treatment of in the classification all has unified structure, use decimation rule and make automatic mark treating robotization mark webpage in this classification.
In this classification treat that robotization mark webpage is through the homogeneity resume module after, the structure between the webpage is very similar.The context token sequence rules of target data and identical feature is arranged in the training webpage of the artificial mark of learning through the context token sequence of target data that marks webpage automatically for the treatment of of the processing of webpage homogeneity module.According to a wrapper file and the training webpage for the treatment of to mark automatically, treat to mark automatically the left and right sides separator of data segment in the webpage by the location, after finding the separator position, use with the identical label of artificial mark and be inserted into the position of separator, thereby realize the automatic mark of webpage.
The specific implementation of wrapper file generating module 104 is as described below.
Described training webpage is converted into flag sequence, locatees the front and back separator of the data of described mark, determine described separator about the rule; Each separator is a state of nondeterministic statement machine, and rule is for jumping to the redirect rule of NextState about each separator correspondence from current state.
Webpage is made up of two kinds of token, and html token and alph token wherein the html label correspondence of webpage inside html token, and the content correspondence between the alph label alph token.
In the process that generates the 2nd wrapper file, at first will train webpage to be converted into the token sequence, turn to the token sequence of forming by html token and alph token, navigate to the position of the data place token that marks in the webpage.
The separator position begins to search for forward before the labeled data, and till the separator that runs into first a word token or a last labeled data, all token that run in the process of above-mentioned search are as the preceding label left side rule of labeled data.
It is not to form alph token by symbol fully that word token is defined as content.
The separator position begins to search for forward before the labeled data, and till the back separator that runs into first a word token or a last labeled data, all token that run in the process of above-mentioned search are as the preceding label left side rule of labeled data.
In like manner, seek backward till the preceding separator that runs into first word token or next labeled data from the back decollator of labeled data, all token that run in said process are as the right rule of the back separator of labeled data.
Seek forward till the preceding separator that runs into first word token or labeled data from the back decollator of labeled data, all token that run in said process are as the left side rule of the back separator of labeled data.
Preserved the name of corresponding html label among the html token, for example Html (<div 〉) has write down the name div of html label in this html token.The Alphtoken correspondence part between two webpage labels in the webpage, and this part is made up of the symbol of some non-characters fully.To train webpage according to above-mentioned agreement serializing, and find out the acceptance of the bid of training webpage then and annotate the front and back separator of data, and note this separator left and right sides token sequence.
Shown in being exemplified below an of web page fragments.
Author<tr><td><a></a><a>{AET:author}Daniel{/AET:author}</a><br><div>Farmer
Wherein, { AET:author} is with {/AET:author} is a separator, has marked author information in the webpage, and is as follows respectively as separator left and right sides sequence rules
The left side rule of preceding separator:
alph(Author)html(<tr>)html(<td>)html(<a>)html(</a>)html(<a>);
The right rule of preceding separator: walph (_);
The left side rule of back separator: alph (_);
The right rule of back separator: html (</a 〉) html (<br 〉) and html (<div 〉) alph (Farmer).
In order to generate the 2nd wrapper file that can extract all other similar webpages, require the separator rule to need generalization ability, thereby in the separator rule, introduced asterisk wildcard " _ ", expression is as long as this token belongs to this classification, need not to consider the particular content of this token, just can mate.It is for example superincumbent that { the right rule of AET:author} separator is alph (_), as long as run into a text that comprises character (character) and numeral so, determines that then this separator rule is just mated.In addition because the number of the token that occurs in the contextual token sequence of part webpage, type exists different with order, introduced " or " descriptor for this reason, there are the rule of a plurality of correspondences in its left rule or right rule when a state jumps to another state, as long as satisfied some rules wherein, determine that then its left side rule or right rule are mated.For example there is its form of another web page fragments as follows now,
Author:<tr><td><a>{AET:author}Alex{/AET:author}</a><br><div>General
The left side rule of preceding separator:
alph(Author)html(<tr>)html(<td>)html(<a>)html(</a>)html(<a>)|alph(Author)html(<tr>)html(<td>)html(<a>)
The right rule of preceding separator: alph (_)
The left side rule of back separator: alph (_)
The right rule of back separator: html (</a 〉) html (<br 〉) and html (<div 〉) alph (_)
Not only added a rule in the left rule of preceding separator here, change has also taken place in last token of the right rule of back decollator simultaneously, and alph originally (Farmer) has become alph (_).This is owing to General and two token of Farmer have occurred at same position after adding new training webpage, has caused extensively, and the content among the alphtoken in the final separator rule is become by extensive " _ ".
Its preceding separator in a last example, lacked<a in the left rule of AET:author}</a〉label, so original separator rule just can not extract this example, this webpage label is come out for this reason and it is regenerated rule as the training webpage, new rule such as precedingly in the left rule of preceding separator, added
Alph (Author) html (<tr 〉) html (<td 〉) html (<a 〉) html (</a 〉) html (<a 〉) rule.
After handling all training webpages, will generate the 2nd wrapper file that finally is used to extract at last, in the 2nd wrapper file, can comprise all properties separator about the rule of redirect between the rule of redirect between rule and the status attribute and the record.
Rule is the rule of redirect between the record about the separator correspondence of the beginning of the separator of the end of last record and back one record.
Rule is the rule of redirect between the attribute about the preceding separator correspondence of the back separator of last attribute and back one attribute.
We need extract the author and the money order receipt to be signed and returned to the sender content of each money order receipt to be signed and returned to the sender for example a plurality of money order receipts to be signed and returned to the sender in certain forum's webpage, and each money order receipt to be signed and returned to the sender is considered to a record.Now represent the front and back separator of author property with A and @A, C and @C represent the front and back separator of money order receipt to be signed and returned to the sender contents attribute, and RB and RE represent start-of-record and end of record (EOR) separator.Comprised the left and right sides specification of RB in the wrapper file to redirect A, rule about A Tiao Zhuandao @A, @A jump to C about the rule, rule about C Tiao Zhuandao @C, @C jumps to the rule of RE and the rule that RE jumps to RB, by finding the redirect rule between all separators, realize identification to all records and all properties in the page.
The specific implementation of online abstraction module 105 is as described below.
Specific implementation is as described below.
Webpage to be extracted is converted into the token sequence, in the process that changes into to the token sequence, token type and generation method during the composition of token sequence generates with the 2nd wrapper file.Read in the 2nd wrapper file, its parsing is become in the nondeterministic automaton that is kept in the internal memory.Traversal token sequence, to each token, determine whether token place current state meets the redirect rule of this state, if, then jump to NextState, when the separator of the front and back of the respectively corresponding attribute of current state and NextState, web page text between the decollator of the separator of current state correspondence and NextState correspondence is preserved as the value of the attribute of separator correspondence.
The process flow diagram of method for abstracting web page information of the present invention as shown in Figure 3.
Step S301 chooses from collections of web pages and treats to mark automatically webpage, will describedly treat to mark automatically Web page classifying according to the training webpage of user mark, generates the web page template of training the webpage corresponding class simultaneously.
Step S301 further comprises:
Step 311 is carried out the described operation for the treatment of to mark automatically webpage of choosing from collections of web pages.
Step 312, the dom tree and the described dom tree for the treatment of to mark automatically webpage of the training webpage of structure user mark, the dom tree of described training webpage is the web page template of described training webpage corresponding class.
Step 313 is determined common ancestor's node that is labeled all nodes in the record in the dom tree of described training webpage, search all neighbor nodes of the same name of described ancestor node; Judgement with the described neighbor node of the same name subtree that is root node with the similarity of the subtree that is root node with the described node that is labeled in the record whether greater than predetermined threshold value, if determine that then described neighbor node of the same name and the described node that is labeled in the record are the repeating data node; Otherwise, in the last layer node of described ancestor node, search till finding the repeating data node or finding the root node of dom tree of described training webpage.
Step 314, the dom tree that calculates described training webpage is with the described similarity for the treatment of to mark automatically the dom tree of webpage, if described similarity is greater than predetermined threshold value, determine describedly to treat to mark automatically webpage and belong to described training webpage corresponding class, the treat dom tree that automatically mark webpage of described web page template with described classification mated, to do not mated in the described web page template and be not the knot removal of described repeating data node, with in the described web page template with the repeating data node except that the described node that is labeled in the record be the subtree of root node subtree is mated with being root node with the described node that is labeled in the record, with described be that the node that is not mated in the subtree of root node shields with the repeating data node except that the described node that is labeled in the record; Otherwise, execution in step 315;
Described step 315, the prompting user treats that to described marking webpage automatically marks, to generate new training webpage, execution in step S312.
Step S302 treats to mark automatically webpage with the difference between the web page template of classification under it according to described web page template shielding.
Described step S302 is further for treating described in the same classification that the dom tree that marks webpage automatically mates with described web page template, treats to mark automatically the node shielding of not mated in the dom tree of webpage with described.
Step S303 resolves the training webpage of described classification correspondence, generates the first wrapper file, by the described first wrapper file to described classification treat that marking webpage automatically marks automatically, to generate new training webpage.
Described step S303 further is, described training webpage is converted into flag sequence, locate the front and back separator of the data of described mark, determine described separator about the rule, with the state of each separator, with regular about each separator correspondence as the redirect rule that jumps to NextState from current state as the nondeterministic statement machine; Automatically mark according to the first wrapper file.
Step S304 resolves all training webpages, generates the second wrapper file.
Described step S304 further comprises:
Step 341 is converted into flag sequence with described training webpage, locatees the front and back separator of the data of described mark, determine described separator about the rule;
Step 342, each separator are a state of nondeterministic statement machine, and rule is for jumping to the redirect rule of NextState about each separator correspondence from current state.
Step S305 uses the described second wrapper file info web that is not selected in the described collections of web pages is extracted.
Described step S305 is further for to be converted into flag sequence with the described webpage that is not selected; Travel through mark in the described flag sequence, whether judge mark meets the redirect rule of current state, if then jump to NextState; When the separator of the front and back of the respectively corresponding attribute of described current state and described NextState, web page text between the decollator of the separator of described current state correspondence and described NextState correspondence is preserved as the value of the attribute of described separator correspondence.
Those skilled in the art can also carry out various modifications to above content under the condition that does not break away from the definite the spirit and scope of the present invention of claims.Therefore scope of the present invention is not limited in above explanation, but determine by the scope of claims.

Claims (22)

1. the system of a Web page information extraction is characterized in that, comprising:
The template generation module is used for choosing from collections of web pages and treats to mark automatically webpage, treats that to described marking webpage automatically classifies according to the training webpage of user's mark, generates the web page template of described training webpage corresponding class simultaneously;
Webpage homogeneity module, be used for according to the shielding of the web page template of described classification belong to described classification treat mark webpage automatically with the difference between the web page template of described classification;
Automatically labeling module is used to resolve the training webpage of described classification, generates the first wrapper file, by the described first wrapper file to described classification treat that marking webpage automatically marks automatically, to generate new training webpage;
The wrapper file generating module is used to resolve all training webpages, generates the second wrapper file;
Online abstraction module is used for using the described second wrapper file info web that described collections of web pages is not selected is extracted.
2. the system of Web page information extraction as claimed in claim 1 is characterized in that,
Described template generation module is further used for carrying out the described operation for the treatment of to mark automatically webpage of choosing from collections of web pages, make up the dom tree and the described dom tree for the treatment of to mark automatically webpage of the training webpage of user's mark, the web page template of described training webpage corresponding class is the dom tree of described training webpage, the dom tree that calculates described training webpage is with the described similarity for the treatment of to mark automatically the dom tree of webpage, to carry out similarity relatively, if described similarity is greater than predetermined threshold value, then describedly treat to mark automatically webpage and belong to described training webpage corresponding class, otherwise, the prompting user treats that to described marking webpage automatically marks, to generate new training webpage, make up the dom tree of described new training page or leaf and treat to mark automatically the dom tree of webpage, the web page template of new training webpage corresponding class is the dom tree of described new training webpage, the dom tree of new training page or leaf is treated that together the dom tree that marks webpage automatically carries out described similarity relatively, to finish classification.
3. the system of Web page information extraction as claimed in claim 2 is characterized in that,
Described template generation module also is used for utilizing the dom tree that marks webpage automatically for the treatment of of described classification to simplify described web page template.
4. the system of Web page information extraction as claimed in claim 3 is characterized in that,
Described template generation module is further used for identifying the repeating data node in the described web page template when simplifying described web page template, the treat dom tree that automatically mark webpage of described web page template with described classification mated, will do not mated in the described web page template and not be the knot removal of described repeating data node.
5. the system of Web page information extraction as claimed in claim 4 is characterized in that,
Described template generation module is further used for determining common ancestor's node that is labeled all nodes in the record in described web page template when the repeating data node of identification in the described web page template, search all neighbor nodes of the same name of described ancestor node, judgement with the described neighbor node of the same name subtree that is root node with the similarity of the subtree that is root node with the described node that is labeled in the record whether greater than predetermined threshold value, if then described neighbor node of the same name and the described node that is labeled in the record are the repeating data node; Otherwise, in the last layer node of described ancestor node, search till finding the repeating data node or finding the root node of described web page template.
6. the system of Web page information extraction as claimed in claim 5 is characterized in that,
It is that the subtree of root node is mated with the subtree that is root node with the described node that is labeled in the record with the repeating data node except that the described node that is labeled in the record that described template generation module also is used for after described web page template is simplified described web page template, will be that the node that is not mated in the subtree of root node shields with the repeating data node except that the described node that is labeled in the record.
7. the system of Web page information extraction as claimed in claim 2 is characterized in that,
Described webpage homogeneity module is further used for and will treats described in the same classification that the dom tree that marks webpage automatically mates with described web page template, treats to mark automatically the node shielding of not mated in the dom tree of webpage with described.
8. the system of Web page information extraction as claimed in claim 1 is characterized in that,
Described automatic labeling module resolving the training webpage of described classification, is further used for when generating the first wrapper file described training webpage is converted into flag sequence, locatees the front and back separator of the data of described mark, determine described separator about rule.
9. the system of Web page information extraction as claimed in claim 8 is characterized in that,
Described automatic labeling module, at the training webpage of resolving described classification, also be used for when generating the first wrapper file the state of each separator, with regular about each separator correspondence as the redirect rule that jumps to NextState from current state as the nondeterministic statement machine.
10. the system of Web page information extraction as claimed in claim 1 is characterized in that,
Described wrapper file generating module is further used for described training webpage is converted into flag sequence, locatees the front and back separator of the data of described mark, determine described separator about the rule; Each separator is a state of nondeterministic statement machine, and rule is for jumping to the redirect rule of NextState about each separator correspondence from current state.
11. the system of Web page information extraction as claimed in claim 10 is characterized in that,
The webpage that described online abstraction module is further used for not being selected in the described collections of web pages is converted into flag sequence, travel through mark in the described flag sequence, whether judge mark meets the redirect rule of current state, if, then jump to NextState, when the separator of the front and back of the respectively corresponding attribute of described current state and described NextState, web page text between the decollator of the separator of described current state correspondence and described NextState correspondence is preserved as the value of the attribute of described separator correspondence.
12. the method for a Web page information extraction is characterized in that, comprising:
Step 1 is chosen from collections of web pages and is treated to mark automatically webpage, treats that to described marking webpage automatically classifies according to the training webpage of user's mark, generates the web page template of described training webpage corresponding class simultaneously;
Step 2, according to the shielding of the web page template of described classification belong to described classification treat mark webpage automatically with the difference between the web page template of described classification;
Step 3 is resolved the training webpage of described classification, generates the first wrapper file, by the described first wrapper file to described classification treat that marking webpage automatically marks automatically, to generate new training webpage;
Step 4 is resolved all training webpages, generates the second wrapper file;
Step 5 is used the described second wrapper file info web that is not selected in the described collections of web pages is extracted.
13. the method for Web page information extraction as claimed in claim 12 is characterized in that,
Described step 1 further is:
Step 131 is carried out the described operation for the treatment of to mark automatically webpage of choosing from collections of web pages;
Step 132, make up the dom tree and the described dom tree for the treatment of to mark automatically webpage of the training webpage of user's mark, the dom tree of described training webpage is as the web page template of described training webpage corresponding class, the dom tree that calculates described training webpage is with the described similarity for the treatment of to mark automatically the dom tree of webpage, if described similarity is greater than predetermined threshold value, determine describedly to treat to mark automatically webpage and belong to described training webpage corresponding class, otherwise, execution in step 133;
Described step 133, the prompting user treats that to described marking webpage automatically marks, and to generate new training webpage, carries out described step 132.
14. the method for Web page information extraction as claimed in claim 13 is characterized in that,
Described step 132 also comprises:
Step 141, the dom tree that utilizes treating in the described classification to mark webpage is automatically simplified described web page template.
15. the method for Web page information extraction as claimed in claim 14 is characterized in that,
Described step 141 further is:
Step 151 identifies the repeating data node in the described web page template;
Step 152, determine described treat to mark automatically webpage and belong to described training webpage corresponding class after, the treat dom tree that automatically mark webpage of described web page template with described classification mated, will do not mated in the described web page template and not be the knot removal of described repeating data node.
16. the method for Web page information extraction as claimed in claim 15 is characterized in that,
Described step 151 further is
Step 161 is determined common ancestor's node that is labeled all nodes in the record in described web page template, search all neighbor nodes of the same name of described ancestor node;
Step 162, judgement with the described neighbor node of the same name subtree that is root node with the similarity of the subtree that is root node with the described node that is labeled in the record whether greater than predetermined threshold value, if determine that then described neighbor node of the same name and the described node that is labeled in the record are the repeating data node; Otherwise, in the last layer node of described ancestor node, search till finding the repeating data node or finding the root node of described web page template.
17. the method for Web page information extraction as claimed in claim 16 is characterized in that,
Also comprise after the described step 141:
Step 171, with in the described web page template with the repeating data node except that the described node that is labeled in the record be the subtree of root node subtree is mated with being root node with the described node that is labeled in the record,
Step 172, with described be the node shielding of not mated in the subtree of root node with the repeating data node except that the described node that is labeled in the record.
18. the method for Web page information extraction as claimed in claim 13 is characterized in that,
Described step 2 further is,
Step 181 will treat described in the same classification that the dom tree that marks webpage automatically mates with described web page template, treats to mark automatically the node shielding of not mated in the dom tree of webpage with described.
19. the method for Web page information extraction as claimed in claim 12 is characterized in that,
Resolving the training webpage of described classification in the described step 3, generating the first wrapper file and further be,
Step 191 is converted into flag sequence with described training webpage, locatees the front and back separator of the data of described mark, determine described separator about the rule.
20. the method for Web page information extraction as claimed in claim 19 is characterized in that,
Also comprise after the described step 191:
With the state of each separator, with regular about each separator correspondence as the redirect rule that jumps to NextState from current state as the nondeterministic statement machine.
21. the method for Web page information extraction as claimed in claim 12 is characterized in that,
Described step 4 further is,
Step 211 is converted into flag sequence with described training webpage, locatees the front and back separator of the data of described mark, determine described separator about the rule;
Step 212, each separator are as a state of nondeterministic statement machine, and rule is for jumping to the redirect rule of NextState about each separator correspondence from current state.
22. the method for Web page information extraction as claimed in claim 21 is characterized in that,
Described step 5 further is:
Step 221 is converted into flag sequence with the described webpage that is not selected;
Step 222 travels through mark in the described flag sequence, and whether judge mark meets the redirect rule of current state, if then jump to NextState;
Step 223, when the separator of the front and back of the respectively corresponding attribute of described current state and described NextState, web page text between the decollator of the separator of described current state correspondence and described NextState correspondence is preserved as the value of the attribute of described separator correspondence.
CN2009100765483A 2009-01-08 2009-01-08 Web page information extraction system and method Active CN101464905B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009100765483A CN101464905B (en) 2009-01-08 2009-01-08 Web page information extraction system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009100765483A CN101464905B (en) 2009-01-08 2009-01-08 Web page information extraction system and method

Publications (2)

Publication Number Publication Date
CN101464905A true CN101464905A (en) 2009-06-24
CN101464905B CN101464905B (en) 2011-03-23

Family

ID=40805480

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009100765483A Active CN101464905B (en) 2009-01-08 2009-01-08 Web page information extraction system and method

Country Status (1)

Country Link
CN (1) CN101464905B (en)

Cited By (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101916285A (en) * 2010-08-20 2010-12-15 北京新岸线网络技术有限公司 Method and device for analyzing internet web page contents
CN101950312A (en) * 2010-08-18 2011-01-19 赵清政 Method for analyzing webpage content of internet
CN102298638A (en) * 2011-08-31 2011-12-28 北京中搜网络技术股份有限公司 Method and system for extracting news webpage contents by clustering webpage labels
CN102495847A (en) * 2011-11-16 2012-06-13 浙江盘石信息技术有限公司 Network commodity information extraction method
CN102567530A (en) * 2011-12-31 2012-07-11 凤凰在线(北京)信息技术有限公司 Intelligent extraction system and intelligent extraction method for article type web pages
CN102567494A (en) * 2011-12-22 2012-07-11 北京亿赞普网络技术有限公司 Website classification method and device
CN102637172A (en) * 2011-02-10 2012-08-15 北京百度网讯科技有限公司 Webpage blocking marking method and system
CN102790967A (en) * 2011-05-19 2012-11-21 华晶科技股份有限公司 Wireless network access method
CN102789474A (en) * 2012-04-12 2012-11-21 北京京东世纪贸易有限公司 Method and device for processing webpage data
CN102073654B (en) * 2009-11-20 2012-12-19 富士通株式会社 Methods and equipment for generating and maintaining web content extraction template
CN102890681A (en) * 2011-07-20 2013-01-23 阿里巴巴集团控股有限公司 Method and system for generating webpage structure template
CN101957816B (en) * 2009-07-13 2013-03-20 上海华燕置业发展有限公司 Webpage metadata automatic extraction method and system based on multi-page comparison
CN103559199A (en) * 2013-09-29 2014-02-05 北京航空航天大学 Web information extraction method and web information extraction device
CN103678510A (en) * 2013-11-25 2014-03-26 北京奇虎科技有限公司 Method and device for providing visualized label for webpage
CN103853823A (en) * 2014-02-26 2014-06-11 中国科学院计算技术研究所 Online encyclopedia oriented entity attribute extraction method and system
CN101944094B (en) * 2009-07-06 2014-06-18 富士通株式会社 Webpage information extraction method and device thereof
CN103870567A (en) * 2014-03-11 2014-06-18 浪潮集团有限公司 Automatic identifying method for webpage collecting template of vertical search engine in cloud computing
CN103870506A (en) * 2012-12-17 2014-06-18 中国科学院计算技术研究所 Webpage information extraction method and system
CN104504086A (en) * 2014-12-25 2015-04-08 北京国双科技有限公司 Clustering method and device for webpage
CN105528357A (en) * 2014-09-30 2016-04-27 中国银联股份有限公司 Webpage content extraction method based on similarity of URLs and similarity of webpage document structures
CN105786972A (en) * 2010-08-20 2016-07-20 北京新岸线移动多媒体技术有限公司 Webpage template generation method and device
WO2016115319A1 (en) * 2015-01-15 2016-07-21 The University Of North Carolina At Chapel Hill Methods, systems, and computer readable media for generating and using a web page classification model
CN106575298A (en) * 2014-07-25 2017-04-19 高通股份有限公司 Fast rendering of websites containing dynamic content and stale content
CN103870601B (en) * 2014-04-02 2017-07-11 王青 A kind of method and device for identifying and highlighting web page contents
CN107402930A (en) * 2016-05-20 2017-11-28 阿里巴巴集团控股有限公司 The amending method and device of web page text
CN107480134A (en) * 2017-07-28 2017-12-15 国信优易数据有限公司 A kind of data processing method and system
CN107808000A (en) * 2017-11-13 2018-03-16 哈尔滨工业大学(威海) A kind of hidden web data collection and extraction system and method
CN108874373A (en) * 2017-05-12 2018-11-23 腾讯科技(深圳)有限公司 Method and device, display terminal and the storage medium of information are inserted into webpage
CN109299271A (en) * 2018-10-30 2019-02-01 腾讯科技(深圳)有限公司 Training sample generation, text data, public sentiment event category method and relevant device
WO2019024755A1 (en) * 2017-08-01 2019-02-07 阿里巴巴集团控股有限公司 Webpage information extraction method, apparatus and system, and electronic device
CN109359301A (en) * 2018-10-19 2019-02-19 国家计算机网络与信息安全管理中心 A kind of the various dimensions mask method and device of web page contents
CN109446195A (en) * 2018-09-20 2019-03-08 成都捕风数据科技有限公司 A kind of design method of non-homogeneous digital asset standard
CN109726341A (en) * 2018-12-28 2019-05-07 四川新网银行股份有限公司 A kind of automatic abstracting method of webpage information based on Web page classifying and cluster
CN110019829A (en) * 2017-09-19 2019-07-16 小草数语(北京)科技有限公司 Data attribute determines method, apparatus
US20190303501A1 (en) * 2018-03-27 2019-10-03 International Business Machines Corporation Self-adaptive web crawling and text extraction
CN111090797A (en) * 2019-11-29 2020-05-01 苏宁云计算有限公司 Data acquisition method and device, computer equipment and storage medium
CN111368227A (en) * 2018-12-25 2020-07-03 阿里巴巴集团控股有限公司 URL processing method and device
CN112395483A (en) * 2020-11-13 2021-02-23 郑州阿帕斯数云信息科技有限公司 Page rendering method and device based on tree structure
CN113822272A (en) * 2020-11-12 2021-12-21 北京沃东天骏信息技术有限公司 Data processing method and device

Cited By (60)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101944094B (en) * 2009-07-06 2014-06-18 富士通株式会社 Webpage information extraction method and device thereof
CN101957816B (en) * 2009-07-13 2013-03-20 上海华燕置业发展有限公司 Webpage metadata automatic extraction method and system based on multi-page comparison
CN102073654B (en) * 2009-11-20 2012-12-19 富士通株式会社 Methods and equipment for generating and maintaining web content extraction template
CN101950312A (en) * 2010-08-18 2011-01-19 赵清政 Method for analyzing webpage content of internet
CN101950312B (en) * 2010-08-18 2012-07-04 赵清政 Method for analyzing webpage content of internet
CN101916285A (en) * 2010-08-20 2010-12-15 北京新岸线网络技术有限公司 Method and device for analyzing internet web page contents
CN101916285B (en) * 2010-08-20 2016-06-08 北京新岸线移动多媒体技术有限公司 A kind of method for analyzing internet web page contents and device
CN105786972A (en) * 2010-08-20 2016-07-20 北京新岸线移动多媒体技术有限公司 Webpage template generation method and device
CN102637172A (en) * 2011-02-10 2012-08-15 北京百度网讯科技有限公司 Webpage blocking marking method and system
CN102637172B (en) * 2011-02-10 2013-11-27 北京百度网讯科技有限公司 Webpage blocking marking method and system
CN102790967A (en) * 2011-05-19 2012-11-21 华晶科技股份有限公司 Wireless network access method
CN102790967B (en) * 2011-05-19 2015-02-04 华晶科技股份有限公司 Wireless network access method
CN102890681A (en) * 2011-07-20 2013-01-23 阿里巴巴集团控股有限公司 Method and system for generating webpage structure template
CN102890681B (en) * 2011-07-20 2016-03-09 阿里巴巴集团控股有限公司 A kind of method and system of generating web page stay in place form
CN102298638A (en) * 2011-08-31 2011-12-28 北京中搜网络技术股份有限公司 Method and system for extracting news webpage contents by clustering webpage labels
CN102495847B (en) * 2011-11-16 2017-04-19 浙江盘石信息技术股份有限公司 Network commodity information extraction method
CN102495847A (en) * 2011-11-16 2012-06-13 浙江盘石信息技术有限公司 Network commodity information extraction method
CN102567494A (en) * 2011-12-22 2012-07-11 北京亿赞普网络技术有限公司 Website classification method and device
CN102567494B (en) * 2011-12-22 2014-07-02 北京亿赞普网络技术有限公司 Website classification method and device
CN102567530A (en) * 2011-12-31 2012-07-11 凤凰在线(北京)信息技术有限公司 Intelligent extraction system and intelligent extraction method for article type web pages
CN102567530B (en) * 2011-12-31 2014-06-11 凤凰在线(北京)信息技术有限公司 Intelligent extraction system and intelligent extraction method for article type web pages
CN102789474A (en) * 2012-04-12 2012-11-21 北京京东世纪贸易有限公司 Method and device for processing webpage data
CN103870506B (en) * 2012-12-17 2017-02-08 中国科学院计算技术研究所 Webpage information extraction method and system
CN103870506A (en) * 2012-12-17 2014-06-18 中国科学院计算技术研究所 Webpage information extraction method and system
CN103559199B (en) * 2013-09-29 2016-09-28 北京航空航天大学 Method for abstracting web page information and device
CN103559199A (en) * 2013-09-29 2014-02-05 北京航空航天大学 Web information extraction method and web information extraction device
CN103678510A (en) * 2013-11-25 2014-03-26 北京奇虎科技有限公司 Method and device for providing visualized label for webpage
CN103853823A (en) * 2014-02-26 2014-06-11 中国科学院计算技术研究所 Online encyclopedia oriented entity attribute extraction method and system
CN103853823B (en) * 2014-02-26 2017-01-18 中国科学院计算技术研究所 Online encyclopedia oriented entity attribute extraction method and system
CN103870567A (en) * 2014-03-11 2014-06-18 浪潮集团有限公司 Automatic identifying method for webpage collecting template of vertical search engine in cloud computing
CN103870601B (en) * 2014-04-02 2017-07-11 王青 A kind of method and device for identifying and highlighting web page contents
CN106575298A (en) * 2014-07-25 2017-04-19 高通股份有限公司 Fast rendering of websites containing dynamic content and stale content
CN106575298B (en) * 2014-07-25 2020-10-30 高通股份有限公司 Rapid presentation of web sites containing dynamic content and stale content
CN105528357A (en) * 2014-09-30 2016-04-27 中国银联股份有限公司 Webpage content extraction method based on similarity of URLs and similarity of webpage document structures
CN104504086A (en) * 2014-12-25 2015-04-08 北京国双科技有限公司 Clustering method and device for webpage
CN104504086B (en) * 2014-12-25 2017-11-21 北京国双科技有限公司 The clustering method and device of Webpage
US10530671B2 (en) 2015-01-15 2020-01-07 The University Of North Carolina At Chapel Hill Methods, systems, and computer readable media for generating and using a web page classification model
WO2016115319A1 (en) * 2015-01-15 2016-07-21 The University Of North Carolina At Chapel Hill Methods, systems, and computer readable media for generating and using a web page classification model
CN107402930A (en) * 2016-05-20 2017-11-28 阿里巴巴集团控股有限公司 The amending method and device of web page text
CN108874373B (en) * 2017-05-12 2023-05-30 深圳市雅阅科技有限公司 Method and device for inserting information into webpage, display terminal and storage medium
CN108874373A (en) * 2017-05-12 2018-11-23 腾讯科技(深圳)有限公司 Method and device, display terminal and the storage medium of information are inserted into webpage
CN107480134A (en) * 2017-07-28 2017-12-15 国信优易数据有限公司 A kind of data processing method and system
WO2019024755A1 (en) * 2017-08-01 2019-02-07 阿里巴巴集团控股有限公司 Webpage information extraction method, apparatus and system, and electronic device
CN110019829A (en) * 2017-09-19 2019-07-16 小草数语(北京)科技有限公司 Data attribute determines method, apparatus
CN107808000A (en) * 2017-11-13 2018-03-16 哈尔滨工业大学(威海) A kind of hidden web data collection and extraction system and method
CN107808000B (en) * 2017-11-13 2020-05-22 哈尔滨工业大学(威海) System and method for collecting and extracting data of dark net
US20190303501A1 (en) * 2018-03-27 2019-10-03 International Business Machines Corporation Self-adaptive web crawling and text extraction
US10922366B2 (en) * 2018-03-27 2021-02-16 International Business Machines Corporation Self-adaptive web crawling and text extraction
CN109446195A (en) * 2018-09-20 2019-03-08 成都捕风数据科技有限公司 A kind of design method of non-homogeneous digital asset standard
CN109359301A (en) * 2018-10-19 2019-02-19 国家计算机网络与信息安全管理中心 A kind of the various dimensions mask method and device of web page contents
CN109299271A (en) * 2018-10-30 2019-02-01 腾讯科技(深圳)有限公司 Training sample generation, text data, public sentiment event category method and relevant device
CN109299271B (en) * 2018-10-30 2022-04-05 腾讯科技(深圳)有限公司 Training sample generation method, text data method, public opinion event classification method and related equipment
CN111368227A (en) * 2018-12-25 2020-07-03 阿里巴巴集团控股有限公司 URL processing method and device
CN111368227B (en) * 2018-12-25 2023-06-27 阿里巴巴集团控股有限公司 URL processing method and device
CN109726341A (en) * 2018-12-28 2019-05-07 四川新网银行股份有限公司 A kind of automatic abstracting method of webpage information based on Web page classifying and cluster
CN111090797A (en) * 2019-11-29 2020-05-01 苏宁云计算有限公司 Data acquisition method and device, computer equipment and storage medium
CN111090797B (en) * 2019-11-29 2023-07-25 苏宁云计算有限公司 Data acquisition method, device, computer equipment and storage medium
CN113822272A (en) * 2020-11-12 2021-12-21 北京沃东天骏信息技术有限公司 Data processing method and device
CN112395483A (en) * 2020-11-13 2021-02-23 郑州阿帕斯数云信息科技有限公司 Page rendering method and device based on tree structure
CN112395483B (en) * 2020-11-13 2024-03-01 郑州阿帕斯数云信息科技有限公司 Page rendering method and device based on tree structure

Also Published As

Publication number Publication date
CN101464905B (en) 2011-03-23

Similar Documents

Publication Publication Date Title
CN101464905B (en) Web page information extraction system and method
Arasu et al. Extracting structured data from web pages
Nguyen et al. Relation extraction from wikipedia using subtree mining
US7941420B2 (en) Method for organizing structurally similar web pages from a web site
CN101593200A (en) Chinese Web page classification method based on the keyword frequency analysis
CN101957816A (en) Webpage metadata automatic extraction method and system based on multi-page comparison
CN102831121A (en) Method and system for extracting webpage information
CN104199871A (en) High-speed test question inputting method for intelligent teaching
CN107436955B (en) English word correlation degree calculation method and device based on Wikipedia concept vector
CN100432996C (en) System, method and program for extracting web page core content based on web page layout
CN105022803A (en) Method and system for extracting text content of webpage
CN103530429A (en) Webpage content extracting method
CN107273546A (en) Counterfeit application detection method and system
CN108021682A (en) Open information extracts a kind of Entity Semantics method based on wikipedia under background
CN111190873B (en) Log mode extraction method and system for log training of cloud native system
Machanavajjhala et al. Collective extraction from heterogeneous web lists
Omari et al. Cross-supervised synthesis of web-crawlers
CN102567016B (en) Method and device for extracting use example of application programming interface
CN105843661B (en) A kind of code method for relocating and its system towards host system
CN104462151A (en) Method for evaluating web page publishing time and related device
Parameswaran et al. Optimal schemes for robust web extraction
CN109948015B (en) Meta search list result extraction method and system
CN117390329A (en) Webpage labeling method, device and equipment
CN103870590A (en) Webpage identification method and device with error-reported characteristic
Souza et al. ARCTIC: metadata extraction from scientific papers in pdf using two-layer CRF

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20090624

Assignee: Branch DNT data Polytron Technologies Inc

Assignor: Institute of Computing Technology, Chinese Academy of Sciences

Contract record no.: 2018110000033

Denomination of invention: Web page information extraction system and method

Granted publication date: 20110323

License type: Common License

Record date: 20180807