CN101872350A - Web page text extracting method and device thereof - Google Patents

Web page text extracting method and device thereof Download PDF

Info

Publication number
CN101872350A
CN101872350A CN200910137364A CN200910137364A CN101872350A CN 101872350 A CN101872350 A CN 101872350A CN 200910137364 A CN200910137364 A CN 200910137364A CN 200910137364 A CN200910137364 A CN 200910137364A CN 101872350 A CN101872350 A CN 101872350A
Authority
CN
China
Prior art keywords
node
given
webpage
tree structure
leaf
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN200910137364A
Other languages
Chinese (zh)
Inventor
贾晓建
王主龙
孟遥
于浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN200910137364A priority Critical patent/CN101872350A/en
Publication of CN101872350A publication Critical patent/CN101872350A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a web page text extracting method and a device thereof. The web page text extracting method of one embodiment of the invention comprises the following steps: expressing a web page as a tree structure; judging whether each node in the tree structure is a valid node; and combining text messages contained in leaf nodes used as the valid nodes to acquire the text of the web page, wherein, for the set node in the tree structure, if the proportion of a preset type of nodes in child nodes of the set node is less than or equal to a first threshold value, the set node is judged to be a valid node.

Description

Web page text extracting method and device
Technical field
The present invention relates to field of information processing, in particular to a kind of Web page text extracting method and device.
Background technology
Along with the continuous development of internet information technology, the quantity of information of internet expands day by day.In recent years, the information data on the fhe global the Internet is just increasing with the speed of explosion type.Claim according to the IDC report, by 2010, estimate that quantity of information will will reach 988EB (1EB,=10 hundred million GB) at informational capacity in 2010 with annual 57% speed increment from now, be about 6 times in 2006, be equivalent to 1,800 ten thousand times of all books amount of digital information since the dawn of human civilization.In the face of huge internet information storehouse like this, information how better to understand this magnanimity is the key issue in the field of information processing always.
Although XML can be considered to the general language among the web, but, current nearly all available network information all is the webpage of being write as according to html format, and this situation is difficult to change (referring to non-patent literature [1] " Giacomo Fiumara.AutomatedInformation Extraction from Web Source:a Survey.Salita Sperone 31; I-98166 Messina, Italy ") in a short time.And HTML is a kind of towards the mark language that shows, mainly be that the browser display webpage is used for convenience, concerning the people, a lot of useless information are arranged, especially after introducing advertisement on the webpage, garbage is just more, so want to understand better information immense on the network, extracts the text message precondition that is absolutely necessary from the webpage of html format.Therefore, need a kind of Web page text extracting method,,, adopt certain technological means extraction body matter wherein as webpage, XML document etc. so that to the structured document on the network.
Traditional web data abstracting method is to use wrapper (wrapper) to extract interested data in the webpage.Wrapper is a program, and it reads specific content from html document, and preserves with certain form, normally the XML form.Wrapper comprises a series of rule, and utilizes these rules to extract the certain content of webpage.Therefore at present web data extract one of main research work be exactly explore how can be more or less freely acquisition construct the effective ways (referring to non-patent literature [1]) of the required rule of wrapper.
Wrapper in the TSIMMIS instrument of introducing in the non-patent literature [2] " Hammer J; McHugh J.; et al.Semistructured Data:The TSIMMIS Experience[A] .In:proceeding ot the First East EuropeanSymposium on Advance in Databases and Information Systems[C] .1997:1-8 " need manually be write decimation rule, and rule is placed in the special file.The form of rule is [variables, source, pattern].Wherein variables preserves and extracts the result, and source preserves input, and pattern preserves the pattern information of data in source.Variable can be used as the source of the rule of back.After last rule is carried out and finished in the file, preserved last extraction result among the variable.The method of the artificial rules for writing of this needs is not only time-consuming, effort, and make mistakes easily, easy care not.
The wrapper of the XWRAP system of introducing in the non-patent literature [3] " Liu; L.; Pu; C.et al.XWRAP:An XML-enableWrapper Construction System for the Web Information Source[C] .In:proceedings of the 16th IEEE International Conference on DataEngineering, 2000:611-620 " has adopted semi-automatic method to obtain rule.It provides friendly human-computer interaction interface, and the user can finish writing of rule according to the guiding of system.Final system generates a wrapper of writing with Java language at particular source.Before extracting, the XWRAP system checks webpage, revises the grammar mistake and the mark that wherein do not meet standard, and webpage is resolved to one tree.
Non-patent literature [4] " Valter Crescenzi; Giansalvatore Mecca; et al.RoadRunner:Towards Automatic Data Extraction from Large WebSite[A] .In:proceeding of the 26th International Conference on veryLarge Database Systems[C]; 2001:109-118 " in the RoadRunner instrument introduced be an automatic Core Generator of full automatic wrapper, it comes to generate certain pattern for the data that are included in the webpage by the structure from two (or a plurality of) sample webpages of same data source relatively, does not even need the user that the sample and the target pattern of data to be extracted are provided.But this method hypothetical target webpage all automatically generates from certain data source, and it just can utilize the mark structure of webpage to obtain the pattern of the data that comprise in the webpage again so, so its scope of application has certain limitation.
The wrapper that the several method of introducing above generates all is to come extracted data by certain rule or pattern.But because the complicacy and the lack of standard of structure of web page, the realization of a wrapper generally can only be at an information source.From non-patent literature [5] " Alberto H.F.Laender; Berthier A.Ribeiro-Neto; et al.A Brief Survey of Web Data Extraction Tools[J] .SIGMOD Record.2002; 31 (2): 84-93 " description in can see that present web data extraction tool all need be write corresponding wrapper or decimation rule at specific data source.Therefore, if information is from a lot of information sources, just need a lot of wrapper, the generation of wrapper and maintenance have just become a kind of work of complexity like this.Text message for a large amount of news category webpages that exist on the network extracts such task, and it obviously is not feasible method that use is finished at the method for the wrapper in customizing messages source.
Non-patent literature [6] " Sun Chengjie; Guan Yi. based on the research [J] of the Web page text information extraction method of adding up. Journal of Chinese Information Processing .2004; 18 (5): 17-22 " then from statistical angle, attempt solving this problem of Web page text extracting with the method for statistics.The document is according to the characteristics of ASSOCIATE STATISTICS information and Web page text self, proposes and realized a kind of more common method, and obtained reasonable effect.But this method requires all Web page texts must be present in the table label, and this has just limited further applying of this method.
Summary of the invention
In view of the foregoing, the present invention proposes a kind of Web page text extracting method and device, so that processing that can be convenient, utilize in the network immense information.
For achieving the above object, according to an aspect of the present invention, provide a kind of Web page text extracting method, having comprised: webpage is expressed as tree structure; Judge whether each node in the tree structure is effective node; And combination is as institute's contained text information in the leaf node of effective node, to obtain the text of described webpage, wherein, for the given node in the tree structure, if the shared ratio of the node of predefined type is less than or equal to first threshold in the child of described given node, judge that then described given node is effective node.
According to one embodiment of present invention, before webpage is expressed as tree structure, also comprise, webpage is handled, to obtain the webpage that meets the Web standard.
Preferably, judge by back root traversal tree structure whether each node in the tree structure is effective node.If the given node in the judgement tree structure is invalid node, then abandon described given node and all descendants's nodes thereof.
According to one embodiment of present invention, webpage being expressed as tree structure comprises: definition is used to represent the data structure of node; And, utilize described data structure that described webpage is expressed as described tree structure according to the structural information of described webpage.Wherein, described data structure comprises following content: described node types; The value of described node; Be used to find the information of whole childs of described node; Be used to date back to the information of the father node of described node; Be used to find the information of the next brother node of described node; And the title of described node.
Preferably, the node of described predefined type is at least one in hinged node or the picture node.
According to one embodiment of present invention, for the given non-leaf node in the tree structure,, judge that then described given non-leaf node is effective node if described given non-leaf node is hinged node or picture node; And if described given non-leaf node is script node or pattern node, judge that then described given non-leaf node is invalid node.
According to one embodiment of present invention, for the given leaf node in the tree structure, if described given leaf node is not a text node, if perhaps described given leaf node is a text node but the father node of described given leaf node is script node or pattern node, judge that then described given leaf node is invalid node.
According to another embodiment of the invention, for the given leaf node in the tree structure, if described given leaf node is a text node, and the father node of described given leaf node is not script node or pattern node, then during greater than second threshold value, judge that described given leaf node is effective node at the content-length of described given leaf node.When the content-length of described given leaf node is less than or equal to second threshold value,, judge that then described given leaf node is effective node if the father node of described given leaf node is to be used to adjust the node that font shows.
According to a further aspect in the invention, provide a kind of Web page text extracting device, comprising: webpage is represented part, is used for webpage is expressed as tree structure; The node availability judgment part is used for judging whether each node of described tree structure is effective node; And text message built-up section, be used for making up leaf node institute contained text information as effective node, to obtain the text of described webpage, wherein, for the given node in the tree structure, if the shared ratio of the node of predefined type is less than or equal to first threshold in the child of described given node, then described node availability judgment part judges that described given node is effective node.
According to Web page text extracting method of the present invention and device, not only have good versatility, and processing speed is fast, only need once travel through the document that can obtain Web page text to the tree structure of representing webpage.
In addition, the present invention also provides corresponding computer readable storage medium and computer program.
Description of drawings
With reference to below in conjunction with the explanation of accompanying drawing, can understand above and other purpose of the present invention, characteristics and advantage more easily to the embodiment of the invention.In the accompanying drawings, technical characterictic or parts identical or correspondence will adopt identical or corresponding Reference numeral to represent.In the accompanying drawing:
Fig. 1 is the process flow diagram of Web page text extracting method according to an embodiment of the invention;
Fig. 2 utilizes Web page text extracting method according to an embodiment of the invention and the tree structure synoptic diagram of the webpage that obtains;
Fig. 3 utilizes Web page text extracting method according to an embodiment of the invention and the tree structure synoptic diagram of another webpage of obtaining;
Fig. 4 is the process flow diagram of decision node validity process according to an embodiment of the invention;
Fig. 5 is the process flow diagram of text message extraction process according to an embodiment of the invention; And
Fig. 6 is a Web page text extracting schematic representation of apparatus according to an embodiment of the invention.
Embodiment
Embodiments of the invention are described with reference to the accompanying drawings.Should be noted that for purpose clearly, omitted the parts that have nothing to do with the present invention, those of ordinary skills are known and the expression and the description of processing in accompanying drawing and the explanation.
Should understand, in the process of any this practical embodiments of exploitation, must make a lot of decisions specific to embodiment, so that realize developer's objectives, for example, meet and system and professional those relevant restrictive conditions, and these restrictive conditions may change to some extent along with the difference of embodiment.In addition, might be very complicated and time-consuming though will also be appreciated that development, concerning the those skilled in the art that have benefited from present disclosure, this development only is customary task.
At first will be with reference to the accompanying drawings, particularly Fig. 1 to Fig. 5 describes the general work principle according to the Web page text extracting method of the embodiment of the invention.Fig. 1 is the process flow diagram of Web page text extracting method according to an embodiment of the invention.
As shown in Figure 1, according to the Web page text extracting method of this embodiment from S100.In step S102, webpage is carried out pre-service, to obtain the webpage that meets the Web standard.At this, propose institute's situation that does not meet the webpage standard to be processed, and, make the general webpage standard of pending web pages conform at situation about being proposed processing emphatically.
Usually, webpage mainly is made up of three parts, that is, and and structure, performance and behavior.Corresponding standard is also divided three aspects, and the structuring standard language mainly comprises XHTML and XML, and the performance standard language mainly comprises CSS, and behavioral standard mainly comprises object model (as W3C DOM), ECMAScript etc.These standard major parts are drafted and are issued by W3C, and some are also arranged is standards that other normal structures are worked out, such as the ECMAScript standard of ECMA (European Computer ManufacturersAssociation).Whether in the time of early stage, the HTML standard also is not completed into, no matter correctly closed<p〉mark, perhaps in addition design code and format rule deviate from fully, all be what it doesn't matter.Mark do not match, lack attribute setting, incorrect nested etc., such or such mistake all causes owing to lacking a widely accepted standard, because most of browsers all have built-in intelligence, certain fault-tolerant ability is arranged, a lot of Website development persons even all do not recognize these mistakes.Therefore, can just need be carried out pre-service, so that the webpage standardization by the unified webpage of handling in order to obtain to webpage.
The standardization of so-called webpage is exactly to make pending web pages conform webpage standard.In the present invention, the emphasis standard construction standard of webpage, that is to say the requirement that makes web pages conform XHTML language.In according to the embodiment of the invention, at the pre-service of webpage, the aspect that needs emphasis to consider when proposing following standardization.Certainly, also can other situations that not meet the XHTML language requirement be handled, so that webpage standardization to be processed according to actual conditions.
(1) "<" and "〉" can only be used for comprising webpage mark (tag), when these two symbols occurring in other place, use “ ﹠amp; Lt; " and “ ﹠amp; Gt; " replace.
(2) all marks must mate, i.e. all corresponding end mark of each beginning label.
(3) underlined property value all must be placed in the quotation marks, as<a href=" www.w3c.org " 〉.
(4) all marks must be correct nested.For example,<a〉...<b〉...</a〉...</b〉be that incorrect nested, correct nested form should be<a ...<b〉...</b〉...</a 〉.
In method, can wait according to the top webpage that proposes situation lack of standardization at step S102 and to handle webpage, so that in webpage, there is not above-mentioned situation lack of standardization according to the embodiment of the invention.
The webpage standardized method that can be used for above-mentioned webpage preprocessing process has a lot, and has a lot of free instruments available, and more famous instrument has HTML Tidy.Can certainly realize the standardization of webpage with other instrument such as HTML-Kit.HTML Tidy is an instrument of checking and put in order html web page.This instrument is detecting the place that is particularly useful when correcting the nested html web page of deep layer.By using HTML Tidy instrument, can convert nonstandard html web page to meet the W3C standard webpage.
In according to one embodiment of present invention, use HTML Tidy instrument that html web page is converted to the webpage of XML form, saying so more accurately is converted to the XHTML webpage.
The difference of XHTML and HTML is very trickle but also very important.XHTML can be considered to meet the HTML 4.01 of XML grammer.The difference of XHTML and HTML mainly contains:
(i) element of XHTML must have beginning and end mark.HTML does not require that then all elements all have end mark, for example<and p 〉.
(ii) empty element is observed the XHTML standard, for example<and br〉should be write as<br/.
(iii) property value must be included in the double quotation marks.
This shows that the basically identical that requires of these differences and the standardization webpage that proposes in embodiments of the present invention is so can think to meet the webpage of Web standard through the webpage after the HTML Tidy processing.
In step S102, webpage is handled and obtained to meet after the webpage of Web standard, in step S104, will be expressed as tree structure through pretreated webpage.At this, definition is used to represent the data structure of node, and according to the structural information of webpage, utilizes defined data structure that webpage is expressed as tree structure.
Tree structure why webpage is expressed as tree structure and is because can be expressed webpage inner each internodal comprising and involved relation easilier clear and intuitively.HTML itself has very strong hierarchy information, therefore html web page is expressed as the very convenient easy row of tree structure.In addition, webpage is represented to become the also convenient processing to webpage of tree structure.
According to the needs of handling, when webpage is represented to become tree structure,, generally need the following information of record for given node:
(a) be used to date back to the information of the father node of given node;
(b) be used to obtain the information of whole childs of given node;
(c) be used to find the information of the next brother node of given node;
(d) given node types;
(e) value of given node; And
(f) title of given node.
Here should be noted that the tree structure that more than obtains might not exclusively be the tree of standard, because also preserve internodal brotherhood in the resulting tree simultaneously.
In above-mentioned information, requisite information is (b), (d), (e), and remaining information is optional.If but in setting up the process of tree structure, obtain these information simultaneously, when handling webpage, be very easily later on.Certainly, also can be as required, the information of needs record is increased and decreased, so that represent the relation between each node in the tree structure of network and the tree better.
Wherein, node types has node element, text node, document node, note node, attribute node etc.Node element is meant the elementary cell in the html language, as<a 〉,<div or the like all be node element.Attribute node is meant the attribute information of node element, as for<ahref=" http://www.baidu.com "〉</a 〉, href=" http://www.baidu.com " promptly is the attribute node of element " a ".Text node is meant the node that comprises text message, as<span〉text</span in " text " promptly be text node.The document node is the ancestor node of all nodes.The note node then is the note that the programmer is write when writing html document.
The value of node is at its implication difference of different node types, and for text node and note node, nodal value refers to content of text and notes content; And for other nodes, nodal value is NULL.
The title of node then is the name of the node that provides in the html web page.
In one embodiment of the invention, structure is Document Object Model (DOM) tree.Can certainly obtain other trees, as long as can comprise above-mentioned requisite information.
In one embodiment of the invention, utilize the minidom module of python language to obtain dom tree, extract to handle for text message and use through pretreated html web page.The minidom module also is a kind of instrument that those skilled in the art use always.
For example, for the webpage that comprises following HTML code, utilize Web page text extracting method according to an embodiment of the invention and the tree structure synoptic diagram of this webpage of obtaining is shown in Figure 2:
<html>
<head>
<title>Example</title>
</head>
<body>
Just<b>a</b>Sample!
</body>
</html>
Again for example, for the webpage that comprises following HTML code, utilize Web page text extracting method according to an embodiment of the invention and the tree structure synoptic diagram of this webpage of obtaining is shown in Figure 3:
<html>
<head>
<title>Example!</title>
</head>
<body>
<div?id=″div1″>
Just?a?sample!
</div>
<div?id=″div2″>
<a?href=″http://www.test1.com″>test1</a>
<a?href=″http://www.test2.com″>test2</a>
<img?src=″sample_pic.jpg″alt=″″border=″0″>img</img>
</div>
</body>
</html
Return with reference to figure 1, after in step S104, webpage being expressed as tree structure, judge one by one in step S106 then whether the given node in the tree structure is effective node, and combination is as institute's contained text information in the leaf node of effective node, to obtain the text of handled webpage in step S108.Then, the Web page text extracting method according to this embodiment of the invention finishes at step S110.
Node availability in step S108 is judged, is judged by the shared ratio of node of predefined type in the child of given node in the tree structure whether this given node is effective node exactly specifically.
So-called invalid node is meant and the irrelevant node of text message, mainly contains the advertisement node, hinged node, navigation nodes, format information node, script information node or the like.Can find that to these node analyses pattern node and script node all have tangible label to show, characteristics are clearly arranged; And advertisement node, hinged node, navigation nodes etc. are all comprising a large amount of link informations and pictorial information, and main contents that we can say these nodes are exactly those URL links.Analyze based on this, proposed to judge the method for effective node according to shared ratio such as hinged node and picture node in the node child.Corresponding with invalid node is exactly effective node.
That is to say, during the validity of the given node in judging tree structure, the node of statistics predefined type, such as hinged node and picture node, shared ratio in the child of given node, and judge that whether this ratio is greater than predetermined threshold.If this ratio, judges then that this given node is invalid node greater than predetermined threshold; Otherwise,, judge that then this given node is effective node if this ratio is less than or equal to predetermined threshold.
Fig. 4 is the process flow diagram of the detailed process of decision node validity according to an embodiment of the invention.As shown in Figure 4, according to the node availability judgment processing of this embodiment of the invention from step S400, and at step S402 input node to be judged.Then, judge at step S404 whether given node to be determined is leaf node.
If determine that in step S404 this given node to be determined is a leaf node, judge in step S422 then whether this given node is text node.If not text node, judge in step S424 that then this given node is invalid node.
If determine that in step S422 this given node is a text node, then in step S426, judge the type of this given node father node.If its father node is the node of style, script and so on, judge in step S428 that then this given node is invalid node.
If determine that in step S426 the father node of this given node is not the node of style, script and so on, whether the length of then judging the text that comprises in this given node in step S430 is greater than threshold value B.If text size, judges in step S432 then that this given node is effective node greater than threshold value B.
Be less than or equal to threshold value B if determine the length of the text that comprises in this given node in step S430, whether the father node of then judging this given node in step S434 is the description font type, such as strong etc.If whether the father node of this given node is to describe font type, judge in step S436 that then this given node is effective node.Otherwise, not the node of describing font type if in step S434, determine the father node of this given node, judge in step S438 that then this given node is invalid node.Here, threshold value B is through adding up the value of the prior setting that draws.
On the other hand, given node is not a leaf node if the result who judges in step S404 is this, will judge in step S406 then whether this node types is hinged node or picture node.If determining this given node is link or picture node, judge in step S408 that then this given node is effective node, to prevent to accidentally injure the hinged node in the text.
If determining this given node in step S406 is not link or picture node, judge in step S410 then whether this given node is script or pattern node.If determine that this given node is script or pattern node, judge in step S412 that then this given node is invalid node.
If determine that in step S410 this given node is not script or pattern node, then in step S414, add up hinged node and the shared ratio of picture node in the child of this given node, and judge that in step S416 whether this ratio is greater than threshold value A.If the ratio of being added up, judges in step S418 then that this given node is invalid node greater than threshold value A.
If determine that in step S416 hinged node and the shared ratio of picture node are less than or equal to threshold value A in the child of this given node, judge in step S420 that then this given node is effective node.Here, threshold value A also is the value that process is added up the prior setting that draws, and this value is to judge the important evidence of invalid node.
After above each step S424, S428, S432, S436, S438, S408, S412, S418 and S420 judged that the given node to be judged of input is effective node or invalid node, treatment scheme finished at step S440.
Abovely general processing procedure according to the Web page text extracting method of the embodiment of the invention has been described in conjunction with Fig. 1 to Fig. 4.Describe the process flow diagram of the detailed process that extracts according to the text message of the embodiment of the invention in detail below with reference to Fig. 5.
As shown in Figure 5, extract the flow process handled from step S500 according to the text message of this embodiment, and in step S502 and S504, carry out initial work.
At first the root node from webpage tree begins to handle in step S502, root node is recorded as current node to be processed and initialization node stack and labeled slots in step S504.Wherein, node stack is used to deposit effective node, and labeled slots are used for depositing and the corresponding mark of the node of node stack, and whether described mark indicates the child of described node all accessed.
Then, after step S504 finishes initialization process, judge in step S506 whether present node is whether sky or node stack are empty.If present node and node stack are sky, show then that tree structure has been finished by traversal and node stack in also processed the finishing of node, process finishes at step S508.
If it is not empty determining present node and node stack in step S506, then showing still has the node that need handle, and process proceeds to step S510, judges in step S510 whether present node is empty.If present node is not empty, judge in step S512 then whether present node is effective node.Can utilize above-described node availability determination methods to judge whether present node is effective node.
If the judged result in step S512 is a present node is effective node, then in step S516, present node is put in the node stack, put in the labeled slots 0, and first child of record present node is new present node.Wherein, the inwhole accessed mistakes of child of 0 expression present node.Then, process turns back to step S510, so that next node is handled.
If the judged result in step S512 is a present node is invalid node, then in step S514, present node is recorded as the node of nearest visit, and present node is put sky, process turns back to step S510 then.That is to say,, then this node is put sky, make and no longer this node and its all childs are handled if present node is invalid node.
On the other hand, be empty if in step S510, judge present node, then process proceeds to step S518, and whether the decision node stack is empty in step S518.If it is empty determining node stack, then process finishes at step S508.
If it is not empty that the judged result in step S518 is a node stack, then in step S520, get first element of labeled slots and node stack.Effective node that first element in the node stack is promptly the last stacked, whether accessed first element in the labeled slots be all sons of this effective node of indication mark.Then, whether first element of decision node stack is leaf node in step S522.
If the judged result among the step S522 is that first element in the node stack is a leaf node, then in step S534, export the content of this leaf node, it is text message, and in step S536, first element of described node stack is recorded as the node of nearest visit, respectively node stack and labeled slots are played stack then.Then, process turns back to step S506.
If the judged result among the step S522 is first element in the node stack is not leaf node, then process proceeds to step S524, and whether first element of judge mark stack is 0 in step S524.If first element of labeled slots is not 0, the accessed mistake of all sons of first element in the expression node stack, then process proceeds to S526, first element of node stack is recorded as the node of nearest visit in step S526, and respectively node stack and labeled slots is played stack.Then, process turns back to step S506.
If the judged result among the step S524 is first element of labeled slots is 0, then in step S528, first brother of nearest access node is recorded as present node.Then, judge in step S530 whether this present node is empty.If this present node is empty, then in step S532, labeled slots are played stack, change is put into labeled slots with 1, and process turns back to step S506 then.Be not sky if the judged result among the step S530 is a present node, then process directly turns back to step S506.
According to above process, judge whether each node is effective in traversal one by one in the tree structure, and export the content of leaf node according to the order in the stack, finally obtain the body matter of whole webpage.When handling each node, at first judge the validity of this node according to the validity deterministic process.If effective node is then stacked this node, do further processing again; Otherwise, directly abandon this node and all descendants's nodes thereof.The content that leaf node comprised at last last effective node is the text message of webpage.
That is to say in above-mentioned text message extraction process, the validity of the web page joint of giving chapter and verse judges whether descendants's node of a node comprises the thought of text message.Particularly, if a node is invalid node, think that then its all descendants's nodes are invalid node.
Although described text message extraction process with reference to figure 5, it will be apparent to those skilled in the art that said process can realize with various design according to the embodiment of the invention.
For example, in text message extraction process shown in Figure 5, take the back root to travel through the method for representing resulting dom tree in the process at webpage.Certainly, it will be understood by those skilled in the art that and also can use other traversal mode to travel through described tree.
To be example with the webpage of representing with HTML code that above provides below, how according to an embodiment of the invention specifying, Web page text extracting method finally obtains institute's contained text in the webpage.
For example, for the tree structure of webpage shown in Figure 2, at first from " html " node, because the quantity of hinged node and picture node is 0 in the child of this node, proportion is less than predetermined threshold, so judge " html " node to be effective node.Based on same principle, can judge that " head ", " title ", " body " and non-leaf nodes such as " b " are effective node.
" node, because they all are text nodes, and its father node is not the node of style, script and so on, and supposes its content-length here greater than prior preset threshold, therefore they all is judged to be effective node.
At last, extract institute's contained text information in effective leaf node, can obtain following body matter:
Example
Just?a?Sample!
All nodes of tree structure among Fig. 2 are effective node as can be seen.Here also it is pointed out that for leaf node " a ", if determine its content-length less than prior preset threshold, can be to be used to adjust the node that font shows according to its father node then, equally it is judged to be effective leaf node.
Again for example, for the tree structure of webpage shown in Figure 3, equally at first from " html " node, because the quantity of hinged node and picture node is 0 in the child of this node, proportion is less than predetermined threshold, so this node " html " is effective node. " wait and be effective node.
Equally, for " body " node in the tree structure of webpage shown in Figure 3, the quantity of hinged node and picture node also is 0 in the child of this node, and proportion is less than predetermined threshold, so judge " body " node to be effective node.Then, handle " div1 " and " Justa sample " node successively, can judge that according to Fig. 4 and treatment scheme shown in Figure 5 these two nodes are effective node.
But, when " div2 " node, because picture node and the shared ratio of hinged node are 66.7% in its child, greater than the predetermined threshold of setting in advance, so this node and descendants's node thereof all are judged to be invalid node.
" node) text message that comprised, both can obtain the following body matter of this webpage.
Example!
Just?a?sample!
Abovely describe the ultimate principle and the processing procedure of Web page text extracting method according to an embodiment of the invention in detail, will describe Web page text extracting device below according to the embodiment of the invention in conjunction with Fig. 1 to Fig. 5.
Fig. 6 is the synoptic diagram of Web page text extracting device 600 according to an embodiment of the invention.As shown in Figure 6, comprise that according to the Web page text extracting device 600 of this embodiment webpage preprocessing part 602, webpage represent part 604, node availability judgment part 606 and text message tissue part 608.
The webpage that 602 pairs of webpage preprocessing parts will extract its body matter carries out pre-service, so that described web pages conform Web standard.Webpage represents that part 604 will be expressed as tree structure through webpage preprocessing part 602 pretreated webpages.Node availability judgment part 606 judges that webpage represents whether each node in the tree structure of part 604 resulting web page is effective node.608 combined joint validity of text message built-up section judgment part 606 is judged to be institute's contained text information in the leaf node of effective node, thereby obtains the text of described webpage.
At this, represent the given node in the tree structure of part 604 resulting web page for webpage, if the shared ratio of the node of predefined type is less than or equal to first threshold in the child of described given node, then described node availability judgment part 606 should be judged to be effective node by given node.
Webpage preprocessing part 602 can wait such as HTMLTidy or HTML-Kit with webpage standardized tool mentioned above and implement, and is not described in detail here.
According to one embodiment of present invention, webpage represent part 604 can comprise definition be used to represent node data structure the unit and utilize defined data structure webpage to be expressed as the unit of tree structure according to the structural information of webpage.Webpage represents that the detailed process process of part 604 can carry out with reference to top description to step S104 in the Web page text extracting method, saves it here and describes in detail, to avoid unnecessary repetition.。
Equally, the concrete processing procedure of node availability judgment part 606 and text message tissue part 608 respectively with above-described Web page text extracting method in step S106 and S108 similar, therefore, for instructions for purpose of brevity, also no longer be described in greater detail at this.
In addition, in the processing procedure of Web page text extracting device according to an embodiment of the invention, also can utilize and realize corresponding processing, thereby obtain the body matter of webpage efficiently according to Fig. 4 described decision node validity process and the described text message extraction process of Fig. 5.
As can be seen from the above description, according to Web page text extracting method of the present invention and device, need not write corresponding wrapper or decimation rule, therefore have good versatility, can extract text message easily from the webpage of various information sources at specific data source.
In addition,, only need once travel through the text message that can obtain webpage, so processing speed is fast, is highly suitable for the network information world of rapidly expanding now the tree structure of representing webpage according to Web page text extracting method of the present invention and device.
Ultimate principle of the present invention has below been described in conjunction with specific embodiments, but, it is to be noted, for those of ordinary skill in the art, can understand the whole or any steps or the parts of method and apparatus of the present invention, can (comprise processor at any computing equipment, storage medium etc.) or in the network of computing equipment, with hardware, firmware, software or their combination are realized, this is that those of ordinary skills use their basic programming skill just can realize under the situation of having read explanation of the present invention, has therefore omitted detailed description here.
Therefore, based on above-mentioned understanding, purpose of the present invention can also realize by program of operation or batch processing on any messaging device.Described messaging device can be known common apparatus.Therefore, purpose of the present invention also can be only by providing the program product that comprises the program code of realizing described method or equipment to realize.That is to say that such program product also constitutes the present invention, and the storage medium that stores such program product also constitutes the present invention.Obviously, described storage medium can be any storage medium that is developed in any known storage medium or future, therefore also there is no need at this various storage mediums to be enumerated one by one.
It is pointed out that also that in apparatus and method of the present invention obviously, each parts or each step can decompose and/or reconfigure.These decomposition and/or reconfigure and to be considered as equivalents of the present invention.And, carry out the step of above-mentioned series of processes and can order following the instructions naturally carry out in chronological order, but do not need necessarily to carry out according to time sequencing.Some step can walk abreast or carry out independently of one another.
In addition, the application's term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thereby make the process, method, article or the device that comprise a series of key elements not only comprise those key elements, but also comprise other key elements of clearly not listing, or also be included as this process, method, article or device intrinsic key element.Do not having under the situation of more restrictions, the key element that limits by statement " comprising ... ", and be not precluded within process, method, article or the device that comprises described key element and also have other identical element.
Though described the specific embodiment of the present invention in detail; but those of ordinary skill in the art should know; protection scope of the present invention is not limited to detail disclosed herein, and can have various variations and equivalents in spirit scope of the present invention.

Claims (22)

1. Web page text extracting method comprises:
Webpage is expressed as tree structure;
Judge whether each node in the tree structure is effective node; And
Combination is as institute's contained text information in the leaf node of effective node, obtaining the text of described webpage,
Wherein, for the given node in the tree structure,, judge that then described given node is effective node if the shared ratio of the node of predefined type is less than or equal to first threshold in the child of described given node.
2. method according to claim 1 wherein, also comprised before webpage is expressed as tree structure: webpage is handled, to obtain the webpage that meets the Web standard.
3. method according to claim 2 wherein, judges by back root traversal tree structure whether each node in the tree structure is effective node.
4. method according to claim 3 wherein, if judge that the given node in the tree structure is invalid node, is then abandoned described given node and all descendants's nodes thereof.
5. method according to claim 4 wherein, is expressed as tree structure with webpage and comprises:
Definition is used to represent the data structure of node; And
According to the structural information of described webpage, utilize described data structure that described webpage is expressed as described tree structure.
6. method according to claim 5, wherein, described data structure comprises following content:
Described node types;
The value of described node;
Be used to find the information of whole childs of described node;
Be used to date back to the information of the father node of described node;
Be used to find the information of the next brother node of described node; And
The title of described node.
7. according to the arbitrary described method of claim 1 to 6, wherein, the node of described predefined type is at least one in hinged node or the picture node.
8. according to the arbitrary described method of claim 1 to 6, wherein,,, judge that then described given non-leaf node is effective node if described given non-leaf node is hinged node or picture node for the given non-leaf node in the tree structure; And if described given non-leaf node is script node or pattern node, judge that then described given non-leaf node is invalid node.
9. according to the arbitrary described method of claim 1 to 6, wherein, for the given leaf node in the tree structure, if described given leaf node is not a text node, if perhaps described given leaf node is a text node but the father node of described given leaf node is script node or pattern node, judge that then described given leaf node is invalid node.
10. according to the arbitrary described method of claim 1 to 6, wherein, for the given leaf node in the tree structure, if described given leaf node is a text node, and the father node of described given leaf node is not script node or pattern node, then during greater than second threshold value, judge that described given leaf node is effective node at the content-length of described given leaf node.
11. method according to claim 10, wherein, when the content-length of described given leaf node is less than or equal to second threshold value,, judge that then described given leaf node is effective node if the father node of described given leaf node is to be used to adjust the node that font shows.
12. a Web page text extracting device comprises:
Webpage is represented part, is used for webpage is expressed as tree structure;
The node availability judgment part is used for judging whether each node of described tree structure is effective node; And
The text message built-up section is used for making up as the effective leaf node institute contained text information of node, obtaining the text of described webpage,
Wherein, for the given node in the tree structure, if the shared ratio of the node of predefined type is less than or equal to first threshold in the child of described given node, then described node availability judgment part judges that described given node is effective node.
13. device according to claim 12 also comprises:
Page processing section is used for to representing that the webpage that part is expressed as tree structure handles by webpage, to obtain the webpage that meets the Web standard.
14. device according to claim 13, wherein, described node availability judgment part judges by back root traversal tree structure whether each node in the tree structure is effective node.
15. device according to claim 14 wherein, if described node availability judgment part judges that the given node in the tree structure is invalid node, is then abandoned described given node and all descendants's nodes thereof.
16. device according to claim 15, wherein, described webpage represents that part comprises:
Definition is used to represent the unit of the data structure of node; And
According to the structural information of described webpage, utilize described data structure described webpage to be expressed as the unit of described tree structure.
17. device according to claim 16, wherein, described data structure comprises following content:
Described node types;
The value of described node;
Be used to find the information of whole childs of described node;
Be used to date back to the information of the father node of described node;
Be used to find the information of the next brother node of described node; And
The title of described node.
18. according to the arbitrary described device of claim 12 to 17, wherein, the node of described predefined type is at least one in hinged node or the picture node.
19. arbitrary described device according to claim 12 to 17, wherein, for the given non-leaf node in the tree structure, if described given non-leaf node is hinged node or picture node, then described node availability judgment part judges that described given non-leaf node is effective node; And if described given non-leaf node is script node or pattern node, then described node availability judgment part judges that described given non-leaf node is invalid node.
20. arbitrary described device according to claim 12 to 17, wherein, for the given leaf node in the tree structure, if described given leaf node is not a text node, if perhaps described given leaf node is a text node but the father node of described given leaf node is script node or pattern node, then described node availability judgment part judges that described given leaf node is invalid node.
21. arbitrary described device according to claim 12 to 17, wherein, for the given leaf node in the tree structure, if described given leaf node is a text node, and the father node of described given leaf node is not script node or pattern node, then at the content-length of described given leaf node during greater than second threshold value, described node availability judgment part judges that described given leaf node is effective node.
22. device according to claim 21, wherein, when the content-length of described given leaf node is less than or equal to second threshold value, if the father node of described given leaf node is to be used to adjust the node that font shows, then described node availability judgment part judges that described given leaf node is effective node.
CN200910137364A 2009-04-24 2009-04-24 Web page text extracting method and device thereof Pending CN101872350A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200910137364A CN101872350A (en) 2009-04-24 2009-04-24 Web page text extracting method and device thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200910137364A CN101872350A (en) 2009-04-24 2009-04-24 Web page text extracting method and device thereof

Publications (1)

Publication Number Publication Date
CN101872350A true CN101872350A (en) 2010-10-27

Family

ID=42997215

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200910137364A Pending CN101872350A (en) 2009-04-24 2009-04-24 Web page text extracting method and device thereof

Country Status (1)

Country Link
CN (1) CN101872350A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102651002A (en) * 2011-02-28 2012-08-29 腾讯科技(深圳)有限公司 Webpage information extracting method and system
CN102663041A (en) * 2012-03-28 2012-09-12 重庆大学 Automatic extraction method oriented to data of deep web pages
CN102779172A (en) * 2012-06-25 2012-11-14 北京奇虎科技有限公司 Recognition system and recognition method of non-body text in webpage
CN103136312A (en) * 2011-12-27 2013-06-05 北京麦克斯泰科技有限公司 Extracting method of contents of news webpage
CN104123125A (en) * 2013-04-26 2014-10-29 腾讯科技(深圳)有限公司 Webpage resource acquisition method and device
CN104376061A (en) * 2014-11-10 2015-02-25 武汉传神信息技术有限公司 Webpage text extracting method
CN106202579A (en) * 2016-08-26 2016-12-07 乐视控股(北京)有限公司 Web page text extraction process method and device, server, terminal
CN103761312B (en) * 2014-01-24 2017-02-08 福州大学 Information extraction system and method for multi-recording webpage
CN106528068A (en) * 2015-09-15 2017-03-22 中国电信股份有限公司 Webpage content reconstruction method and system
CN106855859A (en) * 2015-12-08 2017-06-16 北京搜狗科技发展有限公司 A kind of webpage context extraction method and device
CN107463571A (en) * 2016-06-03 2017-12-12 北京京东尚科信息技术有限公司 Web color method
CN109710833A (en) * 2018-12-29 2019-05-03 上海蜜度信息技术有限公司 For determining the method and apparatus of content node
CN111339457A (en) * 2018-12-18 2020-06-26 富士通株式会社 Method and apparatus for extracting information from web page and storage medium

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102651002A (en) * 2011-02-28 2012-08-29 腾讯科技(深圳)有限公司 Webpage information extracting method and system
CN103136312A (en) * 2011-12-27 2013-06-05 北京麦克斯泰科技有限公司 Extracting method of contents of news webpage
CN103136312B (en) * 2011-12-27 2016-08-31 北京麦克斯泰科技有限公司 A kind of abstracting method of news web page content
CN102663041A (en) * 2012-03-28 2012-09-12 重庆大学 Automatic extraction method oriented to data of deep web pages
CN102663041B (en) * 2012-03-28 2014-01-01 重庆大学 Automatic extraction method oriented to data of deep web pages
CN102779172A (en) * 2012-06-25 2012-11-14 北京奇虎科技有限公司 Recognition system and recognition method of non-body text in webpage
CN102779172B (en) * 2012-06-25 2016-06-01 北京奇虎科技有限公司 The recognition system of non-body text and method in a kind of webpage
CN104123125A (en) * 2013-04-26 2014-10-29 腾讯科技(深圳)有限公司 Webpage resource acquisition method and device
US10110659B2 (en) 2013-04-26 2018-10-23 Tencent Technology (Shenzhen) Company Limited Method and apparatus for obtaining webpages
CN103761312B (en) * 2014-01-24 2017-02-08 福州大学 Information extraction system and method for multi-recording webpage
CN104376061A (en) * 2014-11-10 2015-02-25 武汉传神信息技术有限公司 Webpage text extracting method
CN106528068A (en) * 2015-09-15 2017-03-22 中国电信股份有限公司 Webpage content reconstruction method and system
CN106855859A (en) * 2015-12-08 2017-06-16 北京搜狗科技发展有限公司 A kind of webpage context extraction method and device
CN106855859B (en) * 2015-12-08 2020-11-10 北京搜狗科技发展有限公司 Webpage text extraction method and device
CN107463571A (en) * 2016-06-03 2017-12-12 北京京东尚科信息技术有限公司 Web color method
CN107463571B (en) * 2016-06-03 2020-03-31 北京京东尚科信息技术有限公司 Webpage duplicate elimination method and device and storage medium
CN106202579A (en) * 2016-08-26 2016-12-07 乐视控股(北京)有限公司 Web page text extraction process method and device, server, terminal
CN111339457A (en) * 2018-12-18 2020-06-26 富士通株式会社 Method and apparatus for extracting information from web page and storage medium
CN111339457B (en) * 2018-12-18 2023-09-08 富士通株式会社 Method and apparatus for extracting information from web page and storage medium
CN109710833A (en) * 2018-12-29 2019-05-03 上海蜜度信息技术有限公司 For determining the method and apparatus of content node
CN109710833B (en) * 2018-12-29 2021-07-16 上海蜜度信息技术有限公司 Method and apparatus for determining content node

Similar Documents

Publication Publication Date Title
CN101872350A (en) Web page text extracting method and device thereof
CN107797991B (en) Dependency syntax tree-based knowledge graph expansion method and system
US9619448B2 (en) Automated document revision markup and change control
CN1815477B (en) Method and system for providing semantic subjects based on mark language
KR100324456B1 (en) Structured document searching display method and apparatus
Papadakis et al. Stavies: A system for information extraction from unknown web data sources through automatic web wrapper generation using clustering techniques
US20120303645A1 (en) System and method for extraction of structured data from arbitrarily structured composite data
CN101582074B (en) Method for extracting data of DeepWeb response webpage
CN101211336B (en) Visualized system and method for generating inquiry file
US20090083300A1 (en) Document processing device and document processing method
WO2007081017A1 (en) Document processor
Liu et al. An XML-enabled data extraction toolkit for web sources
US20080005662A1 (en) Server Device and Name Space Issuing Method
CN102257490A (en) Document information selection method and computer program product
US20190387056A1 (en) Irc-infoid data standardization for use in a plurality of mobile applications
US8949710B2 (en) Grammar and method for integrating XML data from multiple sources
US20090217156A1 (en) Method for Storing Localized XML Document Values
Budin et al. Hooking up to the corpus: the Viennese Lexicographic Editor’s corpus interface
US20090083620A1 (en) Document processing device and document processing method
Elsheh et al. Using database metadata and its semantics to generate automatic and dynamic web entry forms in
JP2002297662A (en) Method and device for editing structured document, terminal, and program
JP2004303097A (en) Partial document extraction program and partial document extraction method of structured document
US11210454B2 (en) Method for preparing documents written in markup languages while implementing a user interface for dealing with data of an information system
Amin et al. Information Extraction: A wrapper Approach
CN116049595A (en) Webpage document information extraction method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20101027