CN103761312B - Information extraction system and method for multi-recording webpage - Google Patents

Information extraction system and method for multi-recording webpage Download PDF

Info

Publication number
CN103761312B
CN103761312B CN201410034376.4A CN201410034376A CN103761312B CN 103761312 B CN103761312 B CN 103761312B CN 201410034376 A CN201410034376 A CN 201410034376A CN 103761312 B CN103761312 B CN 103761312B
Authority
CN
China
Prior art keywords
node
record
posting field
number
document order
Prior art date
Application number
CN201410034376.4A
Other languages
Chinese (zh)
Other versions
CN103761312A (en
Inventor
陈国龙
廖祥文
陈巧灵
杨定达
魏晶晶
Original Assignee
福州大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 福州大学 filed Critical 福州大学
Priority to CN201410034376.4A priority Critical patent/CN103761312B/en
Publication of CN103761312A publication Critical patent/CN103761312A/en
Application granted granted Critical
Publication of CN103761312B publication Critical patent/CN103761312B/en

Links

Abstract

The invention relates to an information extraction system and method for a multi-recording webpage. The system comprises a webpage preprocessing module, a recording area locating module, a recording separator identification module and a recording output module. The webpage preprocessing module is used for converting an HTML webpage into an XHTML webpage, filters labels used for rendering the display effect in the webpage and builds a document order tree according to the nested structure of the labels. The recording area locating module is used for receiving the document order tree and locates the position of the recording area in the document order tree through a horizontal analytic hierarchy process. The recording separator identification module is used for finding separators between recordings in the recording area and storing the separators. The recording output module is used for outputting all text nodes in the recording area according to the hierarchical order and outputting separation lines when meeting the separators to obtain a final extraction result. The system and the method can extract the information of the traditional and novel multi-recording webpages efficiently and accurately, the extraction speed is high, the accuracy is high, the universality is good, and the application range is wide.

Description

A kind of information extraction system of many record webpages and method

Technical field

The present invention relates to information extraction technique field, more particularly, to a kind of letter employed in many record webpages Breath extraction system and abstracting method, can apply to record webpage tradition more(As search engine results page etc.)With new-type many records Webpage(As microblogging record webpage, forum postings webpage, product review web page etc.)It is adaptable to multiple difference media and different neck Domain.

Background technology

In the prior art, have a lot of technical methods can be used for recording web page extraction more.Traditional information extraction method is adopted With the method for redaction rule, the method quickly and accurately can extract record information from specific data source.But when number When increasing hundreds and thousands of according to source scale, then rely on manual compiling rule, can take a substantial amount of time with energy it is impossible to meet existing Process demand in the very fast expansion of information.On the other hand, the web page template of each data source is not unalterable, once the page Template renewal, it is necessary to manually remodify rule, causes huge maintenance cost.Some are also had to pass through manually to mark training set The method carrying out create-rule, since it is desired that artificial participate in being also not suitable for extracting the changeable many records webpage of magnanimity.

In the prior art, there are some and be directed to the tradition automatic extraction technique method recording webpage more.Tradition to be remembered Record webpage is to go out record by the cgi program of server from database retrieval, is then dynamically generated with the template making.Due to having Fixing template, so the structural similarity of every record is high, very regular.Automatically abstracting method can be according to one or one The feature of class webpage extracts similar data record in webpage automatically.In these techniques, typically usage record structure is similar Degree(Structure Similarity), and posting field is determined according to the Similarity value calculating.

In the prior art, also there are some and be directed to the new-type automatic extraction technique method recording webpage more.New-type many Record webpage body content, by self creation of netizen, has very high motility, and record external structure is similar, is one in terms of webpage Article one, record, but record internal structure diversity is big, some microbloggings are original microbloggings, only original interior taking microblogging record as a example Hold, and some microbloggings are to forward microblogging, in addition to original content, also embed a record being forwarded.Automatically abstracting method can Similar data record in webpage is automatically extracted according to the feature of or a class webpage.In these techniques, typically use Domain knowledge, is all occurred in being recorded using every and readily identified unit usually determines posting field.

However, new-type many record webpages have the characteristics that its own, record webpage is different more than tradition.Many for tradition The abstracting method of record webpage value of obtaining when calculating new-type structure of web page similarity generally low so that it can not correctly be known Other posting field;In addition, the existing abstracting method for new-type many record webpages often only focuses on a certain medium, expansion is not Foot.

Existing many record web page extraction methods do not take into full account the new-type construction featuress recording webpage more, and can only It is applied to certain medium.With the continuous generation of the social activity medium message such as microblogging in recent years, forum, new-type many record webpages are Have substantial amounts of data resource and it needs to the information such as much-talked-about topic therein, opinion leader are found by data mining technology, This just proposes a challenge to many record informations extraction technique:How to build the effective information extraction system of unification to expire The information extraction of foot difference medium needs.Therefore, in the urgent need to there being a kind of many records abstracting method of efficiently and accurately, the method should The posting field in webpage can be automatically positioned, and the record in posting field is split, simultaneously can different media, Different field is conveniently used.

Content of the invention

It is an object of the invention to overcoming the deficiencies in the prior art, provide a kind of information extraction system of many record webpages and Method, this system and method efficiently and accurately can carry out information extraction to traditional and new-type many record webpages, extracts speed Hurry up, accuracy high, highly versatile, applied widely.

To achieve these goals, the technical scheme is that:A kind of information extraction system recording webpage more, bag Include:One Web-page preprocessing module, for html web page is converted to XHTML webpage, and is used in filtering web page rendering display The label of effect, the then nested structure according to label, build document order tree;One posting field locating module, is used for connecing Receive the document order tree of document to be extracted, and orient posting field using horizontal analytic hierarchy process (AHP) in described document order tree Position;One rs chacter identification module, records it for finding from described posting field using bidirectional research method Between separator and stored;And a record output module, for text nodes all in posting field are pressed level Order traversal exports, and exports separator bar when encountering separator, obtains final extraction result.

Further, described Web-page preprocessing module includes SAX parser, for parsing to XHTML web page code, To build document order tree.

Further, described SAX parser includes 4 event handlers, respectively startDocument event handling Device, endDocument event handler, startElement event handler, endElement event handler;Described 4 Event handler contains the sequence of operations pre-defining respectively, and described 4 event handlers are suitable according to parsing label Sequence is triggered successively, executes.

Further, described posting field locating module is positioned in described document order tree using horizontal analytic hierarchy process (AHP) Go out the position of posting field, comprise the following steps:Step a1:Preorder traversal document order tree, counts and records each node phase With child node number, find the most node of same child node number, and the subtree with this node as root is defined as candidate's note Record region 1 and this node will meet certain text node number of characters in its subtree and be more than the average character of document order tree text node Number;Step a2:While document order tree is traveled through, determine and be located at middle subtree and number of characters in document order tree Many text node positions;Determine text node position and determine the unique road from this node to document order root vertex Footpath, selects the most node of same child node number on this path, the subtree with this node as root is defined as candidate record region 2;Step a3:If the root node in candidate record region 1 is identical with the root node in candidate record region 2, candidate record region 1 It is final entry region with the subtree representated by candidate record region 2, otherwise calculate two stalk root vertex same child node The quotient of number, introduces threshold valueθIf quotient is equal to threshold value greatly, illustrates that record strip number is few, needs to rely on most number of characters texts Being accurately positioned posting field, selection candidate record region 2 is final entry region to node;If quotient is less than threshold value, explanation Record strip number is many, selects candidate record region 1 as final entry region.

Further, described rs chacter identification module searches the separator between record from described posting field, Comprise the following steps:Step b1:All child nodes of root node are defined as posting field joint block, in posting field joint block Middle utilization bidirectional research method searches the non-overlapped repetition sub-block of energy cover-most node;Searched using bidirectional research method and repeat The method of sub-block is:Determine the most text node of number of characters in posting field first, then on the basis of text node, double To find repetition sub-block to expansion;Step b2:Posting field joint block is carried out repeat the coupling of sub-block, match all First node location insertion separator, to split to record in posting field.

The present invention also provides a kind of information extraction method recording webpage more, comprises the following steps:Step 1:Pre- by webpage Html web page is converted to XHTML webpage by processing module, and is used in filtering web page rendering the label of display effect, then basis The nested structure of label, builds document order tree;Step 2:Literary composition from pretreatment module is received by posting field locating module Class sequence tree, by orienting posting field block to the horizontal step analysis of described document order tree in described document order tree Position;Step 3:Posting field block position from posting field locating module, profit are received by rs chacter identification module Find the separator between record with bidirectional research method from posting field block and stored;Step 4:By record output mould Text nodes all in posting field are pressed hierarchical sequence traversal output by block, export separator bar when encountering separator, obtain Whole extraction result.

Further, in step 1, described using SAX parser, web document is parsed, to build document order Tree.

Further, described SAX parser includes 4 event handlers, respectively startDocument event handling Device, endDocument event handler, startElement event handler, endElement event handler;Described 4 Event handler contains the sequence of operations pre-defining respectively, described 4 processors according to parsing label order according to Secondary it is triggered, executes.

Further, in step 2, posting field is oriented in described document order tree by horizontal step analysis Position, comprises the following steps:Step a1:Preorder traversal document order tree, counts and records each node same child node number, Find the most node of same child node number, and the subtree with this node as root is defined as candidate record region 1 and this section Point will meet certain text node number of characters in its subtree and be more than the average number of characters of document order tree text node;Step a2:? While document order tree is traveled through, determine and in document order tree, be located at middle subtree and the most text node of number of characters Position;Determine text node position and determine exclusive path from this node to document order root vertex, select this road The most node of same child node number on footpath, the subtree with this node as root is defined as candidate record region 2;Step a3:As The root node in fruit candidate record region 1 is identical with the root node in candidate record region 2, then candidate record region 1 and candidate record Subtree representated by region 2 is final entry region, otherwise calculates the quotient of two stalk root vertex same child node numbers, draws Enter threshold valueθIf quotient is equal to threshold value greatly, illustrates that record strip number is few, need to rely on most number of characters text nodes accurately to determine Position posting field, selection candidate record region 2 is final entry region;If quotient is less than threshold value, illustrate that record strip number is many, choosing Select candidate record region 1 as final entry region.

Further, in step 3, search the separator between record from posting field, comprise the following steps:Step b1:All child nodes of root node are defined as posting field joint block, posting field joint block utilizes bidirectional research side Method searches the non-overlapped repetition sub-block of energy cover-most node;Using the method that bidirectional research method searches repetition sub-block it is:First First determine posting field in the most text node of number of characters, then on the basis of text node, two-way expansion come to find weight Multiple sub-block;Step b2:Posting field joint block is carried out repeat the coupling of sub-block, in all first node positions matching Put insertion separator, to split to record in posting field.

Compared to prior art, the invention has the beneficial effects as follows can be efficiently and accurately to recording webpage tradition more(As searched Results page etc. held up in index)With new-type many record webpages(As microblogging record webpage, forum postings webpage, product review web page etc.) Carry out information extraction, overcome existing many record web page extraction methods problem poor for applicability to new-type many record webpages, no Only extract speed fast, accuracy height, stability is high, and also highly versatile, applied widely, can be in different media, different neck Domain is easily applied, and has very strong practicality and wide application prospect.

Brief description

Below in conjunction with the accompanying drawings and specific embodiment the invention will be further described.

Fig. 1 is the system structure diagram of the embodiment of the present invention.

Fig. 2 is record separation example schematic in the embodiment of the present invention.

Specific embodiment

As shown in figure 1, recording the information extraction system of webpage the present invention more, including:

(1)Web-page preprocessing module, for html web page is converted to XHTML webpage, and is used in filtering web page rendering The label of display effect, such as<script>Deng, the then nested structure according to label, build document order tree, to record label Father and son's node relationships;Web-page preprocessing module carries out web analysis, the process to webpage for the Web-page preprocessing module using SAX mechanism Result is organized with document order tree;

(2)Posting field locating module, for receiving the document order tree of document to be extracted, and utilizes horizontal step analysis Method orients the position of posting field in document order tree;

(3)Rs chacter identification module, for finding the separator between record and being stored from posting field;

(4)Record output module, for text nodes all in posting field are pressed hierarchical sequence traversal output, is encountering Export separator bar during separator, obtain final extraction result.

Describe the implementation of each module separately below in detail.

(1)Web-page preprocessing module

First, how the SAX parser in description Web-page preprocessing module carries out web analysis, i.e. how webpage HTML code is converted into document order tree.

At present, the treatment mechanism of web page code mainly has two types:Carry out web page code analysis by building dom tree DOM parser, and the SAX parser of web page code analysis is carried out by definition event.In order to process big document, In the present invention, using SAX parser, webpage is parsed.

Web-page preprocessing module includes SAX parser, for parsing to XHTML web page code, to build document Sequence tree.SAX parser includes 4 event handlers:StartDocument event handler, endDocument event handling Device, startElement event handler, endElement event handler, 4 event handlers contain respectively to be determined in advance The good sequence of operations of justice, 4 event handlers are triggered successively, execute according to the order of parsing label.When being resolved to document First label is such as<xml>When trigger event processor startDocument (), then can execute and pre-define in processor Sequence of operations, proceed after execution is good to parse, start label such as when being resolved to<head>When trigger event processor StartElement (), when being resolved to end-tag such as</head>When trigger event processor endElement ().So successively Parsing is gone down until being resolved to last label</xml>Trigger event processor endDocument ().So reading over one After webpage, all parsing work to it are also completed by the operation order pre-defining in event handler.

In order to be able to be configured similarly to the structure of dom tree, under SAX treatment mechanism, using document order index construct label Tree.I.e. according to the order number consecutively of traversal label, and record the filiation of label corresponding to numbering, thus building document Order tree.

(2)Posting field locating module

Secondly, description posting field locating module is how to determine posting field block.Main thought is to document order Tree carries out horizontal step analysis, and by same father node, similarly hereinafter node layer regards a big joint block as, and posting field is positioned Problem be converted to find joint block similar sub-block.

It is proposed that following 2 hypothesis:

Assume 1:Comprising the joint block more than similar node number is more likely posting field joint block.

Assume 2:Text node more than number of characters is more likely the text node in posting field.

According to assuming to carry out preorder traversal to document order tree, count and record each node same child node number, will As the root node in candidate record region 1 and this node will meet certain literary composition in its subtree to the most node of same child node number This node character number is more than the average number of characters of document order tree text node.Increasing this judgement is to exclude web menu hurdle etc. The interference of other non-recorded area joint blocks.Because menu bar joint block also has a large amount of similar node, but its text node Number of characters is generally less, so how much can be made a distinction using number of characters.

Determine while tree is traveled through and in document order tree, be located at the most text node of middle subtree number of characters Position.Determine text node position and determine exclusive path from this node to document order root vertex, select this road The most node of same child node number on footpath, the subtree with this node as root is candidate record region 2.To most number of characters texts The purpose that node location is judged is the interference in order to exclude the long text node in non-recorded area, such as the version of webpage afterbody Power statement etc..Because recording text to be extracted is usually located at the middle part of whole HTML code, it is then middle subtree in dom tree Position.

If the root node in candidate record region 1 is identical with the root node in candidate record region 2, their representative sons Set as final entry region.Otherwise calculate the quotient of two stalk root vertex same child node numbers(Subtree 1 same child node Number is than subtree 2 same child node number), introduce threshold valueθIf quotient is equal to greatly threshold value and illustrates that record strip number is few, need to rely on At most number of characters text node to be more accurately positioned posting field, so selecting candidate record region 2 to be final entry region; When quotient is less than threshold value, illustrate that record strip number is many, at this moment select candidate record region 1 more reliable as final entry region. Generally threshold valueθWhen taking 0.3, to great majority record web page extraction effect preferably, can be according to record strip number practical situation regulating valve ValueθTo obtain optimal extraction effect.

(3)Rs chacter identification module

Again, description rs chacter identification module is the separator between how identification record.Institute by root node Child node is had to be defined as posting field joint block.The first step, is found and can be covered using Suffix array clustering in posting field joint block The non-overlapped repetition sub-block of most nodes.Search, in order to simplify, the process repeating sub-block, word in posting field can be determined first The most text node of symbol number, repetition sub-block is found in two-way expansion on the basis of text node.Second step, in posting field Carry out in joint block repeating the coupling of sub-block, in all first node location insertion separators matching, thus can be right In posting field, record is split.

(4)Record output module

Finally, description record output module.Text nodes all in posting field are pressed hierarchical sequence by record output module Traversal output, exports separator bar when encountering separator, obtains final extraction result.

The maximum innovative point of the present invention is to carry out horizontal step analysis to document order tree, by same father node similarly hereinafter layer Node regards a big joint block as, the problem finding similar subtree is converted to the similar sub-block finding joint block.As Fig. 1 institute In the document order tree shown ground floor have 1 joint block be<head>、<body>, the second layer have 1 joint block be<p>、< ul>、<a>, third layer have 1 joint block<li>、<li>、<li>、<li>、<li>, the 4th layer have 5 joint blocks according to from Left-to-right order be respectively<a>、<T>、<a>}、{<T>、<a>、<a>}、{<T>、<T>}、{<T>}、{<T>, wherein<T>Table Show is text node.If the node composition posting field block of dotted line inframe, the root node of this region unit is referred to as by we Posting field root node(In figure is<ul>), the joint block referred to as posting field joint block that all child nodes of root node are constituted (In figure be<li>、<li>、<li>、<li>、<li>}).

Accordingly, the present invention proposes the information extraction methods recording webpage more, comprises the following steps:

Step 1:Html web page is converted to by XHTML webpage by Web-page preprocessing module, and is used in filtering web page rendering The label of display effect, such as<script>Deng, the then nested structure according to label, build document order tree;

Step 2:Document order tree from pretreatment module is received by posting field locating module, by described document The position of posting field block is oriented in the horizontal step analysis of order tree in described document order tree;

Step 3:Posting field block position from posting field locating module, profit are received by rs chacter identification module Find the separator between record with bidirectional research method from posting field block and stored;

Step 4:Text nodes all in posting field are pressed by hierarchical sequence traversal output by record output module, is encountering Export separator bar during separator, obtain final extraction result.

In step 1, described using SAX parser, web document is parsed, to build document order tree.Described SAX parser includes 4 event handlers, respectively startDocument event handler, endDocument event handling Device, startElement event handler, endElement event handler;Described 4 event handlers contain pre- respectively The sequence of operations first defining, described 4 event handlers are triggered successively, execute according to the order of parsing label.

In step 2, the position of posting field is oriented in described document order tree by horizontal step analysis, including Following steps:

Step a1:Preorder traversal document order tree, counts and records each node same child node number, find identical son The most node of node number, and the subtree with this node as root is defined as candidate record region 1 and this node will meet it In subtree, certain text node number of characters is more than the average number of characters of document order tree text node;

Step a2:While document order tree is traveled through, determine and in document order tree, be located at middle subtree and word The most text node position of symbol number;Determine text node position to determine from this node to document order root vertex Exclusive path, selects the most node of same child node number on this path, and the subtree with this node as root is defined as candidate's note Record region 2;

Step a3:If the root node in candidate record region 1 is identical with the root node in candidate record region 2, candidate remembers Subtree representated by record region 1 and candidate record region 2 is final entry region, otherwise calculates the identical son of two stalk root vertexes The quotient of node number, introduces threshold valueθIf quotient is equal to threshold value greatly, illustrates that record strip number is few, needs to rely on most characters Count text node to be accurately positioned posting field, selection candidate record region 2 is final entry region;If quotient is less than threshold Value, illustrates that record strip number is many, selects candidate record region 1 as final entry region.

In step 3, search the separator between record from posting field, comprise the following steps:

Step b1:All child nodes of root node are defined as posting field joint block, profit in posting field joint block Search the non-overlapped repetition sub-block of energy cover-most node with bidirectional research method;Concrete grammar is:

Represent posting field joint block sequence with X, whereinFor node sequence Row,For mutually the same sub-block, eachCan be expressed as againNode sequence Row, then problem definition is

I.e. demand obtains optimum sub-blockMake all and its identical sub-blockNumber of tagsSum addsMark Sign numberMaximum.Try to achieve simultaneouslyThe sub-block repeating can not be comprised, that is,, represent sequence nodeWithCan not be identical, otherwise, here M is even number.

We simplify lookup using hypothesis 2 aboveProcess, it is first determined in posting field, number of characters is most Text node, repetition sub-block is found in two-way expansion on the basis of this T node(SearchComprise this T node).With Fig. 2 it is Example, the P node with shade for the in figure is the most text node of number of characters in region, and the first round searchesNumber of tags is 1, obtainsFor 12;Second wheelNumber of tags is 2, respectively obtains repetition using expanding sub-block to from left to right Sub-block is aP and PP, and now obtaining value is 8;Exist by that analogyNumber of tags is 4, when sub-block is PPPa, now obtainsFor 16, cover most nodes, then finally determineFor PPPa sub-block.

Step b2:Posting field joint block is carried out repeat the coupling of sub-block, in all first nodes matching Separator is inserted in position, to split to record in posting field.In Fig. 2, sep represents the rs chacter of insertion.

It is more than presently preferred embodiments of the present invention, all changes made according to technical solution of the present invention, produced function is made With without departing from technical solution of the present invention scope when, belong to protection scope of the present invention.

Claims (8)

1. a kind of information extraction system of many record webpages is it is characterised in that include:
One Web-page preprocessing module, for html web page is converted to XHTML webpage, and is used in filtering web page rendering display The label of effect, the then nested structure according to label, build document order tree;
One posting field locating module, for receiving the document order tree of document to be extracted, and utilizes horizontal analytic hierarchy process (AHP) The position of posting field is oriented in described document order tree;
One rs chacter identification module, for finding the separator between record and being deposited from described posting field Storage;And
One record output module, for text nodes all in posting field are pressed hierarchical sequence traversal output, is encountering point Export separator bar when symbol, obtain final extraction result;
Described posting field locating module orients posting field using horizontal analytic hierarchy process (AHP) in described document order tree Position, comprises the following steps:
Step a1:Preorder traversal document order tree, counts and records each node same child node number, find same child node The most node of number, and the subtree with this node as root is defined as candidate record region 1 and this node will meet its subtree Certain text node number of characters interior is more than the average number of characters of document order tree text node;
Step a2:While document order tree is traveled through, determine and in document order tree, be located at middle subtree and number of characters Most text node positions;Determine text node position determine unique to document order root vertex from this node Path, selects the most node of same child node number on this path, the subtree with this node as root is defined as candidate record area Domain 2;
Step a3:If the root node in candidate record region 1 is identical with the root node in candidate record region 2, candidate record area Subtree representated by domain 1 and candidate record region 2 is final entry region, otherwise calculates two stalk root vertex same child node The quotient of number, introduces threshold θ, if quotient is more than or equal to threshold value, illustrates that record strip number is few, needs to rely on most number of characters literary compositions Being accurately positioned posting field, selection candidate record region 2 is final entry region to this node;If quotient is less than threshold value, say Visible record bar number is many, selects candidate record region 1 as final entry region.
2. a kind of information extraction system of many record webpages according to claim 1 is it is characterised in that described webpage is located in advance Reason module includes SAX parser, for parsing to XHTML web page code, to build document order tree.
3. a kind of information extraction system of many record webpages according to claim 2 is it is characterised in that described SAX parses Device includes 4 event handlers, respectively startDocument event handler, endDocument event handler, StartElement event handler, endElement event handler;Described 4 event handlers contain fixed in advance respectively The good sequence of operations of justice, described 4 event handlers are triggered successively, execute according to the order of parsing label.
4. a kind of information extraction systems of many record webpages according to claim 1 are it is characterised in that described record separation Symbol identification module searches the separator between record from described posting field, comprises the following steps:
Step b1:All child nodes of root node are defined as posting field joint block, using double in posting field joint block Search the non-overlapped repetition sub-block of energy cover-most node to searching method;Search the side repeating sub-block using bidirectional research method Method is:Determine the most text node of number of characters in posting field first, then on the basis of text node, two-way expansion comes Find and repeat sub-block;
Step b2:Posting field joint block is carried out repeat the coupling of sub-block, in all first node locations matching Insertion separator, to split to record in posting field.
5. a kind of information extraction method of many record webpages is it is characterised in that comprise the following steps:
Step 1:Html web page is converted to by XHTML webpage by Web-page preprocessing module, and is used in filtering web page rendering display The label of effect, the then nested structure according to label, build document order tree;
Step 2:Document order tree from pretreatment module is received by posting field locating module, by described document order The position of posting field block is oriented in the horizontal step analysis of tree in described document order tree;
Step 3:Posting field block position from posting field locating module is received by rs chacter identification module, from record Find the separator between record in region unit and stored;
Step 4:Text nodes all in posting field are pressed by hierarchical sequence traversal output by record output module, is encountering separation Export separator bar during symbol, obtain final extraction result;
In step 2, orient the position of posting field in described document order tree by horizontal step analysis, including following Step:
Step a1:Preorder traversal document order tree, counts and records each node same child node number, find same child node The most node of number, and the subtree with this node as root is defined as candidate record region 1 and this node will meet its subtree Certain text node number of characters interior is more than the average number of characters of document order tree text node;
Step a2:While document order tree is traveled through, determine and in document order tree, be located at middle subtree and number of characters Most text node positions;Determine text node position determine unique to document order root vertex from this node Path, selects the most node of same child node number on this path, the subtree with this node as root is defined as candidate record area Domain 2;
Step a3:If the root node in candidate record region 1 is identical with the root node in candidate record region 2, candidate record area Subtree representated by domain 1 and candidate record region 2 is final entry region, otherwise calculates two stalk root vertex same child node The quotient of number, introduces threshold θ, if quotient is more than or equal to threshold value, illustrates that record strip number is few, needs to rely on most number of characters literary compositions Being accurately positioned posting field, selection candidate record region 2 is final entry region to this node;If quotient is less than threshold value, say Visible record bar number is many, selects candidate record region 1 as final entry region.
6. a kind of information extraction method of many record webpages according to claim 5 it is characterised in that in step 1, is adopted With SAX parser, web document is parsed, to build document order tree.
7. a kind of information extraction method of many record webpages according to claim 6 is it is characterised in that described SAX parses Device includes 4 event handlers, respectively startDocument event handler, endDocument event handler, StartElement event handler, endElement event handler;Described 4 event handlers contain fixed in advance respectively The good sequence of operations of justice, described 4 event handlers are triggered successively, execute according to the order of parsing label.
8. a kind of information extraction methods of many record webpages according to claim 5 are it is characterised in that in step 3, from Search the separator between record in posting field, comprise the following steps:
Step b1:All child nodes of root node are defined as posting field joint block, using double in posting field joint block Search the non-overlapped repetition sub-block of energy cover-most node to searching method;Search the side repeating sub-block using bidirectional research method Method is:Determine the most text node of number of characters in posting field first, then on the basis of this node, two-way expansion is finding Repeat sub-block;
Step b2:Posting field joint block is carried out repeat the coupling of sub-block, in all first node locations matching Insertion separator, to split to record in posting field.
CN201410034376.4A 2014-01-24 2014-01-24 Information extraction system and method for multi-recording webpage CN103761312B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410034376.4A CN103761312B (en) 2014-01-24 2014-01-24 Information extraction system and method for multi-recording webpage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410034376.4A CN103761312B (en) 2014-01-24 2014-01-24 Information extraction system and method for multi-recording webpage

Publications (2)

Publication Number Publication Date
CN103761312A CN103761312A (en) 2014-04-30
CN103761312B true CN103761312B (en) 2017-02-08

Family

ID=50528549

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410034376.4A CN103761312B (en) 2014-01-24 2014-01-24 Information extraction system and method for multi-recording webpage

Country Status (1)

Country Link
CN (1) CN103761312B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104217025B (en) * 2014-09-28 2018-04-13 福州大学 For the entry extraction system and method for more record webpages
CN106294722B (en) * 2016-08-09 2019-11-22 上海资誉网络科技有限公司 A kind of web page contents extraction method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1786965A (en) * 2005-12-21 2006-06-14 北大方正集团有限公司 Method for acquiring news web page text information
CN101192234A (en) * 2007-06-07 2008-06-04 腾讯科技(深圳)有限公司 Searching system and method based on web page extraction
CN101515287A (en) * 2009-03-24 2009-08-26 崔志明;方 巍;赵朋朋 Automatic generating method of wrapper of complex page
CN101582075A (en) * 2009-06-24 2009-11-18 大连海事大学 Web information extraction system
CN101727486A (en) * 2009-12-04 2010-06-09 中国人民解放军信息工程大学 Web forum information extraction system
CN101872350A (en) * 2009-04-24 2010-10-27 富士通株式会社 Web page text extracting method and device thereof
EP2482206A1 (en) * 2011-01-27 2012-08-01 Samsung Electronics Co., Ltd. Method and apparatus for web browsing of handheld device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1786965A (en) * 2005-12-21 2006-06-14 北大方正集团有限公司 Method for acquiring news web page text information
CN101192234A (en) * 2007-06-07 2008-06-04 腾讯科技(深圳)有限公司 Searching system and method based on web page extraction
CN101515287A (en) * 2009-03-24 2009-08-26 崔志明;方 巍;赵朋朋 Automatic generating method of wrapper of complex page
CN101872350A (en) * 2009-04-24 2010-10-27 富士通株式会社 Web page text extracting method and device thereof
CN101582075A (en) * 2009-06-24 2009-11-18 大连海事大学 Web information extraction system
CN101727486A (en) * 2009-12-04 2010-06-09 中国人民解放军信息工程大学 Web forum information extraction system
EP2482206A1 (en) * 2011-01-27 2012-08-01 Samsung Electronics Co., Ltd. Method and apparatus for web browsing of handheld device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
针对Web论坛的一种结构化数据自动抽取方法;关冕等;《山东大学学报(理学版)》;20100516;第42-47页 *

Also Published As

Publication number Publication date
CN103761312A (en) 2014-04-30

Similar Documents

Publication Publication Date Title
Zhai et al. Web data extraction based on partial tree alignment
Chang et al. Automatic information extraction from semi-structured web pages by pattern discovery
Simon et al. ViPER: augmenting automatic information extraction with visual perceptions
Kayed et al. FiVaTech: Page-level web data extraction from template pages
CN102708096B (en) Network intelligence public sentiment monitoring system based on semantics and work method thereof
Zheng et al. Joint optimization of wrapper generation and template detection
CN101251855A (en) Equipment, system and method for cleaning internet web page
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
US20090049062A1 (en) Method for Organizing Structurally Similar Web Pages from a Web Site
CN101464905B (en) Web page information extraction system and method
CN104615589A (en) Named-entity recognition model training method and named-entity recognition method and device
CN102254014B (en) Adaptive information extraction method for webpage characteristics
CN101556606B (en) Data mining method based on extraction of Web numerical value tables
CN101520798A (en) Webpage classification technology based on vertical search and focused crawler
CN102541874B (en) Webpage text content extracting method and device
CN103049435B (en) Text fine granularity sentiment analysis method and device
CN102262634B (en) Automatic questioning and answering method and system
CN102831121A (en) Method and system for extracting webpage information
CN102253937B (en) Method and related device for acquiring information of interest in webpages
CN104866593A (en) Database searching method based on knowledge graph
Kushmerick Finite-state approaches to web information extraction
CN104699835A (en) Method and device used for determining webpages including POI (point of interest) data
CN102184189B (en) Webpage core block determining method based on DOM (Document Object Model) node text density
CN103226578B (en) Towards the website identification of medical domain and the method for webpage disaggregated classification
CN102930031B (en) By the method and system extracting bilingual parallel text in webpage

Legal Events

Date Code Title Description
PB01 Publication
C06 Publication
SE01 Entry into force of request for substantive examination
C10 Entry into substantive examination
GR01 Patent grant
C14 Grant of patent or utility model