CN104217025B

CN104217025B - For the entry extraction system and method for more record webpages

Info

Publication number: CN104217025B
Application number: CN201410503955.9A
Authority: CN
Inventors: 陈国龙; 廖祥文; 陈巧灵
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2014-09-28
Filing date: 2014-09-28
Publication date: 2018-04-13
Anticipated expiration: 2034-09-28
Also published as: CN104217025A

Abstract

The present invention relates to a kind of entry extraction system and method for more record webpages, which includes：Record tree alignment module, receives the posting field subtree that has extracted, and carries out tree alignment using label information and semantic information, obtains a hypertree, so as to allow same node of the identical semantic node corresponding to hypertree；Content extraction module is recorded, determines to record location of content in record using text density and text density and Measure Indexes；Entry output module, all entries in posting field and its semantic tagger are exported according to tree node preorder traversal；Frame is fed back, it is whether correct using result inspection record zone location is extracted after entry is extracted, it is incorrect, posting field is repositioned, and then change entry and extract as a result, correct then directly terminate to extract flow.The system and method efficiently and accurately can carry out entry extraction to posting field in more record webpages, and extraction speed is fast, accuracy is high, versatile, applied widely.

Description

For the entry extraction system and method for more record webpages

Technical field

The present invention relates to information extraction technique field, is taken out more particularly, to a kind of entry for more record webpages Take system and method, can apply to the webpage that microblogging, forum, product review etc. include a plurality of similar record, suitable for it is a variety of not With medium and different field.

Background technology

With the arrival in Web2.0 epoch, the webpages that record have become the important data source of data mining more.More record nets Page refers to there is posting field more than one in webpage, is made of the similar record of multiple structures, and each record can be solid comprising some Fixed entry.The pages of much more traditional record webpages often go out record by the cgi programs of server from database retrieval, then with The template dynamic generation made.Due to there is fixed template, so the structural similarity of every record is high, it is very regular. New-type more record webpages are due to there is user to participate in web page contents creation, the free and open property of its content format and answering for page structure Polygamy so that extract entry therein so that machine processing becomes very difficult.

In the prior art, many technical methods can be used for more record web page extractions.But mainly to data record Extracted, do not extract data item from data record further.And the extraction of data item can more meet data integration, number According to the demand of the data mining tasks such as analysis.The method that traditional data item abstracting method uses redaction rule, this method can Entry information is quickly and easily extracted from specific data source.But when data source scale increases hundreds and thousands of a, Again by manual compiling rule, it can take a substantial amount of time and energy, can not meet the process demand of the very fast expansion of present information. On the other hand, the web page template of each data source is not unalterable, once Page Template updates, it is necessary to manually repaiies again Change rule, cause huge maintenance cost.More also by manually marking training set come the method for create-rule, since it is desired that It is artificial to participate in being also not suitable for extracting the changeable more record webpages of magnanimity.

In the prior art, the entry abstracting method there is some for more record webpages.These methods are mainly closed Note in the extraction of the specific entry of particular intermediaries, such as the comment content of review pages；The authors' name of the model page, issuing time, Model content, without extracting other entries.And other entries also have its application value, particularly to domain knowledge Deeply excavating needs more comprehensive entry information.Such as to identify comment spam, it is necessary to using in review record commodity marking, Serviceability marking, commentator's information etc. are commented on, only extracting comment content is that cannot meet the needs of comment spam identification, is lacked A kind of entry abstracting method general for more record webpages.

In addition existing most of entry abstracting methods are all to use two stage method, i.e., after record is extracted, then carry out The extraction of entry.The advantages of which is to go deep into layer by layer, and Stepwise Refinement, record identification can substantially reduce entry and extract hardly possible Degree, shortcoming are that the extraction mistake of record can seriously affect the extraction of entry, cause the accumulation of mistake, while when extraction records Due to lacking the semantic information of entry, the extraction effect of record can be influenced.Another way is that entry uniformly extracts mode, It is carried out at the same time record extraction and entry extracts, will both regards the annotation process to tree node as.The advantages of which It is to be carried out at the same time to be conducive to efficiently using for both information.The semantic information of entry will be helpful to record and extract, and record at the same time Extract and will be helpful to improve the accuracy that entry extracts.Shortcoming is that the mask method in text needs training pattern, and required Characteristic set is to lead domain-dependent, it is necessary to manually mark training set, and present mass data is automatically taken out there is an urgent need to a kind of Take method.Existing work is not yet realized carries out entry extraction with non-supervisory, unified approach.

With continuous the producing of the medium message of the social activity such as microblogging, forum in recent years, the webpages that record have possessed largely more Data resource, and need to find the information such as much-talked-about topic therein, leader of opinion by data mining technology, this is just to record Item information extraction technique proposes a challenge：How a unification effective information extraction system is built to meet different media Information extraction need.Therefore, there is an urgent need to there is a kind of entry abstracting method of efficiently and accurately, this method should be able to take out automatically The entry of posting field is taken, and carries out the semantic alignment of entry, while can easily be made in different media, different field With.

The content of the invention

It is an object of the invention to provide a kind of entry extraction systems and method for more record webpages, the system and Method efficiently and accurately can carry out entry extraction to posting field in more record webpages, and extraction speed is fast, accuracy is high, It is versatile, it is applied widely.

To achieve the above object, the technical scheme is that：A kind of entry extraction system for more record webpages, Including：

Record tree alignment module, for receiving the posting field subtree extracted, and utilizes label information and semantic letter Breath carry out tree alignment, obtain a hypertree, thus allow identical semantic node correspond to hypertree same node；

Content extraction module is recorded, determines to record content position in record using text density and text density and Measure Indexes Put；

Entry output module, for by all entries in posting field and its semantic tagger according to tree node elder generation sequence time Go through output；

Frame is fed back, for whether correct using result inspection record zone location is extracted after entry is extracted, if It is incorrect, posting field is repositioned, and then change entry and extract as a result, until extracting result correctly or can not position To new posting field abnormal ending, if correctly, directly terminating to extract flow.

Further, the record tree alignment module uses dom tree interior nodes label and leaf node text semantic to subtree Align.

Further, the workflow of the record content extraction module comprises the following steps：

Step a1：Carry out subtree alignments and obtain a hypertree T_s, filter out T_sIn do not have the node set US of semantic tagger；

Step a2：Calculate the text density of US collector nodes, so try to achieve the text density of every stalk tree with；

Step a3：According to text density and the definite minimum subtree set for including record content.

Further, the workflow of the feedback frame comprises the following steps：

Step b1：Extract record；

Step b2：Entry is extracted, judges every record whether all having time entry and author's entry, if so, Then extract successfully, if if not provided, only some records are not met, remove those records, if most records are not inconsistent Close, then redefine posting field；

Step b3：Repeat step b1, b2, until the record extracted meets condition, or can not select new record Terminate in region.

The present invention also provides a kind of entry abstracting method for more record webpages, comprise the following steps：

Step 1：The posting field subtree extracted is received by record tree alignment module, and utilizes label information and semanteme Information carries out tree alignment, obtains a hypertree, thus allow identical semantic node correspond to hypertree same node；

Step 2：Determine to remember in record using text density and text density and Measure Indexes by record content extraction module Record location of content；

Step 3：It is by entry output module that all entries in posting field and its semantic tagger is first according to tree node Sequence traversal output；

Step 4：Feedback frame, utilizes whether just to extract result inspection record zone location after entry is extracted Really, posting field is repositioned if incorrect, and then changes entry and extracts as a result, up to extraction result is correct or nothing Method navigates to new posting field abnormal ending, if correctly, directly terminating to extract flow.

Further, in step 1, the record tree alignment module uses dom tree interior nodes label and leaf node text Semanteme aligns subtree.

Further, in step 2, the workflow of the record content extraction module comprises the following steps：

Further, in step 4, the workflow of the feedback frame comprises the following steps：

Step b1：Extract record；

Compared to the prior art, the beneficial effects of the invention are as follows can be efficiently and accurately to more record webpages（As microblogging is remembered Record webpage, forum postings webpage, product review web page etc.）Entry extraction is carried out, existing abstracting method mistake is overcome and tires out Long-pending, nonautomatic defect, not only extraction speed is fast, and accuracy is high, and stability is high, and versatile, applied widely, can Easily applied in different media, different field, there is very strong practicality and wide application prospect.

Brief description of the drawings

Fig. 1 is the system structure diagram of the embodiment of the present invention.

Fig. 2 is record content extraction example schematic in the embodiment of the present invention.

Fig. 3 is the workflow schematic diagram that frame is fed back in the embodiment of the present invention.

Embodiment

Below in conjunction with the accompanying drawings and specific embodiment the present invention is described in further detail.

The present invention is directed to the entry extraction systems of more record webpages, as shown in Figure 1, including：

（1）Record tree alignment module, for receiving the posting field subtree extracted, and utilizes label information and semanteme Information carries out tree alignment, obtains a hypertree, thus allow identical semantic node correspond to hypertree same node.The note Record tree alignment module aligns subtree using dom tree interior nodes label and leaf node text semantic.

（2）Content extraction module is recorded, is determined using text density and text density and Measure Indexes in record in record Hold position.

（3）Entry output module, for all entries in posting field and its semantic tagger is first according to tree node Sequence traversal output.

（4）Frame is fed back, for whether correct using result inspection record zone location is extracted after entry is extracted, such as Fruit is incorrect, repositions posting field, and then changes entry and extract as a result, until extracting result correctly or without legal New posting field abnormal ending is arrived in position, if correctly, directly terminating to extract flow.

The implementation of each module is described in detail separately below.

（1）Record tree alignment module

First, how description record tree alignment module carries out tree alignment operation, i.e. how identical semantic node pair Should be in the same node of hypertree.

Existing alignment thereof is postorder traversal dom tree, and the matching of subtree is carried out using tree edit distance.In matched mistake Cheng Zhong, only requires that label is consistent, and without considering entry value, but Html labels are only intended to the design to webpage information layout, Lack semantic ability to express.When a subtree can be with multiple Sub-tree Matchings, the subtree for selecting to occur at first is matched.This Although kind of alignment thereof is simple, accuracy rate is not high.In order to improve accuracy rate, in of the invention, to there is the entry of obvious semanteme （Time, author etc.）It is labeled, identical semantic label is alignd, is just avoided that similar mistake.When a subtree When can be with multiple Sub-tree Matchings, selection has identical semantic subtree to be matched, and if being not matched to identical semantic son Tree, then select the subtree occurred at first.If a node has multiple semantic taggers, if in matching some it is semantic can and On another node matching, then align.

（2）Record content extraction module

Secondly, description record content extraction module is how to determine record content.In dom tree, such as author, time A leaf node in dom tree is usually corresponded to Deng short text entry, and records content and has corresponded to some complicated subtrees rather than one A simple leaf node, it extracts difficulty and is greater than other entries.It is long article due to recording content compared to other entries This, the record content in record is determined by this metric of text density.This method uses 3 step strategies：First, according to preceding Literary the method carries out subtree alignments and obtains a hypertree Ts, filters out the node set US for not having semantic tagger in Ts；Then Calculate the text density of US collector nodes, so try to achieve the text density of every stalk tree with；Finally according to text density and, determine Include the minimum subtree set of record content.Here is that the specific of each step is introduced.

1. subtree alignments

Record subtree alignments are carried out according to the introduced alignment schemes of upper section, all subtrees become by being inserted into some nodes Isomorphism so that identical semantic node is aligned to a node in Ts in each tree.And the semantic tagger of Ts is using throwing Ticket principle, when the record for having more than half is all semantic for certain by the node label, then node label is semantic for certain.At the same time to super The number of characters for setting each text leaf node of Ts is labeled.Marking rule is

WhereinC _iFor the number of characters of i-th of leaf node of hypertree Ts,C _ijFor i-node jth stalk tree number of characters.

2. text density and text density and

The present invention determines to include note in the children tree nodes of no mark using text density and text density and calculation formula Record the minimum subtree set of content.The text density of all remaining nodes is calculated first（formula 2）And text density and （formula 3）.

（1）

（2）

3. determine to include the minimum subtree set of record content

Text density and maximum node DensitySummax are found, the subtree using the node as root is added into result set. Text density and minimum node, the text density of the node are determined into the path of record root node from DensitySummax With as threshold value.Then all remaining nodes are traveled through, if the node text density is more than threshold value, from the child node of the node Text density and maximum node are found out, then is record content blocks using its subtree as root node.Under many circumstances, in record Appearance can be divided into many text blocks, so the node for needing to be more than each text density threshold value carries out identical operation, institute of making uniform Some record content subtree sets.Fig. 2 illustrates the hypertree Ts after an alignment, to introduce lower definite record content subtree exemplified by it The process of collection.Shaded nodes are no semantic tagger node, and wherein T represents text node, and the numeral in bracket is text node Number of characters, the numeral beside remaining label node are the text density of the node, and the numeral in bracket is subtree under the node Text density and.It can be seen that text density and be 230<div>Node is density and maximum, by the subtree using it as root Add result set.It is 151 from the node to record root node density and minimum value, then threshold value is set to 151.In remaining node only It is 167 to have text density<div>Node exceedes threshold value, because it does not have other child nodes, then directly by the subtree using it as root Add result set.Minimum subtree set is finally obtained, is represented in figure with coil.

（3）Record output module

Finally, description record output module.All text nodes in posting field are pressed hierarchical sequence by record output module Traversal output, exports separator bar when encountering separator, obtains final extraction result.

（4）Feed back frame

Existing most of entry abstracting methods are all to use two stage method, i.e., after record is extracted, then are recorded The extraction of item.The advantages of which is to go deep into layer by layer, and Stepwise Refinement, record identification can substantially reduce entry and extract difficulty, lack Point is that the extraction mistake of record can seriously affect the extraction of entry, causes the accumulation of mistake, at the same extract record when due to Lack the semantic information of entry, the extraction effect of record can be influenced.The unified approach that entry extracts, that is, be carried out at the same time record Extract and entry extracts, will both regard the annotation process to tree node as.The advantages of which is to be carried out at the same time favorably In efficiently using for both information.The semantic information of entry will be helpful to record and extract, while records extraction and will be helpful to carry The accuracy that high entry extracts.Shortcoming is that the mask method in text needs training pattern, and required characteristic set is field Rely on, it is necessary to artificial mark training set, and there is an urgent need to a kind of automatic abstracting method for present mass data.

The present invention proposes feedback frame on the basis of two stage method, and which can extract the basis of result in entry On again modification record extract, so as to improve final extraction effect.

Assuming that：UGC webpages every record the time entry and author's entry that should all include one or more.

If the record extracted according to the hypothesis does not include time entry and author's entry, to carry out again Posting field positioning, record extract, until record meets hypothesis or abnormal end.Which flow：

1st, record is extracted

2nd, entry is extracted, judges every record whether all having time entry and author's entry

Successfully terminate if so, then extracting

If not provided, only one or two record is not met, then these it is incongruent be recorded as advertising record etc., remove.

Overwhelming majority record does not comply with, it is determined that vice-minister's text node or the posting field that secondary more record numbers are root node Block is new posting field.

3rd, 1,2 steps are repeated, until the record extracted meets condition, or new posting field can not be selected（Traversal Complete all text nodes or record number are less than 3, and usually more record webpage record numbers are equal to greatly 3）When terminate.

Which flow is full automatic, is also according to automatic semantic tagger knot when judging whether record meets to assume Fruit, without manual intervention, while again can well amendment record extract mistake, avoid mistake accumulation.

Maximum innovative point of the invention includes following three points：

1st, the present invention considers interior nodes label value and leaf node text semantic at the same time when setting and aliging first, both Combination tree can be avoided to align some manifest error, as interior nodes label value is identical and the semantic different situation of leaf node.

2nd, the present invention is first using text density and text density and to determine to record content in record, because in dom tree In, the short text entry such as author, time usually corresponds to a leaf node in dom tree, and records content and corresponded to some Complicated subtree rather than a simple leaf node, it extracts difficulty and is greater than other entries.Due to record content compared to Other entries are long text, and the record content in record can be determined by text density this metric.

3rd, present invention firstly provides feedback frame, the design of the flow can avoid two benches from extracting flow error accumulation Problem, while can solve the problems, such as that unified extraction mode needs manually to mark language material, field dependence again, reach and efficiently and accurately take out Take data record and entry.

Accordingly, the present invention proposes the entry abstracting method for more record webpages, comprises the following steps：

Step 4：Feedback frame at the same time, i.e., be using extracting result inspection record zone location after entry is extracted It is no correct, posting field is repositioned if incorrect, and then change entry extract as a result, until extract result it is correct or Person can not navigate to new posting field abnormal ending, if correctly, directly terminating to extract flow.

In step 1, the record tree alignment carries out subtree using dom tree interior nodes label and leaf node text semantic Alignment.

In step 2, the workflow of the record content extraction module comprises the following steps：

In step 4, the feedback frame is full automatic, as shown in figure 3, its workflow comprises the following steps：

Step b1：Extract record；

Step b2：Entry is extracted, judges every record whether all having time entry and author's entry, if so, Then extract successfully, if if not provided, only some records are not met, those are incongruent to be recorded as advertising record etc., goes Except those records, if most records are not met, posting field is redefined；

Above is presently preferred embodiments of the present invention, all changes made according to technical solution of the present invention, caused function are made During with scope without departing from technical solution of the present invention, protection scope of the present invention is belonged to.

Claims

A kind of 1. entry extraction system for more record webpages, it is characterised in that including：

Record tree alignment module, for receiving the posting field subtree that has extracted, and utilization label information and semantic information into Row tree alignment, obtain a hypertree, thus allow identical semantic node correspond to hypertree same node；

Content extraction module is recorded, determines to record location of content in record using text density and text density and Measure Indexes；

Entry output module, for all entries in posting field and its semantic tagger is defeated according to tree node preorder traversal Go out；

Frame is fed back, for whether correct using result inspection record zone location is extracted after entry is extracted, if not just It is true then reposition posting field, and then change entry and extract as a result, until to extract result correct or can not navigate to new Posting field abnormal ending, if correctly, directly terminate extract flow；The workflow of the feedback frame includes following Step：

Step b1：Extract record；

Step b2：Entry is extracted, every record whether all having time entry and author's entry are judged, if so, then taking out Success is taken, if if not provided, only some records are not met, removes those records, if most records are not met, Redefine posting field；

Step b3：Repeat step b1, b2, until the record extracted meets condition, or can not select new posting field Terminate.
2. the entry extraction system according to claim 1 for more record webpages, it is characterised in that the record tree Alignment module aligns subtree using dom tree interior nodes label and leaf node text semantic.
3. the entry extraction system according to claim 1 for more record webpages, it is characterised in that in the record The workflow for holding abstraction module comprises the following steps：

Step a1：Carry out subtree alignments and obtain a hypertree T_s, filter out T_sIn do not have the node set US of semantic tagger；

Step a2：Calculate the text density of US collector nodes, so try to achieve the text density of every stalk tree with；

Step a3：According to text density and the definite minimum subtree set for including record content.
4. a kind of entry abstracting method for more record webpages, it is characterised in that comprise the following steps：

Step 1：The posting field subtree extracted is received by record tree alignment module, and utilizes label information and semantic information Carry out tree alignment, obtain a hypertree, thus allow identical semantic node correspond to hypertree same node；

Step 2：Determined by record content extraction module using text density and text density and Measure Indexes in record in record Hold position；

Step 3：By entry output module by all entries in posting field and its semantic tagger according to tree node elder generation sequence time Go through output；

Step 4：Feedback frame, it is whether correct using result inspection record zone location is extracted after entry is extracted, such as Fruit is incorrect, repositions posting field, and then changes entry and extract as a result, until extracting result correctly or without legal New posting field abnormal ending is arrived in position, if correctly, directly terminating to extract flow；The workflow bag of the feedback frame Include following steps：

Step b1：Extract record；

Step b2：Entry is extracted, every record whether all having time entry and author's entry are judged, if so, then taking out Success is taken, if if not provided, only some records are not met, removes those records, if most records are not met, Redefine posting field；

Step b3：Repeat step b1, b2, until the record extracted meets condition, or can not select new posting field Terminate.
5. the entry abstracting method according to claim 4 for more record webpages, it is characterised in that in step 1, The record tree alignment module aligns subtree using dom tree interior nodes label and leaf node text semantic.
6. the entry abstracting method according to claim 4 for more record webpages, it is characterised in that in step 2, The workflow of the record content extraction module comprises the following steps：

Step a1：Carry out subtree alignments and obtain a hypertree T_s, filter out T_sIn do not have the node set US of semantic tagger；

Step a2：Calculate the text density of US collector nodes, so try to achieve the text density of every stalk tree with；

Step a3：According to text density and the definite minimum subtree set for including record content.