CN104217025B - For the entry extraction system and method for more record webpages - Google Patents

For the entry extraction system and method for more record webpages Download PDF

Info

Publication number
CN104217025B
CN104217025B CN201410503955.9A CN201410503955A CN104217025B CN 104217025 B CN104217025 B CN 104217025B CN 201410503955 A CN201410503955 A CN 201410503955A CN 104217025 B CN104217025 B CN 104217025B
Authority
CN
China
Prior art keywords
record
entry
posting field
node
extracted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410503955.9A
Other languages
Chinese (zh)
Other versions
CN104217025A (en
Inventor
陈国龙
廖祥文
陈巧灵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN201410503955.9A priority Critical patent/CN104217025B/en
Publication of CN104217025A publication Critical patent/CN104217025A/en
Application granted granted Critical
Publication of CN104217025B publication Critical patent/CN104217025B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of entry extraction system and method for more record webpages, which includes:Record tree alignment module, receives the posting field subtree that has extracted, and carries out tree alignment using label information and semantic information, obtains a hypertree, so as to allow same node of the identical semantic node corresponding to hypertree;Content extraction module is recorded, determines to record location of content in record using text density and text density and Measure Indexes;Entry output module, all entries in posting field and its semantic tagger are exported according to tree node preorder traversal;Frame is fed back, it is whether correct using result inspection record zone location is extracted after entry is extracted, it is incorrect, posting field is repositioned, and then change entry and extract as a result, correct then directly terminate to extract flow.The system and method efficiently and accurately can carry out entry extraction to posting field in more record webpages, and extraction speed is fast, accuracy is high, versatile, applied widely.

Description

For the entry extraction system and method for more record webpages
Technical field
The present invention relates to information extraction technique field, is taken out more particularly, to a kind of entry for more record webpages Take system and method, can apply to the webpage that microblogging, forum, product review etc. include a plurality of similar record, suitable for it is a variety of not With medium and different field.
Background technology
With the arrival in Web2.0 epoch, the webpages that record have become the important data source of data mining more.More record nets Page refers to there is posting field more than one in webpage, is made of the similar record of multiple structures, and each record can be solid comprising some Fixed entry.The pages of much more traditional record webpages often go out record by the cgi programs of server from database retrieval, then with The template dynamic generation made.Due to there is fixed template, so the structural similarity of every record is high, it is very regular. New-type more record webpages are due to there is user to participate in web page contents creation, the free and open property of its content format and answering for page structure Polygamy so that extract entry therein so that machine processing becomes very difficult.
In the prior art, many technical methods can be used for more record web page extractions.But mainly to data record Extracted, do not extract data item from data record further.And the extraction of data item can more meet data integration, number According to the demand of the data mining tasks such as analysis.The method that traditional data item abstracting method uses redaction rule, this method can Entry information is quickly and easily extracted from specific data source.But when data source scale increases hundreds and thousands of a, Again by manual compiling rule, it can take a substantial amount of time and energy, can not meet the process demand of the very fast expansion of present information. On the other hand, the web page template of each data source is not unalterable, once Page Template updates, it is necessary to manually repaiies again Change rule, cause huge maintenance cost.More also by manually marking training set come the method for create-rule, since it is desired that It is artificial to participate in being also not suitable for extracting the changeable more record webpages of magnanimity.
In the prior art, the entry abstracting method there is some for more record webpages.These methods are mainly closed Note in the extraction of the specific entry of particular intermediaries, such as the comment content of review pages;The authors' name of the model page, issuing time, Model content, without extracting other entries.And other entries also have its application value, particularly to domain knowledge Deeply excavating needs more comprehensive entry information.Such as to identify comment spam, it is necessary to using in review record commodity marking, Serviceability marking, commentator's information etc. are commented on, only extracting comment content is that cannot meet the needs of comment spam identification, is lacked A kind of entry abstracting method general for more record webpages.
In addition existing most of entry abstracting methods are all to use two stage method, i.e., after record is extracted, then carry out The extraction of entry.The advantages of which is to go deep into layer by layer, and Stepwise Refinement, record identification can substantially reduce entry and extract hardly possible Degree, shortcoming are that the extraction mistake of record can seriously affect the extraction of entry, cause the accumulation of mistake, while when extraction records Due to lacking the semantic information of entry, the extraction effect of record can be influenced.Another way is that entry uniformly extracts mode, It is carried out at the same time record extraction and entry extracts, will both regards the annotation process to tree node as.The advantages of which It is to be carried out at the same time to be conducive to efficiently using for both information.The semantic information of entry will be helpful to record and extract, and record at the same time Extract and will be helpful to improve the accuracy that entry extracts.Shortcoming is that the mask method in text needs training pattern, and required Characteristic set is to lead domain-dependent, it is necessary to manually mark training set, and present mass data is automatically taken out there is an urgent need to a kind of Take method.Existing work is not yet realized carries out entry extraction with non-supervisory, unified approach.
With continuous the producing of the medium message of the social activity such as microblogging, forum in recent years, the webpages that record have possessed largely more Data resource, and need to find the information such as much-talked-about topic therein, leader of opinion by data mining technology, this is just to record Item information extraction technique proposes a challenge:How a unification effective information extraction system is built to meet different media Information extraction need.Therefore, there is an urgent need to there is a kind of entry abstracting method of efficiently and accurately, this method should be able to take out automatically The entry of posting field is taken, and carries out the semantic alignment of entry, while can easily be made in different media, different field With.
The content of the invention
It is an object of the invention to provide a kind of entry extraction systems and method for more record webpages, the system and Method efficiently and accurately can carry out entry extraction to posting field in more record webpages, and extraction speed is fast, accuracy is high, It is versatile, it is applied widely.
To achieve the above object, the technical scheme is that:A kind of entry extraction system for more record webpages, Including:
Record tree alignment module, for receiving the posting field subtree extracted, and utilizes label information and semantic letter Breath carry out tree alignment, obtain a hypertree, thus allow identical semantic node correspond to hypertree same node;
Content extraction module is recorded, determines to record content position in record using text density and text density and Measure Indexes Put;
Entry output module, for by all entries in posting field and its semantic tagger according to tree node elder generation sequence time Go through output;
Frame is fed back, for whether correct using result inspection record zone location is extracted after entry is extracted, if It is incorrect, posting field is repositioned, and then change entry and extract as a result, until extracting result correctly or can not position To new posting field abnormal ending, if correctly, directly terminating to extract flow.
Further, the record tree alignment module uses dom tree interior nodes label and leaf node text semantic to subtree Align.
Further, the workflow of the record content extraction module comprises the following steps:
Step a1:Carry out subtree alignments and obtain a hypertree Ts, filter out TsIn do not have the node set US of semantic tagger;
Step a2:Calculate the text density of US collector nodes, so try to achieve the text density of every stalk tree with;
Step a3:According to text density and the definite minimum subtree set for including record content.
Further, the workflow of the feedback frame comprises the following steps:
Step b1:Extract record;
Step b2:Entry is extracted, judges every record whether all having time entry and author's entry, if so, Then extract successfully, if if not provided, only some records are not met, remove those records, if most records are not inconsistent Close, then redefine posting field;
Step b3:Repeat step b1, b2, until the record extracted meets condition, or can not select new record Terminate in region.
The present invention also provides a kind of entry abstracting method for more record webpages, comprise the following steps:
Step 1:The posting field subtree extracted is received by record tree alignment module, and utilizes label information and semanteme Information carries out tree alignment, obtains a hypertree, thus allow identical semantic node correspond to hypertree same node;
Step 2:Determine to remember in record using text density and text density and Measure Indexes by record content extraction module Record location of content;
Step 3:It is by entry output module that all entries in posting field and its semantic tagger is first according to tree node Sequence traversal output;
Step 4:Feedback frame, utilizes whether just to extract result inspection record zone location after entry is extracted Really, posting field is repositioned if incorrect, and then changes entry and extracts as a result, up to extraction result is correct or nothing Method navigates to new posting field abnormal ending, if correctly, directly terminating to extract flow.
Further, in step 1, the record tree alignment module uses dom tree interior nodes label and leaf node text Semanteme aligns subtree.
Further, in step 2, the workflow of the record content extraction module comprises the following steps:
Step a1:Carry out subtree alignments and obtain a hypertree Ts, filter out TsIn do not have the node set US of semantic tagger;
Step a2:Calculate the text density of US collector nodes, so try to achieve the text density of every stalk tree with;
Step a3:According to text density and the definite minimum subtree set for including record content.
Further, in step 4, the workflow of the feedback frame comprises the following steps:
Step b1:Extract record;
Step b2:Entry is extracted, judges every record whether all having time entry and author's entry, if so, Then extract successfully, if if not provided, only some records are not met, remove those records, if most records are not inconsistent Close, then redefine posting field;
Step b3:Repeat step b1, b2, until the record extracted meets condition, or can not select new record Terminate in region.
Compared to the prior art, the beneficial effects of the invention are as follows can be efficiently and accurately to more record webpages(As microblogging is remembered Record webpage, forum postings webpage, product review web page etc.)Entry extraction is carried out, existing abstracting method mistake is overcome and tires out Long-pending, nonautomatic defect, not only extraction speed is fast, and accuracy is high, and stability is high, and versatile, applied widely, can Easily applied in different media, different field, there is very strong practicality and wide application prospect.
Brief description of the drawings
Fig. 1 is the system structure diagram of the embodiment of the present invention.
Fig. 2 is record content extraction example schematic in the embodiment of the present invention.
Fig. 3 is the workflow schematic diagram that frame is fed back in the embodiment of the present invention.
Embodiment
Below in conjunction with the accompanying drawings and specific embodiment the present invention is described in further detail.
The present invention is directed to the entry extraction systems of more record webpages, as shown in Figure 1, including:
(1)Record tree alignment module, for receiving the posting field subtree extracted, and utilizes label information and semanteme Information carries out tree alignment, obtains a hypertree, thus allow identical semantic node correspond to hypertree same node.The note Record tree alignment module aligns subtree using dom tree interior nodes label and leaf node text semantic.
(2)Content extraction module is recorded, is determined using text density and text density and Measure Indexes in record in record Hold position.
(3)Entry output module, for all entries in posting field and its semantic tagger is first according to tree node Sequence traversal output.
(4)Frame is fed back, for whether correct using result inspection record zone location is extracted after entry is extracted, such as Fruit is incorrect, repositions posting field, and then changes entry and extract as a result, until extracting result correctly or without legal New posting field abnormal ending is arrived in position, if correctly, directly terminating to extract flow.
The implementation of each module is described in detail separately below.
(1)Record tree alignment module
First, how description record tree alignment module carries out tree alignment operation, i.e. how identical semantic node pair Should be in the same node of hypertree.
Existing alignment thereof is postorder traversal dom tree, and the matching of subtree is carried out using tree edit distance.In matched mistake Cheng Zhong, only requires that label is consistent, and without considering entry value, but Html labels are only intended to the design to webpage information layout, Lack semantic ability to express.When a subtree can be with multiple Sub-tree Matchings, the subtree for selecting to occur at first is matched.This Although kind of alignment thereof is simple, accuracy rate is not high.In order to improve accuracy rate, in of the invention, to there is the entry of obvious semanteme (Time, author etc.)It is labeled, identical semantic label is alignd, is just avoided that similar mistake.When a subtree When can be with multiple Sub-tree Matchings, selection has identical semantic subtree to be matched, and if being not matched to identical semantic son Tree, then select the subtree occurred at first.If a node has multiple semantic taggers, if in matching some it is semantic can and On another node matching, then align.
(2)Record content extraction module
Secondly, description record content extraction module is how to determine record content.In dom tree, such as author, time A leaf node in dom tree is usually corresponded to Deng short text entry, and records content and has corresponded to some complicated subtrees rather than one A simple leaf node, it extracts difficulty and is greater than other entries.It is long article due to recording content compared to other entries This, the record content in record is determined by this metric of text density.This method uses 3 step strategies:First, according to preceding Literary the method carries out subtree alignments and obtains a hypertree Ts, filters out the node set US for not having semantic tagger in Ts;Then Calculate the text density of US collector nodes, so try to achieve the text density of every stalk tree with;Finally according to text density and, determine Include the minimum subtree set of record content.Here is that the specific of each step is introduced.
1. subtree alignments
Record subtree alignments are carried out according to the introduced alignment schemes of upper section, all subtrees become by being inserted into some nodes Isomorphism so that identical semantic node is aligned to a node in Ts in each tree.And the semantic tagger of Ts is using throwing Ticket principle, when the record for having more than half is all semantic for certain by the node label, then node label is semantic for certain.At the same time to super The number of characters for setting each text leaf node of Ts is labeled.Marking rule is
WhereinC i For the number of characters of i-th of leaf node of hypertree Ts,C ij For i-node jth stalk tree number of characters.
2. text density and text density and
The present invention determines to include note in the children tree nodes of no mark using text density and text density and calculation formula Record the minimum subtree set of content.The text density of all remaining nodes is calculated first(formula 2)And text density and (formula 3).
(1)
(2)
3. determine to include the minimum subtree set of record content
Text density and maximum node DensitySummax are found, the subtree using the node as root is added into result set. Text density and minimum node, the text density of the node are determined into the path of record root node from DensitySummax With as threshold value.Then all remaining nodes are traveled through, if the node text density is more than threshold value, from the child node of the node Text density and maximum node are found out, then is record content blocks using its subtree as root node.Under many circumstances, in record Appearance can be divided into many text blocks, so the node for needing to be more than each text density threshold value carries out identical operation, institute of making uniform Some record content subtree sets.Fig. 2 illustrates the hypertree Ts after an alignment, to introduce lower definite record content subtree exemplified by it The process of collection.Shaded nodes are no semantic tagger node, and wherein T represents text node, and the numeral in bracket is text node Number of characters, the numeral beside remaining label node are the text density of the node, and the numeral in bracket is subtree under the node Text density and.It can be seen that text density and be 230<div>Node is density and maximum, by the subtree using it as root Add result set.It is 151 from the node to record root node density and minimum value, then threshold value is set to 151.In remaining node only It is 167 to have text density<div>Node exceedes threshold value, because it does not have other child nodes, then directly by the subtree using it as root Add result set.Minimum subtree set is finally obtained, is represented in figure with coil.
(3)Record output module
Finally, description record output module.All text nodes in posting field are pressed hierarchical sequence by record output module Traversal output, exports separator bar when encountering separator, obtains final extraction result.
(4)Feed back frame
Existing most of entry abstracting methods are all to use two stage method, i.e., after record is extracted, then are recorded The extraction of item.The advantages of which is to go deep into layer by layer, and Stepwise Refinement, record identification can substantially reduce entry and extract difficulty, lack Point is that the extraction mistake of record can seriously affect the extraction of entry, causes the accumulation of mistake, at the same extract record when due to Lack the semantic information of entry, the extraction effect of record can be influenced.The unified approach that entry extracts, that is, be carried out at the same time record Extract and entry extracts, will both regard the annotation process to tree node as.The advantages of which is to be carried out at the same time favorably In efficiently using for both information.The semantic information of entry will be helpful to record and extract, while records extraction and will be helpful to carry The accuracy that high entry extracts.Shortcoming is that the mask method in text needs training pattern, and required characteristic set is field Rely on, it is necessary to artificial mark training set, and there is an urgent need to a kind of automatic abstracting method for present mass data.
The present invention proposes feedback frame on the basis of two stage method, and which can extract the basis of result in entry On again modification record extract, so as to improve final extraction effect.
Assuming that:UGC webpages every record the time entry and author's entry that should all include one or more.
If the record extracted according to the hypothesis does not include time entry and author's entry, to carry out again Posting field positioning, record extract, until record meets hypothesis or abnormal end.Which flow:
1st, record is extracted
2nd, entry is extracted, judges every record whether all having time entry and author's entry
Successfully terminate if so, then extracting
If not provided, only one or two record is not met, then these it is incongruent be recorded as advertising record etc., remove.
Overwhelming majority record does not comply with, it is determined that vice-minister's text node or the posting field that secondary more record numbers are root node Block is new posting field.
3rd, 1,2 steps are repeated, until the record extracted meets condition, or new posting field can not be selected(Traversal Complete all text nodes or record number are less than 3, and usually more record webpage record numbers are equal to greatly 3)When terminate.
Which flow is full automatic, is also according to automatic semantic tagger knot when judging whether record meets to assume Fruit, without manual intervention, while again can well amendment record extract mistake, avoid mistake accumulation.
Maximum innovative point of the invention includes following three points:
1st, the present invention considers interior nodes label value and leaf node text semantic at the same time when setting and aliging first, both Combination tree can be avoided to align some manifest error, as interior nodes label value is identical and the semantic different situation of leaf node.
2nd, the present invention is first using text density and text density and to determine to record content in record, because in dom tree In, the short text entry such as author, time usually corresponds to a leaf node in dom tree, and records content and corresponded to some Complicated subtree rather than a simple leaf node, it extracts difficulty and is greater than other entries.Due to record content compared to Other entries are long text, and the record content in record can be determined by text density this metric.
3rd, present invention firstly provides feedback frame, the design of the flow can avoid two benches from extracting flow error accumulation Problem, while can solve the problems, such as that unified extraction mode needs manually to mark language material, field dependence again, reach and efficiently and accurately take out Take data record and entry.
Accordingly, the present invention proposes the entry abstracting method for more record webpages, comprises the following steps:
Step 1:The posting field subtree extracted is received by record tree alignment module, and utilizes label information and semanteme Information carries out tree alignment, obtains a hypertree, thus allow identical semantic node correspond to hypertree same node;
Step 2:Determine to remember in record using text density and text density and Measure Indexes by record content extraction module Record location of content;
Step 3:It is by entry output module that all entries in posting field and its semantic tagger is first according to tree node Sequence traversal output;
Step 4:Feedback frame at the same time, i.e., be using extracting result inspection record zone location after entry is extracted It is no correct, posting field is repositioned if incorrect, and then change entry extract as a result, until extract result it is correct or Person can not navigate to new posting field abnormal ending, if correctly, directly terminating to extract flow.
In step 1, the record tree alignment carries out subtree using dom tree interior nodes label and leaf node text semantic Alignment.
In step 2, the workflow of the record content extraction module comprises the following steps:
Step a1:Carry out subtree alignments and obtain a hypertree Ts, filter out TsIn do not have the node set US of semantic tagger;
Step a2:Calculate the text density of US collector nodes, so try to achieve the text density of every stalk tree with;
Step a3:According to text density and the definite minimum subtree set for including record content.
In step 4, the feedback frame is full automatic, as shown in figure 3, its workflow comprises the following steps:
Step b1:Extract record;
Step b2:Entry is extracted, judges every record whether all having time entry and author's entry, if so, Then extract successfully, if if not provided, only some records are not met, those are incongruent to be recorded as advertising record etc., goes Except those records, if most records are not met, posting field is redefined;
Step b3:Repeat step b1, b2, until the record extracted meets condition, or can not select new record Terminate in region.
Above is presently preferred embodiments of the present invention, all changes made according to technical solution of the present invention, caused function are made During with scope without departing from technical solution of the present invention, protection scope of the present invention is belonged to.

Claims (6)

  1. A kind of 1. entry extraction system for more record webpages, it is characterised in that including:
    Record tree alignment module, for receiving the posting field subtree that has extracted, and utilization label information and semantic information into Row tree alignment, obtain a hypertree, thus allow identical semantic node correspond to hypertree same node;
    Content extraction module is recorded, determines to record location of content in record using text density and text density and Measure Indexes;
    Entry output module, for all entries in posting field and its semantic tagger is defeated according to tree node preorder traversal Go out;
    Frame is fed back, for whether correct using result inspection record zone location is extracted after entry is extracted, if not just It is true then reposition posting field, and then change entry and extract as a result, until to extract result correct or can not navigate to new Posting field abnormal ending, if correctly, directly terminate extract flow;The workflow of the feedback frame includes following Step:
    Step b1:Extract record;
    Step b2:Entry is extracted, every record whether all having time entry and author's entry are judged, if so, then taking out Success is taken, if if not provided, only some records are not met, removes those records, if most records are not met, Redefine posting field;
    Step b3:Repeat step b1, b2, until the record extracted meets condition, or can not select new posting field Terminate.
  2. 2. the entry extraction system according to claim 1 for more record webpages, it is characterised in that the record tree Alignment module aligns subtree using dom tree interior nodes label and leaf node text semantic.
  3. 3. the entry extraction system according to claim 1 for more record webpages, it is characterised in that in the record The workflow for holding abstraction module comprises the following steps:
    Step a1:Carry out subtree alignments and obtain a hypertree Ts, filter out TsIn do not have the node set US of semantic tagger;
    Step a2:Calculate the text density of US collector nodes, so try to achieve the text density of every stalk tree with;
    Step a3:According to text density and the definite minimum subtree set for including record content.
  4. 4. a kind of entry abstracting method for more record webpages, it is characterised in that comprise the following steps:
    Step 1:The posting field subtree extracted is received by record tree alignment module, and utilizes label information and semantic information Carry out tree alignment, obtain a hypertree, thus allow identical semantic node correspond to hypertree same node;
    Step 2:Determined by record content extraction module using text density and text density and Measure Indexes in record in record Hold position;
    Step 3:By entry output module by all entries in posting field and its semantic tagger according to tree node elder generation sequence time Go through output;
    Step 4:Feedback frame, it is whether correct using result inspection record zone location is extracted after entry is extracted, such as Fruit is incorrect, repositions posting field, and then changes entry and extract as a result, until extracting result correctly or without legal New posting field abnormal ending is arrived in position, if correctly, directly terminating to extract flow;The workflow bag of the feedback frame Include following steps:
    Step b1:Extract record;
    Step b2:Entry is extracted, every record whether all having time entry and author's entry are judged, if so, then taking out Success is taken, if if not provided, only some records are not met, removes those records, if most records are not met, Redefine posting field;
    Step b3:Repeat step b1, b2, until the record extracted meets condition, or can not select new posting field Terminate.
  5. 5. the entry abstracting method according to claim 4 for more record webpages, it is characterised in that in step 1, The record tree alignment module aligns subtree using dom tree interior nodes label and leaf node text semantic.
  6. 6. the entry abstracting method according to claim 4 for more record webpages, it is characterised in that in step 2, The workflow of the record content extraction module comprises the following steps:
    Step a1:Carry out subtree alignments and obtain a hypertree Ts, filter out TsIn do not have the node set US of semantic tagger;
    Step a2:Calculate the text density of US collector nodes, so try to achieve the text density of every stalk tree with;
    Step a3:According to text density and the definite minimum subtree set for including record content.
CN201410503955.9A 2014-09-28 2014-09-28 For the entry extraction system and method for more record webpages Active CN104217025B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410503955.9A CN104217025B (en) 2014-09-28 2014-09-28 For the entry extraction system and method for more record webpages

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410503955.9A CN104217025B (en) 2014-09-28 2014-09-28 For the entry extraction system and method for more record webpages

Publications (2)

Publication Number Publication Date
CN104217025A CN104217025A (en) 2014-12-17
CN104217025B true CN104217025B (en) 2018-04-13

Family

ID=52098515

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410503955.9A Active CN104217025B (en) 2014-09-28 2014-09-28 For the entry extraction system and method for more record webpages

Country Status (1)

Country Link
CN (1) CN104217025B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108073678B (en) * 2017-11-06 2020-08-28 广东广业开元科技有限公司 Document analysis processing method, system and device applied to big data analysis
CN108959204B (en) * 2018-06-22 2021-03-05 中国科学院计算技术研究所 Internet financial project information extraction method and system
CN112559929B (en) * 2021-02-25 2021-05-07 中航信移动科技有限公司 Method, electronic device and medium for extracting webpage target information
CN113934914B (en) * 2021-12-20 2022-03-01 成都橙视传媒科技股份公司 Method for collecting batch encrypted data of news media

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102779170A (en) * 2012-06-25 2012-11-14 北京奇虎科技有限公司 System and method for identifying text floor of webpage
CN103761312A (en) * 2014-01-24 2014-04-30 福州大学 Information extraction system and method for multi-recording webpage

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102779170A (en) * 2012-06-25 2012-11-14 北京奇虎科技有限公司 System and method for identifying text floor of webpage
CN103761312A (en) * 2014-01-24 2014-04-30 福州大学 Information extraction system and method for multi-recording webpage

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"Vision-based Web Data Records Extraction";Liu Wei etc.;《Workshop on the Web and Databases》;20061231;第20-25页 *
"基于DOM节点文本密度的网页核心块抽取算法研究";孙飞;《中国优秀硕士学位论文全文数据库 信息科技辑》;20120715(第7期);论文第3.2、4.2节 *
"基于Hadoop的Web评论自动抽取方法研究";颜佳伟;《中国优秀硕士学位论文全文数据库 信息科技辑》;20131215(第S2期);论文3.3-3.4节,图3.6,图3.15 *

Also Published As

Publication number Publication date
CN104217025A (en) 2014-12-17

Similar Documents

Publication Publication Date Title
CN101464905B (en) Web page information extraction system and method
CN102831121B (en) Method and system for extracting webpage information
CN103678412B (en) A kind of method and device of file retrieval
CN104217025B (en) For the entry extraction system and method for more record webpages
CN103870506B (en) Webpage information extraction method and system
CN102591612B (en) General webpage text extraction method based on punctuation continuity and system thereof
CN102662969B (en) Internet information object positioning method based on webpage structure semantic meaning
CN102306177B (en) Multi-strategy combined ontology or instance matching method
CN103136358B (en) A kind of method of Automatic Extraction forum data
CN102270206A (en) Method and device for capturing valid web page contents
CN104598462B (en) Extract the method and device of structural data
US10789302B2 (en) Method and system for extracting user-specific content
CN108334493A (en) A kind of topic knowledge point extraction method based on neural network
CN105677638B (en) Web information abstracting method
CN106055667A (en) Method for extracting core content of webpage based on text-tag density
CN103699591A (en) Page body extraction method based on sample page
CN101661468B (en) Method for extracting post metadata from forum post list pages
CN109522452A (en) A kind of processing method of magnanimity semi-structured data
CN102117289A (en) Method and device for extracting comment content from webpage
CN103853770B (en) The method and system of model content in a kind of extraction forum Web pages
CN104933032A (en) Method for extracting keywords of blog based on complex network
CN103500216A (en) Method for extracting file information
CN107436931A (en) web page text extracting method and device
CN101727497A (en) Method for generating interactive document structure from web page document
CN104615728B (en) A kind of webpage context extraction method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant