CN104217025B - For the entry extraction system and method for more record webpages - Google Patents
For the entry extraction system and method for more record webpages Download PDFInfo
- Publication number
- CN104217025B CN104217025B CN201410503955.9A CN201410503955A CN104217025B CN 104217025 B CN104217025 B CN 104217025B CN 201410503955 A CN201410503955 A CN 201410503955A CN 104217025 B CN104217025 B CN 104217025B
- Authority
- CN
- China
- Prior art keywords
- record
- entry
- posting field
- node
- extracted
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a kind of entry extraction system and method for more record webpages, which includes:Record tree alignment module, receives the posting field subtree that has extracted, and carries out tree alignment using label information and semantic information, obtains a hypertree, so as to allow same node of the identical semantic node corresponding to hypertree;Content extraction module is recorded, determines to record location of content in record using text density and text density and Measure Indexes;Entry output module, all entries in posting field and its semantic tagger are exported according to tree node preorder traversal;Frame is fed back, it is whether correct using result inspection record zone location is extracted after entry is extracted, it is incorrect, posting field is repositioned, and then change entry and extract as a result, correct then directly terminate to extract flow.The system and method efficiently and accurately can carry out entry extraction to posting field in more record webpages, and extraction speed is fast, accuracy is high, versatile, applied widely.
Description
Technical field
The present invention relates to information extraction technique field, is taken out more particularly, to a kind of entry for more record webpages
Take system and method, can apply to the webpage that microblogging, forum, product review etc. include a plurality of similar record, suitable for it is a variety of not
With medium and different field.
Background technology
With the arrival in Web2.0 epoch, the webpages that record have become the important data source of data mining more.More record nets
Page refers to there is posting field more than one in webpage, is made of the similar record of multiple structures, and each record can be solid comprising some
Fixed entry.The pages of much more traditional record webpages often go out record by the cgi programs of server from database retrieval, then with
The template dynamic generation made.Due to there is fixed template, so the structural similarity of every record is high, it is very regular.
New-type more record webpages are due to there is user to participate in web page contents creation, the free and open property of its content format and answering for page structure
Polygamy so that extract entry therein so that machine processing becomes very difficult.
In the prior art, many technical methods can be used for more record web page extractions.But mainly to data record
Extracted, do not extract data item from data record further.And the extraction of data item can more meet data integration, number
According to the demand of the data mining tasks such as analysis.The method that traditional data item abstracting method uses redaction rule, this method can
Entry information is quickly and easily extracted from specific data source.But when data source scale increases hundreds and thousands of a,
Again by manual compiling rule, it can take a substantial amount of time and energy, can not meet the process demand of the very fast expansion of present information.
On the other hand, the web page template of each data source is not unalterable, once Page Template updates, it is necessary to manually repaiies again
Change rule, cause huge maintenance cost.More also by manually marking training set come the method for create-rule, since it is desired that
It is artificial to participate in being also not suitable for extracting the changeable more record webpages of magnanimity.
In the prior art, the entry abstracting method there is some for more record webpages.These methods are mainly closed
Note in the extraction of the specific entry of particular intermediaries, such as the comment content of review pages;The authors' name of the model page, issuing time,
Model content, without extracting other entries.And other entries also have its application value, particularly to domain knowledge
Deeply excavating needs more comprehensive entry information.Such as to identify comment spam, it is necessary to using in review record commodity marking,
Serviceability marking, commentator's information etc. are commented on, only extracting comment content is that cannot meet the needs of comment spam identification, is lacked
A kind of entry abstracting method general for more record webpages.
In addition existing most of entry abstracting methods are all to use two stage method, i.e., after record is extracted, then carry out
The extraction of entry.The advantages of which is to go deep into layer by layer, and Stepwise Refinement, record identification can substantially reduce entry and extract hardly possible
Degree, shortcoming are that the extraction mistake of record can seriously affect the extraction of entry, cause the accumulation of mistake, while when extraction records
Due to lacking the semantic information of entry, the extraction effect of record can be influenced.Another way is that entry uniformly extracts mode,
It is carried out at the same time record extraction and entry extracts, will both regards the annotation process to tree node as.The advantages of which
It is to be carried out at the same time to be conducive to efficiently using for both information.The semantic information of entry will be helpful to record and extract, and record at the same time
Extract and will be helpful to improve the accuracy that entry extracts.Shortcoming is that the mask method in text needs training pattern, and required
Characteristic set is to lead domain-dependent, it is necessary to manually mark training set, and present mass data is automatically taken out there is an urgent need to a kind of
Take method.Existing work is not yet realized carries out entry extraction with non-supervisory, unified approach.
With continuous the producing of the medium message of the social activity such as microblogging, forum in recent years, the webpages that record have possessed largely more
Data resource, and need to find the information such as much-talked-about topic therein, leader of opinion by data mining technology, this is just to record
Item information extraction technique proposes a challenge:How a unification effective information extraction system is built to meet different media
Information extraction need.Therefore, there is an urgent need to there is a kind of entry abstracting method of efficiently and accurately, this method should be able to take out automatically
The entry of posting field is taken, and carries out the semantic alignment of entry, while can easily be made in different media, different field
With.
The content of the invention
It is an object of the invention to provide a kind of entry extraction systems and method for more record webpages, the system and
Method efficiently and accurately can carry out entry extraction to posting field in more record webpages, and extraction speed is fast, accuracy is high,
It is versatile, it is applied widely.
To achieve the above object, the technical scheme is that:A kind of entry extraction system for more record webpages,
Including:
Record tree alignment module, for receiving the posting field subtree extracted, and utilizes label information and semantic letter
Breath carry out tree alignment, obtain a hypertree, thus allow identical semantic node correspond to hypertree same node;
Content extraction module is recorded, determines to record content position in record using text density and text density and Measure Indexes
Put;
Entry output module, for by all entries in posting field and its semantic tagger according to tree node elder generation sequence time
Go through output;
Frame is fed back, for whether correct using result inspection record zone location is extracted after entry is extracted, if
It is incorrect, posting field is repositioned, and then change entry and extract as a result, until extracting result correctly or can not position
To new posting field abnormal ending, if correctly, directly terminating to extract flow.
Further, the record tree alignment module uses dom tree interior nodes label and leaf node text semantic to subtree
Align.
Further, the workflow of the record content extraction module comprises the following steps:
Step a1:Carry out subtree alignments and obtain a hypertree Ts, filter out TsIn do not have the node set US of semantic tagger;
Step a2:Calculate the text density of US collector nodes, so try to achieve the text density of every stalk tree with;
Step a3:According to text density and the definite minimum subtree set for including record content.
Further, the workflow of the feedback frame comprises the following steps:
Step b1:Extract record;
Step b2:Entry is extracted, judges every record whether all having time entry and author's entry, if so,
Then extract successfully, if if not provided, only some records are not met, remove those records, if most records are not inconsistent
Close, then redefine posting field;
Step b3:Repeat step b1, b2, until the record extracted meets condition, or can not select new record
Terminate in region.
The present invention also provides a kind of entry abstracting method for more record webpages, comprise the following steps:
Step 1:The posting field subtree extracted is received by record tree alignment module, and utilizes label information and semanteme
Information carries out tree alignment, obtains a hypertree, thus allow identical semantic node correspond to hypertree same node;
Step 2:Determine to remember in record using text density and text density and Measure Indexes by record content extraction module
Record location of content;
Step 3:It is by entry output module that all entries in posting field and its semantic tagger is first according to tree node
Sequence traversal output;
Step 4:Feedback frame, utilizes whether just to extract result inspection record zone location after entry is extracted
Really, posting field is repositioned if incorrect, and then changes entry and extracts as a result, up to extraction result is correct or nothing
Method navigates to new posting field abnormal ending, if correctly, directly terminating to extract flow.
Further, in step 1, the record tree alignment module uses dom tree interior nodes label and leaf node text
Semanteme aligns subtree.
Further, in step 2, the workflow of the record content extraction module comprises the following steps:
Step a1:Carry out subtree alignments and obtain a hypertree Ts, filter out TsIn do not have the node set US of semantic tagger;
Step a2:Calculate the text density of US collector nodes, so try to achieve the text density of every stalk tree with;
Step a3:According to text density and the definite minimum subtree set for including record content.
Further, in step 4, the workflow of the feedback frame comprises the following steps:
Step b1:Extract record;
Step b2:Entry is extracted, judges every record whether all having time entry and author's entry, if so,
Then extract successfully, if if not provided, only some records are not met, remove those records, if most records are not inconsistent
Close, then redefine posting field;
Step b3:Repeat step b1, b2, until the record extracted meets condition, or can not select new record
Terminate in region.
Compared to the prior art, the beneficial effects of the invention are as follows can be efficiently and accurately to more record webpages(As microblogging is remembered
Record webpage, forum postings webpage, product review web page etc.)Entry extraction is carried out, existing abstracting method mistake is overcome and tires out
Long-pending, nonautomatic defect, not only extraction speed is fast, and accuracy is high, and stability is high, and versatile, applied widely, can
Easily applied in different media, different field, there is very strong practicality and wide application prospect.
Brief description of the drawings
Fig. 1 is the system structure diagram of the embodiment of the present invention.
Fig. 2 is record content extraction example schematic in the embodiment of the present invention.
Fig. 3 is the workflow schematic diagram that frame is fed back in the embodiment of the present invention.
Embodiment
Below in conjunction with the accompanying drawings and specific embodiment the present invention is described in further detail.
The present invention is directed to the entry extraction systems of more record webpages, as shown in Figure 1, including:
(1)Record tree alignment module, for receiving the posting field subtree extracted, and utilizes label information and semanteme
Information carries out tree alignment, obtains a hypertree, thus allow identical semantic node correspond to hypertree same node.The note
Record tree alignment module aligns subtree using dom tree interior nodes label and leaf node text semantic.
(2)Content extraction module is recorded, is determined using text density and text density and Measure Indexes in record in record
Hold position.
(3)Entry output module, for all entries in posting field and its semantic tagger is first according to tree node
Sequence traversal output.
(4)Frame is fed back, for whether correct using result inspection record zone location is extracted after entry is extracted, such as
Fruit is incorrect, repositions posting field, and then changes entry and extract as a result, until extracting result correctly or without legal
New posting field abnormal ending is arrived in position, if correctly, directly terminating to extract flow.
The implementation of each module is described in detail separately below.
(1)Record tree alignment module
First, how description record tree alignment module carries out tree alignment operation, i.e. how identical semantic node pair
Should be in the same node of hypertree.
Existing alignment thereof is postorder traversal dom tree, and the matching of subtree is carried out using tree edit distance.In matched mistake
Cheng Zhong, only requires that label is consistent, and without considering entry value, but Html labels are only intended to the design to webpage information layout,
Lack semantic ability to express.When a subtree can be with multiple Sub-tree Matchings, the subtree for selecting to occur at first is matched.This
Although kind of alignment thereof is simple, accuracy rate is not high.In order to improve accuracy rate, in of the invention, to there is the entry of obvious semanteme
(Time, author etc.)It is labeled, identical semantic label is alignd, is just avoided that similar mistake.When a subtree
When can be with multiple Sub-tree Matchings, selection has identical semantic subtree to be matched, and if being not matched to identical semantic son
Tree, then select the subtree occurred at first.If a node has multiple semantic taggers, if in matching some it is semantic can and
On another node matching, then align.
(2)Record content extraction module
Secondly, description record content extraction module is how to determine record content.In dom tree, such as author, time
A leaf node in dom tree is usually corresponded to Deng short text entry, and records content and has corresponded to some complicated subtrees rather than one
A simple leaf node, it extracts difficulty and is greater than other entries.It is long article due to recording content compared to other entries
This, the record content in record is determined by this metric of text density.This method uses 3 step strategies:First, according to preceding
Literary the method carries out subtree alignments and obtains a hypertree Ts, filters out the node set US for not having semantic tagger in Ts;Then
Calculate the text density of US collector nodes, so try to achieve the text density of every stalk tree with;Finally according to text density and, determine
Include the minimum subtree set of record content.Here is that the specific of each step is introduced.
1. subtree alignments
Record subtree alignments are carried out according to the introduced alignment schemes of upper section, all subtrees become by being inserted into some nodes
Isomorphism so that identical semantic node is aligned to a node in Ts in each tree.And the semantic tagger of Ts is using throwing
Ticket principle, when the record for having more than half is all semantic for certain by the node label, then node label is semantic for certain.At the same time to super
The number of characters for setting each text leaf node of Ts is labeled.Marking rule is
WhereinC i For the number of characters of i-th of leaf node of hypertree Ts,C ij For i-node jth stalk tree number of characters.
2. text density and text density and
The present invention determines to include note in the children tree nodes of no mark using text density and text density and calculation formula
Record the minimum subtree set of content.The text density of all remaining nodes is calculated first(formula 2)And text density and
(formula 3).
(1)
(2)
3. determine to include the minimum subtree set of record content
Text density and maximum node DensitySummax are found, the subtree using the node as root is added into result set.
Text density and minimum node, the text density of the node are determined into the path of record root node from DensitySummax
With as threshold value.Then all remaining nodes are traveled through, if the node text density is more than threshold value, from the child node of the node
Text density and maximum node are found out, then is record content blocks using its subtree as root node.Under many circumstances, in record
Appearance can be divided into many text blocks, so the node for needing to be more than each text density threshold value carries out identical operation, institute of making uniform
Some record content subtree sets.Fig. 2 illustrates the hypertree Ts after an alignment, to introduce lower definite record content subtree exemplified by it
The process of collection.Shaded nodes are no semantic tagger node, and wherein T represents text node, and the numeral in bracket is text node
Number of characters, the numeral beside remaining label node are the text density of the node, and the numeral in bracket is subtree under the node
Text density and.It can be seen that text density and be 230<div>Node is density and maximum, by the subtree using it as root
Add result set.It is 151 from the node to record root node density and minimum value, then threshold value is set to 151.In remaining node only
It is 167 to have text density<div>Node exceedes threshold value, because it does not have other child nodes, then directly by the subtree using it as root
Add result set.Minimum subtree set is finally obtained, is represented in figure with coil.
(3)Record output module
Finally, description record output module.All text nodes in posting field are pressed hierarchical sequence by record output module
Traversal output, exports separator bar when encountering separator, obtains final extraction result.
(4)Feed back frame
Existing most of entry abstracting methods are all to use two stage method, i.e., after record is extracted, then are recorded
The extraction of item.The advantages of which is to go deep into layer by layer, and Stepwise Refinement, record identification can substantially reduce entry and extract difficulty, lack
Point is that the extraction mistake of record can seriously affect the extraction of entry, causes the accumulation of mistake, at the same extract record when due to
Lack the semantic information of entry, the extraction effect of record can be influenced.The unified approach that entry extracts, that is, be carried out at the same time record
Extract and entry extracts, will both regard the annotation process to tree node as.The advantages of which is to be carried out at the same time favorably
In efficiently using for both information.The semantic information of entry will be helpful to record and extract, while records extraction and will be helpful to carry
The accuracy that high entry extracts.Shortcoming is that the mask method in text needs training pattern, and required characteristic set is field
Rely on, it is necessary to artificial mark training set, and there is an urgent need to a kind of automatic abstracting method for present mass data.
The present invention proposes feedback frame on the basis of two stage method, and which can extract the basis of result in entry
On again modification record extract, so as to improve final extraction effect.
Assuming that:UGC webpages every record the time entry and author's entry that should all include one or more.
If the record extracted according to the hypothesis does not include time entry and author's entry, to carry out again
Posting field positioning, record extract, until record meets hypothesis or abnormal end.Which flow:
1st, record is extracted
2nd, entry is extracted, judges every record whether all having time entry and author's entry
Successfully terminate if so, then extracting
If not provided, only one or two record is not met, then these it is incongruent be recorded as advertising record etc., remove.
Overwhelming majority record does not comply with, it is determined that vice-minister's text node or the posting field that secondary more record numbers are root node
Block is new posting field.
3rd, 1,2 steps are repeated, until the record extracted meets condition, or new posting field can not be selected(Traversal
Complete all text nodes or record number are less than 3, and usually more record webpage record numbers are equal to greatly 3)When terminate.
Which flow is full automatic, is also according to automatic semantic tagger knot when judging whether record meets to assume
Fruit, without manual intervention, while again can well amendment record extract mistake, avoid mistake accumulation.
Maximum innovative point of the invention includes following three points:
1st, the present invention considers interior nodes label value and leaf node text semantic at the same time when setting and aliging first, both
Combination tree can be avoided to align some manifest error, as interior nodes label value is identical and the semantic different situation of leaf node.
2nd, the present invention is first using text density and text density and to determine to record content in record, because in dom tree
In, the short text entry such as author, time usually corresponds to a leaf node in dom tree, and records content and corresponded to some
Complicated subtree rather than a simple leaf node, it extracts difficulty and is greater than other entries.Due to record content compared to
Other entries are long text, and the record content in record can be determined by text density this metric.
3rd, present invention firstly provides feedback frame, the design of the flow can avoid two benches from extracting flow error accumulation
Problem, while can solve the problems, such as that unified extraction mode needs manually to mark language material, field dependence again, reach and efficiently and accurately take out
Take data record and entry.
Accordingly, the present invention proposes the entry abstracting method for more record webpages, comprises the following steps:
Step 1:The posting field subtree extracted is received by record tree alignment module, and utilizes label information and semanteme
Information carries out tree alignment, obtains a hypertree, thus allow identical semantic node correspond to hypertree same node;
Step 2:Determine to remember in record using text density and text density and Measure Indexes by record content extraction module
Record location of content;
Step 3:It is by entry output module that all entries in posting field and its semantic tagger is first according to tree node
Sequence traversal output;
Step 4:Feedback frame at the same time, i.e., be using extracting result inspection record zone location after entry is extracted
It is no correct, posting field is repositioned if incorrect, and then change entry extract as a result, until extract result it is correct or
Person can not navigate to new posting field abnormal ending, if correctly, directly terminating to extract flow.
In step 1, the record tree alignment carries out subtree using dom tree interior nodes label and leaf node text semantic
Alignment.
In step 2, the workflow of the record content extraction module comprises the following steps:
Step a1:Carry out subtree alignments and obtain a hypertree Ts, filter out TsIn do not have the node set US of semantic tagger;
Step a2:Calculate the text density of US collector nodes, so try to achieve the text density of every stalk tree with;
Step a3:According to text density and the definite minimum subtree set for including record content.
In step 4, the feedback frame is full automatic, as shown in figure 3, its workflow comprises the following steps:
Step b1:Extract record;
Step b2:Entry is extracted, judges every record whether all having time entry and author's entry, if so,
Then extract successfully, if if not provided, only some records are not met, those are incongruent to be recorded as advertising record etc., goes
Except those records, if most records are not met, posting field is redefined;
Step b3:Repeat step b1, b2, until the record extracted meets condition, or can not select new record
Terminate in region.
Above is presently preferred embodiments of the present invention, all changes made according to technical solution of the present invention, caused function are made
During with scope without departing from technical solution of the present invention, protection scope of the present invention is belonged to.
Claims (6)
- A kind of 1. entry extraction system for more record webpages, it is characterised in that including:Record tree alignment module, for receiving the posting field subtree that has extracted, and utilization label information and semantic information into Row tree alignment, obtain a hypertree, thus allow identical semantic node correspond to hypertree same node;Content extraction module is recorded, determines to record location of content in record using text density and text density and Measure Indexes;Entry output module, for all entries in posting field and its semantic tagger is defeated according to tree node preorder traversal Go out;Frame is fed back, for whether correct using result inspection record zone location is extracted after entry is extracted, if not just It is true then reposition posting field, and then change entry and extract as a result, until to extract result correct or can not navigate to new Posting field abnormal ending, if correctly, directly terminate extract flow;The workflow of the feedback frame includes following Step:Step b1:Extract record;Step b2:Entry is extracted, every record whether all having time entry and author's entry are judged, if so, then taking out Success is taken, if if not provided, only some records are not met, removes those records, if most records are not met, Redefine posting field;Step b3:Repeat step b1, b2, until the record extracted meets condition, or can not select new posting field Terminate.
- 2. the entry extraction system according to claim 1 for more record webpages, it is characterised in that the record tree Alignment module aligns subtree using dom tree interior nodes label and leaf node text semantic.
- 3. the entry extraction system according to claim 1 for more record webpages, it is characterised in that in the record The workflow for holding abstraction module comprises the following steps:Step a1:Carry out subtree alignments and obtain a hypertree Ts, filter out TsIn do not have the node set US of semantic tagger;Step a2:Calculate the text density of US collector nodes, so try to achieve the text density of every stalk tree with;Step a3:According to text density and the definite minimum subtree set for including record content.
- 4. a kind of entry abstracting method for more record webpages, it is characterised in that comprise the following steps:Step 1:The posting field subtree extracted is received by record tree alignment module, and utilizes label information and semantic information Carry out tree alignment, obtain a hypertree, thus allow identical semantic node correspond to hypertree same node;Step 2:Determined by record content extraction module using text density and text density and Measure Indexes in record in record Hold position;Step 3:By entry output module by all entries in posting field and its semantic tagger according to tree node elder generation sequence time Go through output;Step 4:Feedback frame, it is whether correct using result inspection record zone location is extracted after entry is extracted, such as Fruit is incorrect, repositions posting field, and then changes entry and extract as a result, until extracting result correctly or without legal New posting field abnormal ending is arrived in position, if correctly, directly terminating to extract flow;The workflow bag of the feedback frame Include following steps:Step b1:Extract record;Step b2:Entry is extracted, every record whether all having time entry and author's entry are judged, if so, then taking out Success is taken, if if not provided, only some records are not met, removes those records, if most records are not met, Redefine posting field;Step b3:Repeat step b1, b2, until the record extracted meets condition, or can not select new posting field Terminate.
- 5. the entry abstracting method according to claim 4 for more record webpages, it is characterised in that in step 1, The record tree alignment module aligns subtree using dom tree interior nodes label and leaf node text semantic.
- 6. the entry abstracting method according to claim 4 for more record webpages, it is characterised in that in step 2, The workflow of the record content extraction module comprises the following steps:Step a1:Carry out subtree alignments and obtain a hypertree Ts, filter out TsIn do not have the node set US of semantic tagger;Step a2:Calculate the text density of US collector nodes, so try to achieve the text density of every stalk tree with;Step a3:According to text density and the definite minimum subtree set for including record content.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410503955.9A CN104217025B (en) | 2014-09-28 | 2014-09-28 | For the entry extraction system and method for more record webpages |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410503955.9A CN104217025B (en) | 2014-09-28 | 2014-09-28 | For the entry extraction system and method for more record webpages |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104217025A CN104217025A (en) | 2014-12-17 |
CN104217025B true CN104217025B (en) | 2018-04-13 |
Family
ID=52098515
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410503955.9A Active CN104217025B (en) | 2014-09-28 | 2014-09-28 | For the entry extraction system and method for more record webpages |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104217025B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108073678B (en) * | 2017-11-06 | 2020-08-28 | 广东广业开元科技有限公司 | Document analysis processing method, system and device applied to big data analysis |
CN108959204B (en) * | 2018-06-22 | 2021-03-05 | 中国科学院计算技术研究所 | Internet financial project information extraction method and system |
CN112559929B (en) * | 2021-02-25 | 2021-05-07 | 中航信移动科技有限公司 | Method, electronic device and medium for extracting webpage target information |
CN113934914B (en) * | 2021-12-20 | 2022-03-01 | 成都橙视传媒科技股份公司 | Method for collecting batch encrypted data of news media |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102779170A (en) * | 2012-06-25 | 2012-11-14 | 北京奇虎科技有限公司 | System and method for identifying text floor of webpage |
CN103761312A (en) * | 2014-01-24 | 2014-04-30 | 福州大学 | Information extraction system and method for multi-recording webpage |
-
2014
- 2014-09-28 CN CN201410503955.9A patent/CN104217025B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102779170A (en) * | 2012-06-25 | 2012-11-14 | 北京奇虎科技有限公司 | System and method for identifying text floor of webpage |
CN103761312A (en) * | 2014-01-24 | 2014-04-30 | 福州大学 | Information extraction system and method for multi-recording webpage |
Non-Patent Citations (3)
Title |
---|
"Vision-based Web Data Records Extraction";Liu Wei etc.;《Workshop on the Web and Databases》;20061231;第20-25页 * |
"基于DOM节点文本密度的网页核心块抽取算法研究";孙飞;《中国优秀硕士学位论文全文数据库 信息科技辑》;20120715(第7期);论文第3.2、4.2节 * |
"基于Hadoop的Web评论自动抽取方法研究";颜佳伟;《中国优秀硕士学位论文全文数据库 信息科技辑》;20131215(第S2期);论文3.3-3.4节,图3.6,图3.15 * |
Also Published As
Publication number | Publication date |
---|---|
CN104217025A (en) | 2014-12-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101464905B (en) | Web page information extraction system and method | |
CN102831121B (en) | Method and system for extracting webpage information | |
CN103678412B (en) | A kind of method and device of file retrieval | |
CN104217025B (en) | For the entry extraction system and method for more record webpages | |
CN103870506B (en) | Webpage information extraction method and system | |
CN102591612B (en) | General webpage text extraction method based on punctuation continuity and system thereof | |
CN102662969B (en) | Internet information object positioning method based on webpage structure semantic meaning | |
CN102306177B (en) | Multi-strategy combined ontology or instance matching method | |
CN103136358B (en) | A kind of method of Automatic Extraction forum data | |
CN102270206A (en) | Method and device for capturing valid web page contents | |
CN104598462B (en) | Extract the method and device of structural data | |
US10789302B2 (en) | Method and system for extracting user-specific content | |
CN108334493A (en) | A kind of topic knowledge point extraction method based on neural network | |
CN105677638B (en) | Web information abstracting method | |
CN106055667A (en) | Method for extracting core content of webpage based on text-tag density | |
CN103699591A (en) | Page body extraction method based on sample page | |
CN101661468B (en) | Method for extracting post metadata from forum post list pages | |
CN109522452A (en) | A kind of processing method of magnanimity semi-structured data | |
CN102117289A (en) | Method and device for extracting comment content from webpage | |
CN103853770B (en) | The method and system of model content in a kind of extraction forum Web pages | |
CN104933032A (en) | Method for extracting keywords of blog based on complex network | |
CN103500216A (en) | Method for extracting file information | |
CN107436931A (en) | web page text extracting method and device | |
CN101727497A (en) | Method for generating interactive document structure from web page document | |
CN104615728B (en) | A kind of webpage context extraction method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |