CN106557565A - A kind of text message extracting method based on website construction - Google Patents

A kind of text message extracting method based on website construction Download PDF

Info

Publication number
CN106557565A
CN106557565A CN201611027102.8A CN201611027102A CN106557565A CN 106557565 A CN106557565 A CN 106557565A CN 201611027102 A CN201611027102 A CN 201611027102A CN 106557565 A CN106557565 A CN 106557565A
Authority
CN
China
Prior art keywords
webpage
node
block
text
web page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611027102.8A
Other languages
Chinese (zh)
Inventor
陈星�
王洲
王一洲
戴远飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN201611027102.8A priority Critical patent/CN106557565A/en
Publication of CN106557565A publication Critical patent/CN106557565A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention relates to a kind of text message extracting method based on website construction, website rank is combined with webpage rank, the gap between smooth webpage is realized by the website construction of website rank, recycle the density feature of web page release and node to determine the position of such Web page text, and extract corresponding decimation rule.The present invention can effectively improve the accuracy rate of Web page text contents extraction.

Description

A kind of text message extracting method based on website construction
Technical field
The present invention relates to web page contents extractive technique field, particularly a kind of text message extraction side based on website construction Method.
Background technology
The fast development of Web technologies so that substantial amounts of data are represented with html format, and Web page becomes information and sends out The main carriers of cloth are also one of main channel that people obtain information.In recent years, with the development of big data, it was recognized that The importance of data.Formulate in business decision, the various aspects big data such as public sentiment monitoring plays huge effect.
Therefore, a study hotspot being taken into for current academia of Web page important content.However, in webpage There are two difficult points in content extraction:First, in a Web page in addition to the text interested comprising user, also include The noise information unrelated with theme such as navigation bar, advertisement, recommended links, copyright statement.Second, due to dynamic script and CSS skills The extensive application of art so that the structural difference between webpage constantly increases and the complexity of webpage self structure is constantly carried It is high.For the two difficult points, it has been proposed that the Web page text based on statistics is extracted and the Web page text based on Web-page segmentation is extracted, However, these methods may fail under special circumstances.
In fact, html page is stored in the combination of the data in background data base and HTML content template, in website Internal webpage is generated by a set of identical content template mostly, it can be considered that the design of webpage is that have relative rule Rule.
The content of the invention
In view of this, the purpose of the present invention is to propose to a kind of text message extracting method based on website construction, Ke Yiyou The accuracy rate of the raising Web page text contents extraction of effect.
The present invention is realized using below scheme:A kind of text message extracting method based on website construction, specifically include with Lower step:
Step S1:Crawl and be input into the corresponding webpage of url set, script, annotation in removal html web page, style tags, And by each web analysis into a dom tree;
Step S2:The corresponding dom tree of given webpage, traversal dom tree extract structure of web page feature;Calculated by architectural feature Similarity between webpage;Collections of web pages is clustered according to the similarity between webpage;Ultimately produce a series of labelling Class set, and the web page characteristics of each marking class;
Step S3:For same class webpage, cutting whole into sections is carried out to webpage, count the node density feature of each block, seek Look for comprising body matter and the block not comprising noise information, and extract the feature of the block and carry as content blocks in such webpage Take rule;
Step S4:Webpage in each marking class is extracted into content by extracting rule;For the new of identical source Webpage determines the marking class belonging to webpage also by the web page characteristics of marking class, is then advised according to the extraction of the marking class webpage Then carrying out the extraction of content.
Further, step S2 specifically includes following steps:
Step S21:Using the path of each block node in dom tree as webpage architectural feature, by breadth first traversal Mode travel through webpage dom tree Td, set of paths F.f is extracted, two tuples F=are constituted<w,f>Structure to represent webpage w is special Levy;
Step S22:One web pages W of input are converted into into web page characteristics set D={ F1,F2,…,Fk};
Step S23:According to structure of web page feature F, the similarity between webpage is calculated using following formula:
Wherein, FiAnd FjThe architectural feature of i-th webpage and j-th webpage is represented respectively;
Step S24:One web pages are clustered by webpage Similarity Measure using hierarchical clustering algorithm.
Further, the calculating of node density feature described in step S3 is comprised the following steps:
Step S31:If n ∈ B are dom tree TdIn a block node, then the text density of n be defined as:
Wherein, TnFor the pure words character number that block node n is included, T is TdPure words character in the whole document for representing Number;Wherein TnLink text is not included with T;
Step S32:If n ∈ B are dom tree TdIn a block node, then the link density of n be defined as:
Wherein, lNnFor the link number included in node n, lN is TdThe link number included in whole document for representing;
Step S33:If n ∈ B are dom tree TdIn a block node, then the node text density of n be defined as:
Wherein, wherein TnFor the pure words character number of block node n, lTnFor the text character number of block node n;Wherein TnDo not wrap Containing link text, lTnComprising link text;
Step S34:Calculate combined density eigenvalue H (p) of block node:
H (p)=λ * p1*p2*p3;
Wherein, p={ p | p=p (b), b ∈ B } represents the corresponding path of block node;P1, p2, p3 represent node respectively Density feature, takes p1=ptext, p2=1-plink, p3=ptextl, block node is estimated using H (p).
Compared with prior art, the present invention has following beneficial effect:Website rank is combined by the present invention with webpage rank, The gap between smooth webpage is realized by the website construction of website rank, the density feature of web page release and node is recycled To determine the position of such Web page text, and extract corresponding decimation rule.Webpage just can effectively improved by this method The accuracy rate of literary contents extraction.
Description of the drawings
Principle schematics of the Fig. 1 for the embodiment of the present invention.
Specific embodiment
Below in conjunction with the accompanying drawings and embodiment the present invention will be further described.
As shown in figure 1, present embodiments provide a kind of text message extracting method based on website construction, specifically include with Lower step:
Step S1:Crawl and be input into the corresponding webpage of url set, script, annotation in removal html web page, style tags, And by each web analysis into a dom tree;
Step S2:The corresponding dom tree of given webpage, traversal dom tree extract structure of web page feature;Calculated by architectural feature Similarity between webpage;Collections of web pages is clustered according to the similarity between webpage;Ultimately produce a series of labelling Class set, and the web page characteristics of each marking class;
Step S3:For same class webpage, cutting whole into sections is carried out to webpage, count the node density feature of each block, seek Look for comprising body matter and the block not comprising noise information, and extract the feature of the block and carry as content blocks in such webpage Take rule;
Step S4:Webpage in each marking class is extracted into content by extracting rule;For the new of identical source Webpage determines the marking class belonging to webpage also by the web page characteristics of marking class, is then advised according to the extraction of the marking class webpage Then carrying out the extraction of content.
In the present embodiment, step S2 specifically includes following steps:
Step S21:Using the path of each block node in dom tree as webpage architectural feature, by breadth first traversal Mode travel through webpage dom tree Td, set of paths F.f is extracted, two tuples F=are constituted<w,f>Structure to represent webpage w is special Levy;
Step S22:One web pages W of input are converted into into web page characteristics set D={ F1,F2,…,Fk};
Step S23:According to structure of web page feature F, the similarity between webpage is calculated using following formula:
Wherein, FiAnd FjThe architectural feature of i-th webpage and j-th webpage is represented respectively;
Step S24:One web pages are clustered by webpage Similarity Measure using hierarchical clustering algorithm.
In the present embodiment, the calculating of node density feature described in step S3 is comprised the following steps:
Step S31:If n ∈ B are dom tree TdIn a block node, then the text density of n be defined as:
Wherein, TnFor the pure words character number that block node n is included, T is TdPure words character in the whole document for representing Number;Wherein TnLink text is not included with T;
Step S32:If n ∈ B are dom tree TdIn a block node, then the link density of n be defined as:
Wherein, lNnFor the link number included in node n, lN is TdThe link number included in whole document for representing;
Step S33:If n ∈ B are dom tree TdIn a block node, then the node text density of n be defined as:
Wherein, wherein TnFor the pure words character number of block node n, lTnFor the text character number of block node n;Wherein TnDo not wrap Containing link text, lTnComprising link text;
Step S34:Calculate combined density eigenvalue H (p) of block node:
H (p)=λ * p1*p2*p3;
Wherein, p={ p | p=p (b), b ∈ B } represents the corresponding path of block node;P1, p2, p3 represent node respectively Density feature, takes p1=ptext, p2=1-plink, p3=ptextl, block node is estimated using H (p).
Preferably, in the present embodiment, representing the architectural feature of webpage for convenience, following four definition are introduced.
Define 1. each Web page and can be expressed as a dom tree Td, TdIt is a directed graph<V,E>, wherein V is The set on summit, V=v | v ∈ html tally set Tag }.Set of the E for directed edge, E=<u,v>| u, v ∈ V, wherein u are referred to as The father vertex of v, and v is referred to as the son vertex of u, and in html structures, the corresponding labels of v are contained by the corresponding label packet receivings of u }.
Define 2. 1 dom tree TdIt is represented by the set B={ b of a page blocki|bi∈ V, and biNode is corresponding Html labels are div or table }, the node is called block node.
Define 3.TdIt it is one with v0For the dom tree of root, for arbitrary node v ∈ V, v0iv1i…vkI is tree TdFrom v0 Reach vkSequence node, wherein, parent (vj)=vj-1(1<=j<=k), i represents position of the node in its brotgher of node Put, vk=v, then claim v0iv1i…vkI is the path of node v, is designated as p (v). such as:" body1/div3/div2 " is a node Path.
Architectural feature f for defining 4. given webpage w and webpage is represented by two tuples F=<w,f>, wherein f is one Individual set of paths f={ p1,p2,…,pn|pi=p (bi),bi∈B}。
In order to the quick similarity calculated between webpage, the path of each block node in dom tree is made by the present embodiment For the architectural feature of webpage.Webpage dom tree T is traveled through by way of breadth first traversald, set of paths F.f is extracted, two are constituted Tuple F=<w,f>To represent the architectural feature of webpage w.Web pages W of input are converted into into web page characteristics set D=finally {F1,F2,…,Fk}。
In the present embodiment, according to structure of web page feature F, the similarity that can be calculated between webpage.Define herein Calculate the similarity function of two pages:
Wherein FiAnd FjThe architectural feature of i-th webpage and j-th webpage is represented respectively.Algorithmic notation is herein for convenience Give the definition of labelling category feature after cluster result.
Define 5 given structure identical collections of web pages W={ w1,w2,…,wn|sim(<wi,fi>,<wj,fj>)>0.82,0< I, j<=n, } and webpage architectural feature f=f | f=Fi.fiAnd Fi.wi∈ W }, then can represent every in website construction result One marking class is characterized as tlv triple C=<c,f,W>, wherein c represents such labelling, c ∈ positive integers N.
The present embodiment is clustered to a web pages by webpage Similarity Measure using hierarchical clustering algorithm, such as in following table Algorithm 1.
By algorithm 1, Web Page Tags class collections of web pages M can be obtained, each element representation marking class in set Feature.
In the present embodiment, after website construction result is obtained, in addition it is also necessary to extract content to the webpage of same marking class Decimation rule.The present embodiment is determined in webpage by the way of being combined based on the node density feature and web page release of statistics The position of appearance.In webpage, the distribution of body text is typically relatively concentrated, therefore, the text of the node that body text is located The text density of density ratio other nodes will height.From the point of view of the effect that html file represents in a browser, the page is by some What individual block was constituted, these blocks be by HTML containers labels (<div>With<table>Label) be split to form.So herein will Html page cuts into set of blocks B, then selects not comprising noise information, but comprising in complete text message from set of blocks B Hold block.
Whether the present embodiment adopts density feature come decision block node for content blocks, and 3 density definition are given below:
Define 6. n ∈ B are set as dom tree TdIn a block node, then the text density of n be defined as:
Wherein TnThe pure words character number (without link text) included for block node n, T is TdIn the whole document for representing Pure words character number (do not include link text).
PtextReflect in the global page, Relatively centralized degree of the content of text in certain block node.By observation Draw with experiment, PtextIt is bigger, it is meant that the node is more possible to comprising content blocks to be found.
Define 7. n ∈ B are set as dom tree TdIn a block node, then the link density of n be defined as:
Wherein lNnFor the link number included in node n, lN is TdThe link number included in whole document for representing.
PlinkReflect in the global page, be linked at the Relatively centralized degree of certain block node.By observing and testing Draw PlinkIt is bigger, it is meant that probability of the block node comprising noise information is bigger.
Define 8. n ∈ B are set as dom tree TdIn a block node, then the node text density of n be defined as:
Wherein TnPure words character number (not containing link text) for block node n, lTnFor the text character number of block node n (including link text).
PtextlReflect the plain text intensity in certain node.P is drawn by observation and experimenttextlIt is bigger, meaning The node to be more possible to comprising main text block to be found.
After giving 3 Density Metrics, combined density eigenvalue H (p) of definition block node:
H (p)=λ * p1*p2*p3
Wherein p=p | and p=p (b), b ∈ B } the corresponding path of block node is represented, p1, p2, p3 represent node respectively Density feature, takes p1=ptext, p2=1-plink, p3=ptextl, block node is estimated using H (p).
Web Page Tags class set M, the webpage in same marking class can be obtained by algorithm 1, the present embodiment thinks interior The position for holding block is identical, so in same class webpage going to select content blocks by density feature, then extracts content blocks Decimation rule of the feature as such Web page content block, as shown in algorithm 2.In webpage, the feature of block can have three kinds of expression sides The path path of method, the corresponding value of block class attributes, the corresponding value of block id attributes and block.Algorithmic notation for convenience, this enforcement Example gives after cluster result the definition of the feature of content blocks in decimation rule, i.e. each marking class in each marking class.
The labelling c and the content blocks b ∈ B of the marking class webpage of marking class in 9. given cluster results are defined, the mark is defined The decimation rule of note class webpage be four-tuple L (b)=<c,class,id,p>, wherein class represents such webpage correspondence The corresponding value of content blocks class attributes, id represent such corresponding value of webpage corresponding content block the i-th d attributes, and p represents such net The tag path of page corresponding content block, p=p | p=p (b) and b ∈ B }.
The result of cluster can be obtained into decimation rule L (b) of each marking class corresponding content block through algorithm 2.Utilize Decimation rule set N is processing the Web page in each marking class in marking class set M, and therefrom extracts data. The path of three features of recorded content block in L (b), the corresponding value of block class attributes, the corresponding value of block id attributes and block. In decimation rule L (b), if during L.class and L.id existence values, extracting content blocks using the corresponding value of class and id attributes. In decimation rule L (b), if L.class and L.id is space-time, content blocks are extracted with the path L.p of block.Finally, from content Web page contents are extracted in block.
Particularly, in order to verify the effectiveness of said method, the present embodiment is real on Eclipse platform using Java language Corresponding prototype system is showed.The input of the prototype system is given one group Web page, and output is this group of Web page correspondence Web page contents.Data set used in experiment is from 1000 webpages including 5 websites.The data set passes through semi-hand Mode (seed URL+ reptiles+craft screening) is obtained from the Internet online collection, respectively from Netease, Sohu, Sina, the people Net and the www.xinhuanet.com, different themes classification of these webpages in website.Notebook data collection participates in website construction and processes.
The experiment of the present embodiment is divided into two kinds, and the first is the web page contents based on web page release and block node density feature Extract, be used as content blocks by choosing the higher block of combined density in the page, and extract its content, its result is as shown in table 1.The Two kinds is that web page contents based on website construction, web page release and block node density feature are extracted, and its result is as shown in table 2.
Web page contents of the table 1 based on web page release and block node density feature extract result
DataSet Webpage sum Accuracy rate
Netease 200 86%
Sohu 200 98.5%
Sina 200 96%
The www.xinhuanet.com 200 99%
People's Net 200 100%
Web page contents of the table 2 based on website construction are extracted
DataSet Webpage sum Accuracy rate
Netease 200 91%
Sohu 200 100%
Sina 200 100%
The www.xinhuanet.com 200 100%
People's Net 200 100%
As can be seen that carrying out webpage based on the method for web page release and block node density feature from the experimental result of table 1 Contents extraction is not fine for its extraction effect of some websites.We to malfunction webpage investigate, draw it is following its The reason for middle error:
(1) there is no block comprising complete content in webpage:Web page contents are except being included in<div>With<table>Block section In point label, also it is included in<center>With<text>In label, because define only node mark in definition block node herein Sign and be<div>Or<table>It is block node, causes data to be extracted and substantial amounts of omission occurs, cause accuracy rate to decline.
(2) there is no block in webpage and only include complete content:This kind of mistake is occurred mainly in data set Netease, at us In the middle part of 200 webpages of the Netease of selection there are no accurate content blocks in subnetting page, but content is mixed with recommended links block Place in the same block, which results in us and contain impurity in the content blocks for extracting, so as to reduce accuracy rate.
(3) content blocks of mistake are chosen:Web page text content length is shorter, and the page noise it is more when such as:Visitor Comment it is longer etc. so that system chooses the block of mistake.This kind of situation is occurred mainly in from Sina, Sohu, the net of the www.xinhuanet.com In page.
Find to extract its accuracy rate phase based on the web page contents of web page release and block node density feature by experimental result To unstable, after the process of website construction, then carry out extraction its accuracy rate of content and greatly improve, can be eliminated substantially Three kinds of mistakes.
The foregoing is only presently preferred embodiments of the present invention, all impartial changes done according to scope of the present invention patent with Modification, should all belong to the covering scope of the present invention.

Claims (3)

1. a kind of text message extracting method based on website construction, it is characterised in that:Comprise the following steps:
Step S1:Crawl and be input into the corresponding webpage of url set, script, annotation in removal html web page, style tags, and will Each web analysis is into a dom tree;
Step S2:The corresponding dom tree of given webpage, traversal dom tree extract structure of web page feature;Webpage is calculated by architectural feature Between similarity;Collections of web pages is clustered according to the similarity between webpage;Ultimately produce a series of labelling class set Close, and the web page characteristics of each marking class;
Step S3:For same class webpage, cutting whole into sections is carried out to webpage, count the node density feature of each block, find bag Containing body matter and the block not comprising noise information, and extract the extraction rule of the feature as content blocks in such webpage of the block Then;
Step S4:Webpage in each marking class is extracted into content by extracting rule;For the new webpage of identical source The marking class belonging to webpage is determined also by the web page characteristics of marking class, then according to the decimation rule of the marking class webpage come Carry out the extraction of content.
2. a kind of text message extracting method based on website construction according to claim 1, it is characterised in that:Step S2 Specifically include following steps:
Step S21:Using the path of each block node in dom tree as webpage architectural feature, by the side of breadth first traversal Formula travels through webpage dom tree Td, set of paths F.f is extracted, two tuples F=are constituted<w,f>To represent the architectural feature of webpage w;
Step S22:One web pages W of input are converted into into web page characteristics set D={ F1,F2,…,Fk};
Step S23:According to structure of web page feature F, the similarity between webpage is calculated using following formula:
s i m ( F i , F j ) = c r a d ( F i . f &cap; F j . f ) min ( c r a d ( F i . f ) , c r a d ( F j . f ) ) ;
Wherein, FiAnd FjThe architectural feature of i-th webpage and j-th webpage is represented respectively;
Step S24:One web pages are clustered by webpage Similarity Measure using hierarchical clustering algorithm.
3. a kind of text message extracting method based on website construction according to claim 1, it is characterised in that:Step S3 Described in the calculating of node density feature comprise the following steps:
Step S31:If n ∈ B are dom tree TdIn a block node, then the text density of n be defined as:
p t e x t i = T n T ;
Wherein, TnFor the pure words character number that block node n is included, T is TdPure words character number in the whole document for representing;Its Middle TnLink text is not included with T;
Step S32:If n ∈ B are dom tree TdIn a block node, then the link density of n be defined as:
p l i n k i = lN n l N ;
Wherein, lNnFor the link number included in node n, lN is TdThe link number included in whole document for representing;
Step S33:If n ∈ B are dom tree TdIn a block node, then the node text density of n be defined as:
p t e x t l i = T n lT n ;
Wherein, wherein TnFor the pure words character number of block node n, lTnFor the text character number of block node n;Wherein TnNot comprising chain Meet text, lTnComprising link text;
Step S34:Calculate combined density eigenvalue H (p) of block node:
H (p)=λ * p1*p2*p3;
Wherein, p={ p | p=p (b), b ∈ B } represents the corresponding path of block node;P1, p2, p3 represent the density of node respectively Feature, takes p1=ptext, p2=1-plink, p3=ptextl, block node is estimated using H (p).
CN201611027102.8A 2016-11-22 2016-11-22 A kind of text message extracting method based on website construction Pending CN106557565A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611027102.8A CN106557565A (en) 2016-11-22 2016-11-22 A kind of text message extracting method based on website construction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611027102.8A CN106557565A (en) 2016-11-22 2016-11-22 A kind of text message extracting method based on website construction

Publications (1)

Publication Number Publication Date
CN106557565A true CN106557565A (en) 2017-04-05

Family

ID=58444640

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611027102.8A Pending CN106557565A (en) 2016-11-22 2016-11-22 A kind of text message extracting method based on website construction

Country Status (1)

Country Link
CN (1) CN106557565A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107391678A (en) * 2017-07-21 2017-11-24 福州大学 Web page content information extracting method based on cluster
CN107590288A (en) * 2017-10-11 2018-01-16 百度在线网络技术(北京)有限公司 Method and apparatus for extracting webpage picture and text block
CN107894974A (en) * 2017-11-02 2018-04-10 华南农业大学 Webpage context extraction method based on tag path and text punctuate than Fusion Features
CN108520007A (en) * 2018-03-15 2018-09-11 江河瑞通(北京)技术有限公司 Web page information extracting method, storage medium and computer equipment
CN108694193A (en) * 2017-04-07 2018-10-23 北京国双科技有限公司 The judgment method and device of type of webpage
CN108694192A (en) * 2017-04-07 2018-10-23 北京国双科技有限公司 The judgment method and device of type of webpage
CN109145162A (en) * 2018-08-21 2019-01-04 慧安金科(北京)科技有限公司 For determining the method, equipment and computer readable storage medium of data similarity
CN109325204A (en) * 2018-09-13 2019-02-12 武汉伯远生物科技有限公司 Web page contents extraction method
CN110020038A (en) * 2017-08-01 2019-07-16 阿里巴巴集团控股有限公司 Webpage information extracting method, device, system and electronic equipment
CN110377796A (en) * 2019-07-25 2019-10-25 中南民族大学 Text extracting method, device, equipment and storage medium based on dom tree
CN110851606A (en) * 2019-11-18 2020-02-28 杭州安恒信息技术股份有限公司 Website clustering method and system based on webpage structure similarity
CN111625749A (en) * 2020-06-01 2020-09-04 深圳市小满科技有限公司 Method, device, equipment and medium for extracting detail page information of participating company website
CN113343140A (en) * 2020-03-03 2021-09-03 四川大学 Method for automatically extracting webpage text content based on neo4j graphic database

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663023A (en) * 2012-03-22 2012-09-12 浙江盘石信息技术有限公司 Implementation method for extracting web content
US20140164342A1 (en) * 2012-12-11 2014-06-12 Human Threading Corporation Human threading search engine

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663023A (en) * 2012-03-22 2012-09-12 浙江盘石信息技术有限公司 Implementation method for extracting web content
US20140164342A1 (en) * 2012-12-11 2014-06-12 Human Threading Corporation Human threading search engine

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
SACHINDRA JOSHI ET AL: "A Bag of Paths Model for Measuring Structural Similarity in Web Documents", 《THE 9TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING 》 *
张乃洲等: "一种基于节点密度分割和标签传播的Web页面挖掘方法", 《计算机学报》 *
邱韬奋: "基于聚类算法的Web信息抽取技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108694193A (en) * 2017-04-07 2018-10-23 北京国双科技有限公司 The judgment method and device of type of webpage
CN108694192A (en) * 2017-04-07 2018-10-23 北京国双科技有限公司 The judgment method and device of type of webpage
CN108694192B (en) * 2017-04-07 2021-05-14 北京国双科技有限公司 Webpage type judging method and device
CN107391678A (en) * 2017-07-21 2017-11-24 福州大学 Web page content information extracting method based on cluster
CN110020038A (en) * 2017-08-01 2019-07-16 阿里巴巴集团控股有限公司 Webpage information extracting method, device, system and electronic equipment
US10755091B2 (en) 2017-10-11 2020-08-25 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for retrieving image-text block from web page
CN107590288A (en) * 2017-10-11 2018-01-16 百度在线网络技术(北京)有限公司 Method and apparatus for extracting webpage picture and text block
CN107590288B (en) * 2017-10-11 2020-09-18 百度在线网络技术(北京)有限公司 Method and device for extracting webpage image-text blocks
CN107894974A (en) * 2017-11-02 2018-04-10 华南农业大学 Webpage context extraction method based on tag path and text punctuate than Fusion Features
CN108520007A (en) * 2018-03-15 2018-09-11 江河瑞通(北京)技术有限公司 Web page information extracting method, storage medium and computer equipment
CN108520007B (en) * 2018-03-15 2021-09-28 江河瑞通(北京)技术有限公司 Web page information extracting method, storage medium and computer equipment
CN109145162B (en) * 2018-08-21 2021-06-15 慧安金科(北京)科技有限公司 Method, apparatus, and computer-readable storage medium for determining data similarity
CN109145162A (en) * 2018-08-21 2019-01-04 慧安金科(北京)科技有限公司 For determining the method, equipment and computer readable storage medium of data similarity
CN109325204A (en) * 2018-09-13 2019-02-12 武汉伯远生物科技有限公司 Web page contents extraction method
CN109325204B (en) * 2018-09-13 2022-01-07 武汉伯远生物科技有限公司 Automatic extraction method of webpage content
CN110377796A (en) * 2019-07-25 2019-10-25 中南民族大学 Text extracting method, device, equipment and storage medium based on dom tree
CN110377796B (en) * 2019-07-25 2021-11-02 中南民族大学 Text extraction method, device and equipment based on DOM tree and storage medium
CN110851606A (en) * 2019-11-18 2020-02-28 杭州安恒信息技术股份有限公司 Website clustering method and system based on webpage structure similarity
CN113343140A (en) * 2020-03-03 2021-09-03 四川大学 Method for automatically extracting webpage text content based on neo4j graphic database
CN113343140B (en) * 2020-03-03 2022-12-13 四川大学 Method for automatically extracting webpage text content based on neo4j graphic database
CN111625749A (en) * 2020-06-01 2020-09-04 深圳市小满科技有限公司 Method, device, equipment and medium for extracting detail page information of participating company website
CN111625749B (en) * 2020-06-01 2023-08-11 深圳市小满科技有限公司 Method, device, equipment and medium for extracting website detail page information of participant company

Similar Documents

Publication Publication Date Title
CN106557565A (en) A kind of text message extracting method based on website construction
CN103955529B (en) A kind of internet information search polymerize rendering method
Buttler et al. A fully automated object extraction system for the World Wide Web
Zheng et al. Template-independent news extraction based on visual consistency
CN103559199B (en) Method for abstracting web page information and device
CN103927397B (en) Recognition method for Web page link blocks based on block tree
CN101727498A (en) Automatic extraction method of web page information based on WEB structure
US20200004792A1 (en) Automated website data collection method
CN107391678A (en) Web page content information extracting method based on cluster
CN105677638B (en) Web information abstracting method
CN106503211A (en) Information issues the method that the mobile edition of class website is automatically generated
CN102184189A (en) Webpage core block determining method based on DOM (Document Object Model) node text density
CN101515287A (en) Automatic generating method of wrapper of complex page
Ji et al. Tag tree template for Web information and schema extraction
CN108733813A (en) Information extracting method, system towards BBS forum Web pages contents and medium
CN107515849A (en) It is a kind of into word judgment model generating method, new word discovery method and device
CN109657114B (en) Method for extracting webpage semi-structured data
CN105740355B (en) Webpage context extraction method and device based on aggregation text density
CN112667940A (en) Webpage text extraction method based on deep learning
CN103064966B (en) A kind of method extracting rule noise from unirecord webpage
US20120221545A1 (en) Isolating desired content, metadata, or both from social media
CN107145591B (en) Title-based webpage effective metadata content extraction method
CN102262658A (en) Method for extracting web data from bottom to top based on entity
CN108255895A (en) A kind of web data acquisition methods using context environmental rule
Aslam et al. Web-AM: An efficient boilerplate removal algorithm for Web articles

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170405

RJ01 Rejection of invention patent application after publication