CN103530429A - Webpage content extracting method - Google Patents

Webpage content extracting method Download PDF

Info

Publication number
CN103530429A
CN103530429A CN201310538575.4A CN201310538575A CN103530429A CN 103530429 A CN103530429 A CN 103530429A CN 201310538575 A CN201310538575 A CN 201310538575A CN 103530429 A CN103530429 A CN 103530429A
Authority
CN
China
Prior art keywords
node
label
web page
text
webpage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310538575.4A
Other languages
Chinese (zh)
Other versions
CN103530429B (en
Inventor
涂波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongsou Cloud Business Network Technology Co ltd
Original Assignee
Beijing Zhongsou Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongsou Network Technology Co ltd filed Critical Beijing Zhongsou Network Technology Co ltd
Priority to CN201310538575.4A priority Critical patent/CN103530429B/en
Publication of CN103530429A publication Critical patent/CN103530429A/en
Application granted granted Critical
Publication of CN103530429B publication Critical patent/CN103530429B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents

Abstract

The invention provides a webpage content extracting method. The method comprises the following steps of I, preprocessing a webpage, II, searching for the longest series in the webpage, III, establishing a DOM tree and searching for the nodes corresponding to the longest series according to the DOM tree, IV, determining a beginning node and a finishing node according to labels of the nodes corresponding to the longest series, V, checking and filtering the beginning node and the finishing node, and VI, outputting text in the filtered beginning node and text in the filtered finishing node. The method overcomes the defect of a module or blocking technique in news content extraction application, searches for seed paragraphs based on the longest series and improves webpage content extracting work efficiency and accuracy.

Description

A kind of method of Web page text extracting
Technical field
It is in particular to a kind of that the method that news web page body matter is extracted is realized based on the significant node of searching " most long string " lookup the present invention relates to a kind of method of computer realm.
Background technology
In news(Or information)Search field, it is the essential link of item that body, which is extracted, and the quality height of its text extracting determines the quality and Consumer's Experience of news search.
Current body abstracting method form various kinds, is divided into two major classes in the way of whether template is used:Based on template(Or wrapper)Mode is extracted and extracted based on untemplated fashion.
Extracted based on template way:Definition template, then writes program parsing execution template and obtains data first.According to template generation mode, it can be divided into again:Artificial template is extracted and automatic moulding plate is extracted.Artificial template extracts.For the targeted sites of extraction, artificial hand-coding template, template can be canonical matching way or the first matching way of simple string matching.Automatic moulding plate is extracted.Using machine learning algorithm, a part of web data is first obtained from targeted website and carries out learning training, template is obtained, then program utilizes template extracted data.
Untemplated fashion is extracted to be realized based on statistics and mode of learning more.Main algorithm has rule-based at present, based on piecemeal, view-based access control model etc..Compare the page partitioning algorithm of the representational view-based access control model for being Microsoft, extracted by page block, divider is extracted and semantic chunk reconstructs 3 steps, determines the main semantic chunk of webpage.
The shortcoming of manual compiling template way is to need to expend huge human resources to write template, and with the change of targeted website, safeguards that the cost of template is also very big.The shortcoming of automatic moulding plate mode is that algorithm is complicated, simultaneously, it is also desirable to targeted website cycle monitoring, to safeguard the change of template.Either whether manually or automatically produce template, on the assumption that the data of website are produced by template, some large-scale website basic problems are little, the possible template of namely different entrances is different, but for numerous medium and small websites, its templating is not fine, and most information can only be extracted by being extracted using template, has more chance to include junk information.The page partitioning algorithm of view-based access control model is due to regular complexity, and performance is not high, the application of unsuitable news search engine.
The content of the invention
In order to overcome the above-mentioned deficiencies of the prior art, the present invention provides a kind of effective method for extracting internet news content.Extracted for template or partition in news content using upper deficiency, design and seed paragraph is found based on " most long string ", extract news content using the algorithm of label clustering, it is to avoid manually the drawbacks of regular and its template.
Realize solution that above-mentioned purpose used for:
A kind of method of Web page text extracting, it is theed improvement is that:It the described method comprises the following steps:
I, Web-page preprocessing;
II, the most long string found in the webpage;
III, dom tree is created, searched according to DOM and described most long go here and there corresponding node;
IV, the most long label for going here and there corresponding node according to determine start node and end node;
V, inspection filtering is carried out to the start node and end node;
The text in start node and end node after VI, output filtering.
Further, the step I includes:Whether judge in the webpage comprising negligible label:" annotation ", " script ", " meta ";Obtain the negligible label and the content in negligible label and deletion.
Further, the step II comprises the following steps:1), in the Web page text the most long string in the webpage is found with behavior unit;
2), obtain and record most long string length, the most long string further processing to being obtained increases or decreases the length of acquisition when most long string is in specific label.
Further, dom tree is created to the webpage in the step III, the information of all nodes is obtained according to dom tree, and the node is stored in array, searched in the array comprising node comprising the most long string of node of correspondence;
The information of the node includes word number, Chinese character number, link number.
Further, similar node is found using label clustering method according to the label of the most long string of node stored in array, determines the start node and end node in the step IV.
Further, the label clustering method is included to the label characteristics forward, backward and two-way searching.
Further, the start node and end node selected in the step IV are checked and filtered, obtain the content in remaining node, the remaining node of output and node.
Further, the Web page text searches text using the part that the most paragraph of continuous text is text, and seed node is searched in dom tree, according to seed node to forward and backward extension, whole text region is found out.
Compared with prior art, the invention has the advantages that:
(1)Method of the present invention design is based on most long go here and there and finds seed paragraph, and news content is extracted using the algorithm of label clustering, is extracted for template or partition in news content and applies upper deficiency, it is to avoid manually the drawbacks of regular and its template.
(2)The method of the present invention need not create dom trees when early stage looks for " most long string ", and most long string is directly searched in web page text, with behavior unit, without entering a new line naturally, the processing of forced termination current line, improve operating efficiency, and accuracy rate is high.
(3)The method of the present invention is based on single web page analysis, without template, saves a large amount of artificial;Simple with kind of substring finding algorithm, analysis efficiency is high;Method of the present invention flexibility simultaneously is high, more convenient for abnormal conditions processing.
(4)The method of the present invention is using single webpage without template label clustering news web page content extraction, and its result is more accurate;Calculated for follow-up fingerprint, content clustering, media event cluster provides quality data and ensured.
(5)The method of the present invention mutually simple and quick can find text area, and when because of majority be operated in dom tree, and flexibility is good, and convenient increase filtering rule, end to end locating rule, this method are applicable not only to Chinese and are also applied for western language.
Brief description of the drawings
Fig. 1 is the flow chart of Web page text extracting method.
Embodiment
The embodiment to the present invention is described in further detail below in conjunction with the accompanying drawings.
Webpage includes the information such as text title, text source, text issuing time, text, author, substantial amounts of advertisement, junk information etc. may also be included in webpage, and " most long string " is appeared in text more in news category webpage, one section is found in text area using this feature and its corresponding label characteristics is obtained, then found label characteristics are utilized in turn forward, backward, two-way searching similar tags node, this process is referred to as " label clustering ".
A kind of Web page text extracting method, searches the extraction that significant node realizes news web page body matter according to " most long string " is found, the described method comprises the following steps:I, the negligible label in the deletion webpage and the content in negligible label;II, the most long string found in the webpage;III, dom tree is created, searched according to DOM and described most long go here and there corresponding node;IV, the most long label for going here and there corresponding node according to determine start node and end node;V, inspection filtering is carried out to the start node and end node;The text in start node and end node after VI, output filtering.
As shown in figure 1, Fig. 1 is the flow chart of Web page text extracting method;A kind of method of Web page text extracting specifically includes following steps:
Step 1: deleting the content in the negligible label and negligible label in the webpage.
Collection obtains the source file of webpage, is such as acquired using acquisition system;
The source file of html web page is pre-processed.Because the data in webpage are various, unified page specificationsization need to be carried out to the HTML code in source file and handled, that is, pre-processed, comprise the following steps:
First, it is determined that whether the label in source file matches, then label is modified if any not paired situation, it is ensured that the beginning and end matching of all labels;
Secondly, whether judge in the webpage comprising negligible label, obtain the content in negligible label and negligible label, the content in label and negligible label can be neglected in deletion.
Negligible label:Label substance is not related to body matter, such as " annotation ", " script ", " meta ".
Step 2: finding the most long string in the webpage.
1), found in the Web page text with behavior unit and record the continuous string length in the webpage.
Do not include label in the continuous string, length is recorded when running into label(When length be more than current line most long string length when, be expert at head when the most long string length of current line be initialized as 0), and length is counted clear 0(Start new string length to count).According to residing label adjustment correlation length, such as when in paragraph tag<p></p>Increase when middle,(Length is adjusted using proportionality coefficient is multiplied), when in similar<strong></strong>Reduced when middle.
2), most long string therein is obtained from the continuous string length of each row recorded;
The continuous string is continuous Chinese character(2 or more)Or continuous word(The western language word of 2 or more, centre is with space interval).
Step 3: create dom tree, searched according to DOM and described most long go here and there corresponding node.
1), dom tree is created to the source file of the webpage of acquisition, count the information of each node(Including word number, Chinese character number, link number etc.)And node is stored in array.
2), the most long string that is obtained according to step 2 search out corresponding node and the node position in array, it is simple to search correspondence node using substring.
For example:"<td><div>My test, language</div></td>")Whether middle search contains resulting most long string, if most long go here and there is that " my test " just can be found, if " catching a duck " can not just be found, find for matched node, i.e. seed node.
Step 4: according to the corresponding node array position of most long string and its label characteristics string(Such as:html:body:div:p)Find start node and end node.
Due to the node in Web page text totally one father or grandfather's node, similar node is found according to the label of the most long string of node stored in array, it may be determined that start node and end node.
Step 5: to the node obtained(Including start node and end node)Carry out inspection filtering.
The start node and end node selected in the step 4 are checked and filtered, the content in remaining node, the remaining node of output and node is obtained, the content includes word, picture etc..
Filtering is carried out according to some features of interdependent node, such as class is deleted the need for being equal to certain particular value, id is equal to deletion etc. the need for certain particular value, and content, which meets some feature and is then followed by some network address, just deletes, and is mainly used in the advertisement category information cleared up in the middle of content.
Step 6: the text in start node and end node after output filtering.
Finally it should be noted that:Above example is merely to illustrate the technical scheme of the application rather than the limitation to its protection domain, although the application is described in detail with reference to above-described embodiment, those of ordinary skill in the art should be understood:Those skilled in the art read can still carry out a variety of changes, modification or equivalent to the embodiment of application after the application, but these changes, modification or equivalent, apply within pending claims.

Claims (8)

1. a kind of method of Web page text extracting, it is characterised in that:It the described method comprises the following steps:
I, Web-page preprocessing;
II, the most long string found in the webpage;
III, dom tree is created, searched according to DOM and described most long go here and there corresponding node;
IV, the most long label for going here and there corresponding node according to determine start node and end node;
V, inspection filtering is carried out to the start node and end node;
The text in start node and end node after VI, output filtering.
2. a kind of method of Web page text extracting as claimed in claim 1, it is characterised in that:The step I includes:Whether judge in the webpage comprising negligible label:" annotation ", " script ", " meta ";Obtain the negligible label and the content in negligible label and deletion.
3. a kind of method of Web page text extracting as claimed in claim 1, it is characterised in that:The step II comprises the following steps:1), in the Web page text the most long string in the webpage is found with behavior unit;
2), obtain and record most long string length, the most long string further processing to being obtained increases or decreases the length of acquisition when most long string is in specific label.
4. a kind of method of Web page text extracting as claimed in claim 1, it is characterised in that:Dom tree is created to the webpage in the step III, the information of all nodes is obtained according to dom tree, and the node is stored in array, is searched in the array comprising node comprising the most long string of node of correspondence;
The information of the node includes word number, Chinese character number, link number.
5. a kind of method of Web page text extracting as claimed in claim 1, it is characterised in that:Similar node is found using label clustering method according to the label of the most long string of node stored in array, the start node and end node in the step IV is determined.
6. a kind of method of Web page text extracting as claimed in claim 5, it is characterised in that:The label clustering method is included to the label characteristics forward, backward and two-way searching.
7. a kind of method of Web page text extracting as claimed in claim 1, it is characterised in that:The start node and end node selected in the step IV are checked and filtered, the content in remaining node, the remaining node of output and node is obtained.
8. a kind of method of Web page text extracting as claimed in claim 1, it is characterised in that:The Web page text searches text using the part that the most paragraph of continuous text is text, and seed node is searched in dom tree, according to seed node to forward and backward extension, whole text region is found out.
CN201310538575.4A 2013-11-04 2013-11-04 Webpage content extracting method Expired - Fee Related CN103530429B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310538575.4A CN103530429B (en) 2013-11-04 2013-11-04 Webpage content extracting method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310538575.4A CN103530429B (en) 2013-11-04 2013-11-04 Webpage content extracting method

Publications (2)

Publication Number Publication Date
CN103530429A true CN103530429A (en) 2014-01-22
CN103530429B CN103530429B (en) 2017-01-18

Family

ID=49932438

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310538575.4A Expired - Fee Related CN103530429B (en) 2013-11-04 2013-11-04 Webpage content extracting method

Country Status (1)

Country Link
CN (1) CN103530429B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103942335A (en) * 2014-05-07 2014-07-23 武汉大学 Construction method of uninterrupted crawler system oriented to web page structure change
CN104376061A (en) * 2014-11-10 2015-02-25 武汉传神信息技术有限公司 Webpage text extracting method
CN104573097A (en) * 2015-01-30 2015-04-29 湖南蚁坊软件有限公司 Method for extracting webpage content
CN106802899A (en) * 2015-11-26 2017-06-06 北京搜狗科技发展有限公司 web page text extracting method and device
CN107203527A (en) * 2016-03-16 2017-09-26 北大方正集团有限公司 The text extracting method and system of news web page
CN107229668A (en) * 2017-03-07 2017-10-03 桂林电子科技大学 A kind of text extracting method based on Keywords matching
CN110377796A (en) * 2019-07-25 2019-10-25 中南民族大学 Text extracting method, device, equipment and storage medium based on dom tree
CN110390038A (en) * 2019-07-25 2019-10-29 中南民族大学 Segment method, apparatus, equipment and storage medium based on dom tree
CN111046302A (en) * 2019-12-30 2020-04-21 珠海趣印科技有限公司 Method and device for extracting webpage content

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727461A (en) * 2008-10-13 2010-06-09 中国科学院计算技术研究所 Method for extracting content of web page
US20110302486A1 (en) * 2010-06-03 2011-12-08 Beijing Ruixin Online System Technology Co., Ltd Method and apparatus for obtaining the effective contents of web page
CN102298638A (en) * 2011-08-31 2011-12-28 北京中搜网络技术股份有限公司 Method and system for extracting news webpage contents by clustering webpage labels
CN102314520A (en) * 2011-10-24 2012-01-11 莫雅静 Webpage text extraction method and device based on statistical backtracking positioning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727461A (en) * 2008-10-13 2010-06-09 中国科学院计算技术研究所 Method for extracting content of web page
US20110302486A1 (en) * 2010-06-03 2011-12-08 Beijing Ruixin Online System Technology Co., Ltd Method and apparatus for obtaining the effective contents of web page
CN102298638A (en) * 2011-08-31 2011-12-28 北京中搜网络技术股份有限公司 Method and system for extracting news webpage contents by clustering webpage labels
CN102314520A (en) * 2011-10-24 2012-01-11 莫雅静 Webpage text extraction method and device based on statistical backtracking positioning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
万晶: "Web网页正文抽取方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 8, 15 August 2010 (2010-08-15) *
张瑞雪: "基于DOM树的网页相似度研究与应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 10, 15 October 2011 (2011-10-15) *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103942335B (en) * 2014-05-07 2017-04-26 武汉大学 Construction method of uninterrupted crawler system oriented to web page structure change
CN103942335A (en) * 2014-05-07 2014-07-23 武汉大学 Construction method of uninterrupted crawler system oriented to web page structure change
CN104376061A (en) * 2014-11-10 2015-02-25 武汉传神信息技术有限公司 Webpage text extracting method
CN104573097A (en) * 2015-01-30 2015-04-29 湖南蚁坊软件有限公司 Method for extracting webpage content
CN104573097B (en) * 2015-01-30 2018-07-24 湖南蚁坊软件有限公司 A method of extraction Web page text
CN106802899B (en) * 2015-11-26 2020-11-24 北京搜狗科技发展有限公司 Webpage text extraction method and device
CN106802899A (en) * 2015-11-26 2017-06-06 北京搜狗科技发展有限公司 web page text extracting method and device
CN107203527A (en) * 2016-03-16 2017-09-26 北大方正集团有限公司 The text extracting method and system of news web page
CN107203527B (en) * 2016-03-16 2019-06-28 北大方正集团有限公司 The text extracting method and system of news web page
CN107229668A (en) * 2017-03-07 2017-10-03 桂林电子科技大学 A kind of text extracting method based on Keywords matching
CN110390038A (en) * 2019-07-25 2019-10-29 中南民族大学 Segment method, apparatus, equipment and storage medium based on dom tree
CN110377796A (en) * 2019-07-25 2019-10-25 中南民族大学 Text extracting method, device, equipment and storage medium based on dom tree
CN110390038B (en) * 2019-07-25 2021-10-15 中南民族大学 Page blocking method, device and equipment based on DOM tree and storage medium
CN110377796B (en) * 2019-07-25 2021-11-02 中南民族大学 Text extraction method, device and equipment based on DOM tree and storage medium
CN111046302A (en) * 2019-12-30 2020-04-21 珠海趣印科技有限公司 Method and device for extracting webpage content

Also Published As

Publication number Publication date
CN103530429B (en) 2017-01-18

Similar Documents

Publication Publication Date Title
CN103530429A (en) Webpage content extracting method
CN101593200B (en) Method for classifying Chinese webpages based on keyword frequency analysis
JP5746286B2 (en) High-performance data metatagging and data indexing method and system using a coprocessor
CN103023714B (en) The liveness of topic Network Based and cluster topology analytical system and method
CN104268148B (en) A kind of forum page Information Automatic Extraction method and system based on time string
CN106126648B (en) It is a kind of based on the distributed merchandise news crawler method redo log
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
JP6203374B2 (en) Web page style address integration
CN100462969C (en) Method for providing and inquiry information for public by interconnection network
CN109857956B (en) News webpage key information automatic extraction method based on label and block characteristics
TWI695277B (en) Automatic website data collection method
CN103823824A (en) Method and system for automatically constructing text classification corpus by aid of internet
CN108416034B (en) Information acquisition system based on financial heterogeneous big data and control method thereof
CN102298638A (en) Method and system for extracting news webpage contents by clustering webpage labels
CN110147439A (en) A kind of news event detecting method and system based on big data processing technique
CN102945244A (en) Chinese web page repeated document detection and filtration method based on full stop characteristic word string
CN102270212A (en) User interest feature extraction method based on hidden semi-Markov model
CN110457579B (en) Webpage denoising method and system based on cooperative work of template and classifier
CN105512143A (en) Method and device for web page classification
CN102043808A (en) Method and equipment for extracting bilingual terms using webpage structure
CN108959580A (en) A kind of optimization method and system of label data
CN109165373B (en) Data processing method and device
CN111190873B (en) Log mode extraction method and system for log training of cloud native system
US8954438B1 (en) Structured metadata extraction
CN112818200A (en) Data crawling and event analyzing method and system based on static website

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20170426

Address after: 100086 Beijing, Haidian District, North Third Ring Road West, No. 43, building 5, floor 08-09, No. 2

Patentee after: BEIJING ZHONGSOU CLOUD BUSINESS NETWORK TECHNOLOGY Co.,Ltd.

Address before: Shou Heng Technology Building No. 51 Beijing 100191 Haidian District Xueyuan Road room 0902

Patentee before: BEIJING ZHONGSOU NETWORK TECHNOLOGY Co.,Ltd.

TR01 Transfer of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170118

Termination date: 20211104

CF01 Termination of patent right due to non-payment of annual fee