CN109740097A

CN109740097A - A kind of Web page text extracting method of logic-based chained block

Info

Publication number: CN109740097A
Application number: CN201811632086.4A
Authority: CN
Inventors: 王贤明
Original assignee: Wenzhou University Oujiang College
Current assignee: Wenzhou University of Technology
Priority date: 2018-12-29
Filing date: 2018-12-29
Publication date: 2019-05-10
Anticipated expiration: 2038-12-29
Also published as: CN109740097B

Abstract

The invention discloses a kind of Web page text extracting method of logic-based chained block, the present invention merely relies upon current web page itself during extracting web page template and text, supports without heuristic rule, which dictates that this method has good versatility；The present invention is not necessarily to manual intervention, high degree of automation to the extraction process of web page template；And analytic process is simple, without carrying out tag resolution to webpage, thus analysis speed is fast, and strong interference immunity can preferably adapt to design nonstandard Web page；Also there is preferable extraction effect for the very short page of body matter；The template form that last this method extracts is simple, is easy to use.This just determines that the present invention has potential application value in terms of Web page text extracting, it can be used for all kinds of news, blog or Web page text extracting with similar structure, be also with a wide range of applications in other Web information processing of less demanding to chained block fine granulation and excavation applications.

Description

A kind of Web page text extracting method of logic-based chained block

Technical field

The invention belongs to field of computer technology, are related to a kind of Web page text extracting method, and in particular to one kind is applied to The abstracting method of the Web page text extracting template of all kinds of news, blog or logic-based chained block with similar structure (Content Extraction based on Logical Link Blocks,CELLB)。

Background technique

Web page information extraction refers to according to specific analysis and application demand, and the specific information content is extracted from webpage. These specific information contents had both included the certain shallow-layer contents arisen directly from webpage, also included web-based specific point The deep content of analysis and formation.Web page text extracting is one of Main way of the research, and research history is longer, correlation technique It is numerous.The abstracting method of webpage is summarized as special based on Wrapper, template, machine learning, visual layout feature, HTML by document 1 Five classes such as sign, wherein the versatility of Wrapper and template is generally acknowledged that poor, and generally requires artificial participation, and is needed more New maintenance, extremely time and effort consuming, in consideration of it, the Wrapper algorithm ([document 2-4]) without template support or manual oversight is mentioned Out, and preferable effect is achieved；The method of machine learning is needed by suitable training set and suitable feature ([document 5-6]), and be difficult to completely disengage manual oversight；Utilize the Typical Representative, that is, VIPS and similar approach of the method for visual layout feature ([document 7-9]), it is excessively fine to the parsing requirement of webpage although this method accuracy rate is higher, it calculates consumption greatly, faces Its robustness is difficult to ensure when the webpage of a large amount of non-standardizations, and controls each page currently generalling use CSS ([document 10]) In the case that effect is presented in the vision of face label, it is also necessary to separately parse correlation CSS, eventually lead to big, the program that parses task amount Robustness shortcoming；Correlation technique based on HTML feature is biased to some heuristic rules ([document 11-15]) or some systems mostly Meter rule, versatility is to be improved, and the production of various decimation rules also more time and effort consuming.In addition, also there is researcher's proposition Other certain methods, such as the method ([document 16]) for utilizing fuzzy neural network to realize Segment, MSS Segment Method ([document 17]) etc., furthermore also there is the method that the above method is carried out some fusions, such as document 18-19.Document 20 is not Only all kinds of abstracting methods are classified, and are applied to have also been made more comprehensive summary.Although related abstracting method is more Kind multiplicity, respectively there is feature, however summarizes can be found that by analysis: substantially all about the correlation technique of Web page information extraction at present It is based on tag tree ([document 12-13,18,21-23]), and DOM ([document 24]) is that a kind of building tag tree is most commonly seen Mode, XPath are then often based on DOM and carry out content analysis and extract ([document 19])；Other methods are also all marked substantially with HTML Label tree or DOM based on ([document 25-26]).Such methods all propose higher requirement to the normalization of HTML.In addition, It, often need to be in conjunction with text or link density ([document 27-30]), label rate ([document in the method for DOM based on parsing HTML 31-32]), tag path ([document 33]) information etc. carry out text extracting, these methods are when handling the very short page of text one As less effective, therefore the case where short text, has obtained the concern of researcher and has made certain gains ([document 28]).

In Web page information extraction correlative study, there is quite a few research to mark substantially only for the HTML of block grade level Label, such as div, table, tr, td, p etc., wherein since table label function enriches ([document 34]), early stage page layout, Modification and Content Organizing are almost indispensable to table, correspondingly, part document also only accounts for the net for table layout Page, and fail to distinguish the table for the table of layout and for Content Organizing well.Son ([document 22]) is specialized in Based on the webpage of table design, two kinds of table label are made to be used as and distinguishes and identifies respectively, the experiment proves that mentioned The advance of method out.But too big only for the processing mode limitation of table, current webpage design is substantially table It being coexisted with div, Uzun [document 23] considers both of these case simultaneously, the blocking information in webpage is first obtained according to div and td, Secondly it combines decision tree to generate decimation rule, achieves good effect, especially obtained on extracting speed and rule by hand Then comparable performance；Wang [document 14] then proposes BSU concept, and based on this using two methods of cluster and heuristic rule Realize that page info extracts, it is more preferable than using the methods and results based on div and table.

In order to solve the problems, such as that Web page text automates extraction, same area webpage has the characteristics that similar structure, and this is received more Extensive concern has the research achievement much based on webpage similarity feature at present.TEX ([document 35]) method is divided into extraction With two key steps of filtering, core concept is to analyze and utilizes the difference of two webpages and then the mould of establishment webpage Plate part and content part.Willow blueness ([document 36]) etc. then realizes same node point by the DOM node similarity of HTML Delete the extraction with content part.Wang Yizhou ([document 37]) etc. carries out webpage characterized by the path of block node in dom tree Expression, and the cluster that the similarity between webpage realizes similar web page is calculated based on this feature, finally utilize node density feature Deng the determination for realizing body matter.

Existing Web page information extraction correlation technique, the various methods for being based especially on tag tree need webpage to defer to preferably Specification, this specification both tag-syntax specification (pairing relationships of such as label) such as including HTML, XHTML also set including semanteme The specification for counting aspect (will also tend to lead to as block-like content is visually presented after rendering by browser in actual code Block grade element div, table etc. are crossed to present, visual title is presented by labels such as h1, h2).But in fact, sea In the Web page of amount, there are a considerable amount of webpages not defer to the tag-syntax such as HTML specification and semantic design specification.Although The phraseological lack of standard of html tag can be corrected by some existing or designed, designed page specifications program, But it can not guarantee accuracy well；The correction difficulty of semantic design specification problem is then bigger.In addition, the analysis based on DOM Also ([document 38]) is influenced by CSS, Background, Flash etc..This just determines the various methods based on tag tree or DOM only Good effect can be obtained in design specification or the webpage for being easy to correct, then seem awkward in non-standardization webpage. And based on this requirement of the fine resolution to html tag attribute, determine these methods in the automation for facing magnanimity webpage There are many troubles in.

The existing Web page text extracting method based on information such as text densities can not handle the very short page of text well Face, the extraction accuracy rate for being inserted into video or advertisement page in text are also not too high.

Document 39 proposes the concept and its recognition methods of logical connection block.The decision rule of logical connection block in this method Simply, complicated calculations are not necessarily to, can be completed carrying out single pass to Web page, while completing the identification of page link block, Tag tree parsing or DOM resolving indispensable in chained block identification process are also avoided, has not only been saved a large amount of The label fine resolution time has also preferably adapted to numerous and complicated and has lacked the HTML code of specification.

The present invention is based on the further research that above-mentioned document 39 carries out.

[document 1] AL-GHURIBI S M, ALSHOMRANI S.A Comprehensive Survey on Web Content Extraction Algorithms and Techniques[C]//2013International Conference on Information Science and Applications(ICISA).IEEE,2013:1–5.

[document 2] WANG J F, HE X F, WANG C, et al.News article extraction with template-independent wrapper[C]//Proceedings of the 18th international conference on World wide web.New York,USA:ACM Press,2009:1085.

[document 3] WANG J F, CHEN C, WANG C, et al.Can we learn a template- Independent wrapper for news article extraction from a single training site? [C]//Proceedings of the15th ACM SIGKDD international conference on Knowledge discovery and data mining.New York,USA:ACM Press,2009:1345–1353.

[document 4] HE J, GU Y Q, LIU H Y, et al.Scalable and noise tolerant web knowledge extraction for search task simplification[J].Decision Support Systems,2013,56:156–167.

[document 5] PETERS M, LECOCQ D.Content extraction using diverse feature sets[C]//Proceeding WWW’13Companion Proceedings of the 22nd internationalconference on World Wide Web companion.Geneva,Switzerland:2013: 89–90.

[document 6] Hassan A.Sleiman, Rafael Corchuelo.A class of neural-network- based transducers for web information extraction[J].Neurocomputing,2014,135: 61-68.

[document 7] Cai D, Yu S P, Wen J R, et al.VIPS:a vision-based page segmentation algorithm,Microsoft Technical Report,MSR-TR-2003-79,2003.

[document 8] Michael Cormier, Karyn Moffatt, Robin Cohen, et al.Purely vision- based segmentation of web pages for assistive Technology[J].Computer Vision and Image Understanding,2016,148:46-66.

[document 9] Jan Zeleny, Radek Burget, Jaroslav Zendulka.Box clustering segmentation:A new method for vision-based web page preprocessing[J] .Information Processing and Management,2017,53:735–750.

2010 [S/OL] [2018-10- of [document 10] W3C.Cascading Style Sheets (CSS) Snapshot 08].http://www.w3.org/TR/CSS/.

[document 11] XUE Y, HU Y, XIN G, et al.Web page title extraction and its application[J].Information Processing&Management,2007,43(5):1332–1347.

[document 12] AHMADI H, KONG J.User-centric adaptation of Web information for small screens[J].Journal of Visual Languages&Computing,2012,23(1):13–28.

[document 13] JI X W, ZENG J P, ZHANG S Y, et al.Tag tree template for Web information and schema extraction[J].Expert Systems with Applications,2010,37 (12):8492–8498.

[document 14] WANG J Q, CHEN Q C, WANG X L, et al.Basic semantic units based web page content extraction[C]//2008IEEE International Conference on Systems, Man and Cybernetics.IEEE,2008:1489–1494.

[document 15] Patricia Jim é nez, Rafael Corchuelo.On learning web information extraction rules with TANGO[J].Information Systems,2016,62:74-103.

[document 16] CAPONETTI L, CASTIELLO C,P.Document page segmentation using neuro-fuzzy approach[J].Applied Soft Computing,2008,8(1):118–126.

[document 17] PASTERNACK J, ROTH D.Extracting article text from the web with maximum subsequence segmentation[C]//Proceedings of the 18th internationalconference on World wide web.New York,USA:ACM Press,2009:971– 980.

Magnanimity Web information abstracting method [J] of [document 18] Wang Haiyan, Cao Pan based on nodal community and body matter is logical Believe journal, 2016,37 (10): 9-17.

[document 19] Leandro Neiva Lopes Figueiredo, Guilherme Tavares de Assis, Anderson A.Ferreira.DERIN:A data extraction method based on rendering information and n-gram.Information Processing and Management,2017,53:1120– 1138.

[document 20] Emilio Ferrara, Pasquale De Meo, GiacomoFiumara, et al.Web data extraction,applications and techniques:Asurvey[J].Knowledge-Based Systems, 2014,70:301-323.

[document 21] WONG T L, LAM W.An unsupervised method for joint information extraction and feature mining across different Web sites[J].Data&Knowledge Engineering,2009,68(1):107–125.

[document 22] SON J-W, PARK S-B.Web table discrimination with composition of rich structural and content information[J].Applied Soft Computing,2013,13(1): 47–57.

[document 23] UZUN E, AGUN H V, YERLIKAYAT.Ahybrid approach for extracting informative content from web pages[J].Information Processing&Management,2013, 49(4):928–944.

[document 24] W3C.Document Object Model (DOM) [S/OL] [2018-10-08] .http: // www.w3.org/DOM/.

[document 25]M,PAN A,RAPOSO J,et al.Extracting lists of data records from semi-structured web pages[J].Data&Knowledge Engineering,2008,64 (2):491–509.

[document 26] Li Zhiwen, Shen Zhirui studies [J] information journal based on the Web page information extraction marked naturally, 2013,32(8):853–859.

[document 27] Liu Pengcheng, Hu Jun, Wu Gongqing is taken out based on the Web page text of text block density and tag path coverage rate [J] computer application is taken to study, 2018,35 (6): 1645-1650.

[document 28] Xi Jiazhen, Guo Yan, Li Qiang is waited in a kind of text automatic decimation method [J] of short positive web page text of Literary information journal, 2016,30 (1): 8-15.

Web page text of [document 29] the Liao Jianjun based on tab style and density model extracts [J] information science automatically, 2018,36(7):123-129.

[document 30] Zhu Zede, Li Miao, Zhang Jian waits Web text extracting [J] the pattern-recognition of based on text density model With artificial intelligence, 2013,26 (7): 667-672.

[document 31] David Insa, Josep Silva, Salvador Tamarit.Using the words/leafs ratio in the DOM tree for content extraction[J].The Journal of Logic and Algebraic Programming,2013,82(8):311-325.

[document 32] Yu-Chieh Wu.Language independent web news extraction system based on text detection framework[J].Information Sciences,2016,342:132–149.

[document 33] Wu Gong-Qing, Li Lei, Li Li, Wu Xindong.Web News Extraction via Tag Path Feature Fusion Using DS Theory[J].Journal of Computer Science and Technology,2016,31(4):661–672.

[document 34] CAFARELLAM J, HALEVY A, WANG D Z, et al.WebTables:exploring the power of tables on the web[C]//Proceedings of the VLDB Endowment.Auckland,New Zealand:2008:538–549.

[document 35] Hassan A.Sleiman, Rafael Corchuelo.TEX:An efficient and effective unsupervised Web information extractor[J].Knowledge-Based Systems, 2013,39:109-123.

[document 36] willow is green, Li Xiaodong, the research of Web page text contents extraction [J] of Geng Guanggang based on layout similitude Computer application research, 2015,32 (9): 2581-2586.

Text message extracting method [J] of [document 37] Wang Yizhou, Chen Xing, Dai Yuanfei based on website construction is small-sized miniature Computer system, 2018,39 (1): 111-115.

[document 38] Ahmet Selman Bozkir, Ebru Akcapinar Sezer.Layout-based computation of web page similarity ranks[J].International Journal of Human- Computer Studies,2018,110:95-114.

[document 39] Wang, X.M., Wu, Z.D., Huang, Y.N., Gu, Q.Anew recognition approach for logical link blocks in webpages.Journal of Digital Information Management,2015,13(2):76-85.

Summary of the invention

For being currently based on, a kind of method of html tag tree is higher to HTML Regulatory requirements, and based on text density etc. The method of information can not handle the problems such as Web page text is shorter very well, and the invention proposes a kind of the new of logic-based chained block Abstracting method (the Content Extraction based on Logical Link of the Web page text extractings template such as news, blog Blocks,CELLB)。

The technical scheme adopted by the invention is that: a kind of Web page text extracting method of logic-based chained block, feature It is, comprising the following steps:

Step 1: generating Web page text extracting template；

Step 1.1: inputting the network address URL of template to be generated₀；

Step 1.2: obtaining network address URL₀The source code HTML of corresponding webpage₀, and extract wherein all same area network address (i.e. and URL₀Belong to the network address under the same second level domain), it is denoted as URLList；

Step 1.3: utilize network address similarity rule RuleURL, from URLList preferably with URL₀The high preceding s of similarity A network address (if alternative practical network address number is less than s, the value of s takes actual network address number), and form similar net Location list, is denoted as URL_s；

Step 1.4: obtaining similar list of websites URL_sIn each network address source code, be denoted as HTML₁、HTML₂、…、HTML_s, it Constitute HTMLList₀；

Step 1.5: identifying and remove HTMLList₀The logical connection block of middle institute's source code forms new source code column Table is denoted as HTMLList, while executing web page elementization operation to wherein each webpage；It is if being directed to compressed webpage, then right It executes the operation of webpage atomization；

Step 1.6: by the fuzzy text of each webpage in fuzzy text region recognition rule RuleText identification HTMLList Region a₁、a₂、…、a_s, it is denoted as A={ a₁,a₂,…,a_s}；And then obtain the text size lt in each fuzzy text region₁、 lt₂、…、lt_s, it is denoted as LT={ lt₁,lt₂,…,lt_s}；

Step 1.7: using similar network address scoring rule RuleScore to similar list of websites URL_sIt carries out preferably, taking it The c network address of c constitutes candidate link URL before ranking_c(if c > s, c take value identical with s)；

Step 1.8: according to fuzzy text region A and candidate link URL_cIn after each page elements as a result, asking respectively It solves each page and obscures element and the intersection E of element later before text_itrs,First, E_itrs,Last；

Step 1.9: template header element E is determined according to header element decision rule RuleFirst_First, differentiated according to tail element Regular RuleLast determines template tail element E_Last；

Step 1.10: selecting appropriate algorithm according to demand, generate URL₀Fingerprint URLFinger, so that it is determined that URL₀Institute is right Answer text extracting template (URLFinger, the E of webpage_First, E_Last), the template of extraction is stored；

Step 2: utilizing text extracting template, complete the extraction to Web page text.

Compared with the existing technology, advantages of the present invention are as follows:

1) present invention has high recall rate；

Recall rate is high to illustrate that the present invention extracts correctly when extracting text or substantially completely or extraction range is inclined Make text become a part for extracting text greatly, but is substantially not in the case where omitting text.Its reason is: using this When invention is extracted, if template extraction is correct, text can accurately be extracted substantially, will not lose text at this time；Even if Template extraction is inaccurate, it is general it is extracted it is text filed be all it is bigger than true text region, text will not be lost.Need spy It does not hand over, is generally substantially not in the problem of extraction less than head and the tail element when template extraction.This is because for appointing Anticipate fuzzy text region (no matter whether the fuzzy text region identifies correctly), later use uniqueness and with fuzzy text area When this minimum two features of the distance in domain establish header element and tail element, in the case where " worst ", it is with fuzzy text region Benchmark toward "upper" extend, typically result in component "</head>" or "<body>"；Prolonged on the basis of fuzzy text region toward "lower" Stretch, typically result in component "</body>" or "</html>".Although the region defined at this time by them is excessively wide in range, but Text will not be omitted, text becomes the sub-regions for extracting text, although accurate rate is not necessarily high at this time, recall rate is 100%, exactly the present invention can obtain the basic reason of high recall rate for this.

2) present invention method based on tag tree more conventional to the Regulatory requirements of webpage HTML code is lower, thus one As it is more efficient；

It is more rough to the processing of HTML since the present invention is without parsing html tag tree, therefore the mark in many webpages Sign unpaired, label pairing interlocks, and the problems such as tag misses all will not influence the execution of the method for the present invention.

3) present invention can preferably handle the extraction of content in the shorter webpage of text.

Traditional method based on information such as text densities can not correctly handle this kind of pages, because text is too in short-term, The text density in the region is very low, it is easy to judge by accident.The present invention does not extract current page directly then, but it is preferred that goes out current page In related link, preferably when the similar network address scoring rule that the uses webpage that may insure that these preferably go out generally be not in The very short situation of text realizes the correct extraction to shorter text web page contents in this way.

Detailed description of the invention

Fig. 1 is the generation Web page text extracting template flow chart of the embodiment of the present invention；

Fig. 2 is the extraction flow chart for utilizing text extracting template, completing to Web page text of the embodiment of the present invention.

Specific embodiment

Understand for the ease of those of ordinary skill in the art and implement the present invention, with reference to the accompanying drawings and embodiments to this hair It is bright to be described in further detail, it should be understood that implementation example described herein is merely to illustrate and explain the present invention, not For limiting the present invention.

Generally, the almost types website such as most of website, such as news, blog, specific content details page In face, in addition to being mainly used for showing the main contents of this page in the middle part of webpage, (usually in main contents around main contents Lower part or right side), also have a large amount of related link, the existing recommendation of websites system of these related links or automatch The related pages link (such as " related news ", " for your recommendation ", " guessing that you like " etc. may be referred to as) come is matched, There are other that the web page interlinkage of some or multiple classifications out is recommended (such as may to be referred to as by artificial or ranking system " Editor's Choice ", " hot news ", " essence recommendation ") etc..The page of these all kinds of related links, for current page, It is existing to be under the jurisdiction of same column, it is also possible to have and belong to different columns.The webpage of same website or column often has identical Tactic pattern, the webpage of non-same column, it is also possible to tactic pattern having the same.

Identical tactic pattern provides theoretic feasibility for the automatic decimation of template.In fact, almost big absolutely Most websites all take static technology, and the page before static is often from the phase in a small number of dynamic web page, website When more pages exactly by a small number of dynamic pages in database the data of structuring constitute, wherein dynamic page is held The role of container is carried on a shoulder pole, template to be extracted exactly resides in this, and the mostly important content in database is Text to be extracted.Template is contained in the high part of same or similar degree in same area webpage, dissimilar parts then contain to The body matter of extraction.

It is not difficult to obtain from above-mentioned analysis, the key point of template analysis is as follows:

1) identification of similar network address；

Similar network address is identified by network address similarity rule RuleURL progress, its object is to obtain s with currently to The most similar network address of network address of text is extracted, for the use of subsequent relevant calculation.

2) web page element；

In conventional uncompressed or compression and its limited situation, the division mark of web page element uses new line Symbol；If but towards be heavy compression webpage, should first carry out webpage atomization or elementization operation.

3) location ambiguity text region；

Fuzzy text zone location, that is, identify the particular sub-area in text region, entire without accurately identifying Body text block.This is also the pass that the present invention is different from and extracts better than other based on text density text this kind method In place of key.

4) c candidate link determines；

Calculate each similar network address score using similar network address scoring rule, and by the arrangement of score descending after, determine participate in it is first The preceding c candidate link that element and tail elemental analysis calculate.

5) header element and tail element are determined；

Using header element and tail element rule, identify the identification composite component before and after text region, i.e., it is first Element and tail element, so that it is determined that the text extracting template of the webpage.

6) it using the aforementioned Web page text extracting template being made of header element and tail element, can be completed to Web page text Extraction.

A kind of Web page text extracting method of logic-based chained block provided by the invention, comprising the following steps:

Step 1: generating Web page text extracting template；

Referring to Fig.1, specific implementation includes following sub-step:

Step 1.1: inputting the network address URL of template to be generated₀；

Step 1.3: utilize network address similarity rule RuleURL, from URLList preferably with URL₀The high preceding s of similarity A network address (if alternative practical network address number is less than s, the value of s takes actual network address number), and form similar net Location list URL_s；

Network address is also an extremely useful resource, especially in some preliminary screening occasions, such as webpage or link classification And the screening of high similarity webpage, it can be carried out just with network address, this can reduce the processes such as unnecessary page-downloading.

The time is handled to reduce entire program, therefore carries out primary dcreening operation using URL in the present invention.If the net of currently processed webpage Location is to be denoted as url with reference to network address, and the same area network address extracted from the webpage is URL={ url₁,url₂,…,url_d, wherein d table Show that url corresponds to the same area network address number in webpage, then i-th of network address and the similarity with reference to network address url are as follows:

Wherein, lcs () indicates to calculate longest common subsequence, and Len () indicates calculating character string length.

There are many website links number in webpage, some network address are chains to other websites, some network address are subordinate to current network address In same website but belong to different sub- columns, some network address are then under the jurisdiction of same sub- column with current network address, it is clear that it Effect in web page template extraction process be different.Similar network address number s, which refers to, counts URL according to network address similarity rule After calculating sequence, therefrom preferred high similarity network address number, corresponding network address then constitute similar list of websites, are denoted as: URL_s= {url₁,url₂,…,url_s, s is similar network address number.Similar network address is preferred beneficial for reducing subsequent fuzzy text region recognition Calculation amount, promote the speed of subsequent processing.

Step 1.4: obtaining similar list of websites URL_sIn each network address source code, be denoted as HTML₁、HTML₂、…、HTML_s, They constitute HTMLList₀；

Step 1.5.1: webpage atomization；

HTML component refers to that the fundamental for constituting HTML, component are also referred to as atom, are denoted as e_i, e expression component, i expression The serial number of component.Component is specifically divided into html tag component, content component, wherein starting label and end-tag is considered as Independent HTML component is not related to html tag pairing and nested problem.Such as HTML code segment " <img src=' Logo.jpg '><br><p align='left'>content 1<br>content 2</p>" in, it include seven components, be respectively as follows: "< Img src=' logo.jpg '>", "<br>", "<p align='left'>", " content 1 ", "<br>", " content 2 ", "</p >”。

Webpage atomization, which refers to the process of, is converted to HTML component (atom) expression for web page code.The atomizing table of HTML Up to being denoted as: H={ e₁,e₂,…,e_f, H indicates that webpage, f=Card (H) they are component sum, Card () table in webpage component set Show element number in set of computations.Such as expression of the above-mentioned code snippet after HTML atomization are as follows: H={ " <img src=' Logo.jpg '>", "<br>", "<p align='left'>", " content 1 ", "<br>", " content 2 ", "</p>".

The atomizing process of webpage only needs that HTML code progress single pass can be completed, and conventional html tag The parsings such as tree, page vision are compared, and realization is more quick and easy, also lower to the Standardization Requirement of HTML, therefore program To be more healthy and strong, in the automatic processing towards magnanimity webpage, this point seems increasingly important.

Step 1.5.2: web page element；

Composite component is formed by the component composition of several adjoinings, also referred to as element.As mentioned in the above "<br><p Align=' left ' > " is a composite component.Composite component is denoted as E_i={ e_p,…,e_q, q >=p；E indicates component, i table Show the serial number of composite component.Obviously, as p=q, it is component that composite component, which is degenerated,；Work as p=1, (f indicates webpage herein when q=f Component sum under atomization expression, such as above), composite component, that is, entire webpage.

Web page element, which refers to the process of, is converted to composite component (element) expression for web page code.Expression after element It is denoted as: H={ E₁,E₂,…,E_m, H indicates that webpage, m are the compound structure after webpage HTML code is divided according to certain division rule Number of packages.Division rule is very flexible herein, is not related to what special technical problem substantially.If such as only for the present invention, then Division rule when element is exactly: all newlines that encounters then are divided.If then dividing rule for other application demand It then may be different.Such as the chain in identification Web page text (different from Web page text extracting of the invention) or identification webpage When connecing block, then division rule may is that is divided when encountering block grade element tags.And so on, application scenarios are different, draw Divider then may be different.It will be apparent that division rule is different, different elementizations will be obtained and expressed, correspondingly, first prime number m It is also possible to difference.

Generally, in most of webpage, the division mark of web page element is accorded with using new line, that is, utilizes net New line symbol in page HTML code can be by web page element, and every a line in HTML is known as row element at this time.It is good It is in the semantic division to a certain degree for taking full advantage of web developers to webpage, thus excellent.

Similar with webpage atomization, web page element process generally also only needs to carry out single pass to HTML code It completes, realization is more quick and easy compared with webpage atomization, also lower thus also more practical to the Standardization Requirement of HTML.

Explanation is additionally needed, in the method for the present invention system, handling most of webpage is that need not carry out webpage It is atomizing, because there is included and excellent division marks in the not high webpage of uncompressed or compression ratio, that is, return Vehicle newline.Just because of this, also mean that substantially without the pretreatment for doing any preposition property, place is greatly improved in the present invention Manage efficiency.

Webpage atomization be only only during handling compression degree high webpage it is necessary, because of the net of high compression rate In page, the label of formatting is used between each component without space, new line etc., thus also just member can not be carried out to webpage very well Elementization, carrying out atomization to webpage at this time is one of optimal selection.

Step 1.6: by the fuzzy text of each webpage in fuzzy text region recognition rule RuleText identification HTMLList Region a₁、a₂、…、a_s, it is denoted as A={ a₁,a₂,…,a_s}；And then obtain the text size lt in each region₁、lt₂、…、lt_s, note For LT={ lt₁,lt₂,…,lt_s}；

Generally, webpage only needs to carry out element, obtains H={ E after element₁,E₂,…,E_m, to each compound structure Part successively carries out text and extracts to obtain T={ t₁,t₂,…,t_m, wherein E_iI-th of composite component after indicating web page element, t_i Show by E_iText obtained from extraction, but<script>with</script>,<style>with</style>between equal components Component does not make extraction processing.Then obscure text region are as follows:

I.e. fuzzy text region is expressed by identifying the component row serial number comprising longest text.In addition, why Referred to herein as " fuzzy text " is the real complete text area because obtained a is not complete Web page text region herein Domain is determined under header element decision rule RuleFirst and tail element decision rule RuleLast below.

The i.e. fuzzy corresponding text size in text region of fuzzy text size, is denoted as lt=Len (t_a)。

When webpage needs atomization, executes webpage atomization and obtain H={ e₁,e₂,…,e_f, processing mode is same at this time On, it is only to carry out text extraction for each component at this time.And extraction at this time is reduced to the judgement of element type: all Component or it is content of text component or is html tag component.

Step 1.7: being completed using similar network address scoring rule RuleScore to similar list of websites URL_sIt carries out preferably, The c network address of c before its ranking is taken to constitute candidate link URL_c；

A large amount of related link is usually contained in webpage, for the webpage of text to be extracted for one, these links pair The effect that text extracts is different.Similar network address scoring rule based on carrying out address correlation according to certain rules Point, preferably to go out to be conducive to the web page interlinkage of Web page text extracting from these similar links.Remember the network address of the page to be extracted For url, similar list of websites URL is preferably obtained from current web page by network address similarity rule and similar network address number s_s, they Longest common subsequence with url is LCS={ lcs₁,lcs₂,…,lcs_s, the length of longest common subsequence is L= {l₁,l₂,…,l_s, i.e. l_i=Len (lcs_i), lcs herein_iIndicate i-th of Longest Common Substring, l_iIndicate lcs_iLength.Respectively It is LT={ lt that webpage, which obscures text size,₁,lt₂,…,lt_s}.Then i-th of network address url_iScore are as follows:

Wherein, ∝ ∈ [0,1] be weight Dynamic gene, by adjust network address similitude and fuzzy text size in network address based on Percentage contribution in point.

Candidate link refers to the link for participating in that header element and tail elemental analysis are extracted in webpage, and the quantity of candidate link is waited Select chain connects number, is denoted as c.Using above-mentioned network address code of points to similar network address URL_sScore and inverted order arranges, then Its first c is taken to obtain candidate link URL by candidate link number c_c.Obviously, as candidate link number c=2, it is meant that only need Wanting 2 similarly to spend, text can be completed in highest additional link and template extracts.

Step 1.8: according to fuzzy text region A and candidate link URL_cIn after each page elements as a result, asking respectively It solves each page and obscures element and the intersection E of element later before text region_itrs,First, E_itrs,Last；

Itrs is writing a Chinese character in simplified form for intersection, indicates intersection；First indicate element be in front of fuzzy text region and It is to calculate header element；Last indicates that element is in after fuzzy text region and is to calculate tail element.

Header element refers to E_itrs,FirstIn with uniqueness and the nearest composite component of range ambiguity text region a, tail member Element refers to E_itrs,LastIn with uniqueness and the nearest composite component of range ambiguity text region a；Header element and tail element are nets Have identification in page and often also there is the plyability component of certain versatility, Web page text extracting template is just to rely on Header element and tail element and constitute.

The identification of header element and tail element is carried out based on the corresponding page of c candidate link.These pages are denoted as H₁, H₂,…,H_c, i-th of page elements or atomization are expressed asi_m=Card (H_i) i-th of table Composite component number in the page, the fuzzy text region of i-th of page are a_i, utilize a_iIt can be by H_iIt is divided into two parts, respectively It is denoted asWith

Former and later two parts after each page segmentation are sought common ground respectively, are respectively obtained:

Wherein, i indicates the serial number of webpage in data set, u=Card (E_itrs,First), v=Card (E_itrs,Last), respectively Indicate first prime number of fuzzy text region front and back composite component set intersection.It is noted that the sequence of each element is tieed up after seeking common ground It is constant to hold its original sequence.

The header element E in text region_FirstAre as follows:

Wherein, j indicates the serial number of composite component, Count (i, E_j,First) indicate the component included in i-th of page E_j,FirstNumber, it is necessary to be greater than 0.From expression formula as it can be seen that component E_j,FirstThe number of appearance is fewer, that is, above formula denominator is smaller, Closer to fuzzy text region, that is, denominator is bigger, and the value of above-mentioned expression formula is bigger.By the header element decision rule, will obtain Take in webpage uniqueness strong and the composite component as close to text region: header element.

The tail element E in text region_LastAre as follows:

Wherein, j indicates the serial number of composite component, Count (i, E_j,Last) indicate the component included in i-th of page E_j,LastNumber, it is necessary to be greater than 0.Since tail element is after text region, j value is smaller, and corresponding component is got over from text Closely.By the tail element decision rule, it will acquire in webpage that uniqueness is strong and the composite component as close to text region: Tail element.

Step 1.10: selecting appropriate algorithm (the present embodiment is using MD5 algorithm) according to demand, generate URL₀Fingerprint URLFinger, so that it is determined that URL₀Text extracting template (URLFinger, the E of corresponding webpage_First, E_Last), to extraction Template is stored；

Step 2: utilizing text extracting template, complete the extraction to Web page text；

See Fig. 2, specific implementation includes following sub-step:

Step 2.1: inputting the network address URL of text to be extracted₀；

Step 2.2: URL is calculated according to fingerprint algorithm₀Fingerprint；

Step 2.3: judging URL₀Fingerprint whether there is；

If it is not, then executing following step 2.4；

If so, executing following step 2.6；

Step 2.4: analysis text extracting template；

Step 2.5: judging whether to analyze text extracting template；

If so, executing following step 2.7；

If it is not, then exporting failure information, this process terminates；

Step 2.6: according to URL₀Fingerprint return text extracting template；

Step 2.7: according to text extracting template to URL₀The corresponding page carries out text extracting；

Step 2.8: judging whether to be drawn into text；

If so, the text that output is drawn into, this process terminate；

If it is not, then exporting failure information, this process terminates.

It should be noted that if the algorithm uniqueness of network address fingerprint is strong, then each webpage corresponds to a template；If network address refers to Line algorithm uniqueness is not strong, if such as template may be implemented to the same fingerprint in the web address mapping with same template It reuses, this often can be applied to the webpage under the same column or the same domain name.

The present invention is further elaborated by the following examples:

The whole network news data that the experimental data of the present embodiment uses search dog laboratory to provide.It is sent out additionally, due to above-mentioned data It is distributed in 2012, the linked web pages in a large portion webpage can not all have been opened, therefore also voluntarily acquire several country More well-known number of site obtains some newest web datas.Finally add up to 20 websites, the webpage of each website 200. A possibility that being deleted due to the webpage in one's early years by website operator is very big, in order to guarantee third party to this method as far as possible Freshly harvested webpage will be mainly used when verifying, therefore hereafter explaining.

In order to enable webpage collected, covers the channel or column as much as possible of a website, therefore voluntarily acquire When, the link network address in extraction website homepage first removes the index type page therein, video-type page etc., then from residue Link in randomly selected according to three-level domain name, i.e., guarantee to the covering as much as possible of three-level domain name under taken out at random It takes.

It is as follows to test relative parameters setting:

(1) division mark of web page element is accorded with using new line.

(2) similar network address number s=30.

(3) weight Dynamic gene ∝=0.8.

(4) candidate link number c=2.

In addition, in order to which (Content Extraction based on Density, is abbreviated as with the method based on density CED it) is compared, using the density method and parameter setting in document 29, has carried out extracting experiment for same data.

Using the method that the present invention is mentioned and based on the extracting method of density, for the experimental result such as following table of data as above Shown in 1.Wherein P table accurate rate, R table recall rate.

1 experimental data of table and result

In the present embodiment, conventional accurate rate (P is used for the evaluation criterion of certain single webpage_i), recall rate (R_i)、F1 ValueIt is evaluated.

It is defined respectively as:

Wherein, i indicates the serial number of webpage in data set, and also to three Index areas with hereafter entire data set Point, therefore using i as the subscript or subscript of single webpage evaluation index.Indicate the craft for being directed to i-th of webpage Text is extracted,Indicate the automatic extraction text for being directed to i-th of webpage.

Extraction for some data set is as a result, equally use accurate rate P, recall rate R, F₁Three Xiang Zhibiao of value, only this Place's two indexs of P, R are the arithmetic mean of instantaneous values that corresponding data concentrates each web page extraction accurate rate and recall rate.It is defined as follows:

Wherein, n indicates data set size, i.e., webpage number in data set, and i shows i-th of page in data set.

From upper table 1 as it can be seen that accuracy rate average value of the invention is 95.02%, recall rate 98.23%, performance is good, wherein Recall rate performance is especially prominent, and the recall rate of nearly half website is 100%.From the point of view of three evaluation index average value, all compared with base It is slightly good in the method (CED) of density.

From the point of view of specific website, the present invention has outstanding performance in websites such as new civilian network, www.ynet.com, Wenzhou net, World Wide Webs, but The performance of the websites such as www.qq.com, the www.xinhuanet.com, China's net needs to be further improved.But generally speaking, on most websites, this hair It is bright to be better than the method based on density.

In general, the present invention shows well in test data set, participates in extraction process without artificial.In conjunction with to tool The analysis of the extraction result of the body page it is found that the present invention be embodied in the main similarities and differences of method based on text density it is following several Aspect.

(1) when the extremely short page of processing text, in contrast, the present invention has extraordinary effect, this exactly conventional base In the short slab of text density abstracting method.Method based on text density is extremely easy the shadow by other non-text plain texts It rings, especially in the shorter webpage of processing text, error rate is higher.Such as http://ent.163.com/17/1110/13/ D2SQ8TG500038FO9.html。

(2) when the longer page of processing body text, two method effects are substantially suitable.

(3) when related link is not present in the page to be extracted, the present invention can not be extracted normally, such as http: // sports.xinhuanet.com/c/2017-11/14/c_1121950475.htm。

(4) since the text extracting under the present invention is carried out on the basis of related link is analyzed in currently pending webpage , this also determines that the present invention is limited by Network status.Extraction operation can not be executed in no network, when network is slow, extracted Speed is slower.

Wherein, as follows about the discussion of experiment parameter:

(1) similar network address number s；

In general, similar network address number is The more the better, but similar network address number is more, then means that subsequent calculation amount is got over Greatly.Under normal circumstances, it is screened since similar network address has passed through network address similarity, and the website links pair of low similarity Web page text extracting is not helpful, therefore the value need not be arranged excessive.It should be clear that the value should not also be arranged it is too small, it is no It then will affect the preferred of subsequent candidate link.

(2) weight Dynamic gene ∝；

Weight Dynamic gene is big for adjusting the contribution of network address similarity and fuzzy text size when candidate link is preferred It is small.Obviously, only under the network address of high similarity, just there is the necessity for carrying out fuzzy text region recognition, low similarity network address is It is helpless to text extracting.It therefore, generally should be by weight Dynamic gene to network address similarity when scoring similar link Inclination, i.e. ∝ > 0.5.In addition, consuming smaller, whole process since network address similarity calculation is calculated than fuzzy text region In, it is first to calculate network address similarity, then just calculates fuzzy text region.

(3) candidate link number c；

For theoretically, candidate link number c is bigger, and versatility of the template of extraction under the domain is stronger, as a result also should It is more accurate.But in fact not necessarily in this way, its reason is: c is bigger, and candidate link number is more, the link that when analysis is included in More, a possibility that meeting with various unfavorable factors, is bigger.Such as although some webpages are under the jurisdiction of the same domain, but use not Same template；Or properly to link number included in webpage inadequate, in order to reach the requirement of candidate link number c, it has to receive Enter the too low link of other similitudes, which kind of either aforementioned situation all will lead to template and extract failure or due to template mistake In wide in range to extract appearance " tail ", therefore the parameter threshold generally should not be too large.Empirical value shows the general parameter setting It is 2 or 3.

The present invention merely relies upon current web page itself during extracting web page template and text, without inspiring Formula rule is supported, which dictates that this method has good versatility；This method is not necessarily to artificial the extraction process of web page template Intervene, high degree of automation；And analytic process is simple, carries out any tag resolution even without to webpage, thus analyze speed Fastly, strong interference immunity can preferably adapt to design nonstandard Web page；For the very short page of body matter also have compared with Good extraction effect；The template form that last this method extracts is simple, is easy to use.This just determines this method in Web page There is potential application value in terms of text extracting, can be used for all kinds of news, blog or Web page text with similar structure It extracts, is also had a wide range of applications in other Web information processing of less demanding to chained block fine granulation and excavation applications Prospect.

It should be understood that the part that this specification does not elaborate belongs to the prior art.

It should be understood that the above-mentioned description for preferred embodiment is more detailed, can not therefore be considered to this The limitation of invention patent protection range, those skilled in the art under the inspiration of the present invention, are not departing from power of the present invention Benefit requires to make replacement or deformation under protected ambit, fall within the scope of protection of the present invention, this hair It is bright range is claimed to be determined by the appended claims.

Claims

1. a kind of Web page text extracting method of logic-based chained block, which comprises the following steps:

Step 1: generating Web page text extracting template；

Step 1.1: inputting the network address URL of template to be generated₀；

Step 1.2: obtaining network address URL₀The source code HTML of corresponding webpage₀, and wherein all same area network address are extracted, it is denoted as URLList；

Step 1.3: utilize network address similarity rule RuleURL, from URLList preferably with URL₀The high preceding s net of similarity Location, and form similar list of websites URL_s；

Step 1.4: obtaining similar list of websites URL_sIn each network address source code, be denoted as HTML₁、HTML₂、...、HTML_s, constitute HTMLList₀；

Step 1.5: identifying and remove HTMLList₀The logical connection block of middle institute's source code, forms new source code listing, note For HTMLList, while web page elementization operation is executed to wherein each webpage；If being directed to compressed webpage, then it is executed The operation of webpage atomization；

Step 1.6: by the fuzzy text region of each webpage in fuzzy text region recognition rule RuleText identification HTMLList a₁、a₂、…、a_s, it is denoted as A={ a₁, a₂..., a_s}；And then obtain the text size lt in each fuzzy text region₁、lt₂、…、 lt_s, it is denoted as LT={ lt₁, lt₂..., lt_s}；

Step 1.7: being completed using similar network address scoring rule RuleScore to similar list of websites URL_sIt carries out preferably, taking its row The c network address of c constitutes candidate link URL before name_c；If c > s, c take value identical with s；

Step 1.8: according to fuzzy text region A and candidate link URL_cIn after each page elements as a result, solving each page respectively Face mould pastes element and the intersection E of element later before text_{Itrs, First}, E_{Itrs, Last}；

Step 1.9: template header element E is determined according to header element decision rule RuleFirst_First, according to tail element decision rule RuleLast determines template tail element E_Last；

Step 1.10: selecting suitable algorithm according to demand, generate URL₀Fingerprint URLFinger, so that it is determined that URL₀It is corresponding Text extracting template (URLFinger, the E of webpage_First, E_Last), the template of extraction is stored；

2. the Web page text extracting method of logic-based chained block according to claim 1, which is characterized in that step 1.3 Described in the specific implementation process of network address similarity rule RuleURL be:

If the network address of currently processed webpage is to be denoted as url with reference to network address, the same area network address extracted from the webpage is URL={ url₁, url₂..., url_d, wherein d indicates that url corresponds to the same area network address number in webpage, then the phase of i-th of network address and reference network address url Like degree are as follows:

3. the Web page text extracting method of logic-based chained block according to claim 1, which is characterized in that step 1.5 Described in web page element or atomization operation are executed to wherein each webpage specific implementation process be:

Step 1.5.1: webpage atomization；

Web page code is converted into HTML component expression: H={ e₁, e₂..., e_f, H indicates that webpage, f=Card (H) they are webpage structure Component sum in part set, Card () indicate element number in set of computations；The HTML component, which refers to, constitutes the basic of HTML Element, including tag member, content component, component are also referred to as atom, are denoted as e_i, e expression component, the serial number of i expression component；

Step 1.5.2: web page element；

Web page code is converted into composite component expression: H={ E₁, E₂..., E_m, H indicates that webpage, m are that webpage HTML code is pressed Composite component number after being divided according to division rule；The composite component is formed by the component composition of several adjoinings, also referred to as first Element is denoted as E_i={ e_p..., e_q, q >=p；E_iI-th of composite component after indicating web page element, as p=q, composite component Degenerating is component；Work as p=1, when q=f, composite component, that is, entire webpage.

4. the Web page text extracting method of logic-based chained block according to claim 3, it is characterised in that: step 1.5.2 the division mark of division rule described in, web page element is accorded with using new line, i.e., using in webpage HTML code New line symbol can be by web page element, and every a line in HTML is known as row element at this time.

5. the Web page text extracting method of logic-based chained block according to claim 3, which is characterized in that step 1.6 Described in obscure text region recognition rule RuleText, obscure text region are as follows:

Wherein, t_iShow by E_iText obtained from extraction, Len () indicate calculating character string length；Fuzzy text region is to pass through What identification included the component row serial number of longest text to express；The i.e. fuzzy corresponding text in text region of fuzzy text size is long Degree, is denoted as lt=Len (t_a)。

6. the Web page text extracting method of logic-based chained block according to claim 5, which is characterized in that step 1.7 Described in similar network address scoring rule RuleScore, specific implementation process is:

The network address for remembering the page to be extracted is url, is preferably obtained from current web page by network address similarity rule and similar network address number s Similar list of websites URL_s, the longest common subsequence of they and url are LCS={ lcs₁, lcs₂..., lcs_s, longest is public The length of substring is L={ l altogether₁, l₂..., l_s, i.e. l_i=Len (lcs_i), lcs herein_iIndicate the public son of i-th of longest String, l_iIndicate lcs_iLength, each webpage obscure text size be LT={ lt₁, lt₂..., lt_s}；

Then i-th of network address url_iScore are as follows:

Wherein, ∝ ∈ [0,1] is weight Dynamic gene, for adjusting network address similitude and fuzzy text size in network address score Percentage contribution.

7. the Web page text extracting method of logic-based chained block according to claim 5, it is characterised in that: step 1.9 In, header element refers to E_{Itrs, First}In with uniqueness and the nearest composite component of range ambiguity text region a, tail element refer to E_{Itrs, Last}In with uniqueness and the nearest composite component of range ambiguity text region a；

The identification of header element and tail element is carried out based on the corresponding page of c candidate link, these pages are denoted as H₁, H₂..., H_c, i-th of page elements or atomization are expressed asi_m=Card (H_i) i-th of table Composite component number in the page, the fuzzy text region of i-th of page are a_i, utilize a_iBy H_iIt is divided into two parts, is denoted as respectivelyWith

Wherein, i indicates the serial number of webpage in data set, u=Card (E_{Itrs, First}), v=Card (E_{Itrs, Last}), respectively indicate mould First prime number of text region front and back composite component set intersection is pasted, the sequence of each element maintains its original sequence not after seeking common ground Become；Itrs is writing a Chinese character in simplified form for intersection, indicates intersection；First indicate before element is in fuzzy text region and be for Calculating header element；Last indicates that element is in after fuzzy text region and is to calculate tail element；

The header element E in text region_FirstAre as follows:

E_First=E_{Jmax, First}

Wherein, j indicates the serial number of composite component, Count (i, E_{J, First}) indicate the component E included in i-th of page_{J, First} Number, it is necessary to be greater than 0；From expression formula as it can be seen that component E_{J, First}The number of appearance is fewer, that is, above formula denominator is smaller, more leans on Near-lying mode pastes text region；By the header element decision rule, it is strong and as close to text area to will acquire in webpage uniqueness The composite component in domain: header element；

The tail element E in text region_LastAre as follows:

E_Last=E_{Jmin, Last}

Wherein, j indicates the serial number of composite component, Count (i, E_{J, Last}) indicate the component E included in i-th of page_{J, Last}'s Number, it is necessary to be greater than 0；Since tail element is after text region, j value is smaller, and corresponding component is closer from text；Pass through The tail element decision rule, will acquire in webpage that uniqueness is strong and the composite component as close to text region: tail element.

8. the Web page text extracting method of logic-based chained block described in -7 any one, feature exist according to claim 1 In the specific implementation of step 2 includes following sub-step:

Step 2.1: inputting the network address URL of text to be extracted₀；

Step 2.2: URL is calculated according to fingerprint algorithm₀Fingerprint；

Step 2.3: judging URL₀Fingerprint whether there is；

If it is not, then executing following step 2.4；

If so, executing following step 2.6；

Step 2.4: analysis text extracting template；

Step 2.5: judging whether to analyze text extracting template；

If so, executing following step 2.7；

If it is not, then exporting failure information, this process terminates；

Step 2.6: according to URL₀Fingerprint return text extracting template；

Step 2.8: judging whether to be drawn into text；

If so, the text that output is drawn into, this process terminate；

If it is not, then exporting failure information, this process terminates.