CN104598577B

CN104598577B - A kind of extracting method of Web page text

Info

Publication number: CN104598577B
Application number: CN201510017223.3A
Authority: CN
Inventors: 汤奇峰; 刘作涛
Original assignee: ZAMPLUS ADVERTISING (SHANGHAI) CO Ltd
Current assignee: ZAMPLUS ADVERTISING (SHANGHAI) CO Ltd
Priority date: 2015-01-14
Filing date: 2015-01-14
Publication date: 2017-09-15
Anticipated expiration: 2035-01-14
Also published as: CN104598577A

Abstract

A kind of extracting method of Web page text, comprises the following steps：Step one, web page title is extracted by regular expression；Step 2, Web-page preprocessing；Step 3, dynamic divides text block；Step 4, gives a mark to text block, chooses optimal text block；Step 5, circulation expands text block.Extraction rate of the present invention quickly, all has good extraction effect to news portal or personal blog, the various webpages of community of forum, and accuracy is high, robustness is good.

Description

A kind of extracting method of Web page text

Technical field

The present invention relates to field of information processing, and in particular to a kind of general and efficient webpage context extraction method.

Technical background

Web page text, which is extracted, refers to computer system from the internet net that various unstructured, contents are different, layout is different The method that the text message of the structurings such as title, text is identified in page.To Web information searching system, especially search engine For, it is a very important basic module that Web page text, which is extracted,.If text extracts inaccurate, such as to body part Extraction have a large amount of omissions, or non-body part is identified as text, then subsequently the matching process with query word just can not Ensure accurate, it is difficult to meet the demand of user.

In recent years, with the chat sharing platform such as wechat and the personalized popularization for reading application, these are using needs pair The webpage of third party's website carries out text extraction, to be adapted to web page contents on various sizes of screen, improves Consumer's Experience. On the other hand, the development of accurate advertisement technology etc., to the demand more and more higher of big data text mining, and big data text mining A premise, be that can recognize the corresponding Web page texts of URL of magnanimity on internet.These webpages are not only from news Door, also includes encyclopaedia website, personal blog, Ask-Answer Community etc., data volume is big, and content embraces a wide spectrum of ideas.

Above demand all illustrates the significance level that Web page text is extracted, while also requiring that the method tool that Web page text is extracted Standby quick, accurate, general characteristic.Existing webpage extracting method may be summarized to be following several：

（1）Rule-based way.Manually extracting rule, such as regular expression or XPath are specified for specific website Deng.Advantage is that order of accuarcy is high, but has the disadvantage that the webpage of fixed website or set form, and the formulation of rule can only be parsed Journey wastes time and energy, once page layout changes, is subsequently difficult to find and updating maintenance.

（2）Parse HTML DOM（DOM Document Object Model）Tree construction.By building dom tree to html web page, tree is carried out Traversal, recognizes and rejects non-text message, and according to the Rule Extraction body text such as page layout, text density.So do most Big shortcoming is that the speed of structure dom tree is slower, and text extraction efficiency is very low.And with the update of webpage origination techniques, Many website HTML become increasingly complex, also more and more lack of standardization, and not only speed is slow for this analytic method, also has certain probability structure Can not build dom tree causes to extract failure.

（3）Based on the implicit annotation and visual information in HTML（Such as text color, font）.This method it is general Property not enough, be still required for many artificial rules, generally only have very good effect to news portal etc., but to this style of personal blog The webpage changed greatly, success rate is relatively low.

The content of the invention

In order to overcome the deficiencies in the prior art, the invention provides a kind of general and efficient webpage context extraction method, Efficiently webpage can be handled, extract body text, can be very good to be applied to search engine, big data text mining Etc. task.

The purpose of the present invention is achieved through the following technical solutions：

A kind of extracting method of Web page text, comprises the following steps：

Step one, web page title is extracted by regular expression.

Step 2, Web-page preprocessing removes unrelated character and web page tag, obtains continuous text fragments in webpage. Unrelated character includes but is not limited to page script, CSS, annotation.Removing web page tag specific method includes：（1）Remove Original newline in webpage；（2）The label in webpage is removed, and is replaced with new newline.

Step 3, dynamic divides text block.（1）All nulls in the text fragments obtained are counted, check that each is empty Capable how many continuous null altogether, an array is obtained with this；（2）Array is sorted from small to large, number is selected The 1/5th small number of group is used as threshold value；（3）Text fragments are separated using the number of this threshold value as continuous null, obtained Obtain multiple dynamic text blocks.

Step 4, gives a mark to text block, chooses optimal text block.Particular content includes：Count each text block Word length, hyperlink number, punctuation mark number, stop-word number, the similarity with title, text block are in all texts Relative position information, text_density, link_density, punctuation_density are designated as respectively, Stopword_density, title_match_rate, position_rate, give a mark, fraction is got over to each text block Height, text is better, and optimal text block is chosen with this.

The score formula of text block is：

y = text_density * position_rate * (1 + title_match_rate) * sqrt (punctuation_density) / link_density / sqrt(stopword_density)。

Step 5, circulation expands text block.From the optimal text BOB(beginning of block) of acquisition, merge the text of its front or behind Block, and calculate again point；If score is raised, the text block after merging is added optimal text block, if score declines, Then abandon this time merging；Circulation is gone down, untill score can not be raised, the text block finally given, seeks to extract just Literary information.

In summary, by adopting the above-described technical solution, compared to the prior art the present invention, there is following beneficial effect：

（1）Efficiently.The context extraction method of the present invention need not build dom tree, it is not required that recurrence tree construction, only need Pretreatment and the acquisition work of text fragments can just be completed by scanning several times to HTML, and extraction rate is quickly.

（2）It is general.The present invention is not relying on specific page layout, it is not required that web page tag, annotation etc. are made greatly Amount it is assumed that thus have good versatility.Experiment shows, no matter to news portal or personal blog, community of forum it is each Planting webpage has good extraction effect.

（3）Accuracy is high.Because the extracting method of the present invention needs to give a mark to text block, it can be good to non-text Content is distinguished and filtered, and the strategy for doing loop fusion from optimal text block being capable of minimal omission high-quality text.

（4）Robustness is good.The present invention extracting method simultaneously HTML correctness need not be made the assumption that, even if therefore To nonstandard html text, text extraction can also be carried out.

Brief description of the drawings

Fig. 1 is a kind of flow chart of the extracting method of Web page text of the invention.

Embodiment

The embodiment to the present invention elaborates with reference to the accompanying drawings and examples.

Fig. 1 is referred to, a kind of extracting method of Web page text of the invention is divided into two key steps：First, text block separates； 2nd, statistics filtering and merging.The purpose that text block separates is that text segment is separated with non-text segment；Statistics filtering and conjunction And purpose be that text segment is screened.Text block separates pre- including extracting web page title, webpage by regular expression Processing, dynamic divide three steps of text block；Statistics filtering and merging include the optimal text block of selection, circulation and expand text block two Individual step.

First have to extract web page title from original web page, be that follow-up text block marking is prepared.Extract web page title It can be completed by setting general-purpose regular expression, such as the extracting rule in Python：

_title = re.compile(r'<title>(.*)</title>', re.I | re.S) and,

_title = re.compile(r'<h1>(.*?)</h1>', re.I|re.S)。

Web-page preprocessing will be carried out by extracting after web page title, the purpose of Web-page preprocessing is to remove unrelated character and webpage mark Label, obtain continuous text fragments in webpage.Unrelated character includes but is not limited to page script, CSS, annotation etc., this A little characters are not in the body of the email, and the extraction of text not to be helped.Because the HTML of each website differs widely, have There is newline, what is had does not have, so before extracting text fragments, first to remove original newline in webpage.In order to webpage Text is separated, to distinguish text and non-text, also to remove the label in webpage, the label of removal is with new newline Replace.After this step, continuous text fragment line by line will be obtained, fragment it is of different sizes, and be spaced it is far and near It is different.

The division of dynamic text block is carried out below, and the purpose that dynamic divides text block is that close text block is aggregated to Together.But it is due to the structure difference of different web pages, the sparse degree difference for the text fragments that Web-page preprocessing is obtained is very big, institute This problem of solution has been thought, it is necessary to the sparse degree based on webpage Chinese version, sets dynamic threshold value, and with this threshold value by net Page is divided into multiple text blocks.Specific practice is：All nulls in the text fragments of acquisition are first counted, each null is checked How many continuous null, an array, such as [1,1,3,3,1,1,5,1,1] are obtained with this altogether；Then logarithm Group sorts from small to large, and the selection small number of array the 1/5th is used as threshold value；The number of continuous null is used as using this threshold value Text fragments are separated, multiple dynamic text blocks are obtained.

, it is necessary to be given a mark to text block after text block separation is completed, optimal text block is chosen.Specific practice is：System Count word length, hyperlink number, punctuation mark number, the stop-word number of each text block（Such as advertisement, on we, Friendly link etc. frequently appears in the word in non-text region）And relative position of the similarity, text block of title in all texts Put（The text block position for example most started is 1, finally occur for 0）Etc. information, text_density is designated as respectively, link_density, punctuation_density, stopword_density, title_match_rate, Position_rate, is given a mark to each text block, and optimal text block is chosen with this.

The score formula of text block is：

From the optimal text BOB(beginning of block) of acquisition, merge the text block of its front or behind, and calculate again point；If Decilitre is high, then the text block after merging is added optimal text block, if score declines, and abandons this time merging；Circulation is gone down, Untill score can not be raised, the text block finally given seeks to the text message extracted.The number of times of cycle calculations is that have Limit, not over the number of text block.So the omission of text can be reduced with merging high-quality text as much as possible.

Embodiment described above is merely to illustrate the technological thought and feature of this patent, in the art its object is to make Technical staff can understand the content of this patent and implement according to this, it is impossible to the patent model of this patent is only limited with the present embodiment Enclose, i.e., equal change or modification that all spirit according to disclosed in this patent is made still fall in the scope of the claims of this patent.

Claims

1. a kind of extracting method of Web page text, it is characterised in that comprise the following steps：

Step one, web page title is extracted by regular expression；

Step 2, Web-page preprocessing obtains continuous text fragments in webpage；

Step 3, dynamic divides text block；

Step 4, according to the word length of each text block, hyperlink number, punctuation mark number, stop-word number and title The information of relative position in all texts of similarity, text block each text block is beaten respectively based on pre-defined rule Point, choose optimal text block；

Step 5, the optimal text block is expanded based on pre-defined rule circulation, and the optimal text block after expansion is text message.

2. the extracting method of a kind of Web page text according to claim 1, it is characterised in that the step 2 includes：Go Except unrelated character and web page tag, continuous text fragments in webpage are obtained；

Removing web page tag specific method includes：(1) original newline in webpage is removed；(2) label in webpage is removed, and Replaced with new newline；

The unrelated character includes but is not limited to page script, CSS, annotation.

3. the extracting method of a kind of Web page text according to claim 1 or 2, it is characterised in that the step 3 includes：

(1) all nulls in the text fragments that statistics is obtained, check each null how many continuous sky altogether OK, an array is obtained with this；

(2) array is sorted from small to large, the selection small number of array the 1/5th is used as threshold value；

(3) number using this threshold value as continuous null is separated to text fragments, obtains multiple dynamic text blocks.

4. the extracting method of a kind of Web page text according to claim 1, it is characterised in that in the step 4, will be every The word length of individual text block, hyperlink number, punctuation mark number, stop-word number, exist with similarity, the text block of title Relative position in all texts is designated as text_density, link_density, punctuation_density respectively, Stopword_density, title_match_rate, position_rate,

The score formula of text block is：

Y=text_density*position_rate* (1+title_match_rate) * sqrt (punctuation_ density)/link_density/sqrt(stopword_density)。

5. the extracting method of a kind of Web page text according to claim 1, it is characterised in that the step 5 includes：From The optimal text BOB(beginning of block) obtained, merges the text block of its front or behind, and calculates again point；If score is raised, Text block after merging is added optimal text block, if score declines, abandons this time merging；Circulation is gone down, until score Untill can not raising, the text block finally given seeks to the text message extracted.