CN104598577A

CN104598577A - Extraction method for webpage text

Info

Publication number: CN104598577A
Application number: CN201510017223.3A
Authority: CN
Inventors: 汤奇峰; 刘作涛
Original assignee: ZAMPLUS ADVERTISING (SHANGHAI) CO Ltd
Current assignee: ZAMPLUS ADVERTISING (SHANGHAI) CO Ltd
Priority date: 2015-01-14
Filing date: 2015-01-14
Publication date: 2015-05-06
Anticipated expiration: 2035-01-14
Also published as: CN104598577B

Abstract

The invention relates to an extraction method for a webpage text. The extraction method comprises the following steps: 1) extracting a webpage title through a regular expression; 2) preprocessing the webpage; 3) dynamically dividing a text block; 4) rating the text block and selecting the optimal text block; 5) circularly expanding the text block. According to the extraction method for the webpage text, the extraction speed is high, the extraction effects for various pages of the news gateway, personal blog and community forum are all excellent, the accuracy is high and the robustness is excellent.

Description

A kind of extracting method of Web page text

Technical field

The present invention relates to field of information processing, be specifically related to a kind of general and webpage context extraction method efficiently.

Technical background

Web page text extracts and refers to that computer system identifies the method for the structurized text message such as title, text from various destructuring, content is different, layout is different internet web page.To Web information searching system, especially search engine, it is a very important basic module that Web page text extracts.If text extracts inaccurate, such as, there is a large amount of omission to the extraction of body part, or non-body part is identified as text, so follow-uply just can not ensure precisely, to be difficult to the demand meeting user with the matching process of query word.

In recent years, wait chat sharing platform along with micro-letter and personalized read the universal of application, these application needed to carry out text extraction to the webpage of third party's website, so as on the screen of different size adaptive web page contents, raising Consumer's Experience.On the other hand, the development of accurate advertisement technology etc., the demand excavated large data text is more and more higher, and the prerequisite that large data text excavates, be the Web page text that can identify that on internet, the URL of magnanimity is corresponding.These webpages, not only from news portal, also include encyclopaedia website, individual blog, Ask-Answer Community etc., and data volume is large, and content embraces a wide spectrum of ideas.

Above demand all describes the significance level that Web page text extracts, and also requires that the method that Web page text extracts possesses quick, accurate, general characteristic simultaneously.Existing webpage extracting method may be summarized to be following several:

(1) rule-based way.Artificial is that extracting rule, such as regular expression or XPath etc. are specified in specific website.Advantage is that order of accuarcy is high, but shortcoming is the webpage can only resolving fixing website or set form, and the formulation process of rule is wasted time and energy, once page layout changes, follow-up being difficult to finds and updating maintenance.

(2) the DOM(DOM Document Object Model of HTML is resolved) tree construction.By building dom tree to html web page, tree being traveled through, identifies and reject non-text message, and according to the Rule Extraction such as page layout, text density body text.Doing maximum shortcoming is like this that to build the speed of dom tree comparatively slow, and text extraction efficiency is very low.And along with the update of webpage origination techniques, a lot of website HTML becomes increasingly complex, also more and more lack of standardization, this analytic method not only speed is slow, also has certain probability and can not build dom tree and cause extracting unsuccessfully.

(3) based on the implicit annotation in HTML and visual information (such as text color, font etc.).The versatility of this method is inadequate, still needs a lot of artificial rule, usually only have very good effect to news portal etc., but to the webpage that this style of individual blog changes greatly, success ratio is lower.

Summary of the invention

In order to overcome the deficiencies in the prior art, the invention provides a kind of general and webpage context extraction method efficiently, can process webpage efficiently, extract body text, the tasks such as search engine, greatly data text excavation can well be applied to.

The object of the invention is to be achieved through the following technical solutions:

An extracting method for Web page text, comprises the following steps:

Step one, extracts web page title by regular expression.

Step 2, Web-page preprocessing, namely removes irrelevant character and web page tag, obtains continuous print text fragments in webpage.Irrelevant character includes but not limited to page script, CSS (cascading style sheet), annotation.Remove web page tag concrete grammar to comprise: (1) removes original newline in webpage; (2) remove the label in webpage, and replace with new newline.

Step 3, dynamically divides text block.(1) add up all nulls in the text fragments obtained, check that each null has how many continuous print nulls altogether, obtain an array with this; (2) array is sorted from small to large, select the little number of array 1/5th as threshold value; (3) number using this threshold value as continuous null is separated text fragments, obtains multiple dynamic text block.

Step 4, gives a mark to text block, chooses optimum text block.Particular content comprises: add up the word length of each text block, hyperlink number, punctuation mark number, stop-word number, with the similarity of title, the information of the relative position of text block in all texts, be designated as text_density respectively, link_density, punctuation_density, stopword_density, title_match_rate, position_rate, give a mark to each text block, mark is higher, text is better, chooses optimum text block with this.

The score formula of text block is:

y = text_density * position_rate * (1 + title_match_rate) * sqrt(punctuation_density) / link_density / sqrt(stopword_density)。

Step 5, circulation expands text block.From the optimum text block obtained, merge the text block at its front or rear, and again calculate the score; If score raises, then the text block after merging is added optimum text block, if score declines, then abandon this time merging; Circulation is gone down, until score can not raise, and the text block finally obtained, the text message that will extract exactly.

In sum, owing to have employed technique scheme, compared to the prior art the present invention, has following beneficial effect:

(1) efficient.Context extraction method of the present invention does not need to build dom tree, does not need recursive tree structure yet, and only need several acquisition work all over just completing pre-service and text fragments of HTML scanning, extraction rate is very fast.

(2) general.The present invention does not rely on specific page layout, does not need to make a large amount of hypothesis to web page tag, annotation etc. yet, thus has good versatility.Experiment shows, all has good extraction effect to the various webpages of news portal or individual blog, community of forum.

(3) accuracy is high.Because extracting method of the present invention needs, to text block marking, to can be good at distinguishing non-body matter and filtering, and do the strategy of loop fusion from optimum text block can minimal omission high-quality text.

(4) robustness is good.Extracting method of the present invention does not need to make hypothesis to the correctness of HTML, even if therefore to nonstandard html text, can carry out text extraction yet.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of the extracting method of a kind of Web page text of the present invention.

Embodiment

Below in conjunction with drawings and Examples, the specific embodiment of the present invention is elaborated.

Refer to Fig. 1, the extracting method of a kind of Web page text of the present invention is divided into two key steps: one, text block is separated; Two, statistics is filtered and is merged.The object that text block is separated is that text segment and non-text segment are separated; Statistics is filtered and the object of merging is that text segment is screened.Text block separation comprises to be extracted web page title, Web-page preprocessing by regular expression, is dynamically divided text block three steps; Statistics filtration and merging comprise chooses optimum text block, circulation expansion text block two steps.

First web page title to be extracted from original web page, for the marking of follow-up text block is prepared.Extract web page title to have been come by setting general-purpose regular expression, such as, extracting rule in Python:

_ title=re.compile (r'<title> (.*) </title>', re.I|re.S) and,

_title = re.compile(r'<h1>(.*?)</h1>', re.I|re.S)。

Will carry out Web-page preprocessing after extracting web page title, the object of Web-page preprocessing removes irrelevant character and web page tag, obtains continuous print text fragments in webpage.Irrelevant character includes but not limited to page script, CSS (cascading style sheet), annotation etc., and these characters there will not be in the body of the email, and do not help the extraction of text.Because the HTML of each website differs widely, what have has newline, and what have does not have, so before extracting text fragments, first will remove original newline in webpage.In order to separate web page text, to distinguish text and non-text, also will remove the label in webpage, the label removed is replaced with new newline.After this step, continuous text fragment line by line, varying in size of fragment will be obtained, and the distance at interval is also different.

Will carry out the division of dynamic text block below, the object dynamically dividing text block is that close text block is aggregated to together.But because the structure of different web pages is different, the sparse degree difference of the text fragments that Web-page preprocessing obtains is very large, so in order to address this problem, needs the sparse degree based on webpage Chinese version, set dynamic threshold value, and with this threshold value, webpage is divided into multiple text block.Specific practice is: first add up all nulls in the text fragments of acquisition, checks that each null has how many continuous print nulls altogether, obtains an array with this, such as [1,1,3,3,1,1,5,1,1]; Then array is sorted from small to large, select the little number of array 1/5th as threshold value; Number using this threshold value as continuous null is separated text fragments, obtains multiple dynamic text block.

After text block has been separated, need to give a mark to text block, choose optimum text block.Specific practice is: the word length of adding up each text block, hyperlink number, punctuation mark number, (the such as advertisement of stop-word number, about us, friendly links etc. often appear at the word in non-text region), with the similarity of title, (the text block position such as started most is 1 to the relative position of text block in all texts, what finally occur is 0) etc. information, be designated as text_density respectively, link_density, punctuation_density, stopword_density, title_match_rate, position_rate, each text block is given a mark, optimum text block is chosen with this.

The score formula of text block is:

From the optimum text block obtained, merge the text block at its front or rear, and again calculate the score; If score raises, then the text block after merging is added optimum text block, if score declines, then abandon this time merging; Circulation is gone down, until score can not raise, and the text block finally obtained, the text message that will extract exactly.The number of times of cycle calculations is limited, can not exceed the number of text block.Like this can merging high-quality text as much as possible, reduce the omission of text.

Above-described embodiment is only for illustration of technological thought and the feature of this patent, its object is to enable those skilled in the art understand the content of this patent and implement according to this, the scope of the claims of this patent only can not be limited with the present embodiment, namely the equal change done of all spirit disclosed according to this patent or modification, still drop in the scope of the claims of this patent.

Claims

1. an extracting method for Web page text, is characterized in that, comprises the following steps:

Step one, extracts web page title by regular expression;

Step 2, Web-page preprocessing;

Step 3, dynamically divides text block;

Step 4, gives a mark to text block, chooses optimum text block;

Step 5, circulation expands text block.

2. the extracting method of a kind of Web page text according to claim 1, is characterized in that, described step 2 comprises: remove irrelevant character and web page tag, obtain continuous print text fragments in webpage;

Remove web page tag concrete grammar to comprise: (1) removes original newline in webpage; (2) remove the label in webpage, and replace with new newline;

Described irrelevant character includes but not limited to page script, CSS (cascading style sheet), annotation.

3. the extracting method of a kind of Web page text according to claim 1 and 2, is characterized in that, described step 3 comprises:

(1) add up all nulls in the text fragments obtained, check that each null has how many continuous print nulls altogether, obtain an array with this;

(2) array is sorted from small to large, select the little number of array 1/5th as threshold value;

(3) number using this threshold value as continuous null is separated text fragments, obtains multiple dynamic text block.

4. the extracting method of a kind of Web page text according to claim 1, it is characterized in that, described step 4 comprises: add up the word length of each text block, hyperlink number, punctuation mark number, stop-word number, with the similarity of title, the information of the relative position of text block in all texts, be designated as text_density respectively, link_density, punctuation_density, stopword_density, title_match_rate, position_rate, each text block is given a mark, chooses optimum text block with this;

The score formula of text block is:

5. the extracting method of a kind of Web page text according to claim 1, is characterized in that, described step 5 comprises: from the optimum text block obtained, merge the text block at its front or rear, and again calculate the score; If score raises, then the text block after merging is added optimum text block, if score declines, then abandon this time merging; Circulation is gone down, until score can not raise, and the text block finally obtained, the text message that will extract exactly.