CN103699591A

CN103699591A - Page body extraction method based on sample page

Info

Publication number: CN103699591A
Application number: CN201310665878.2A
Authority: CN
Inventors: 兰秋军
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2013-12-11
Filing date: 2013-12-11
Publication date: 2014-04-02

Abstract

The invention discloses a method for extracting page bodies by filtering noise information such as advertisements, irrelevant links, pictures and the like from pages. The method is characterized in that page bodies are extracted rapidly and accurately by using the similarity on the aspects of the structures and contents of column pages in the same website. The method comprises the following steps: prompting a user to specify a sample page; performing initial filtering in a system by using page tags; calculating the text similarity of a page to be processed and the sample page on the basis of an editing distance; quickly identifying the boundary of a page body in order to extract pure body contents from noise-including page texts. The method has the characteristics of easiness and practicability in operation, rapidness and accuracy. The technical scheme can be widely applied to systems for acquiring, searching, reproducing, mining and analyzing webpage text information.

Description

A kind of webpage context extraction method based on example page

Technical field

The present invention relates to Web page text and extract implementation method, particularly from webpage, extract the technology of body text information.

Background technology

Current is the age of an information explosion, collects and understand the ability of information, is the key that a lot of enterprises seize competitive advantage.Popularizing fast of the Internet, makes network become the main place that information produces, propagates, spreads, receives.Even traditional newspaper, magazine, books etc. also all start electronization, with the form appearance of network text.The quick collection network information for this reason, and to its retrieve, taxonomic revision, reprinting, storage and mining analysis become people in the urgent need to.

HTML is a kind of markup language that creates webpage, is now international standard, by World Wide Web Consortium (W3C), is safeguarded.The descriptive text that any webpage is comprised of html tag in fact exactly, HTML label can comment, figure, animation, sound, form, link etc.The structure of HTML comprises head (Head), main body (Body) two large divisions.Head is described the required information of browser, and main body comprises the particular content that will illustrate.Although the technology of each website generating web page varies, at client browser end, all webpages are all supported to represent with html format.Can obtain by the standard mode of standard identical html text information.

Existing network text extracting, what generally adopt is the method for rule-based template, to specific website, by analyzing its page structure, determines the residing page-tag of text border.There is very large defect in this mode: because different web sites adopts different technological development, and also frequently correcting of Website page, this just needs information mechanism cost human and material resources energetically to remove to safeguard pattern rule.Even if there is the support of professional tool software, the defining operation of pattern rule is still very complicated, and especially needing operating personnel deeply to grasp webpage development technology could, to structure of web page correct understanding, could accurately extract.

The applicant carries out statistical study by each column to domestic and international more or less a hundred well-known website, finds that nearly all text website all has following several feature:

1. at same subject plate, the time is not apart that too two pages of (the not words of correcting) far away have identical page layout conventionally, and has similar advertisement, focus navigation etc.And, do not comprise that the similarity of other web page text of body matter is very high, the i.e. webpage context extraction method based on template, need to use wrapper (Wrapper) to extract the effective information in webpage, wrapper is a program, the spatial layout feature of this program based on the page, for a specific class webpage, write resolver, parse the position of text in the page, the advantage of this method is to realize simply, it is high that text extracts accuracy, but it is poor that shortcoming is versatility, each class webpage all will be write specific wrapper, be not suitable for the extraction of the extensive page, if there is variation in a certain class page in addition, the wrapper of such page will lose efficacy so, must remodify.

2. Web page text originally flocks together continuously in page text, seldom in the middle of text, by advertisement, navigation text, is inserted and interrupts.

3. the Web page text of nearly all website is all comprised by <p> or two html tags of <div>.Shortcoming is that the paragraph that comprises <p> or two html tags of <div> in Web page text is too many, computing is complicated, consuming time longer, concrete invention can be referring to application number 2010105533273, a kind of Web page text extracting method and device based on DIV position by name.

Summary of the invention

The object of the present invention is to provide a kind of user of allowing to operate more simple, can filter out in the page Web page text extracting mode of the noise informations such as a large amount of advertisements that exist, focus link, thus collection from webpage easily, reprinting, the required body text of mining analysis.

The present invention consists of following step: (1) user or program are specified an example page; (2) utilize webpage label tentatively to filter, obtain the text containing text; (3) utilize editing distance model to calculate the similarity between page text to be extracted and example page text, identification text border; (4) extract text.

Specify any one example page to carry out text extraction to the to be extracted page similar to example page.

The page to be extracted and example page are carried out to the accuracy and efficiency that preliminary filtration treatment is extracted to improve follow-up text, preliminary method of filtering is to utilize the innerHTML attribute that meets W3C standard to obtain the content page text that comprises html tag, with regular expression, remove html tag again, two texts after being filtered are respectively Similar Text.

Can guarantee that like this method has versatility, make the content of text that obtains and website based on development language and browser irrelevant.For the purpose of describing conveniently below, claim that two texts after this filtration are respectively PageA and PageB.

To beginning and the end boundary of Similar Text applicating text similarity analysis technology identification Web page text, the calculating of text similarity, based on scale model, need to be calculated two similarity sequential values, and step is as follows:

(1) head of alignment Similar Text character string,

(2) start from top to bottom to intercept respectively the substring of Similar Text designated length, be denoted as s1, s2,

(3) calculate s1, the similar value of s2,

(4) move backward a character, continue the substring of the next designated length of intercepting, be still denoted as s1, s2,

(5) forward step (2) to and repeat said process, until end is denoted as FS by the similar sequences obtaining like this;

Similarly,

(1) afterbody of alignment Similar Text,

(2) start from bottom to up to intercept respectively the substring of Similar Text designated length, be denoted as s1, s2,

(3) calculate s1, the similar value of s2,

(4) move forward a character, continue the substring of the next designated length of intercepting, be still denoted as s1, s2,

(5) forward step (2) to and repeat said process, until head is denoted as BS by the similar sequences obtaining like this.

Obtain two similar sequences, identify accordingly the border of text, analyze two sequences, when starting and finishing, similarity value is higher, and middle similarity value diminishes rapidly, occurs turnover;

And this turning point corresponding be exactly the border of text, what wherein the turning point of sequence FS was corresponding is the head of text, the afterbody of the corresponding text of the turning point of sequence B S;

According to the mode of sequential value is counter, push away, can obtain the head and the tail boundary position of text in Similar Text, thereby can extract text, noise text message is filtered out.

The invention has the advantages that, highly versatile, having overcome inhomogeneity webpage needs Different Package device to extract the shortcoming of effective information in webpage, has versatility.That the present invention has is simple to operate, easily go, feature fast and accurately, and it is high that text extracts accuracy rate.Its technical scheme can be widely used in the systems such as web page text information acquisition, search, reprinting, mining analysis.

Accompanying drawing explanation

Accompanying drawing 1 is system architecture diagram.

Accompanying drawing 2 is page HTML code and dom tree topology example.

The division signal of 3 two page text of accompanying drawing.

The concrete instance of 4 two page text of accompanying drawing.

Accompanying drawing 5 is FB(flow block).

Embodiment

The invention is characterized in, user need not understand the layout structure of the page and realize technology, and only by specifying an example page, be equivalent to inform that computing machine extracts the mode of text, thereby need not carry out complicated operation, define the pattern rule with maintain pages.Its operation steps in computing machine is very simple, and user only needs following two steps can easily start page body extraction work:

(1) add new task to task list

To each task, user only must provide the information of following 4 aspects:

Figure 2013106658782100002DEST_PATH_IMAGE001

an example page network address;

page network address Changing Pattern information to be collected;

collection period information;

text storage information

(2) startup system is carried out Automatic Extraction (as the step 12) of accompanying drawing 1

System will read task list automatically, according to each mission bit stream, take turns, periodically according to the network address Changing Pattern of giving, obtain web page text to be collected, extract text, preserve/be published to assigned address.

Be composition structure of the present invention as shown in Figure 1, mainly comprise following functions module: task definition module 11, page capture engine modules 12, text extract engine modules 13, stored/distributed engine modules 18.Task definition module 11 realizes user interface, and appointed task information is added in task list 13.Task list is the text of a format, has preserved each mission bit stream, and every mission bit stream comprises example page network address, page network address Changing Pattern to be collected, collection period, stored/distributed mode etc.Page capture engine 12 reads the data in task list 13, and according to the collection period of each task, access the Internet 14 gathers corresponding web page information on time, forms html text 15.Text extracts 16, engine according to html text 15, and example web page text, based on text extractive technique of the present invention, extracts text from webpage, and the noise informations such as filtering advertisement, navigation, form plain text 17.Finally, stored/distributed engine 18 will carry out the issue of result 19 according to the stored/distributed mode of mission bit stream appointment.

In functional structure as shown in Figure 1, page capture engine 12 is mainly that DOM technology realizes by DOM Document Object Model (Document Object Model).DOM be one cross-platform, with the standard of the expression of language independent and operation HTML, XHTML, XML object (can referring to http://www.w3.org/dom/).The structure of any page can show by the tag tree of DOM, and as shown in Figure 2, the left side is the HTML code of a webpage, and the right corresponding DOM tag tree that is it.Utilize the application programming interfaces (API) of DOM can access easily and operate each DOM node object.For example, based on Microsoft .Net framework, the innerText attribute of the HtmlElement object that it provides has just represented in <html> </html> label all removals that the comprise text of html label.(referring to http://msdn.microsoft.com).Based on this, can obtain the text message that comprises text of the page, and get rid of the multimedia noises such as most picture, video.And can be based on the preliminary filtration fraction noise of regular expression, and do not lose body matter.

The page to be extracted and example page are carried out to the accuracy and efficiency that preliminary filtration treatment is extracted to improve follow-up text.Preliminary method of filtering is to utilize the innerHTML attribute that meets W3C standard to obtain the content page text that comprises html tag, then removes html tag with regular expression.Can guarantee that like this method has versatility, make the content of text that obtains and website based on development language and browser irrelevant.For describe below convenient for the purpose of, claim that two texts after this filtration are respectively PageA and PageB,, i.e. similar pages, referring to accompanying drawing 4.

It is core of the present invention and key that text extracts engine.Its technology starting point is to be applied to three universals that each website and webpage text that this instructions background technology partly points out has.Ultimate principle is described below: as shown in Figure 3, establishing PageA is the corresponding text after preliminary filtration of example page, and PageB is the text of the page to be extracted after preliminary filtration.Because text flocks together continuously, therefore two texts all can be divided into three parts,

,

Figure 2013106658782100002DEST_PATH_IMAGE006

, and , ,

Figure 2013106658782100002DEST_PATH_IMAGE010

.Wherein,

represent the text of example page, and represent the text of the page to be extracted.And

,

represent respectively the noise text of example page text front and back,

,

represent the noise text of page body to be carried front and back.Lift a concrete instance, as shown in Figure 4, this is the texts of two actual pages after preliminary filtration.In this figure, use as space is limited, " ... .. " replaced part text, and for clarity sake, page body marks with underscore.For example, to PageA, " German Chancellor ... Reuters reported " is exactly text, in Fig. 3

, and " British Broadcasting ... 2010 22:31 UK " is in Fig. 3

, " Print Sponsor ... you are able to do so. " is

.Example page is very similar with in content in structure with the page to be carried, and generally has ,

Figure 2013106658782100002DEST_PATH_IMAGE012

.And

with

have very large difference, similarity degree is extremely low.Accordingly, situation of change that can be based on two text similarities, finds

with

head and the tail border, the most at last

from PageB, extract.

Task definition module 11 in Fig. 1 can be sayed to realize based on any common programming, as the Visual Basic under Microsoft .Net, C++, the realizations such as C#.Task list can file form exist, concrete form, developer can freely define.

Page capture engine modules 12 in Fig. 1, can realize the collection of page data and the preliminary filtration of text more easily based on DOM and regular expression.Its realization is easier, and concrete programming tool can be based on Microsoft .Net, also can be based on Java, or other supports the developing instrument of DOM object and regular expression, developer can select according to specific needs.

In Fig. 1, text extracts engine modules, and a key of its realization is the similarity of calculating between text, and this can realize by the editing distance based on Levenshtein, and Levenshtein editing distance is defined as follows:

EditDist("","")?=?0

EditDist(s,?"")?=?EditDist("",?s)?=?L

EditDist(s1+ch1,?s2+ch2)=min{?EditDist(s1,?s2)?+?if?ch1=ch2?then?0?else?1,

EditDist(s1+ch1,?s2)?+?1,

EditDist(s1,?s2+ch2)?+?1?}

Here EditDist is for calculating the function of two string editing distances, and " " represents null character string, and L represents the length of character string s.Ch1, ch2 represents respectively single character, min is the function of minimizing.

And character string

,

Figure 2013106658782100002DEST_PATH_IMAGE014

between similarity available following formula calculate,

Here

Figure 2013106658782100002DEST_PATH_IMAGE016

,

represent respectively character string ,

length, max is maximizing function.

The editing distance model that the calculating of text similarity proposes based on similarity calculations such as Levenshtein, vector space, included angle cosines, above-mentioned three kinds of computation models are at present conventional similarity technology model.Need to calculate two similarity sequential values, step is as follows:

(1) head of alignment characters string PageA and PageB,

(2) start from top to bottom to intercept respectively the substring that PageA and PageB length are d, be denoted as s1, s2, wherein length is that d is designated parameter, the character string line number of setting according to web length.

(3) calculate s1, the similar value of s2,

(4) move backward a character, continue the substring that the next length of intercepting is d, be still denoted as s1, s2,

(5) forward step (2) to and repeat said process, until PageA or PageB end are denoted as FS by the similar sequences obtaining like this.

Similarly,

(1) afterbody of alignment PageA and PageB,

(2) start from bottom to up to intercept respectively the substring that PageA and PageB length are d, be denoted as s1, s2,

(3) calculate s1, the similar value of s2,

(4) move forward a character, continue the substring that the next length of intercepting is d, be still denoted as s1, s2,

(5) forward step (2) to and repeat said process, until PageA or PageB head are denoted as BS by the similar sequences obtaining like this.

Similar sequences is divided into forward direction similar sequences FS and backward similar sequences BS, and it calculates flow process as shown in Figure 5 and carries out.Wherein FS is after example web page text is alignd with web page text to be extracted front portion, and the similarity of calculating from front to back a series of synchronous substrings obtains, and BS in contrast, is by after afterbody alignment, calculates from back to front gained.And the position that in FS sequence, similar value occurs to transfer is exactly the place, first border of text, the position that in BS sequence, similar value occurs to transfer is exactly trailing edge circle of text.

Analyze two sequences, two sequences all can be " Z " font, and when starting and finishing, similarity value is higher, and middle similarity value diminishes rapidly, occur turnover.

And this turning point corresponding be exactly the border of text, what wherein the turning point of FS was corresponding is the head of text.The afterbody of the corresponding text of turning point of BS.According to counter the pushing away of mode that obtains sequential value, can obtain the head and the tail boundary position of text in PageA and PageB accordingly.Thereby can extract text, noise text message is filtered out.

Claims

1. the webpage context extraction method based on example page, is characterized in that, following step, consists of:

(1) user or program are specified an example page;

(2) utilize webpage label tentatively to filter, obtain the text containing text;

(3) utilize editing distance model to calculate the similarity between page text to be extracted and example page text, identification text border;

(4) extract text.

2. a kind of webpage context extraction method based on example page according to claim 1, is characterized in that, specifies any one example page to carry out text extraction to the to be extracted page similar to example page.

3. a kind of webpage context extraction method based on example page according to claim 1, it is characterized in that, the page to be extracted and example page are carried out to the accuracy and efficiency that preliminary filtration treatment is extracted to improve follow-up text, preliminary method of filtering is to utilize the innerHTML attribute that meets W3C standard to obtain the content page text that comprises html tag, with regular expression, remove html tag again, two texts after being filtered are respectively Similar Text.

4. according to a kind of webpage context extraction method based on example page described in claim 1 or 3, it is characterized in that, beginning and end boundary to Similar Text applicating text similarity analysis technology identification Web page text, the calculating of text similarity is based on scale model, need to calculate two similarity sequential values, step is as follows:

(1) head of alignment Similar Text character string,

(3) calculate s1, the similar value of s2,

Similarly,

(1) afterbody of alignment Similar Text,

(3) calculate s1, the similar value of s2,

(5) forward step (2) to and repeat said process, until top is denoted as BS by the similar sequences obtaining like this.

5. a kind of webpage context extraction method based on example page of stating according to claim 4, it is characterized in that, obtain two similar sequences, identify accordingly the border of text, analyze two sequences, when starting and finishing, similarity value is higher, middle similarity value diminishes rapidly, the border that occurs the corresponding text of turning point of turnover, what wherein the turning point of sequence FS was corresponding is the head of text, the afterbody of the corresponding text of turning point of sequence B S; According to the mode of sequential value is counter, push away, can obtain the head and the tail boundary position of text in Similar Text, thereby can extract text, noise text message is filtered out.