CN105574066A

CN105574066A - Web page text extraction and comparison method and system thereof

Info

Publication number: CN105574066A
Application number: CN201510695688.4A
Authority: CN
Inventors: 不公告发明人
Original assignee: Qingdao Hengbo Instrument Co Ltd
Current assignee: Qingdao Hengbo Instrument Co Ltd
Priority date: 2015-10-23
Filing date: 2015-10-23
Publication date: 2016-05-11

Abstract

The invention discloses a web page text extraction and comparison method and a system thereof. The method comprises the following steps: A, judging whether a web page is a text page or not based on a specific web page tag; and B, identifying parallel web pages; wherein the step A further comprises the following steps: (1), pre-processing the web pages to construct an HTML tree; (2), pruning the HTML tree; (3), obtaining the themes of the web pages; (4), extracting the content of a character string in a partitioned block; (5), calculating the distance between the theme S and the content y in one block; and (6), comparing the editing distance with max(p,q). The web page text extraction and comparison method disclosed by the invention has the advantages that: the web pages, the texts of which are relatively short, can be extracted; the selection correctness cannot be influenced by the length of the content; the text can be used in calculation and cannot be ignored regardless of the length of the text; and, for processing 1t;table and gt, the nested complex web pages can ensure that each 1t;table and gt tag can be consistently processed.

Description

Web page text extracts comparison method and system thereof

Method field

The present invention relates to computer networking technology method and system thereof, particularly a kind of Web page text extracts comparison method and system thereof.

Background method

Webpage context extraction method has a lot, wherein has the method specially for comment webpage or news web page, but the present invention discusses is context extraction method for most of generic web pages.Generally speaking, main at present webpage context extraction method has following direction: the webpage context extraction method based on the webpage context extraction method of DOM, Corpus--based Method, the webpage context extraction method based on piecemeal and other webpage context extraction methods.

DOM Document Object Model (DocumentObjectModel, DOM) is the standard interface specifications that W3C formulates.Because DOM node organizes based on the hierarchical structure of tree, therefore after establishing tree construction, just can will originally be converted into by the operation to tree the operation of webpage.Although organize the standard formulated according to W3C, structure of web page all can convert the form of dom tree accordingly to, and in fact many webpages do not follow this standard.Therefore usually all need pretreatment module when DOM method uses, by finally abstract for webpage be a dom tree.

One, based on the webpage context extraction method of DOM

Webpage context extraction method based on DOM is a kind of webpage content extracting method based on DOM, and its initial object improves PDA application, removes ad content.DOM method first by abstract for web page contents be corresponding object, be converted to the form of node; Then by set membership, each node organization is got up, final formation one tree type structure.

All identical from the structure of web page major part of same website in internet, such as Yahoo news web page <body> label is all made up of <iframe> and <div> two labels, and therefore this kind of web page template can be gathered is a class.The similar dom tree of cluster needs to calculate similarity, and the step calculating two simple dom tree similarities is: the first step judges that whether two root nodes set are identical, just returns 0 if not identical; If identical, then continue the leaf node comparing two trees.Second step compares title and the attribute of the leaf node of two dom trees, returns the number of same node point in two dom trees.

Two, the webpage context extraction method of Corpus--based Method

Statistics-Based Method is mainly used in the text extracting news category webpage.The party's ratio juris is the <table> label node that Web page text information can only be arranged in webpage.The basic step of method is: the first step removes the noise of the page, according to web page tag, webpage correspondence is expressed as one tree; The each <table> node of second step process, removes the html tag in node, then obtains not containing the character string of any label; The character quantity of the more each node of the 3rd step, usually choosing the maximum node of character quantity is Web page text.The method advantage is the characteristic that make use of news web page, and versatility is good, and realize simple, also do not need to build different templates for different webpages, do not need sample learning, time complexity is low.But shortcoming is this algorithm to be only applicable to all text messages in webpage and to be all placed in the situation in a <table> node, for the webpage having multiple <table> text, effect is also bad.Due to the rise of present microblogging, light blog etc., increasing complex format and short text webpage are produced, and the limitation of this method is more obvious.

In existing method, Web page text extracts comparison effect table:

Generally speaking, all also rest on mainly for the conventional internet webpage stage at the related algorithm of Web page text extraction and webpage Similarity measures at present, no matter be that Web page text extracts or webpage Study on Similarity, the new feature of mobile Internet webpage content is not conscientiously considered, is mainly manifested in following shortcoming:

(1) structure of web page of mobile Internet becomes increasingly complex, and the new methods related to also gets more and more, and traditional 2.2 to save the limitation of the extraction algorithm of webpage content main introduced more and more obvious.

(2) because short text web page contents is too many, the theoretical foundation of part text similarity research algorithm no longer exists, and algorithm accuracy rate reduces, and can not adapt to the demand that large-scale data uses.

Summary of the invention

Method problem to be solved by this invention is, provide a kind of the Web page text based on the similar piecemeal of theme and extract and comparison method, result shows that the inventive method obtains larger lifting in accuracy rate.

For solving said method problem, the invention provides a kind of Web page text and extracting control methods, comprising the following steps:

Steps A: based on for webpage specific label, judge whether webpage is text page;

Step B: to the identification of parallel web pages;

Described step B comprises further: feature information extraction sub-step and support vector cassification sub-step.

Described steps A may further include following sub-step:

Step one: Web-page preprocessing, structure HTML tree;

Step 2: to HTML hedge clipper branch;

Step 3: obtain Web page subject;

Step 4: extract the string content in piecemeal;

Step 5: the distance calculating content y in a theme S and block;

Step 6: compare editing distance L and max (p, q).

Described step 2 can further include following sub-step: carry out piecemeal according to <table> label, is removed by the leaf node not comprising text and link information.

Described step 5 may further include: carry out participle to Chinese, and the Levenshtein distance of use is such as formula shown in (2) and formula (3):

L ((x^{1}, \cdot \cdot \cdot, x^{p}), (y^{1}, ..., y^{q})) = \{\begin{matrix} p & q = 0 \\ q & p = 0 \\ \min (L ((x^{1}, ..., x^{p - 1}), (y^{1}, ..., y^{q})) + 1, & e l s e \\ L ((x^{1}, ..., x^{p}), (y^{1}, ..., y^{q - 1})) + 1, \\ L ((x^{1}, ..., x^{p - 1}), (y^{1}, ..., y^{q - 1})) + Z (x^{p}, y^{q})) \end{matrix} - - - (2)

The editing distance improved is adopted to calculate the similarity of the html tag sequence of classification:

Editing distance refers between two character strings, is transformed into the minimum editing operation number of times needed for another character string by a character string, and editing operation comprises a character and replaces to another character, insert a character and delete a character; According to the sort feature of label, the dissimilar label that the editing distance of described improvement is defined as a character string is by deleting, inserting and replacement converts another character string dissimilar label minimal action cost to; Wherein, deletion action and update cost are 1, and in class, replacement operation cost is 0, and between class, replacement operation cost is 1.5, is:

Update: C _t(t)=1;

Deletion action: C _d(t)=1;

Replacement operation:

C_{s} (t_{1}, t_{2}) = \{\begin{matrix} 0, & i f t_{1}, t_{2} &Element; T \\ 1.5, & i f t_{1} &Element; T_{1}, t_{2} &Element; T_{2}, T_{1} &NotEqual; T_{2} \\ T_{1}, T_{2}, T : t a g c a t e g o r i e s \end{matrix}

Html tag sequence W=[w ₀, w ₁... w _a... w _a] and Z=[z ₀, z ₁... z _b... z _b] adopt dynamic programming calculate both improve editing distance matrix M, matrix element algorithm M [a, b]:

M [a, b] = \{\begin{matrix} a, i f b = 0 \\ b, i f a = 0 \\ M i n (M [a - 1, b] + C_{d} (w_{a}), \\ M [a - 1, b - 1] + C_{s} (w_{a}, z_{b}), \\ M [a, b - 1] + C_{i} (w_{a})), \end{matrix} o t h e r w i s e

Matrix lower right corner element M [A, B] i.e. S ₁and S ₂the editing distance improved, then label construction information D _t:

D _t＝M[A，B]/Max(A+1，B+1)。

Described step B may further include: feature information extraction sub-step and support vector cassification sub-step;

Described feature information extraction sub-step comprises further:

Set up characteristic information: characteristic information comprises webpage html tag structural information and content-based text size information, text sentence number information and Serial No. information;

Html tag is divided into structure label, format tags and irrelevant label three class label by it in page layout, display, linking functions feature:

Structure label: blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, toWe, tbody, td, tfoot, th, thead, tr, ul;

Format tags: abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u;

Irrelevant label: applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, isindex, label, legend; Leave out during computation structure symmetry.

Editing distance is between two character strings, is transformed into the minimum editing operation number of times needed for another character string by a character string;

Editing operation comprises a character and replaces to another character, insert a character and delete a character;

According to the sort feature of label, the editing distance of improvement is defined as: the dissimilar label of a character string converts another character string dissimilar label minimal action cost to by deletion, insertion and replacement.

For solving the problems of the technologies described above, present invention also offers a kind of Web page text and extracting comparison system, comprising with lower module:

Modules A: for based on for webpage specific label, judge whether webpage is text page;

Module B: for the identification to parallel web pages;

Described modules A may further include following submodule:

Pre-service submodule: for Web-page preprocessing, construct HTML tree;

Beta pruning submodule: for HTML hedge clipper branch;

Obtain theme submodule: for obtaining Web page subject;

Extract piecemeal submodule: for extracting the string content in piecemeal;

Calculate distance submodule: for calculating the distance of content y in a theme S and block;

Relatively distance submodule: for comparing editing distance L and max (p, q).

Described beta pruning submodule can be further used for: carry out piecemeal according to <table> label, is removed by the leaf node not comprising text and link information.

Described calculating distance submodule can be further used for: carry out participle to Chinese, the Levenshtein distance of use is such as formula shown in (2) and formula (3):

L ((x^{1}, \cdot \cdot \cdot, x^{p}), (y^{1}, ..., y^{q})) = \{\begin{matrix} p & q = 0 \\ q & p = 0 \\ \min (L ((x^{1}, ..., x^{p - 1}), (y^{1}, ..., y^{q})) + 1, & e l s e \\ L ((x^{1}, ..., x^{p}), (y^{1}, ..., y^{q - 1})) + 1, \\ L ((x^{1}, ..., x^{p - 1}), (y^{1}, ..., y^{q - 1})) + Z (x^{p}, y^{q})) \end{matrix} - - - (2)

Update: C _t(t)=1;

Deletion action: C _d(t)=1;

Replacement operation:

C_{s} (t_{1}, t_{2}) = \{\begin{matrix} 0, & i f t_{1}, t_{2} &Element; T \\ 1.5, & i f t_{1} &Element; T_{1}, t_{2} &Element; T_{2}, T_{1} &NotEqual; T_{2} \\ T_{1}, T_{2}, T : t a g c a t e g o r i e s \end{matrix}

M [a, b] = \{\begin{matrix} \begin{matrix} \begin{matrix} a, & i f b = 0 \end{matrix} \\ \begin{matrix} b, & i f a = 0 \end{matrix} \\ M i n (M [a - 1, b] + C_{d} (w_{a}), \\ M [a - 1, b - 1] + C_{s} (w_{a}, z_{b}), \\ M [a, b - 1] + C_{i} (w_{a})), \end{matrix} & o t h e r w i s e \end{matrix}

D _t＝M[A，B]/Max(A+1，B+1)。

Described module B may further include following submodule: feature information extraction submodule and support vector cassification submodule;

Described feature information extraction submodule is used for:

The method effect that the present invention is useful is: Web page text of the present invention extracts comparison method and contrasts traditional web page release algorithm and the webpage context extraction method based on the similar piecemeal of theme, has the following advantages:

(1) can extract the shorter webpage of text, the length of content can't affect the correctness of selection.Because text length all can participate in calculating, all can not be left in the basket.

(2) to the webpage processing the nested complexity of <table>.Because construct a HTML tree, can ensure that each <table> label can obtain consistent process.

(3) operand is reduced.The analysis not needing to carry out bunch, cluster is very time-consuming, does not need the entropy of computing block, just carries out for this webpage analysiss and just can judge.

(4) semantic information is to a certain degree added.Because effectively make use of the semantic information of heading label and text, the semantic dependency extracting text is stronger.

Embodiment

Describe embodiments of the present invention in detail below with reference to embodiment, to the present invention, how application process means solve method problem whereby, and the implementation procedure reaching method effect can fully understand and implement according to this.

The Web page text that the present invention is based on the similar piecemeal of theme extracts said theme in control methods, i.e. the title of webpage and label.Algorithm of the present invention is left in the basket in order to avoid mobile Internet short text piecemeal, does not calculate the entropy of content blocks, mainly utilizes the similarity of theme and content blocks as the basis for estimation extracting block.Specifically, the following feature of webpage is mainly utilized:

One is that webpage format has tree structure.Increasing webpage format builds according to the standard of xml now, and web page tag is normally nested into what occur, therefore can convert a HTML tree structure to, in fact in based on the webpage context extraction method of DOM, also utilize this characteristic.Building the tree structure of HTML in the methods of the invention, mainly in order to cut useless branch, reducing operand.

Two is webpage normally partitioning placements.Although the webpage format of mobile Internet is complicated, from content, each webpage comprises following piece substantially: classification block, navigation block, text block, peer link block and advertising message block etc.Utilize this characteristic of webpage, and web page tag is normally nested into what occur, utilizes web page tag to carry out piecemeal to webpage.In fact current widely using due to DIV+CSS method, in addition label <table></tableGr eatT.GreaT.GT label has good layout character, and present most of webpage all adopts <table> label to carry out the layout of webpage format when finally presenting to user.Based on the similar piecemeal of theme webpage context extraction method just on this basis, utilize <table> label to resolve webpage.

Three is the relevant property of theme and content.Webpage all has title and some labels usually, and high level overview Web page text, and therefore in fact theme best embodies the feature of Web page text, represent the key content of webpage.All fail in this former webpage context extraction method to consider.The inventive method is just using important indicator that the relation of theme and text is extracted as text.Especially because the structure of mobile Internet webpage is more and more diversified, web page contents different in size, the interfere information of advertisement is many, and the web page contents of short text is easy to flood in advertising information, therefore the similarity of theme and web page contents is taken into account in webpage extracts and is absolutely necessary.The index that the present invention measures similarity is editing distance (i.e. Levenshtein distance).Levenshtein distance is namely from the number of the minimum insertion required for former string (a) converting into target string (b), deletion and replacement.Levenshtein formula is as shown in the formula shown in (1):

{leν}_{a, b} (i, j) = \{\begin{matrix} \max (i, j), \min (i, j) \\ \min \{\begin{matrix} {lev}_{a, b} (i - 1, j + 1) \\ {lev}_{a, b} (i + 1, j - 1) & e l s e \\ {lev}_{a, b} (i - 1, j - 1) & a_{i} &NotEqual; b_{i} \end{matrix} \end{matrix} - - - (1)

Illustrate: a, b are character string, i is the length of character string a, and j is the length of character string b.Utilize based on above 3, this is as follows based on the webpage context extraction method basic thought of the similar piecemeal of theme: structure webpage being converted to HTML tree; Extract the theme of webpage; Web page tag is utilized to extract content blocks; The editing distance Levenshtein distance L that calculating theme and content are seen, when distance L is less than the length p of content blocks, then be considered as Web page text content to be extracted, when distance L is greater than the length of (comprise and equaling) a certain content blocks, then ignore this content.

In one embodiment, the invention provides a kind of Web page text and extract control methods, comprise the following steps:

Step B: to the identification of parallel web pages;

Described steps A may further include following sub-step:

Step one: Web-page preprocessing, structure HTML tree;

Step 2: to HTML hedge clipper branch;

Step 3: obtain Web page subject;

Step 4: extract the string content in piecemeal;

Step 5: the distance calculating content y in a theme S and block;

Step 6: compare editing distance L and max (p, q).

L ((x^{1}, \cdot \cdot \cdot, x^{p}), (y^{1}, ..., y^{q})) = \{\begin{matrix} p & q = 0 \\ q & p = 0 \\ \min (L ((x^{1}, ..., x^{p - 1}), (y^{1}, ..., y^{q})) + 1, & e l s e \\ L ((x^{1}, ..., x^{p}), (y^{1}, ..., y^{q - 1})) + 1, \\ L ((x^{1}, ..., x^{p - 1}), (y^{1}, ..., y^{q - 1})) + Z (x^{p}, y^{q})) \end{matrix} - - - (2)

Update: C _t(t)=1;

Deletion action: C _d(t)=1;

Replacement operation:

C_{s} (t_{1}, t_{2}) = \{\begin{matrix} 0, & i f t_{1}, t_{2} &Element; T \\ 1.5, & i f t_{1} &Element; T_{1}, t_{2} &Element; T_{2}, T_{1} &NotEqual; T_{2} \\ T_{1}, T_{2}, T : t a g c a t e g o r i e s \end{matrix}

M [a, b] = \{\begin{matrix} \begin{matrix} \begin{matrix} a, & i f b = 0 \end{matrix} \\ \begin{matrix} b, & i f a = 0 \end{matrix} \\ M i n (M [a - 1, b] + C_{d} (w_{a}), \\ M [a - 1, b - 1] + C_{s} (w_{a}, z_{b}), \\ M [a, b - 1] + C_{i} (w_{a})), \end{matrix} & o t h e r w i s e \end{matrix}

D _t＝M[A，B]/Max(A+1，B+1)。

Described feature information extraction sub-step comprises further:

In another embodiment, present invention also offers a kind of Web page text and extract comparison system, comprise with lower module:

Module B: for the identification to parallel web pages;

Described modules A may further include following submodule:

Pre-service submodule: for Web-page preprocessing, construct HTML tree;

Beta pruning submodule: for HTML hedge clipper branch;

Obtain theme submodule: for obtaining Web page subject;

Extract piecemeal submodule: for extracting the string content in piecemeal;

Relatively distance submodule: for comparing editing distance L and max (p, q).

L ((x^{1}, \cdot \cdot \cdot, x^{p}), (y^{1}, ..., y^{q})) = \{\begin{matrix} p & q = 0 \\ q & p = 0 \\ \min (L ((x^{1}, ..., x^{p - 1}), (y^{1}, ..., y^{q})) + 1, & e l s e \\ L ((x^{1}, ..., x^{p}), (y^{1}, ..., y^{q - 1})) + 1, \\ L ((x^{1}, ..., x^{p - 1}), (y^{1}, ..., y^{q - 1})) + Z (x^{p}, y^{q})) \end{matrix} - - - (2)

Update: C _t(t)=1;

Deletion action: C _d(t)=1;

Replacement operation:

C_{s} (t_{1}, t_{2}) = \{\begin{matrix} 0, & i f t_{1}, t_{2} &Element; T \\ 1.5, & i f t_{1} &Element; T_{1}, t_{2} &Element; T_{2}, T_{1} &NotEqual; T_{2} \\ T_{1}, T_{2}, T : t a g c a t e g o r i e s \end{matrix}

M [a, b] = \{\begin{matrix} \begin{matrix} \begin{matrix} a, & i f b = 0 \end{matrix} \\ \begin{matrix} b, & i f a = 0 \end{matrix} \\ M i n (M [a - 1, b] + C_{d} (w_{a}), \\ M [a - 1, b - 1] + C_{s} (w_{a}, z_{b}), \\ M [a, b - 1] + C_{i} (w_{a})), \end{matrix} & o t h e r w i s e \end{matrix}

D _t＝M[A，B]/Max(A+1，B+1)。

Described feature information extraction submodule is used for:

In another embodiment, the invention provides a kind of Web page text and extract control methods, comprise the following steps:

Step B: to the identification of parallel web pages.

Described steps A may further include following sub-step:

Step one: Web-page preprocessing, structure HTML tree;

Step 2: to HTML hedge clipper branch;

Step 3: obtain Web page subject;

Step 4: extract the string content in piecemeal;

Step 5: the distance calculating content y in a theme S and block;

Step 6: compare editing distance L and max (p, q).

L ((x^{1}, \cdot \cdot \cdot, x^{p}), (y^{1}, ..., y^{q})) = \{\begin{matrix} p & q = 0 \\ q & p = 0 \\ \min (L ((x^{1}, ..., x^{p - 1}), (y^{1}, \cdot \cdot \cdot, y^{q})) + 1, & e l s e \\ L ((x^{1}, ..., x^{p}), (y^{1}, ..., y^{q - 1})) + 1, \\ L ((x^{1}, ..., x^{p - 1}), (y^{1}, ..., y^{q - 1})) + Z (x^{p}, y^{q})) \end{matrix} - - - (2)

Described feature information extraction sub-step comprises further:

Module B: for the identification to parallel web pages.

Described modules A may further include following submodule:

Pre-service submodule: for Web-page preprocessing, construct HTML tree;

Beta pruning submodule: for HTML hedge clipper branch;

Obtain theme submodule: for obtaining Web page subject;

Extract piecemeal submodule: for extracting the string content in piecemeal;

Relatively distance submodule: for comparing editing distance L and max (p, q).

L ((x^{1}, \cdot \cdot \cdot, x^{p}), (y^{1}, ..., y^{q})) = \{\begin{matrix} p & q = 0 \\ q & p = 0 \\ \min (L ((x^{1}, ..., x^{p - 1}), (y^{1}, ..., y^{q})) + 1, & e l s e \\ L ((x^{1}, ..., x^{p}), (y^{1}, ..., y^{q - 1})) + 1, \\ L ((x^{1}, ..., x^{p - 1}), (y^{1}, ..., y^{q - 1})) + Z (x^{p}, y^{q})) \end{matrix} - - - (2)

Described feature information extraction submodule is used for:

In another embodiment, in conjunction with the basic thought of webpage context extraction method that the present invention is based on the similar piecemeal of theme, algorithm of the present invention obviously will comprise structure HTM tree, extracts Web page subject, calculate theme and block similarity matching degree three key steps; In addition because webpage is semi-structured, need to carry out pre-service; Simultaneously in order to reduce operand, need to carry out beta pruning to the tree of structure.Specifically, the basic step of algorithm is as follows:

Step one: Web-page preprocessing, structure html tree.Webpage is standardized, is finally mapped to tree structure, comprise following sub-step:

(1) if " < " and " > " that occur in the place except webpage <table> respective labels all uses & lt and & gt; Replace, the mark that completion webpage terminates due to expressions such as lack of standardization lacked <li>, <hr>.

(2) in webpage, the property value of whole label is all placed in quotation marks, as

<ahref＝″www.hust.edu.cn″>。

(3) label is all mate in pairs, the i.e. corresponding end-tag of each beginning label, the corresponding </head> of as corresponding in <body> </body>, <head>.

(4) label is correctly nested, as <a>, <b>, </b>, </a>.Only have correctly nested, could by correct iterative processing.

(5) mark that some are useless is removed, as form, img etc.Utilize the label information after specification, utilize the method for recurrence, the html tree that structure webpage is corresponding.

Step 2: to HTML hedge clipper branch.Owing to carrying out piecemeal according to <table> label, some leaf node does not comprise text and link information, is therefore removed by these useless branches, reduces operand.

Step 3: obtain Web page subject.Obtain the content of webpage Title and title <h1> ~ <hg> at different levels and label <meta> thereof.If Chinese, the ICTCLAS Words partition system that the Chinese Academy of Sciences can be utilized to propose carries out word segmentation processing to above content, then removes function word, stop words etc., finally obtains the sequence Stitle only containing notional word.

Step 4: extract the string content in piecemeal.First to the leaf node of HTML tree, the subtree that namely the <table> label of innermost layer is corresponding is merged into a block, removes the HTML mark in block, obtains the string content Y in block.

Step 5: the distance calculating content y in a theme S and block.For Chinese, needing to carry out participle to Chinese, is also utilize the Chinese Academy of Sciences's Words partition system in step (three).The concrete Levenshtein distance used is such as formula shown in (2) and formula (3) in the present invention:

L ((x^{1}, \cdot \cdot \cdot, x^{p}), (y^{1}, ..., y^{q})) = \{\begin{matrix} p & q = 0 \\ q & p = 0 \\ \min (L ((x^{1}, ..., x^{p - 1}), (y^{1}, ..., y^{q})) + 1, & e l s e \\ L ((x^{1}, ..., x^{p}), (y^{1}, ..., y^{q - 1})) + 1, \\ L ((x^{1}, ..., x^{p - 1}), (y^{1}, ..., y^{q - 1})) + Z (x^{p}, y^{q})) \end{matrix} - - - (2)

Step 6: compare editing distance L and max (p, q).If L<max (p, q), be then text message in this block, extract; Otherwise be identified as interfere information, ignore.Finally obtain the text message of webpage.

In addition, Web page text extraction control methods of the present invention also comprises the identification to parallel web pages.

Parallel web pages identification of the present invention mainly comprises feature information extraction and support vector cassification two parts composition.

1, feature information extraction

Characteristic information mainly contains webpage html tag structural information and content-based text size information, text sentence number information and Serial No. information.

(1) label construction feature

The body matter intertranslation of bilingual parallel web pages, but the appearance form of webpage often otherness is larger.For avoiding misprinting except parallel web pages because of the difference of form, strengthen the degree of similarity of structure label registration between parallel web pages,, html tag is divided into structure label, format tags and irrelevant label three class label by it in difference in functionality features such as page layout, display, links:

Structure label: blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, toWe, tbody, td, tfoot, th, thead, tr, ul etc.;

Format tags: abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u etc.;

Irrelevant label: applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, isindex, label, legend etc., leaves out during computation structure symmetry.

The editing distance improved is adopted to calculate the similarity of the html tag sequence of classification.

Editing distance refers between two character strings, is transformed into the minimum editing operation number of times needed for another character string by a character string, and editing operation comprises a character and replaces to another character, insert a character and delete a character.According to the sort feature of label, the dissimilar label that the editing distance of improvement is defined as a character string is by deleting, inserting and replacement converts another character string dissimilar label minimal action cost to.Wherein, deletion action and update cost are 1, and in class, replacement operation cost is 0, and between class, replacement operation cost is 1.5, is:

Update: C _t(t)=1;

Deletion action: C _d(t)=1;

Replacement operation:

C_{s} (t_{1}, t_{2}) = \{\begin{matrix} 0, & i f t_{1}, t_{2} &Element; T \\ 1.5, & i f t_{1} &Element; T_{1}, t_{2} &Element; T_{2}, T_{1} &NotEqual; T_{2} \\ T_{1}, T_{2}, T : t a g c a t e g o r i e s \end{matrix} .

Html tag sequence W=[w ₀, w ₁... w _a... w _a] and Z=[z ₀, z ₁... zb ... z _b] adopt dynamic programming calculate both improve editing distance matrix M, matrix element algorithm M [a, b]:

M [a, b] = \{\begin{matrix} \begin{matrix} \begin{matrix} a, & i f b = 0 \end{matrix} \\ \begin{matrix} b, & i f a = 0 \end{matrix} \\ M i n (M [a - 1, b] + C_{d} (w_{a}), \\ M [a - 1, b - 1] + C_{s} (w_{a}, z_{b}), \\ M [a, b - 1] + C_{i} (w_{a})), \end{matrix} & o t h e r w i s e \end{matrix}

D _t＝M[A，B]/Max(A+1，B+1)

As html tag sequence [div, style, style, div, style, style, p, p, div, div] and Z=[div, table, tr, td, span, span, td, tr, table, div], the editing distance matrix improved is as shown in table 1, the editing distance improved is 3, label construction information D _t=0.3.

The editing distance matrix M that table 1:W and Z improves

(2) content surface feature

For reducing the degree of dependence to bilingual dictionary, content surface feature refers in particular to the information of directly related with content but non-vocabulary intertranslation, and mainly comprise text sentence number information, text size information and Serial No. information that text is right, each feature calculates as follows:

1) text sentence number information Ds:

D _s＝Min(S _S，S _T)/Max(S _S，S _T)

2) civilian wooden length information Dt:

D _t＝|L _S-L _T|/Max(L _S，L _T)

3) Serial No. information Dn:

D _n＝1-Z/Max(m，n)

Wherein m and n is respectively the number of source language text and target language text appearance numeral, and Z is maximum matching length, and detailed calculation procedure is as follows:

Suppose to be respectively X=[x from source language and target language Wen Mu to the Serial No. that towel extracts ₁, x ₂..., x _i..., x _m] and Y=[y ₁, y ₂..., y _j..., y _n], build m*n thus and tie up matching relationship Matrix C, matrix element c [i, j] is:

C [i, j] = \{\begin{matrix} 0, & x_{i} &NotEqual; y_{j} \\ 1, & x_{i} = y_{j} \end{matrix}

Matrix C is utilized to set up the maximum matching length matrix D of character string, element D [i, j] calculating principle:

A, circulation are from right to left, from below to up.

B, element D [i, j] are:

D[i，j]＝Max(C[i，j]+C[i+1，j+1]，C[i，j+1]，C[i+1，j])

Wherein, in matrix D, the final element D [0,0] generated is maximum matching length Z.

For fully showing the computing method of co-occurrence Serial No. information, enumerating Serial No. and being respectively X=[4,5,34,5,2,45,8,12] and Y=[4,7,34,8,78,9,5,2,12].Calculate gained matching relationship Matrix C as table 2, maximum coupling matrix D is as table 3, and therefore obtaining maximum matching length Z is 5, and the size of Serial No. information Dn is 1-5/9=0.44.

Table 2:X and Y matching relationship Matrix C

The maximum coupling matrix D of table 3:X and Y

Web page text of the present invention extracts the SVM algorithm that comparison method have employed support vector cassification.SVM algorithm is a kind of implementation method of statistical theory.SVM is based upon statistical learning VC and ties up on (Vapnik-ChervonenkisDimension) theory and Structural risk minization basis, by introducing kernel function, sample vector is mapped to high-dimensional feature space, then in higher dimensional space, construct optimal classification surface, obtain linear optimal decision function.The advantage of SVM is can by adopting the ingenious solution problem of dimension of kernel function, avoids the directly related of learning algorithm computation complexity and sample dimension.

Make { (x _i, y _i), i=1 ..., S} constitutes the training dataset of SVM by S data point, wherein, and x _i∈ R ⁿ, y _i∈-1,1}, optimal decision function is:

f (x) = S g n [Σ_{i = 1}^{S} α_{i} y_{i} < x \cdot x_{i} > + b] - - - (2.8)

Wherein, Sgn [.] is sign function, nonnegative variable α _ifor Lagrange function, b is the bias of lineoid.

From pretreated source language and target document, select the webpage within mirror image to local path difference two-stage to form candidate's parallel web pages pair respectively.For webpage to the characteristic information x calculating html tag sequence information Dt, text size information Di, text sentence number information Ds and Serial No. information Dn formation SVM classifier respectively _i∈ R ⁿ(n=4).Wherein, Dt reflects structure of web page information, extracts from pretreated webpage; Di, Ds and Dn reflect web page content information, extract from Web page text.

By training SVM to non-parallel webpage on the training set formed by known parallel web pages, judging that the webpage of unknown classification is whether as parallel web pages.The judged result yi=l of support vector machine represents that webpage is to being parallel web pages pair, and yi=-1 represents that webpage is to being non-parallel webpage pair.

Web page text of the present invention extracts comparison method and contrasts traditional web page release algorithm and the webpage context extraction method based on the similar piecemeal of theme, and the latter has the following advantages:

All above-mentioned this intellecture properties of primary enforcement, not setting restriction this new product of other forms of enforcement and/or new method.Art processes personnel will utilize this important information, and foregoing is revised, to realize similar implementation status.But all modifications or transformation belong to the right of reservation based on new product of the present invention.

Claims

1. Web page text extracts a control methods, it is characterized in that, comprises the following steps:

Step B: to the identification of parallel web pages;

2. Web page text extracts control methods according to claim 1, and it is characterized in that, described steps A comprises following sub-step further:

Step one: Web-page preprocessing, structure HTML tree;

Step 2: to HTML hedge clipper branch;

Step 3: obtain Web page subject;

Step 4: extract the string content in piecemeal;

Step 5: the distance calculating content y in a theme S and block;

Step 6: compare editing distance L and max (p, q).

3. according to claim 1 or 2, Web page text extracts control methods, it is characterized in that, described step 2 comprises following sub-step further: carry out piecemeal according to <table> label, is removed by the leaf node not comprising text and link information.

4. according to any one of claims 1 to 3, Web page text extracts control methods, and it is characterized in that, described step 5 comprises further: carry out participle to Chinese, and the Levenshtein distance of use is such as formula shown in (2) and formula (3):

L ((x^{1}, ..., x^{p}), (y^{1}, ..., y^{q})) = \{\begin{matrix} p & q = 0 \\ q & p = 0 \\ m i n (L ((x^{1}, ..., x^{p - 1}), (y^{1}, ..., y^{q})) + 1 & e l s e \\ L ((x^{1}, ..., x^{p}), (y^{1}, ..., y^{q - 1})) + 1, \\ L ((x^{1}, ..., x^{p - 1}), (y^{1}, ..., y^{q - 1})) + Z (x^{p}, y^{q})) \end{matrix} - - - (2)

Update: C _t(t)=1;

Deletion action: C _d(t)=1;

Replacement operation:

C_{s} (t_{1}, t_{2}) = \{\begin{matrix} 0, & i f t_{1}, t_{2} &Element; T \\ 1.5, & i f t_{1} &Element; T_{1}, t_{2} &Element; T_{2}, T_{1} &NotEqual; T_{2} \\ T_{1}, T_{2}, T : t a g c a t e g o r i e s \end{matrix}

M [a, b] = \{\begin{matrix} a, i f b = 0 \\ b, i f a = 0 \\ M i n (M [a - 1, b] + C_{d} (w_{a}), \\ M [a - 1, b - 1] + C_{s} (w_{a}, z_{b}), \\ M [a, b - 1] + C_{i} (w_{a})), \end{matrix} o t h e r w i s e

D _t＝M[A，B]/Max(A+1，B+1)。

5. according to any one of Claims 1 to 4, Web page text extracts control methods, it is characterized in that,

Described feature information extraction sub-step comprises further:

6. Web page text extracts a comparison system, it is characterized in that, comprises with lower module:

Module B: for the identification to parallel web pages;

7. Web page text extracts comparison system according to claim 6, and it is characterized in that, described modules A comprises following submodule further:

Pre-service submodule: for Web-page preprocessing, construct HTML tree;

Beta pruning submodule: for HTML hedge clipper branch;

Obtain theme submodule: for obtaining Web page subject;

Extract piecemeal submodule: for extracting the string content in piecemeal;

Relatively distance submodule: for comparing editing distance L and max (p, q).

8. according to claim 6 or 7, Web page text extracts comparison system, it is characterized in that, described beta pruning submodule is further used for: carry out piecemeal according to <table> label, is removed by the leaf node not comprising text and link information.

9. according to any one of claim 6 ~ 8, Web page text extracts comparison system, it is characterized in that, described calculating distance submodule is further used for: carry out participle to Chinese, the Levenshtein distance of use is such as formula shown in (2) and formula (3):

L ((x^{1}, ..., x^{p}), (y^{1}, ..., y^{q})) = \{\begin{matrix} p & q = 0 \\ q & p = 0 \\ m i n (L ((x^{1}, ..., x^{p - 1}), (y^{1}, ..., y^{q})) + 1 & e l s e \\ L ((x^{1}, \cdot \cdot \cdot, x^{p}), (y^{1}, ..., y^{q - 1})) + 1, \\ L ((x^{1}, ..., x^{p - 1}), (y^{1}, ..., y^{q - 1})) + Z (x^{p}, y^{q})) \end{matrix} - - - (2)

Update: C _t(t)=1;

Deletion action: C _d(t)=1;

Replacement operation:

C_{s} (t_{1}, t_{2}) = \{\begin{matrix} 0, & i f t_{1}, t_{2} &Element; T \\ 1.5, & i f t_{1} &Element; T_{1}, t_{2} &Element; T_{2}, T_{1} &NotEqual; T_{2} \\ T_{1}, T_{2}, T : t a g c a t e g o r i e s \end{matrix}

M [a, b] = \{\begin{matrix} a, i f b = 0 \\ b, i f a = 0 \\ M i n (M [a - 1, b] + C_{d} (w_{a}), \\ M [a - 1, b - 1] + C_{s} (w_{a}, z_{b}), \\ M [a, b - 1] + C_{i} (w_{a})), \end{matrix} o t h e r w i s e

D _t＝M[A，B]/Max(A+1，B+1)。

10. according to any one of claim 6 ~ 9, Web page text extracts comparison system, and it is characterized in that, described module B comprises following submodule further: feature information extraction submodule and support vector cassification submodule;

Described feature information extraction submodule is used for: