CN105574066A - Web page text extraction and comparison method and system thereof - Google Patents

Web page text extraction and comparison method and system thereof Download PDF

Info

Publication number
CN105574066A
CN105574066A CN201510695688.4A CN201510695688A CN105574066A CN 105574066 A CN105574066 A CN 105574066A CN 201510695688 A CN201510695688 A CN 201510695688A CN 105574066 A CN105574066 A CN 105574066A
Authority
CN
China
Prior art keywords
label
character
text
web page
character string
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510695688.4A
Other languages
Chinese (zh)
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao Hengbo Instrument Co Ltd
Original Assignee
Qingdao Hengbo Instrument Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao Hengbo Instrument Co Ltd filed Critical Qingdao Hengbo Instrument Co Ltd
Priority to CN201510695688.4A priority Critical patent/CN105574066A/en
Publication of CN105574066A publication Critical patent/CN105574066A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a web page text extraction and comparison method and a system thereof. The method comprises the following steps: A, judging whether a web page is a text page or not based on a specific web page tag; and B, identifying parallel web pages; wherein the step A further comprises the following steps: (1), pre-processing the web pages to construct an HTML tree; (2), pruning the HTML tree; (3), obtaining the themes of the web pages; (4), extracting the content of a character string in a partitioned block; (5), calculating the distance between the theme S and the content y in one block; and (6), comparing the editing distance with max(p,q). The web page text extraction and comparison method disclosed by the invention has the advantages that: the web pages, the texts of which are relatively short, can be extracted; the selection correctness cannot be influenced by the length of the content; the text can be used in calculation and cannot be ignored regardless of the length of the text; and, for processing 1t;table and gt, the nested complex web pages can ensure that each 1t;table and gt tag can be consistently processed.

Description

Web page text extracts comparison method and system thereof
Method field
The present invention relates to computer networking technology method and system thereof, particularly a kind of Web page text extracts comparison method and system thereof.
Background method
Webpage context extraction method has a lot, wherein has the method specially for comment webpage or news web page, but the present invention discusses is context extraction method for most of generic web pages.Generally speaking, main at present webpage context extraction method has following direction: the webpage context extraction method based on the webpage context extraction method of DOM, Corpus--based Method, the webpage context extraction method based on piecemeal and other webpage context extraction methods.
DOM Document Object Model (DocumentObjectModel, DOM) is the standard interface specifications that W3C formulates.Because DOM node organizes based on the hierarchical structure of tree, therefore after establishing tree construction, just can will originally be converted into by the operation to tree the operation of webpage.Although organize the standard formulated according to W3C, structure of web page all can convert the form of dom tree accordingly to, and in fact many webpages do not follow this standard.Therefore usually all need pretreatment module when DOM method uses, by finally abstract for webpage be a dom tree.
One, based on the webpage context extraction method of DOM
Webpage context extraction method based on DOM is a kind of webpage content extracting method based on DOM, and its initial object improves PDA application, removes ad content.DOM method first by abstract for web page contents be corresponding object, be converted to the form of node; Then by set membership, each node organization is got up, final formation one tree type structure.
All identical from the structure of web page major part of same website in internet, such as Yahoo news web page <body> label is all made up of <iframe> and <div> two labels, and therefore this kind of web page template can be gathered is a class.The similar dom tree of cluster needs to calculate similarity, and the step calculating two simple dom tree similarities is: the first step judges that whether two root nodes set are identical, just returns 0 if not identical; If identical, then continue the leaf node comparing two trees.Second step compares title and the attribute of the leaf node of two dom trees, returns the number of same node point in two dom trees.
Two, the webpage context extraction method of Corpus--based Method
Statistics-Based Method is mainly used in the text extracting news category webpage.The party's ratio juris is the <table> label node that Web page text information can only be arranged in webpage.The basic step of method is: the first step removes the noise of the page, according to web page tag, webpage correspondence is expressed as one tree; The each <table> node of second step process, removes the html tag in node, then obtains not containing the character string of any label; The character quantity of the more each node of the 3rd step, usually choosing the maximum node of character quantity is Web page text.The method advantage is the characteristic that make use of news web page, and versatility is good, and realize simple, also do not need to build different templates for different webpages, do not need sample learning, time complexity is low.But shortcoming is this algorithm to be only applicable to all text messages in webpage and to be all placed in the situation in a <table> node, for the webpage having multiple <table> text, effect is also bad.Due to the rise of present microblogging, light blog etc., increasing complex format and short text webpage are produced, and the limitation of this method is more obvious.
In existing method, Web page text extracts comparison effect table:
Generally speaking, all also rest on mainly for the conventional internet webpage stage at the related algorithm of Web page text extraction and webpage Similarity measures at present, no matter be that Web page text extracts or webpage Study on Similarity, the new feature of mobile Internet webpage content is not conscientiously considered, is mainly manifested in following shortcoming:
(1) structure of web page of mobile Internet becomes increasingly complex, and the new methods related to also gets more and more, and traditional 2.2 to save the limitation of the extraction algorithm of webpage content main introduced more and more obvious.
(2) because short text web page contents is too many, the theoretical foundation of part text similarity research algorithm no longer exists, and algorithm accuracy rate reduces, and can not adapt to the demand that large-scale data uses.
Summary of the invention
Method problem to be solved by this invention is, provide a kind of the Web page text based on the similar piecemeal of theme and extract and comparison method, result shows that the inventive method obtains larger lifting in accuracy rate.
For solving said method problem, the invention provides a kind of Web page text and extracting control methods, comprising the following steps:
Steps A: based on for webpage specific label, judge whether webpage is text page;
Step B: to the identification of parallel web pages;
Described step B comprises further: feature information extraction sub-step and support vector cassification sub-step.
Described steps A may further include following sub-step:
Step one: Web-page preprocessing, structure HTML tree;
Step 2: to HTML hedge clipper branch;
Step 3: obtain Web page subject;
Step 4: extract the string content in piecemeal;
Step 5: the distance calculating content y in a theme S and block;
Step 6: compare editing distance L and max (p, q).
Described step 2 can further include following sub-step: carry out piecemeal according to <table> label, is removed by the leaf node not comprising text and link information.
Described step 5 may further include: carry out participle to Chinese, and the Levenshtein distance of use is such as formula shown in (2) and formula (3):
L ( ( x 1 , &CenterDot; &CenterDot; &CenterDot; , x p ) , ( y 1 , ... , y q ) ) = p q = 0 q p = 0 min ( L ( ( x 1 , ... , x p - 1 ) , ( y 1 , ... , y q ) ) + 1 , e l s e L ( ( x 1 , ... , x p ) , ( y 1 , ... , y q - 1 ) ) + 1 , L ( ( x 1 , ... , x p - 1 ) , ( y 1 , ... , y q - 1 ) ) + Z ( x p , y q ) ) - - - ( 2 )
The editing distance improved is adopted to calculate the similarity of the html tag sequence of classification:
Editing distance refers between two character strings, is transformed into the minimum editing operation number of times needed for another character string by a character string, and editing operation comprises a character and replaces to another character, insert a character and delete a character; According to the sort feature of label, the dissimilar label that the editing distance of described improvement is defined as a character string is by deleting, inserting and replacement converts another character string dissimilar label minimal action cost to; Wherein, deletion action and update cost are 1, and in class, replacement operation cost is 0, and between class, replacement operation cost is 1.5, is:
Update: C t(t)=1;
Deletion action: C d(t)=1;
Replacement operation: C s ( t 1 , t 2 ) = 0 , i f t 1 , t 2 &Element; T 1.5 , i f t 1 &Element; T 1 , t 2 &Element; T 2 , T 1 &NotEqual; T 2 T 1 , T 2 , T : t a g c a t e g o r i e s
Html tag sequence W=[w 0, w 1... w a... w a] and Z=[z 0, z 1... z b... z b] adopt dynamic programming calculate both improve editing distance matrix M, matrix element algorithm M [a, b]:
M &lsqb; a , b &rsqb; = a , i f b = 0 b , i f a = 0 M i n ( M &lsqb; a - 1 , b &rsqb; + C d ( w a ) , M &lsqb; a - 1 , b - 1 &rsqb; + C s ( w a , z b ) , M &lsqb; a , b - 1 &rsqb; + C i ( w a ) ) , o t h e r w i s e
Matrix lower right corner element M [A, B] i.e. S 1and S 2the editing distance improved, then label construction information D t:
D t=M[A,B]/Max(A+1,B+1)。
Described step B may further include: feature information extraction sub-step and support vector cassification sub-step;
Described feature information extraction sub-step comprises further:
Set up characteristic information: characteristic information comprises webpage html tag structural information and content-based text size information, text sentence number information and Serial No. information;
Html tag is divided into structure label, format tags and irrelevant label three class label by it in page layout, display, linking functions feature:
Structure label: blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, toWe, tbody, td, tfoot, th, thead, tr, ul;
Format tags: abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u;
Irrelevant label: applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, isindex, label, legend; Leave out during computation structure symmetry.
The editing distance improved is adopted to calculate the similarity of the html tag sequence of classification:
Editing distance is between two character strings, is transformed into the minimum editing operation number of times needed for another character string by a character string;
Editing operation comprises a character and replaces to another character, insert a character and delete a character;
According to the sort feature of label, the editing distance of improvement is defined as: the dissimilar label of a character string converts another character string dissimilar label minimal action cost to by deletion, insertion and replacement.
For solving the problems of the technologies described above, present invention also offers a kind of Web page text and extracting comparison system, comprising with lower module:
Modules A: for based on for webpage specific label, judge whether webpage is text page;
Module B: for the identification to parallel web pages;
Described step B comprises further: feature information extraction sub-step and support vector cassification sub-step.
Described modules A may further include following submodule:
Pre-service submodule: for Web-page preprocessing, construct HTML tree;
Beta pruning submodule: for HTML hedge clipper branch;
Obtain theme submodule: for obtaining Web page subject;
Extract piecemeal submodule: for extracting the string content in piecemeal;
Calculate distance submodule: for calculating the distance of content y in a theme S and block;
Relatively distance submodule: for comparing editing distance L and max (p, q).
Described beta pruning submodule can be further used for: carry out piecemeal according to <table> label, is removed by the leaf node not comprising text and link information.
Described calculating distance submodule can be further used for: carry out participle to Chinese, the Levenshtein distance of use is such as formula shown in (2) and formula (3):
L ( ( x 1 , &CenterDot; &CenterDot; &CenterDot; , x p ) , ( y 1 , ... , y q ) ) = p q = 0 q p = 0 min ( L ( ( x 1 , ... , x p - 1 ) , ( y 1 , ... , y q ) ) + 1 , e l s e L ( ( x 1 , ... , x p ) , ( y 1 , ... , y q - 1 ) ) + 1 , L ( ( x 1 , ... , x p - 1 ) , ( y 1 , ... , y q - 1 ) ) + Z ( x p , y q ) ) - - - ( 2 )
The editing distance improved is adopted to calculate the similarity of the html tag sequence of classification:
Editing distance refers between two character strings, is transformed into the minimum editing operation number of times needed for another character string by a character string, and editing operation comprises a character and replaces to another character, insert a character and delete a character; According to the sort feature of label, the dissimilar label that the editing distance of described improvement is defined as a character string is by deleting, inserting and replacement converts another character string dissimilar label minimal action cost to; Wherein, deletion action and update cost are 1, and in class, replacement operation cost is 0, and between class, replacement operation cost is 1.5, is:
Update: C t(t)=1;
Deletion action: C d(t)=1;
Replacement operation: C s ( t 1 , t 2 ) = 0 , i f t 1 , t 2 &Element; T 1.5 , i f t 1 &Element; T 1 , t 2 &Element; T 2 , T 1 &NotEqual; T 2 T 1 , T 2 , T : t a g c a t e g o r i e s
Html tag sequence W=[w 0, w 1... w a... w a] and Z=[z 0, z 1... z b... z b] adopt dynamic programming calculate both improve editing distance matrix M, matrix element algorithm M [a, b]:
M &lsqb; a , b &rsqb; = a , i f b = 0 b , i f a = 0 M i n ( M &lsqb; a - 1 , b &rsqb; + C d ( w a ) , M &lsqb; a - 1 , b - 1 &rsqb; + C s ( w a , z b ) , M &lsqb; a , b - 1 &rsqb; + C i ( w a ) ) , o t h e r w i s e
Matrix lower right corner element M [A, B] i.e. S 1and S 2the editing distance improved, then label construction information D t:
D t=M[A,B]/Max(A+1,B+1)。
Described module B may further include following submodule: feature information extraction submodule and support vector cassification submodule;
Described feature information extraction submodule is used for:
Set up characteristic information: characteristic information comprises webpage html tag structural information and content-based text size information, text sentence number information and Serial No. information;
Html tag is divided into structure label, format tags and irrelevant label three class label by it in page layout, display, linking functions feature:
Structure label: blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, toWe, tbody, td, tfoot, th, thead, tr, ul;
Format tags: abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u;
Irrelevant label: applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, isindex, label, legend; Leave out during computation structure symmetry.
The editing distance improved is adopted to calculate the similarity of the html tag sequence of classification:
Editing distance is between two character strings, is transformed into the minimum editing operation number of times needed for another character string by a character string;
Editing operation comprises a character and replaces to another character, insert a character and delete a character;
According to the sort feature of label, the editing distance of improvement is defined as: the dissimilar label of a character string converts another character string dissimilar label minimal action cost to by deletion, insertion and replacement.
The method effect that the present invention is useful is: Web page text of the present invention extracts comparison method and contrasts traditional web page release algorithm and the webpage context extraction method based on the similar piecemeal of theme, has the following advantages:
(1) can extract the shorter webpage of text, the length of content can't affect the correctness of selection.Because text length all can participate in calculating, all can not be left in the basket.
(2) to the webpage processing the nested complexity of <table>.Because construct a HTML tree, can ensure that each <table> label can obtain consistent process.
(3) operand is reduced.The analysis not needing to carry out bunch, cluster is very time-consuming, does not need the entropy of computing block, just carries out for this webpage analysiss and just can judge.
(4) semantic information is to a certain degree added.Because effectively make use of the semantic information of heading label and text, the semantic dependency extracting text is stronger.
Embodiment
Describe embodiments of the present invention in detail below with reference to embodiment, to the present invention, how application process means solve method problem whereby, and the implementation procedure reaching method effect can fully understand and implement according to this.
The Web page text that the present invention is based on the similar piecemeal of theme extracts said theme in control methods, i.e. the title of webpage and label.Algorithm of the present invention is left in the basket in order to avoid mobile Internet short text piecemeal, does not calculate the entropy of content blocks, mainly utilizes the similarity of theme and content blocks as the basis for estimation extracting block.Specifically, the following feature of webpage is mainly utilized:
One is that webpage format has tree structure.Increasing webpage format builds according to the standard of xml now, and web page tag is normally nested into what occur, therefore can convert a HTML tree structure to, in fact in based on the webpage context extraction method of DOM, also utilize this characteristic.Building the tree structure of HTML in the methods of the invention, mainly in order to cut useless branch, reducing operand.
Two is webpage normally partitioning placements.Although the webpage format of mobile Internet is complicated, from content, each webpage comprises following piece substantially: classification block, navigation block, text block, peer link block and advertising message block etc.Utilize this characteristic of webpage, and web page tag is normally nested into what occur, utilizes web page tag to carry out piecemeal to webpage.In fact current widely using due to DIV+CSS method, in addition label <table></tableGr eatT.GreaT.GT label has good layout character, and present most of webpage all adopts <table> label to carry out the layout of webpage format when finally presenting to user.Based on the similar piecemeal of theme webpage context extraction method just on this basis, utilize <table> label to resolve webpage.
Three is the relevant property of theme and content.Webpage all has title and some labels usually, and high level overview Web page text, and therefore in fact theme best embodies the feature of Web page text, represent the key content of webpage.All fail in this former webpage context extraction method to consider.The inventive method is just using important indicator that the relation of theme and text is extracted as text.Especially because the structure of mobile Internet webpage is more and more diversified, web page contents different in size, the interfere information of advertisement is many, and the web page contents of short text is easy to flood in advertising information, therefore the similarity of theme and web page contents is taken into account in webpage extracts and is absolutely necessary.The index that the present invention measures similarity is editing distance (i.e. Levenshtein distance).Levenshtein distance is namely from the number of the minimum insertion required for former string (a) converting into target string (b), deletion and replacement.Levenshtein formula is as shown in the formula shown in (1):
le&nu; a , b ( i , j ) = max ( i , j ) , min ( i , j ) min lev a , b ( i - 1 , j + 1 ) lev a , b ( i + 1 , j - 1 ) e l s e lev a , b ( i - 1 , j - 1 ) a i &NotEqual; b i - - - ( 1 )
Illustrate: a, b are character string, i is the length of character string a, and j is the length of character string b.Utilize based on above 3, this is as follows based on the webpage context extraction method basic thought of the similar piecemeal of theme: structure webpage being converted to HTML tree; Extract the theme of webpage; Web page tag is utilized to extract content blocks; The editing distance Levenshtein distance L that calculating theme and content are seen, when distance L is less than the length p of content blocks, then be considered as Web page text content to be extracted, when distance L is greater than the length of (comprise and equaling) a certain content blocks, then ignore this content.
In one embodiment, the invention provides a kind of Web page text and extract control methods, comprise the following steps:
Steps A: based on for webpage specific label, judge whether webpage is text page;
Step B: to the identification of parallel web pages;
Described step B comprises further: feature information extraction sub-step and support vector cassification sub-step.
Described steps A may further include following sub-step:
Step one: Web-page preprocessing, structure HTML tree;
Step 2: to HTML hedge clipper branch;
Step 3: obtain Web page subject;
Step 4: extract the string content in piecemeal;
Step 5: the distance calculating content y in a theme S and block;
Step 6: compare editing distance L and max (p, q).
Described step 2 can further include following sub-step: carry out piecemeal according to <table> label, is removed by the leaf node not comprising text and link information.
Described step 5 may further include: carry out participle to Chinese, and the Levenshtein distance of use is such as formula shown in (2) and formula (3):
L ( ( x 1 , &CenterDot; &CenterDot; &CenterDot; , x p ) , ( y 1 , ... , y q ) ) = p q = 0 q p = 0 min ( L ( ( x 1 , ... , x p - 1 ) , ( y 1 , ... , y q ) ) + 1 , e l s e L ( ( x 1 , ... , x p ) , ( y 1 , ... , y q - 1 ) ) + 1 , L ( ( x 1 , ... , x p - 1 ) , ( y 1 , ... , y q - 1 ) ) + Z ( x p , y q ) ) - - - ( 2 )
The editing distance improved is adopted to calculate the similarity of the html tag sequence of classification:
Editing distance refers between two character strings, is transformed into the minimum editing operation number of times needed for another character string by a character string, and editing operation comprises a character and replaces to another character, insert a character and delete a character; According to the sort feature of label, the dissimilar label that the editing distance of described improvement is defined as a character string is by deleting, inserting and replacement converts another character string dissimilar label minimal action cost to; Wherein, deletion action and update cost are 1, and in class, replacement operation cost is 0, and between class, replacement operation cost is 1.5, is:
Update: C t(t)=1;
Deletion action: C d(t)=1;
Replacement operation: C s ( t 1 , t 2 ) = 0 , i f t 1 , t 2 &Element; T 1.5 , i f t 1 &Element; T 1 , t 2 &Element; T 2 , T 1 &NotEqual; T 2 T 1 , T 2 , T : t a g c a t e g o r i e s
Html tag sequence W=[w 0, w 1... w a... w a] and Z=[z 0, z 1... z b... z b] adopt dynamic programming calculate both improve editing distance matrix M, matrix element algorithm M [a, b]:
M &lsqb; a , b &rsqb; = a , i f b = 0 b , i f a = 0 M i n ( M &lsqb; a - 1 , b &rsqb; + C d ( w a ) , M &lsqb; a - 1 , b - 1 &rsqb; + C s ( w a , z b ) , M &lsqb; a , b - 1 &rsqb; + C i ( w a ) ) , o t h e r w i s e
Matrix lower right corner element M [A, B] i.e. S 1and S 2the editing distance improved, then label construction information D t:
D t=M[A,B]/Max(A+1,B+1)。
Described step B may further include: feature information extraction sub-step and support vector cassification sub-step;
Described feature information extraction sub-step comprises further:
Set up characteristic information: characteristic information comprises webpage html tag structural information and content-based text size information, text sentence number information and Serial No. information;
Html tag is divided into structure label, format tags and irrelevant label three class label by it in page layout, display, linking functions feature:
Structure label: blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, toWe, tbody, td, tfoot, th, thead, tr, ul;
Format tags: abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u;
Irrelevant label: applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, isindex, label, legend; Leave out during computation structure symmetry.
The editing distance improved is adopted to calculate the similarity of the html tag sequence of classification:
Editing distance is between two character strings, is transformed into the minimum editing operation number of times needed for another character string by a character string;
Editing operation comprises a character and replaces to another character, insert a character and delete a character;
According to the sort feature of label, the editing distance of improvement is defined as: the dissimilar label of a character string converts another character string dissimilar label minimal action cost to by deletion, insertion and replacement.
In another embodiment, present invention also offers a kind of Web page text and extract comparison system, comprise with lower module:
Modules A: for based on for webpage specific label, judge whether webpage is text page;
Module B: for the identification to parallel web pages;
Described step B comprises further: feature information extraction sub-step and support vector cassification sub-step.
Described modules A may further include following submodule:
Pre-service submodule: for Web-page preprocessing, construct HTML tree;
Beta pruning submodule: for HTML hedge clipper branch;
Obtain theme submodule: for obtaining Web page subject;
Extract piecemeal submodule: for extracting the string content in piecemeal;
Calculate distance submodule: for calculating the distance of content y in a theme S and block;
Relatively distance submodule: for comparing editing distance L and max (p, q).
Described beta pruning submodule can be further used for: carry out piecemeal according to <table> label, is removed by the leaf node not comprising text and link information.
Described calculating distance submodule can be further used for: carry out participle to Chinese, the Levenshtein distance of use is such as formula shown in (2) and formula (3):
L ( ( x 1 , &CenterDot; &CenterDot; &CenterDot; , x p ) , ( y 1 , ... , y q ) ) = p q = 0 q p = 0 min ( L ( ( x 1 , ... , x p - 1 ) , ( y 1 , ... , y q ) ) + 1 , e l s e L ( ( x 1 , ... , x p ) , ( y 1 , ... , y q - 1 ) ) + 1 , L ( ( x 1 , ... , x p - 1 ) , ( y 1 , ... , y q - 1 ) ) + Z ( x p , y q ) ) - - - ( 2 )
The editing distance improved is adopted to calculate the similarity of the html tag sequence of classification:
Editing distance refers between two character strings, is transformed into the minimum editing operation number of times needed for another character string by a character string, and editing operation comprises a character and replaces to another character, insert a character and delete a character; According to the sort feature of label, the dissimilar label that the editing distance of described improvement is defined as a character string is by deleting, inserting and replacement converts another character string dissimilar label minimal action cost to; Wherein, deletion action and update cost are 1, and in class, replacement operation cost is 0, and between class, replacement operation cost is 1.5, is:
Update: C t(t)=1;
Deletion action: C d(t)=1;
Replacement operation: C s ( t 1 , t 2 ) = 0 , i f t 1 , t 2 &Element; T 1.5 , i f t 1 &Element; T 1 , t 2 &Element; T 2 , T 1 &NotEqual; T 2 T 1 , T 2 , T : t a g c a t e g o r i e s
Html tag sequence W=[w 0, w 1... w a... w a] and Z=[z 0, z 1... z b... z b] adopt dynamic programming calculate both improve editing distance matrix M, matrix element algorithm M [a, b]:
M &lsqb; a , b &rsqb; = a , i f b = 0 b , i f a = 0 M i n ( M &lsqb; a - 1 , b &rsqb; + C d ( w a ) , M &lsqb; a - 1 , b - 1 &rsqb; + C s ( w a , z b ) , M &lsqb; a , b - 1 &rsqb; + C i ( w a ) ) , o t h e r w i s e
Matrix lower right corner element M [A, B] i.e. S 1and S 2the editing distance improved, then label construction information D t:
D t=M[A,B]/Max(A+1,B+1)。
Described module B may further include following submodule: feature information extraction submodule and support vector cassification submodule;
Described feature information extraction submodule is used for:
Set up characteristic information: characteristic information comprises webpage html tag structural information and content-based text size information, text sentence number information and Serial No. information;
Html tag is divided into structure label, format tags and irrelevant label three class label by it in page layout, display, linking functions feature:
Structure label: blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, toWe, tbody, td, tfoot, th, thead, tr, ul;
Format tags: abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u;
Irrelevant label: applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, isindex, label, legend; Leave out during computation structure symmetry.
The editing distance improved is adopted to calculate the similarity of the html tag sequence of classification:
Editing distance is between two character strings, is transformed into the minimum editing operation number of times needed for another character string by a character string;
Editing operation comprises a character and replaces to another character, insert a character and delete a character;
According to the sort feature of label, the editing distance of improvement is defined as: the dissimilar label of a character string converts another character string dissimilar label minimal action cost to by deletion, insertion and replacement.
In another embodiment, the invention provides a kind of Web page text and extract control methods, comprise the following steps:
Steps A: based on for webpage specific label, judge whether webpage is text page;
Step B: to the identification of parallel web pages.
Described steps A may further include following sub-step:
Step one: Web-page preprocessing, structure HTML tree;
Step 2: to HTML hedge clipper branch;
Step 3: obtain Web page subject;
Step 4: extract the string content in piecemeal;
Step 5: the distance calculating content y in a theme S and block;
Step 6: compare editing distance L and max (p, q).
Described step 2 can further include following sub-step: carry out piecemeal according to <table> label, is removed by the leaf node not comprising text and link information.
Described step 5 may further include: carry out participle to Chinese, and the Levenshtein distance of use is such as formula shown in (2) and formula (3):
L ( ( x 1 , &CenterDot; &CenterDot; &CenterDot; , x p ) , ( y 1 , ... , y q ) ) = p q = 0 q p = 0 min ( L ( ( x 1 , ... , x p - 1 ) , ( y 1 , &CenterDot; &CenterDot; &CenterDot; , y q ) ) + 1 , e l s e L ( ( x 1 , ... , x p ) , ( y 1 , ... , y q - 1 ) ) + 1 , L ( ( x 1 , ... , x p - 1 ) , ( y 1 , ... , y q - 1 ) ) + Z ( x p , y q ) ) - - - ( 2 )
Described step B may further include: feature information extraction sub-step and support vector cassification sub-step;
Described feature information extraction sub-step comprises further:
Set up characteristic information: characteristic information comprises webpage html tag structural information and content-based text size information, text sentence number information and Serial No. information;
Html tag is divided into structure label, format tags and irrelevant label three class label by it in page layout, display, linking functions feature:
Structure label: blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, toWe, tbody, td, tfoot, th, thead, tr, ul;
Format tags: abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u;
Irrelevant label: applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, isindex, label, legend; Leave out during computation structure symmetry.
The editing distance improved is adopted to calculate the similarity of the html tag sequence of classification:
Editing distance is between two character strings, is transformed into the minimum editing operation number of times needed for another character string by a character string;
Editing operation comprises a character and replaces to another character, insert a character and delete a character;
According to the sort feature of label, the editing distance of improvement is defined as: the dissimilar label of a character string converts another character string dissimilar label minimal action cost to by deletion, insertion and replacement.
In another embodiment, present invention also offers a kind of Web page text and extract comparison system, comprise with lower module:
Modules A: for based on for webpage specific label, judge whether webpage is text page;
Module B: for the identification to parallel web pages.
Described modules A may further include following submodule:
Pre-service submodule: for Web-page preprocessing, construct HTML tree;
Beta pruning submodule: for HTML hedge clipper branch;
Obtain theme submodule: for obtaining Web page subject;
Extract piecemeal submodule: for extracting the string content in piecemeal;
Calculate distance submodule: for calculating the distance of content y in a theme S and block;
Relatively distance submodule: for comparing editing distance L and max (p, q).
Described beta pruning submodule can be further used for: carry out piecemeal according to <table> label, is removed by the leaf node not comprising text and link information.
Described calculating distance submodule can be further used for: carry out participle to Chinese, the Levenshtein distance of use is such as formula shown in (2) and formula (3):
L ( ( x 1 , &CenterDot; &CenterDot; &CenterDot; , x p ) , ( y 1 , ... , y q ) ) = p q = 0 q p = 0 min ( L ( ( x 1 , ... , x p - 1 ) , ( y 1 , ... , y q ) ) + 1 , e l s e L ( ( x 1 , ... , x p ) , ( y 1 , ... , y q - 1 ) ) + 1 , L ( ( x 1 , ... , x p - 1 ) , ( y 1 , ... , y q - 1 ) ) + Z ( x p , y q ) ) - - - ( 2 )
Described module B may further include following submodule: feature information extraction submodule and support vector cassification submodule;
Described feature information extraction submodule is used for:
Set up characteristic information: characteristic information comprises webpage html tag structural information and content-based text size information, text sentence number information and Serial No. information;
Html tag is divided into structure label, format tags and irrelevant label three class label by it in page layout, display, linking functions feature:
Structure label: blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, toWe, tbody, td, tfoot, th, thead, tr, ul;
Format tags: abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u;
Irrelevant label: applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, isindex, label, legend; Leave out during computation structure symmetry.
The editing distance improved is adopted to calculate the similarity of the html tag sequence of classification:
Editing distance is between two character strings, is transformed into the minimum editing operation number of times needed for another character string by a character string;
Editing operation comprises a character and replaces to another character, insert a character and delete a character;
According to the sort feature of label, the editing distance of improvement is defined as: the dissimilar label of a character string converts another character string dissimilar label minimal action cost to by deletion, insertion and replacement.
In another embodiment, in conjunction with the basic thought of webpage context extraction method that the present invention is based on the similar piecemeal of theme, algorithm of the present invention obviously will comprise structure HTM tree, extracts Web page subject, calculate theme and block similarity matching degree three key steps; In addition because webpage is semi-structured, need to carry out pre-service; Simultaneously in order to reduce operand, need to carry out beta pruning to the tree of structure.Specifically, the basic step of algorithm is as follows:
Step one: Web-page preprocessing, structure html tree.Webpage is standardized, is finally mapped to tree structure, comprise following sub-step:
(1) if " < " and " > " that occur in the place except webpage <table> respective labels all uses & lt and & gt; Replace, the mark that completion webpage terminates due to expressions such as lack of standardization lacked <li>, <hr>.
(2) in webpage, the property value of whole label is all placed in quotation marks, as
<ahref=″www.hust.edu.cn″>。
(3) label is all mate in pairs, the i.e. corresponding end-tag of each beginning label, the corresponding </head> of as corresponding in <body> </body>, <head>.
(4) label is correctly nested, as <a>, <b>, </b>, </a>.Only have correctly nested, could by correct iterative processing.
(5) mark that some are useless is removed, as form, img etc.Utilize the label information after specification, utilize the method for recurrence, the html tree that structure webpage is corresponding.
Step 2: to HTML hedge clipper branch.Owing to carrying out piecemeal according to <table> label, some leaf node does not comprise text and link information, is therefore removed by these useless branches, reduces operand.
Step 3: obtain Web page subject.Obtain the content of webpage Title and title <h1> ~ <hg> at different levels and label <meta> thereof.If Chinese, the ICTCLAS Words partition system that the Chinese Academy of Sciences can be utilized to propose carries out word segmentation processing to above content, then removes function word, stop words etc., finally obtains the sequence Stitle only containing notional word.
Step 4: extract the string content in piecemeal.First to the leaf node of HTML tree, the subtree that namely the <table> label of innermost layer is corresponding is merged into a block, removes the HTML mark in block, obtains the string content Y in block.
Step 5: the distance calculating content y in a theme S and block.For Chinese, needing to carry out participle to Chinese, is also utilize the Chinese Academy of Sciences's Words partition system in step (three).The concrete Levenshtein distance used is such as formula shown in (2) and formula (3) in the present invention:
L ( ( x 1 , &CenterDot; &CenterDot; &CenterDot; , x p ) , ( y 1 , ... , y q ) ) = p q = 0 q p = 0 min ( L ( ( x 1 , ... , x p - 1 ) , ( y 1 , ... , y q ) ) + 1 , e l s e L ( ( x 1 , ... , x p ) , ( y 1 , ... , y q - 1 ) ) + 1 , L ( ( x 1 , ... , x p - 1 ) , ( y 1 , ... , y q - 1 ) ) + Z ( x p , y q ) ) - - - ( 2 )
Step 6: compare editing distance L and max (p, q).If L<max (p, q), be then text message in this block, extract; Otherwise be identified as interfere information, ignore.Finally obtain the text message of webpage.
In addition, Web page text extraction control methods of the present invention also comprises the identification to parallel web pages.
Parallel web pages identification of the present invention mainly comprises feature information extraction and support vector cassification two parts composition.
1, feature information extraction
Characteristic information mainly contains webpage html tag structural information and content-based text size information, text sentence number information and Serial No. information.
(1) label construction feature
The body matter intertranslation of bilingual parallel web pages, but the appearance form of webpage often otherness is larger.For avoiding misprinting except parallel web pages because of the difference of form, strengthen the degree of similarity of structure label registration between parallel web pages,, html tag is divided into structure label, format tags and irrelevant label three class label by it in difference in functionality features such as page layout, display, links:
Structure label: blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, toWe, tbody, td, tfoot, th, thead, tr, ul etc.;
Format tags: abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u etc.;
Irrelevant label: applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, isindex, label, legend etc., leaves out during computation structure symmetry.
The editing distance improved is adopted to calculate the similarity of the html tag sequence of classification.
Editing distance refers between two character strings, is transformed into the minimum editing operation number of times needed for another character string by a character string, and editing operation comprises a character and replaces to another character, insert a character and delete a character.According to the sort feature of label, the dissimilar label that the editing distance of improvement is defined as a character string is by deleting, inserting and replacement converts another character string dissimilar label minimal action cost to.Wherein, deletion action and update cost are 1, and in class, replacement operation cost is 0, and between class, replacement operation cost is 1.5, is:
Update: C t(t)=1;
Deletion action: C d(t)=1;
Replacement operation: C s ( t 1 , t 2 ) = 0 , i f t 1 , t 2 &Element; T 1.5 , i f t 1 &Element; T 1 , t 2 &Element; T 2 , T 1 &NotEqual; T 2 T 1 , T 2 , T : t a g c a t e g o r i e s .
Html tag sequence W=[w 0, w 1... w a... w a] and Z=[z 0, z 1... zb ... z b] adopt dynamic programming calculate both improve editing distance matrix M, matrix element algorithm M [a, b]:
M &lsqb; a , b &rsqb; = a , i f b = 0 b , i f a = 0 M i n ( M &lsqb; a - 1 , b &rsqb; + C d ( w a ) , M &lsqb; a - 1 , b - 1 &rsqb; + C s ( w a , z b ) , M &lsqb; a , b - 1 &rsqb; + C i ( w a ) ) , o t h e r w i s e
Matrix lower right corner element M [A, B] i.e. S 1and S 2the editing distance improved, then label construction information D t:
D t=M[A,B]/Max(A+1,B+1)
As html tag sequence [div, style, style, div, style, style, p, p, div, div] and Z=[div, table, tr, td, span, span, td, tr, table, div], the editing distance matrix improved is as shown in table 1, the editing distance improved is 3, label construction information D t=0.3.
The editing distance matrix M that table 1:W and Z improves
(2) content surface feature
For reducing the degree of dependence to bilingual dictionary, content surface feature refers in particular to the information of directly related with content but non-vocabulary intertranslation, and mainly comprise text sentence number information, text size information and Serial No. information that text is right, each feature calculates as follows:
1) text sentence number information Ds:
D s=Min(S S,S T)/Max(S S,S T)
2) civilian wooden length information Dt:
D t=|L S-L T|/Max(L S,L T)
3) Serial No. information Dn:
D n=1-Z/Max(m,n)
Wherein m and n is respectively the number of source language text and target language text appearance numeral, and Z is maximum matching length, and detailed calculation procedure is as follows:
Suppose to be respectively X=[x from source language and target language Wen Mu to the Serial No. that towel extracts 1, x 2..., x i..., x m] and Y=[y 1, y 2..., y j..., y n], build m*n thus and tie up matching relationship Matrix C, matrix element c [i, j] is:
C [ i , j ] = 0 , x i &NotEqual; y j 1 , x i = y j
Matrix C is utilized to set up the maximum matching length matrix D of character string, element D [i, j] calculating principle:
A, circulation are from right to left, from below to up.
B, element D [i, j] are:
D[i,j]=Max(C[i,j]+C[i+1,j+1],C[i,j+1],C[i+1,j])
Wherein, in matrix D, the final element D [0,0] generated is maximum matching length Z.
For fully showing the computing method of co-occurrence Serial No. information, enumerating Serial No. and being respectively X=[4,5,34,5,2,45,8,12] and Y=[4,7,34,8,78,9,5,2,12].Calculate gained matching relationship Matrix C as table 2, maximum coupling matrix D is as table 3, and therefore obtaining maximum matching length Z is 5, and the size of Serial No. information Dn is 1-5/9=0.44.
Table 2:X and Y matching relationship Matrix C
The maximum coupling matrix D of table 3:X and Y
Web page text of the present invention extracts the SVM algorithm that comparison method have employed support vector cassification.SVM algorithm is a kind of implementation method of statistical theory.SVM is based upon statistical learning VC and ties up on (Vapnik-ChervonenkisDimension) theory and Structural risk minization basis, by introducing kernel function, sample vector is mapped to high-dimensional feature space, then in higher dimensional space, construct optimal classification surface, obtain linear optimal decision function.The advantage of SVM is can by adopting the ingenious solution problem of dimension of kernel function, avoids the directly related of learning algorithm computation complexity and sample dimension.
Make { (x i, y i), i=1 ..., S} constitutes the training dataset of SVM by S data point, wherein, and x i∈ R n, y i∈-1,1}, optimal decision function is:
f ( x ) = S g n &lsqb; &Sigma; i = 1 S &alpha; i y i < x &CenterDot; x i > + b &rsqb; - - - ( 2.8 )
Wherein, Sgn [.] is sign function, nonnegative variable α ifor Lagrange function, b is the bias of lineoid.
From pretreated source language and target document, select the webpage within mirror image to local path difference two-stage to form candidate's parallel web pages pair respectively.For webpage to the characteristic information x calculating html tag sequence information Dt, text size information Di, text sentence number information Ds and Serial No. information Dn formation SVM classifier respectively i∈ R n(n=4).Wherein, Dt reflects structure of web page information, extracts from pretreated webpage; Di, Ds and Dn reflect web page content information, extract from Web page text.
By training SVM to non-parallel webpage on the training set formed by known parallel web pages, judging that the webpage of unknown classification is whether as parallel web pages.The judged result yi=l of support vector machine represents that webpage is to being parallel web pages pair, and yi=-1 represents that webpage is to being non-parallel webpage pair.
Web page text of the present invention extracts comparison method and contrasts traditional web page release algorithm and the webpage context extraction method based on the similar piecemeal of theme, and the latter has the following advantages:
(1) can extract the shorter webpage of text, the length of content can't affect the correctness of selection.Because text length all can participate in calculating, all can not be left in the basket.
(2) to the webpage processing the nested complexity of <table>.Because construct a HTML tree, can ensure that each <table> label can obtain consistent process.
(3) operand is reduced.The analysis not needing to carry out bunch, cluster is very time-consuming, does not need the entropy of computing block, just carries out for this webpage analysiss and just can judge.
(4) semantic information is to a certain degree added.Because effectively make use of the semantic information of heading label and text, the semantic dependency extracting text is stronger.
All above-mentioned this intellecture properties of primary enforcement, not setting restriction this new product of other forms of enforcement and/or new method.Art processes personnel will utilize this important information, and foregoing is revised, to realize similar implementation status.But all modifications or transformation belong to the right of reservation based on new product of the present invention.

Claims (10)

1. Web page text extracts a control methods, it is characterized in that, comprises the following steps:
Steps A: based on for webpage specific label, judge whether webpage is text page;
Step B: to the identification of parallel web pages;
Described step B comprises further: feature information extraction sub-step and support vector cassification sub-step.
2. Web page text extracts control methods according to claim 1, and it is characterized in that, described steps A comprises following sub-step further:
Step one: Web-page preprocessing, structure HTML tree;
Step 2: to HTML hedge clipper branch;
Step 3: obtain Web page subject;
Step 4: extract the string content in piecemeal;
Step 5: the distance calculating content y in a theme S and block;
Step 6: compare editing distance L and max (p, q).
3. according to claim 1 or 2, Web page text extracts control methods, it is characterized in that, described step 2 comprises following sub-step further: carry out piecemeal according to <table> label, is removed by the leaf node not comprising text and link information.
4. according to any one of claims 1 to 3, Web page text extracts control methods, and it is characterized in that, described step 5 comprises further: carry out participle to Chinese, and the Levenshtein distance of use is such as formula shown in (2) and formula (3):
L ( ( x 1 , ... , x p ) , ( y 1 , ... , y q ) ) = p q = 0 q p = 0 m i n ( L ( ( x 1 , ... , x p - 1 ) , ( y 1 , ... , y q ) ) + 1 e l s e L ( ( x 1 , ... , x p ) , ( y 1 , ... , y q - 1 ) ) + 1 , L ( ( x 1 , ... , x p - 1 ) , ( y 1 , ... , y q - 1 ) ) + Z ( x p , y q ) ) - - - ( 2 )
The editing distance improved is adopted to calculate the similarity of the html tag sequence of classification:
Editing distance refers between two character strings, is transformed into the minimum editing operation number of times needed for another character string by a character string, and editing operation comprises a character and replaces to another character, insert a character and delete a character; According to the sort feature of label, the dissimilar label that the editing distance of described improvement is defined as a character string is by deleting, inserting and replacement converts another character string dissimilar label minimal action cost to; Wherein, deletion action and update cost are 1, and in class, replacement operation cost is 0, and between class, replacement operation cost is 1.5, is:
Update: C t(t)=1;
Deletion action: C d(t)=1;
Replacement operation: C s ( t 1 , t 2 ) = 0 , i f t 1 , t 2 &Element; T 1.5 , i f t 1 &Element; T 1 , t 2 &Element; T 2 , T 1 &NotEqual; T 2 T 1 , T 2 , T : t a g c a t e g o r i e s
Html tag sequence W=[w 0, w 1... w a... w a] and Z=[z 0, z 1... z b... z b] adopt dynamic programming calculate both improve editing distance matrix M, matrix element algorithm M [a, b]:
M &lsqb; a , b &rsqb; = a , i f b = 0 b , i f a = 0 M i n ( M &lsqb; a - 1 , b &rsqb; + C d ( w a ) , M &lsqb; a - 1 , b - 1 &rsqb; + C s ( w a , z b ) , M &lsqb; a , b - 1 &rsqb; + C i ( w a ) ) , o t h e r w i s e
Matrix lower right corner element M [A, B] i.e. S 1and S 2the editing distance improved, then label construction information D t:
D t=M[A,B]/Max(A+1,B+1)。
5. according to any one of Claims 1 to 4, Web page text extracts control methods, it is characterized in that,
Described feature information extraction sub-step comprises further:
Set up characteristic information: characteristic information comprises webpage html tag structural information and content-based text size information, text sentence number information and Serial No. information;
Html tag is divided into structure label, format tags and irrelevant label three class label by it in page layout, display, linking functions feature:
Structure label: blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, toWe, tbody, td, tfoot, th, thead, tr, ul;
Format tags: abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u;
Irrelevant label: applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, isindex, label, legend; Leave out during computation structure symmetry.
The editing distance improved is adopted to calculate the similarity of the html tag sequence of classification:
Editing distance is between two character strings, is transformed into the minimum editing operation number of times needed for another character string by a character string;
Editing operation comprises a character and replaces to another character, insert a character and delete a character;
According to the sort feature of label, the editing distance of improvement is defined as: the dissimilar label of a character string converts another character string dissimilar label minimal action cost to by deletion, insertion and replacement.
6. Web page text extracts a comparison system, it is characterized in that, comprises with lower module:
Modules A: for based on for webpage specific label, judge whether webpage is text page;
Module B: for the identification to parallel web pages;
Described step B comprises further: feature information extraction sub-step and support vector cassification sub-step.
7. Web page text extracts comparison system according to claim 6, and it is characterized in that, described modules A comprises following submodule further:
Pre-service submodule: for Web-page preprocessing, construct HTML tree;
Beta pruning submodule: for HTML hedge clipper branch;
Obtain theme submodule: for obtaining Web page subject;
Extract piecemeal submodule: for extracting the string content in piecemeal;
Calculate distance submodule: for calculating the distance of content y in a theme S and block;
Relatively distance submodule: for comparing editing distance L and max (p, q).
8. according to claim 6 or 7, Web page text extracts comparison system, it is characterized in that, described beta pruning submodule is further used for: carry out piecemeal according to <table> label, is removed by the leaf node not comprising text and link information.
9. according to any one of claim 6 ~ 8, Web page text extracts comparison system, it is characterized in that, described calculating distance submodule is further used for: carry out participle to Chinese, the Levenshtein distance of use is such as formula shown in (2) and formula (3):
L ( ( x 1 , ... , x p ) , ( y 1 , ... , y q ) ) = p q = 0 q p = 0 m i n ( L ( ( x 1 , ... , x p - 1 ) , ( y 1 , ... , y q ) ) + 1 e l s e L ( ( x 1 , &CenterDot; &CenterDot; &CenterDot; , x p ) , ( y 1 , ... , y q - 1 ) ) + 1 , L ( ( x 1 , ... , x p - 1 ) , ( y 1 , ... , y q - 1 ) ) + Z ( x p , y q ) ) - - - ( 2 )
The editing distance improved is adopted to calculate the similarity of the html tag sequence of classification:
Editing distance refers between two character strings, is transformed into the minimum editing operation number of times needed for another character string by a character string, and editing operation comprises a character and replaces to another character, insert a character and delete a character; According to the sort feature of label, the dissimilar label that the editing distance of described improvement is defined as a character string is by deleting, inserting and replacement converts another character string dissimilar label minimal action cost to; Wherein, deletion action and update cost are 1, and in class, replacement operation cost is 0, and between class, replacement operation cost is 1.5, is:
Update: C t(t)=1;
Deletion action: C d(t)=1;
Replacement operation: C s ( t 1 , t 2 ) = 0 , i f t 1 , t 2 &Element; T 1.5 , i f t 1 &Element; T 1 , t 2 &Element; T 2 , T 1 &NotEqual; T 2 T 1 , T 2 , T : t a g c a t e g o r i e s
Html tag sequence W=[w 0, w 1... w a... w a] and Z=[z 0, z 1... z b... z b] adopt dynamic programming calculate both improve editing distance matrix M, matrix element algorithm M [a, b]:
M &lsqb; a , b &rsqb; = a , i f b = 0 b , i f a = 0 M i n ( M &lsqb; a - 1 , b &rsqb; + C d ( w a ) , M &lsqb; a - 1 , b - 1 &rsqb; + C s ( w a , z b ) , M &lsqb; a , b - 1 &rsqb; + C i ( w a ) ) , o t h e r w i s e
Matrix lower right corner element M [A, B] i.e. S 1and S 2the editing distance improved, then label construction information D t:
D t=M[A,B]/Max(A+1,B+1)。
10. according to any one of claim 6 ~ 9, Web page text extracts comparison system, and it is characterized in that, described module B comprises following submodule further: feature information extraction submodule and support vector cassification submodule;
Described feature information extraction submodule is used for:
Set up characteristic information: characteristic information comprises webpage html tag structural information and content-based text size information, text sentence number information and Serial No. information;
Html tag is divided into structure label, format tags and irrelevant label three class label by it in page layout, display, linking functions feature:
Structure label: blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, toWe, tbody, td, tfoot, th, thead, tr, ul;
Format tags: abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u;
Irrelevant label: applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, isindex, label, legend; Leave out during computation structure symmetry.
The editing distance improved is adopted to calculate the similarity of the html tag sequence of classification:
Editing distance is between two character strings, is transformed into the minimum editing operation number of times needed for another character string by a character string;
Editing operation comprises a character and replaces to another character, insert a character and delete a character;
According to the sort feature of label, the editing distance of improvement is defined as: the dissimilar label of a character string converts another character string dissimilar label minimal action cost to by deletion, insertion and replacement.
CN201510695688.4A 2015-10-23 2015-10-23 Web page text extraction and comparison method and system thereof Pending CN105574066A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510695688.4A CN105574066A (en) 2015-10-23 2015-10-23 Web page text extraction and comparison method and system thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510695688.4A CN105574066A (en) 2015-10-23 2015-10-23 Web page text extraction and comparison method and system thereof

Publications (1)

Publication Number Publication Date
CN105574066A true CN105574066A (en) 2016-05-11

Family

ID=55884202

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510695688.4A Pending CN105574066A (en) 2015-10-23 2015-10-23 Web page text extraction and comparison method and system thereof

Country Status (1)

Country Link
CN (1) CN105574066A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105426388A (en) * 2015-10-23 2016-03-23 青岛恒波仪器有限公司 Apparatus for extracting and comparing webpage text
CN106598954A (en) * 2017-01-05 2017-04-26 北京工商大学 Method for recognizing social network sock puppet model based on frequency sub-tree
CN108920434A (en) * 2018-06-06 2018-11-30 武汉酷犬数据科技有限公司 A kind of general Web page subject method for extracting content and system
CN109543126A (en) * 2018-11-19 2019-03-29 四川长虹电器股份有限公司 Web page text information extracting method based on block text accounting
CN114239590A (en) * 2021-12-01 2022-03-25 马上消费金融股份有限公司 Data processing method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102930031A (en) * 2012-11-08 2013-02-13 哈尔滨工业大学 Method and system for extracting bilingual parallel text in web pages
CN103020043A (en) * 2012-11-16 2013-04-03 哈尔滨工业大学 Distributed acquisition system facing web bilingual parallel corpora resources
CN103309961A (en) * 2013-05-30 2013-09-18 北京智海创讯信息技术有限公司 Webpage content extraction method based on Markov random field
US20150356196A1 (en) * 2014-06-04 2015-12-10 International Business Machines Corporation Classifying uniform resource locators

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102930031A (en) * 2012-11-08 2013-02-13 哈尔滨工业大学 Method and system for extracting bilingual parallel text in web pages
CN103020043A (en) * 2012-11-16 2013-04-03 哈尔滨工业大学 Distributed acquisition system facing web bilingual parallel corpora resources
CN103309961A (en) * 2013-05-30 2013-09-18 北京智海创讯信息技术有限公司 Webpage content extraction method based on Markov random field
US20150356196A1 (en) * 2014-06-04 2015-12-10 International Business Machines Corporation Classifying uniform resource locators

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
朱泽德: ""网络双语语料挖掘关键技术研究"", 《中国博士学位论文全文数据库 信息科技辑》 *
陈秋: ""移动互联网内容相似性研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105426388A (en) * 2015-10-23 2016-03-23 青岛恒波仪器有限公司 Apparatus for extracting and comparing webpage text
CN106598954A (en) * 2017-01-05 2017-04-26 北京工商大学 Method for recognizing social network sock puppet model based on frequency sub-tree
CN108920434A (en) * 2018-06-06 2018-11-30 武汉酷犬数据科技有限公司 A kind of general Web page subject method for extracting content and system
CN108920434B (en) * 2018-06-06 2022-08-30 武汉酷犬数据科技有限公司 Universal webpage theme content extraction method and system
CN109543126A (en) * 2018-11-19 2019-03-29 四川长虹电器股份有限公司 Web page text information extracting method based on block text accounting
CN109543126B (en) * 2018-11-19 2022-04-29 四川长虹电器股份有限公司 Webpage text information extraction method based on block character ratio
CN114239590A (en) * 2021-12-01 2022-03-25 马上消费金融股份有限公司 Data processing method and device
CN114239590B (en) * 2021-12-01 2023-09-19 马上消费金融股份有限公司 Data processing method and device

Similar Documents

Publication Publication Date Title
CN106528583A (en) Method for extracting and comparing web page main body
KR102237702B1 (en) Entity relationship data generating method, apparatus, equipment and storage medium
WO2022022045A1 (en) Knowledge graph-based text comparison method and apparatus, device, and storage medium
CN109388795B (en) Named entity recognition method, language recognition method and system
CN104933027B (en) A kind of open Chinese entity relation extraction method of utilization dependency analysis
CN101593200B (en) Method for classifying Chinese webpages based on keyword frequency analysis
CN101079025B (en) File correlation computing system and method
CN100552673C (en) Open type document isomorphism engines system
CN105243129A (en) Commodity property characteristic word clustering method
CN105574066A (en) Web page text extraction and comparison method and system thereof
CN111190900B (en) JSON data visualization optimization method in cloud computing mode
CN102253930A (en) Method and device for translating text
CN111177591A (en) Knowledge graph-based Web data optimization method facing visualization demand
CN103646112A (en) Dependency parsing field self-adaption method based on web search
Yuan-jie et al. Web service classification based on automatic semantic annotation and ensemble learning
CN102662969A (en) Internet information object positioning method based on webpage structure semantic meaning
Kim et al. Web information extraction by HTML tree edit distance matching
CN105786963A (en) Corpus searching method and system
CN107145591B (en) Title-based webpage effective metadata content extraction method
CN110929518A (en) Text sequence labeling algorithm using overlapping splitting rule
Chen et al. A Structured Information Extraction Algorithm for Scientific Papers based on Feature Rules Learning.
CN105426388A (en) Apparatus for extracting and comparing webpage text
CN112199960A (en) Standard knowledge element granularity analysis system
CN105138517A (en) Parallel web page identification method and parallel web page identification device
You Automatic summarization and keyword extraction from web page or text file

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160511