WO2017080090A1

WO2017080090A1 - Extraction and comparison method for text of webpage

Info

Publication number: WO2017080090A1
Application number: PCT/CN2015/100180
Authority: WO
Inventors: 孙燕群
Original assignee: 孙燕群
Priority date: 2015-11-14
Filing date: 2015-12-31
Publication date: 2017-05-18
Also published as: CN106528583A

Abstract

An extraction and comparison method for a text of a webpage comprises the following steps: step A: determining whether a webpage is a text page according to a specific tab of a webpage; and step B: identifying a parallel webpage. The step A further comprises the following substeps: step one, pretreating the webpage, and constructing an HTML tree; step two, pruning the HTML tree; step three, acquiring webpage themes; step four, extracting character string content in subblocks; step five, calculating the distance between a theme S and content y in a block; and step six, comparing an editing distance L and max (p, q). The webpage text extraction and comparison method has the following advantages: a webpages having a short text can be extracted, and the correction of selection is not affected no matter how the content long is. No matter how long the text is, the text can participate in calculation and is not ignored. All <table> tabs can be consistently processed when a complicated <table> nested webpage is processed.

Description

Web page body text comparison and comparison method

Method area

The invention relates to a computer network technology method, in particular to a web page text extraction comparison method.

Background method

There are many methods for extracting webpage texts, among which there are methods for commenting webpages or news webpages, but the present invention discusses a text extracting method for most common webpages. In general, the main web page text extraction methods are as follows: DOM-based web page text extraction method, statistics-based web page text extraction method, block-based web page text extraction method, and other web page text extraction methods.

The Document Object Model (DOM) is a standard interface specification developed by the W3C. Because the DOM nodes are organized based on the tree's hierarchy, after the tree structure is established, the original operations on the web page can be converted into operations through the tree. Although the web page structure can be converted into a DOM tree format according to the standards set by the W3C organization, in fact many web pages do not follow the standard. Therefore, when the DOM method is used, it usually needs a preprocessing module to finally abstract the web page into a DOM tree.

First, DOM-based web page text extraction method

The DOM-based web page text extraction method is a DOM-based web page content extraction method, and its original purpose is to improve the PDA application and remove the advertisement content. The DOM method abstracts the content of the web page into corresponding objects and converts them into the form of nodes; then organizes the nodes with the parent-child relationship to form a tree structure.

The structure of web pages from the same website on the Internet is mostly the same. For example, the <body> tag of Yahoo News page is composed of two tags: <iframe> and <div>, so you can group these web page templates into one. class. The clustering similar DOM tree needs to calculate the similarity. The procedure for calculating the similarity of two simple DOM trees is: the first step is to judge whether the root nodes of the two trees are the same, and if they are not the same, return 0; if they are the same, continue to compare The leaf nodes of the two trees. The second step compares the names and attributes of the leaf nodes of the two DOM trees and returns the number of identical nodes in the two DOM trees.

Second, based on statistics, web page text extraction method

The statistical-based method is mainly used to extract the body of news-based web pages. The principle of this method is that the web page body information can only be located in the <table> tag node in the web page. The basic steps of the method are as follows: the first step is to remove the noise of the page, and the webpage is correspondingly represented as a tree according to the webpage label; the second step processes each <table> node, removes the HTML label in the node, and then obtains the label without any label. String The third step compares the number of characters in each node. Usually, the node with the largest number of characters is the body of the web page. The advantage of this method is that it utilizes the characteristics of the news webpage, has good versatility, is simple to implement, does not need to construct different templates for different webpages, does not require sample learning, and has low time complexity. However, the disadvantage is that the algorithm is only applicable to the case where all the text information in the webpage is placed in a <table> node, and the effect is not good for a webpage having multiple <table> texts. Due to the rise of Weibo, light blogs, etc., more and more complex formats and short text pages have been created, and the limitations of this method are more obvious.

In the existing method, the webpage text extraction comparison effect table:

In general, the relevant algorithms for web page text extraction and web page similarity calculation are still mainly in the stage of traditional Internet web pages. Whether it is web page text extraction or web page similarity research, there are no new features for mobile web page content. Serious considerations are mainly manifested in the following shortcomings:

(1) The structure of the webpage of the mobile Internet is becoming more and more complex, and more and more emerging methods are involved. The limitation of the webpage text extraction algorithm introduced in the traditional section 2.2 is more and more obvious.

(2) Due to the too much content of short text pages, the theoretical basis of the text similarity research algorithm introduced in Section 2.3 no longer exists, and the accuracy of the algorithm is reduced, which cannot meet the needs of large-scale data usage.

Summary of the invention

The method to be solved by the present invention is to provide a web page text extraction and comparison method based on the similarity of the subject, and the result shows that the method of the present invention achieves a large improvement in accuracy.

In order to solve the above method problem, the present invention provides a web page text extraction and comparison method, comprising the following steps:

Step A: determining whether the webpage is a text page based on a specific label for the webpage;

Step B: Identification of parallel web pages.

Step C: For the Chinese webpage, the body part often includes Chinese punctuation, and the title does not contain or contain few Chinese punctuation. By setting a threshold, that is, the number of Chinese punctuation, the network is judged. Page <p> tag Chinese text, if the number of Chinese punctuation is greater than the given threshold, you can add it to the body, and then get multiple consecutive <P> tags (1 or 2 between p tags) The text of the other tags) is added to the text by the above judgment.

The step A may further comprise the following sub-steps:

Step 1: Preprocessing the web page to construct an HTML tree;

Step 2: Pruning the HTML tree;

Step 3: Obtain the webpage theme;

Step 4: Extract the contents of the string in the block;

Step 5: Calculate the distance between the subject S and the content y in a block;

Step 6: Compare the edit distances L and max(p, q).

The second step may further include the following substeps: performing block according to the <table> tag, and removing the leaf node that does not contain text and link information.

The step 5 may further include: segmenting the Chinese word, and using the Levenshtein distance as shown in the formula (2) and the formula (3):

.

The step B may further include: a feature information extraction sub-step and a support vector machine classification sub-step;

The feature information extraction sub-step further includes:

Establishing feature information: the feature information includes webpage HTML tag structure information and content-based text length information, text sentence number information, and digital sequence information;

The HTML tags are divided into three types of tags: structure tags, format tags, and irrelevant tags according to their web page layout, display, and link features:

Structure tags: blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, to We, tbody, td, tfoot, th, thead, tr, ul;

Format tags: abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u;

Irrelevant tags: applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, is index, label, legend; When the structure symmetry is deleted.

Calculate the similarity of the classified HTML tag sequences with improved edit distance:

The edit distance is the minimum number of edit operations required to convert from one string to another between two strings;

Editing operations include replacing one character with another, inserting one character, and deleting one character;

According to the classification characteristics of the label, the improved editing distance is defined as: the minimum operation cost of different types of labels of one string is converted into another string by deleting, inserting and replacing.

To solve the above technical problem, the present invention also provides a webpage text extraction and comparison system, comprising the following modules:

Module A: for determining whether a webpage is a text page based on a specific label for a webpage;

Module B: Used to identify parallel web pages.

The module A may further comprise the following sub-modules:

Pre-processing sub-module: used to pre-process the web page and construct an HTML tree;

Pruning sub-module: used to pruning HTML trees;

Get the topic sub-module: used to get the web page theme;

Extracting the sub-module of the block: for extracting the content of the string within the block;

Calculating the distance sub-module: used to calculate the distance between the subject S and the content y within a block;

Compare Distance Submodule: Used to compare the edit distances L and max(p, q).

The pruning sub-module may be further configured to: block the leaf according to the <table> tag, and remove the leaf node that does not include the text and the link information.

The calculating distance sub-module may be further used to: segment Chinese characters, and the Levenshtein distance used is as shown in formula (2) and formula (3):

.

The module B may further include the following sub-modules: a feature information extraction sub-module and a support vector machine classification sub-module;

The feature information extraction submodule is used to:

Establish feature information: feature information includes web page HTML tag structure information and content-based text The length information, the text sentence number information, and the digital sequence information;

Irrelevant tags: applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, is index, label, legend; delete when calculating structural symmetry go with.

The beneficial method of the present invention has the following effects: the webpage text extraction comparison method of the present invention has the following advantages over the conventional webpage blocking algorithm and the webpage text extraction method based on the topic similarity partitioning:

(1) It is possible to extract a web page with a short text, and the length of the content does not affect the correctness of the selection. Because no matter the length of the text will participate in the calculation, it will not be ignored.

(2) Complex web pages that are nested with <table>. Because an HTML tree is built, every <table> tag can be guaranteed to be processed consistently.

(3) Reduce the amount of calculation. Cluster analysis is not required, and clustering is very time consuming. It is not necessary to calculate the entropy of the block, but it can be judged by analyzing this web page.

(4) Increased a certain degree of semantic information. Because the semantic information of the title tag and the body is effectively utilized, the semantic relevance of the extracted body is stronger.

Detailed ways

The embodiments of the present invention will be described in detail below with reference to the embodiments, so that the method of the present invention is applied to solve the method problems, and the implementation process of achieving the effect of the method can be fully understood and implemented.

The invention is based on the theme mentioned in the web page text extraction and comparison method of the topic similarity block, namely the title and label of the webpage. In order to avoid the short text block of the mobile internet being ignored, the algorithm of the present invention does not calculate the entropy of the content block, and mainly uses the similarity of the topic and the content block as the judgment basis of the extracted block. Specifically, the main features of the web page are:

First, the web page format has a tree structure. Now more and more web page formats are built according to the xml standard. Web page tags are usually nested in pairs, so they can be converted into an HTML tree. The shape structure, in fact, also takes advantage of this feature in the DOM-based web page text extraction method. The tree structure of HTML is constructed in the method of the present invention, mainly for cutting out useless branches and reducing the amount of calculation.

Second, web pages are usually arranged in chunks. Although the web format of the mobile internet is complicated, in terms of content, each web page basically includes the following blocks: a classification block, a navigation block, a text block, a related link block, and an advertisement information block. Utilizing this feature of web pages, and web page tags are usually nested in pairs, web pages are used to block web pages. In fact, due to the widespread use of the DIV+CSS method, and the label <table></table> tag has a good layout feature, most of the web pages now use the <table> tag for the layout of the web page format when finally presented to the user. . Based on the topic similarity block, the web page text extraction method is based on this, and the <table> tag is used to parse the web page.

Third, the theme and content are related. Web pages usually have a title and a number of tags, and a high-level summary of the body of the page, so the theme actually reflects the characteristics of the body of the page, representing the key content of the page. This was not considered in the previous web page extraction method. The method of the present invention is to use the relationship between the subject and the text as an important index for text extraction. Especially because the structure of mobile Internet webpages is more and more diversified, the length of webpage content is different, the interrogation information of advertisements is many, and the webpage content of short texts is easily submerged in advertisement information, so the theme and webpage content are extracted in webpage extraction. Similarity considerations are indispensable. The indicator for measuring similarity in the present invention is the edit distance (i.e., the Levenshtein distance). The Levenshtein distance is the minimum number of insertions, deletions, and substitutions required to convert from the original string (a) to the target string (b). The Levenshtein formula is shown in the following equation (1):

Description: a, b are strings, i is the length of the string a, and j is the length of the string b. Based on the above three points, the basic idea of the web page text extraction method based on the topic similarity block is as follows: converting the web page into the structure of the HTML tree; extracting the theme of the web page; extracting the content block by using the webpage label; and editing the theme and content viewing The distance L from the Levenshtein is regarded as the content of the webpage body when the distance L is smaller than the length p of the content block. When the distance L is greater than (including equal to) the length of a certain content block, the content is ignored.

In an embodiment, the present invention provides a web page body text comparison and comparison method, comprising the following steps:

Step B: identification of parallel web pages;

Step C: For the Chinese webpage, the body part often includes Chinese punctuation, and the title does not contain or contain few Chinese punctuation. By setting a threshold, that is, the number of Chinese punctuation, the text of the <p> tag is judged. If the number of Chinese punctuation is greater than a given threshold, you can After adding the text, and then obtaining a plurality of consecutive <P> tags (there may be one or two other tags between the p tags), the text is added to the text by the above determination.

The step A may further comprise the following sub-steps:

Step 1: Preprocessing the web page to construct an HTML tree;

Step 2: Pruning the HTML tree;

Step 3: Obtain the webpage theme;

Step 4: Extract the contents of the string in the block;

Step 6: Compare the edit distances L and max(p, q).

.

The feature information extraction sub-step further includes:

In another embodiment, the present invention also provides a web page text extraction and comparison system, comprising the following modules:

Module B: Used to identify parallel web pages.

The module A may further comprise the following sub-modules:

Pruning sub-module: used to pruning HTML trees;

Get the topic sub-module: used to get the web page theme;

Compare Distance Submodule: Used to compare the edit distances L and max(p, q).

.

The feature information extraction submodule is used to:

In another embodiment, in combination with the basic idea of the web page text extraction method based on the topic similarity block of the present invention, the algorithm of the present invention obviously includes three main steps of constructing an HTM tree, extracting a web page theme, calculating a topic, and blocking similarity; In addition, since the webpage is semi-structured, pre-processing is required; at the same time, in order to reduce the amount of computation, the constructed tree needs to be pruned. Specifically, the basic steps of the algorithm are as follows:

Step 1: Web page preprocessing, constructing an html tree. Normalize the web page and finally map it into a tree structure, including the following substeps:

(1) If "<" and "〉" appearing in addition to the <table> related label on the webpage are replaced with <>, the completion page is represented by <li>, <hr>, etc. which are not standardized. The end of the sign.

(2) The attribute values of all tags in the web page are placed in quotation marks, such as

<a href="www.hust.edu.cn"〉.

(3) The tags are matched in pairs, that is, each start tag corresponds to an end tag, such as <body> corresponding </body>, <head> corresponding </head>.

(4) The tags are nested correctly, such as <a>, <b>, </b>, </a>. Only nested correctly can be correctly iterated.

(5) Remove some useless marks, such as form, img, etc. Using the tag information after the specification, the recursive method is used to construct the html tree corresponding to the web page.

Step 2: Pruning the HTML tree. Since the block is segmented according to the <table> tag, some leaf nodes do not contain text and link information, so these useless branches are removed, reducing the amount of computation.

Step 3: Get the web page theme. Get the content of the page Title and its various levels of title <h1>~<hg> and the tag <meta>. If it is Chinese, you can use the ICTCLAS word segmentation system proposed by the Chinese Academy of Sciences to process the above words, then remove the word, stop words, etc., and finally get only the The sequence Stitle of the real word.

Step 4: Extract the contents of the string in the block. First, the leaf nodes of the HTML tree, that is, the subtree corresponding to the innermost <table> tag, are merged into one block, and the HTML mark in the block is removed, and the string content Y in the block is obtained.

Step 5: Calculate the distance between the subject S and the content y within a block. For Chinese, it is necessary to segment Chinese words, and also use the Chinese Academy of Sciences word segmentation system in step (3). The Levenshtein distance specifically used in the present invention is as shown in the formulas (2) and (3):

Step 6: Compare the edit distances L and max(p, q). If L<max(p,q), the block is the body information, which is extracted; otherwise it is recognized as interference information and ignored. Finally get the body information of the web page.

In addition, the webpage text extraction and comparison method of the present invention further includes the identification of parallel webpages.

The parallel webpage identification of the invention mainly comprises two parts: feature information extraction and support vector machine classification.

1. Feature information extraction

The feature information mainly includes webpage HTML tag structure information and content-based text length information, text sentence number information, and digital sequence information.

(1) Label structure characteristics

The main content of bilingual parallel web pages is translated, but the presentation forms of web pages are often different. In order to avoid the parallelization of the parallel webpage due to the difference of the form, and to enhance the degree of similarity of the structure label alignment between the parallel webpages, the HTML label is divided into structural labels, format labels and according to different functional features such as webpage layout, display, and link. Unrelated tags three types of tags:

Structure tags: blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, to We, tbody, td, tfoot, th, thead, tr, ul, etc.;

Format tags: abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u, etc.

Irrelevant tags: applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, is index, label, legend, etc., when calculating structural symmetry Delete.

The similarity of the classified HTML tag sequences is calculated using the improved edit distance.

The edit distance is the minimum number of edit operations required to convert from one string to another between two strings. The edit operation consists of replacing one character with another, inserting one character, and deleting one character. Depending on the classification characteristics of the tag, the improved edit distance is defined as the minimum operational cost of converting different types of tags into one string by deleting, inserting, and replacing them into another string. The cost of the delete operation and the insert operation is 1, the cost of the in-class replacement operation is 0, and the cost of the replacement operation between classes is 1.5, which is:

Insertion operation: C _t (t)=1;

Delete operation: C _d (t)=1;

Replacement operation:

The HTML tag sequence W=[w ₀ ,w ₁ ,...w _a ,...w _A ] and Z=[z ₀ ,z ₁ ,...z _b ,...z _B ] using the dynamic programming calculation to improve the edit distance matrix M , matrix element algorithm M[a,b]:

The lower right corner element M[A, B] is the modified editing distance of S ₁ and S ₂ , then the label structure information D _t :

D _t =M[A,B]/Max(A+1,B+1)

Such as HTML tag sequence [div, style, style, div, style, style, p, p, div, div] and Z = [div, table, tr, td, span, span, td, tr, table, div], The improved edit distance matrix is shown in Table 1. The improved edit distance is 3 and the label structure information D _t = 0.3.

Table 1: W and Z improved edit distance matrix M

(2) Content surface features

In order to reduce the dependence on bilingual dictionaries, the content surface features specifically refer to information that is directly related to the content but not vocabulary, mainly including the text sentence number information, the text length information and the digital sequence information of the text pair, and the features are calculated as follows:

1) Text sentence number information Ds:

D _s =Min(S _S ,S _T )/Max(S _S ,S _T )

2) Wenmu length information Dt:

D _t =|L _S -L _T |/Max(L _S ,L _T )

3) Digital sequence information Dn:

D _n =1-Z/Max(m,n)

Where m and n are the number of digits appearing in the source language text and the target language text, respectively, and Z is the maximum matching length. The detailed calculation steps are as follows:

It is assumed that the numerical sequences extracted from the source language and the target language wenwan are X=[x ₁ , x ₂ , . . . , x _i , . . . , x _m ] and Y=[y ₁ , y ₂ , . . . , y _j ,...,y _n ], thereby constructing an m*n-dimensional matching relation matrix C, and the matrix element c[i,j] is:

The matrix C is used to establish the maximum matching length matrix D of the string, and the calculation principle of the element D[i, j] is as follows:

a, loop from right to left, bottom to top.

b, the element D[i, j] is:

D[i,j]=Max(C[i,j]+C[i+1,j+1],C[i,j+1],C[i+1,j])

Among them, the finally generated element D[0,0] in the matrix D is the maximum matching length Z.

In order to fully display the calculation method of the co-occurrence digital sequence information, the numerical sequence is X=[4,5,34,5,2,45,8,12] and Y=[4,7,34,8,78, 9,5,2,12]. The calculated matching relationship matrix C is as shown in Table 2. The maximum matching matrix D is as shown in Table 3. Therefore, the maximum matching length Z is 5, and the size of the digital sequence information Dn is 1-5/9=0.44.

Table 2: X and Y matching relationship matrix C

Table 3: X and Y maximum matching matrix D

The webpage text extraction comparison method of the present invention adopts the SVM algorithm of support vector machine classification. The SVM algorithm is an implementation of statistical theory. The SVM is based on the theory of Vapnik-Chervonenkis Dimension and the principle of structural risk minimization. By introducing the kernel function, the sample vector is mapped to the high-dimensional feature space, and then the optimal classification surface is constructed in the high-dimensional space. Linear optimal decision function. The advantage of SVM is that it can solve the dimension problem by using the kernel function, which avoids the direct correlation between the computational complexity of the learning algorithm and the sample dimension.

Order _{_{{(x i, y i)}} , i = 1, ..., S} by the S data points constitute the SVM training data set, _{^{_{where, x i ∈R n, y i}}} ∈ {-1,1}, the most The optimal decision function is:

Among them, Sgn[.] is a symbol function, non-negative variable α _i is a Lagrange function, and b is an offset value of a hyperplane.

Selecting a webpage within two levels of the mirrored to local path from the preprocessed source language and the target language document constitutes a candidate parallel webpage pair. The HTML tag sequence information Dt, the text length information Di, the text sentence number information Ds, and the digital sequence information Dn are respectively calculated for the web page pair to constitute the feature information x _i ∈ R ⁿ (n=4) of the SVM classifier. Among them, Dt reflects the webpage structure information, and extracts from the preprocessed webpage; Di, Ds and Dn reflect the webpage content information, and extract it from the webpage body.

By training the SVM on a training set consisting of known parallel web page pairs and non-parallel web page pairs, it is determined whether the web page of the unknown classification is a parallel web page. The judgment result of the support vector machine yi=1 indicates that the web page pair is a parallel web page pair, and yi=-1 indicates that the web page pair is a non-parallel web page pair.

In still another embodiment of the present invention, a method for extracting and comparing webpage texts including double sentence alignment is also provided.

The step of aligning the two sentences in the method for extracting and comparing the webpage text of the present invention is: after obtaining the chapter-level bilingual parallel webpage document, the bilingual parallel webpage is extracted by the text, and the sentence is formed to form a sentence pair (S _i , T _j ), and the candidate sentence is aligned. C and B are {c ₁ , c ₂ , ..., c _n } and {b ₁ , b ₂ , ..., b _n }, respectively, where C _i and B _i are words after word segmentation. Assuming that there are K pairs of words that are translated into each other, then the similarity of (S _i , T _j )

Use the following calculation method:

Where stf(c _m , b _m ) is the number of occurrences of mutually translated words in the pair of sentences; |S _i | and |T _j | are the number of sentences in the source language S _i and the target language T _j , respectively ;idtf(c _m ) is the ratio of the total number of occurrences of c _m in S _i to the number of occurrences of c _m in the text;

with

They are the lengths of the sentences in the source language S _i and the target language T _j respectively; Matching(|S _i |, |T _j |) is a penalty factor, and different alignment modes are penalized to different degrees to prevent the algorithm from taking more sentences. combine it all toghther;

Is a penalty factor determined by length.

Based on the similarity evaluation function Sim(S _i , T _j ), dynamic programming is used to find the optimal sentence alignment path to obtain bilingual parallel corpus.

The webpage text extraction comparison method of the present invention compares the traditional webpage blocking algorithm with the webpage text extraction method based on the topic similarity partitioning, and the latter has the following advantages:

All of the above-mentioned primary implementations of this intellectual property are not set to limit other forms of implementation of this new product and/or new method. Those skilled in the art will utilize this important information and modify the above to achieve a similar implementation. However, all modifications or adaptations based on the novel products of the invention are reserved.

Claims

A method for extracting and comparing webpage texts, comprising the steps of:

Step A: determining whether the webpage is a text page based on a specific label for the webpage;

Step B: identification of parallel web pages;

Step C: setting a threshold number of Chinese punctuation for the Chinese webpage; determining the text of the webpage <p> by the threshold of the number of Chinese punctuation: if the number of Chinese punctuation is greater than the set threshold, then It is added to the body.
The method for extracting and comparing webpage text according to claim 1, wherein said step A further comprises the following substeps:

Step 1: Preprocessing the web page to construct an HTML tree;

Step 2: Pruning the HTML tree;

Step 3: Obtain the webpage theme;

Step 4: Extract the contents of the string in the block;

Step 5: Calculate the distance between the subject S and the content y in a block;

Step 6: Compare the edit distances L and max(p, q).
The webpage text extraction and comparison method according to claim 1 or 2, wherein the step 2 further comprises the substep of: deleting the leaf nodes that do not contain the text and the link information according to the <table> tag.
The method for extracting and comparing webpage text according to any one of claims 1 to 3, wherein the step 5 further comprises: segmenting Chinese characters, and using a Levenshtein distance as shown in equations (2) and (3) :
The webpage text extraction and comparison method according to any one of claims 1 to 4, wherein the step B further comprises: a feature information extraction sub-step and a support vector machine classification sub-step;

The feature information extraction sub-step further includes:

Establishing feature information: the feature information includes webpage HTML tag structure information and content-based text length information, text sentence number information, and digital sequence information;

The HTML tags are divided into three types of tags: structure tags, format tags, and irrelevant tags according to their web page layout, display, and link features:

Structure tags: blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, to We, tbody, td, tfoot, th, thead, tr, ul;

Format tags: abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u;

Irrelevant tags: applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, is index, label, legend; delete when calculating structural symmetry go with.

Calculate the similarity of the classified HTML tag sequences with improved edit distance:

The edit distance is the minimum number of edit operations required to convert from one string to another between two strings;

Editing operations include replacing one character with another, inserting one character, and deleting one character;

According to the classification characteristics of the label, the improved editing distance is defined as: the minimum operation cost of converting different types of labels of one string into another string by deleting, inserting and replacing them;

The method for extracting and comparing webpage texts further includes a step of comparing and extracting webpage texts of two sentences;

The two-sentence-aligned webpage text extraction and comparison step is: after obtaining the chapter-level bilingual parallel webpage document, the bilingual parallel webpage is extracted by the text, and the sentence is formed to form a sentence pair (S i , T j ), and the candidate sentence is aligned with C and B is {c 1 , c 2 , . . . , c n } and {b 1 , b 2 , . . . , b n }, respectively, where C i and B i are words after word segmentation; it is assumed that K pairs are mutually translated. For word pairs, the similarity of (S i , T j ) is calculated as follows:

Where stf(c m , b m ) is the number of times the mutually translated words appear in the pair of sentences;

|S i | and |T j | are the number of sentences in the source language S i and the target language T j , respectively;

Idtf(c m ) is the ratio of the total number of occurrences of c m in S i to the number of occurrences of c m in the text;

with
Is the length of the sentence in the source language S i and the target language T j respectively;

Matching(|S i |, |T j |) is a penalty factor, and different alignment modes are penalized to different degrees to prevent the algorithm from combining more sentences;

Is a penalty factor determined by length;

Based on the similarity evaluation function Sim(S i , T j ), dynamic programming is used to find the optimal sentence alignment path to obtain bilingual parallel corpus.
A webpage text extraction and comparison system, comprising the following modules:

Module A: for determining whether a webpage is a text page based on a specific label for a webpage;

Module B: Used to identify parallel web pages.
The webpage text extraction and comparison system according to claim 6, wherein the module A further comprises the following sub-modules:

Pre-processing sub-module: used to pre-process the web page and construct an HTML tree;

Pruning sub-module: used to pruning HTML trees;

Get the topic sub-module: used to get the web page theme;

Extracting the sub-module of the block: for extracting the content of the string within the block;

Calculating the distance sub-module: used to calculate the distance between the subject S and the content y within a block;

Compare Distance Submodule: Used to compare the edit distances L and max(p, q).
The webpage text extraction and comparison system according to claim 6 or 7, wherein the pruning sub-module is further configured to: block the leaf according to the <table> tag, and remove the leaf node that does not include the text and the link information.
The webpage text extraction and comparison system according to any one of claims 6 to 8, wherein the calculation distance sub-module is further used for: segmenting Chinese characters, using a Levenshtein distance such as equations (2) and (3) ) shown:
The webpage text extraction and comparison system according to any one of claims 6 to 9, wherein the module B further comprises the following sub-modules: a feature information extraction sub-module and a support vector machine classification sub-module;

The feature information extraction submodule is used to:

Establishing feature information: the feature information includes webpage HTML tag structure information and content-based text length information, text sentence number information, and digital sequence information;

The HTML tags are divided into three types of tags: structure tags, format tags, and irrelevant tags according to their web page layout, display, and link features:

Structure tags: blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, to We, tbody, td, tfoot, th, thead, tr, ul;

Format tags: abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u;

Irrelevant tags: applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, is index, label, legend; delete when calculating structural symmetry go with.

Calculate the similarity of the classified HTML tag sequences with improved edit distance:

The edit distance is the minimum number of edit operations required to convert from one string to another between two strings;

Editing operations include replacing one character with another, inserting one character, and deleting one character;

According to the classification characteristics of the label, the improved editing distance is defined as: the minimum operation cost of converting different types of labels of one string into another string by deleting, inserting and replacing them;

The webpage text extraction and comparison system further comprises a double sentence aligned webpage text extraction comparison module;

The double-sentence-aligned webpage text extraction and comparison module is configured to: after obtaining the chapter-level bilingual parallel webpage document, set the bilingual parallel webpage to form a sentence pair (S i , T j ) after the text is extracted, and the candidate sentence alignment C And B are {c 1 , c 2 , . . . , c n } and {b 1 , b 2 , . . . , b n }, respectively, where C i and B i are words after word segmentation; For the translated word pair, the similarity of (S i , T j ) is calculated as follows:

Where stf(c m , b m ) is the number of times the mutually translated words appear in the pair of sentences;

|S i | and |T j | are the number of sentences in the source language S i and the target language T j , respectively;

Idtf(c m ) is the ratio of the total number of occurrences of c m in S i to the number of occurrences of c m in the text;

with
Is the length of the sentence in the source language S i and the target language T j respectively;

Matching(|S i |, |T j |) is a penalty factor, and different alignment modes are penalized to different degrees to prevent the algorithm from combining more sentences;

Is a penalty factor determined by length;

Based on the similarity evaluation function Sim(S i , T j ), dynamic programming is used to find the optimal sentence alignment path to obtain bilingual parallel corpus.